Skip to content

Conversation

raunakab
Copy link
Contributor

Description

Query history exporting fetches the entire history and populates it into memory. This causes significant pressure on memory, and can lead to OOMs for very large datasets.

This PR updates the logic to incrementally fetch pages of data, transform them, and write them to file, instead of fetching the entire dataset at once, transforming each row, and then writing it to file.

I.e., we don't fully collect / materialize the entire dataset, we do it in pages instead.

This PR also parallelizes the chat-session fetching logic (given that chat-session reading does not have any cross-thread dependencies [I think?]).

Addresses: https://linear.app/danswer/issue/DAN-1984/improve-performance-of-query-history-exporting.

How Has This Been Tested?

This is performance change, not a feature-addition. We don't have much (if any) performance testing / benchmarking in our code. Therefore, this is not tested.

The feature still outputs the same output for the same input.

@raunakab raunakab requested a review from a team as a code owner May 18, 2025 02:44
Copy link

vercel bot commented May 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 19, 2025 7:13pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR optimizes query history export by implementing generator-based pagination and parallel processing to reduce memory usage.

  • Refactored fetch_and_process_chat_session_history in /backend/ee/onyx/server/query_history/api.py to use paginated fetching with PAGE_SIZE=100 instead of loading all data at once
  • Added parallel processing of chat session snapshots using parallel_yield to improve performance while maintaining thread safety
  • Modified CSV writing in /backend/ee/onyx/background/celery/apps/heavy.py to process data incrementally as it's fetched rather than materializing entire dataset
  • Fixed pagination logic bug where the break condition was incorrectly checking for equal instead of less than PAGE_SIZE
  • Removed redundant list comprehensions and memory-intensive operations throughout the codebase

2 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@raunakab raunakab enabled auto-merge May 19, 2025 19:52
@raunakab raunakab added this pull request to the merge queue May 19, 2025
Merged via the queue into main with commit fd735c9 May 19, 2025
10 of 11 checks passed
@raunakab raunakab deleted the perf/query-history-export branch May 19, 2025 21:20
ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 27, 2025
…ully into memory (onyx-dot-app#4729)

* Change query-exporting to use generators instead of expanding fully into memory

* Fix pagination logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type annotation

* Add early break if list of chat_sessions is empty

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request May 27, 2025
…ully into memory (onyx-dot-app#4729)

* Change query-exporting to use generators instead of expanding fully into memory

* Fix pagination logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type annotation

* Add early break if list of chat_sessions is empty

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
ZhipengHe pushed a commit to ZhipengHe/onyx that referenced this pull request Jun 6, 2025
…ully into memory (onyx-dot-app#4729)

* Change query-exporting to use generators instead of expanding fully into memory

* Fix pagination logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type annotation

* Add early break if list of chat_sessions is empty

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
…ully into memory (onyx-dot-app#4729)

* Change query-exporting to use generators instead of expanding fully into memory

* Fix pagination logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type annotation

* Add early break if list of chat_sessions is empty

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Severe Memory Spike/OOM in export_query_history_task During Chat History Export
2 participants