-
Notifications
You must be signed in to change notification settings - Fork 56
perf: implement chunked fulltext indexing for large datasets #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cgoldshtein ,
Thanks for the PR.
For us to accept this commit, can you please enable the cicd pipeline so we can have confidence it passes all tests. https://github.com/cgoldshtein/ckanext-xloader/actions
Secondary please add your new config option to the config_declaration.yaml
As you are testing with chunking, could you add a test where you set the chunk config number to be say '5' and see that it loops correctly on log output etc.
Once we get a green build, we will get this merged and cut a release.
if rows_count: | ||
# Configure chunk size - prevents timeouts and memory issues on large datasets | ||
# Default 100,000 rows per chunk balances performance vs. resource usage | ||
chunks = int(config.get('ckanext.xloader.search_update_chunks', 100000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this config item to https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/config_declaration.yaml with descrition and example. You can also set default value there so its not hidden in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
How large of a table can now be loaded with this change? |
Thats a very good question and I think that comes down to how beefy the database is as well as the worker now. Chunking the data ensures you don't hit ram/contention states. |
@duttonw I'm really happy to see this change. We've got some 11M row datasets that wouldn't load with xloader likely due to this issue. So far we've worked around the issue by using table designer for the resources (to use the most efficient column types) and loading the data with the API but it would be simpler for our client to use xloader since the schema changes fairly often and we typically can't make use of the partial update ability. The second problem has been slow pagination in streaming downloads of such a large dataset, this is being addressed in ckan/ckan#9028 The third problem is not having a way to define composite indexes with the datastore API. Sorting efficiently requires an index with the column(s) being sorted followed by the |
I'll keep a note on how long this record set takes to load in once we get this deployed to www.data.qld.gov.au https://www.data.qld.gov.au/dataset/unclaimed-monies/resource/872065ae-ddfd-4b5f-ad15-e1935dadd883 2,295,221 records https://www.data.qld.gov.au/dataset/queensland-covid-19-case-line-list-location-source-of-infection/resource/1dbae506-d73c-4c19-b727-e8654b8be95a 1,854,305 records |
Description
Implement chunked processing for fulltext indexing to prevent timeouts on large datasets.
Fixes #255
Changes