Skip to content

Conversation

cgoldshtein
Copy link

Description

Implement chunked processing for fulltext indexing to prevent timeouts on large datasets.

Fixes #255

Changes

  • Replace single UPDATE query with configurable chunked processing
  • Add progress logging for each processed chunk
  • Add new config option: ckanext.xloader.search_update_chunks

Copy link
Collaborator

@duttonw duttonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cgoldshtein ,

Thanks for the PR.

For us to accept this commit, can you please enable the cicd pipeline so we can have confidence it passes all tests. https://github.com/cgoldshtein/ckanext-xloader/actions

Secondary please add your new config option to the config_declaration.yaml

As you are testing with chunking, could you add a test where you set the chunk config number to be say '5' and see that it loops correctly on log output etc.

Once we get a green build, we will get this merged and cut a release.

if rows_count:
# Configure chunk size - prevents timeouts and memory issues on large datasets
# Default 100,000 rows per chunk balances performance vs. resource usage
chunks = int(config.get('ckanext.xloader.search_update_chunks', 100000))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this config item to https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/config_declaration.yaml with descrition and example. You can also set default value there so its not hidden in the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@duttonw duttonw merged commit 594891a into ckan:master Aug 10, 2025
4 checks passed
@wardi
Copy link
Contributor

wardi commented Aug 10, 2025

How large of a table can now be loaded with this change?

@duttonw
Copy link
Collaborator

duttonw commented Aug 11, 2025

Thats a very good question and I think that comes down to how beefy the database is as well as the worker now. Chunking the data ensures you don't hit ram/contention states.

@wardi
Copy link
Contributor

wardi commented Aug 11, 2025

@duttonw I'm really happy to see this change. We've got some 11M row datasets that wouldn't load with xloader likely due to this issue.

So far we've worked around the issue by using table designer for the resources (to use the most efficient column types) and loading the data with the API but it would be simpler for our client to use xloader since the schema changes fairly often and we typically can't make use of the partial update ability.

The second problem has been slow pagination in streaming downloads of such a large dataset, this is being addressed in ckan/ckan#9028

The third problem is not having a way to define composite indexes with the datastore API. Sorting efficiently requires an index with the column(s) being sorted followed by the _id field because that's used to provide stable ordering for pagination. We've worked around this by manually adding indexes to these large tables with psql, but hope to work on extending the datastore_create API for adding indexes soon.

@duttonw
Copy link
Collaborator

duttonw commented Aug 11, 2025

I'll keep a note on how long this record set takes to load in once we get this deployed to www.data.qld.gov.au

https://www.data.qld.gov.au/dataset/unclaimed-monies/resource/872065ae-ddfd-4b5f-ad15-e1935dadd883 2,295,221 records
start: August 11, 2025, 10:25 (AEST)
copy done: August 11, 2025, 10:26 (AEST)
complete: August 11, 2025, 10:37 (AEST)

https://www.data.qld.gov.au/dataset/queensland-covid-19-case-line-list-location-source-of-infection/resource/1dbae506-d73c-4c19-b727-e8654b8be95a 1,854,305 records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement chunked processing for fulltext indexing to prevent timeouts on large datasets
4 participants