perf: implement chunked fulltext indexing for large datasets #256

cgoldshtein · 2025-07-31T12:57:02Z

Description

Implement chunked processing for fulltext indexing to prevent timeouts on large datasets.

Fixes #255

Changes

Replace single UPDATE query with configurable chunked processing
Add progress logging for each processed chunk
Add new config option: ckanext.xloader.search_update_chunks

duttonw

Hi @cgoldshtein ,

Thanks for the PR.

For us to accept this commit, can you please enable the cicd pipeline so we can have confidence it passes all tests. https://github.com/cgoldshtein/ckanext-xloader/actions

Secondary please add your new config option to the config_declaration.yaml

As you are testing with chunking, could you add a test where you set the chunk config number to be say '5' and see that it loops correctly on log output etc.

Once we get a green build, we will get this merged and cut a release.

duttonw · 2025-07-31T21:02:50Z

ckanext/xloader/loader.py

+    if rows_count:
+        # Configure chunk size - prevents timeouts and memory issues on large datasets
+        # Default 100,000 rows per chunk balances performance vs. resource usage
+        chunks = int(config.get('ckanext.xloader.search_update_chunks', 100000))


please add this config item to https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/config_declaration.yaml with descrition and example. You can also set default value there so its not hidden in the code.

ckanext/xloader/loader.py

…aration file

wardi · 2025-08-10T14:05:45Z

How large of a table can now be loaded with this change?

duttonw · 2025-08-11T00:19:57Z

Thats a very good question and I think that comes down to how beefy the database is as well as the worker now. Chunking the data ensures you don't hit ram/contention states.

wardi · 2025-08-11T00:43:55Z

@duttonw I'm really happy to see this change. We've got some 11M row datasets that wouldn't load with xloader likely due to this issue.

So far we've worked around the issue by using table designer for the resources (to use the most efficient column types) and loading the data with the API but it would be simpler for our client to use xloader since the schema changes fairly often and we typically can't make use of the partial update ability.

The second problem has been slow pagination in streaming downloads of such a large dataset, this is being addressed in ckan/ckan#9028

The third problem is not having a way to define composite indexes with the datastore API. Sorting efficiently requires an index with the column(s) being sorted followed by the _id field because that's used to provide stable ordering for pagination. We've worked around this by manually adding indexes to these large tables with psql, but hope to work on extending the datastore_create API for adding indexes soon.

duttonw · 2025-08-11T08:10:56Z

I'll keep a note on how long this record set takes to load in once we get this deployed to www.data.qld.gov.au

https://www.data.qld.gov.au/dataset/unclaimed-monies/resource/872065ae-ddfd-4b5f-ad15-e1935dadd883 2,295,221 records
start: August 11, 2025, 10:25 (AEST)
copy done: August 11, 2025, 10:26 (AEST)
complete: August 11, 2025, 10:37 (AEST)

https://www.data.qld.gov.au/dataset/queensland-covid-19-case-line-list-location-source-of-infection/resource/1dbae506-d73c-4c19-b727-e8654b8be95a 1,854,305 records

perf: implement chunked fulltext indexing for large datasets

2eeb137

duttonw requested changes Jul 31, 2025

View reviewed changes

ThrawnCA reviewed Jul 31, 2025

View reviewed changes

ckanext/xloader/loader.py Show resolved Hide resolved

fix logger error and add search_update_chunks variable to config_decl…

6a9fecf

…aration file

duttonw approved these changes Aug 10, 2025

View reviewed changes

duttonw merged commit 594891a into ckan:master Aug 10, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: implement chunked fulltext indexing for large datasets #256

perf: implement chunked fulltext indexing for large datasets #256

Uh oh!

cgoldshtein commented Jul 31, 2025

Uh oh!

duttonw left a comment

Uh oh!

duttonw Jul 31, 2025

Uh oh!

cgoldshtein Aug 10, 2025

Uh oh!

Uh oh!

Uh oh!

wardi commented Aug 10, 2025

Uh oh!

duttonw commented Aug 11, 2025

Uh oh!

wardi commented Aug 11, 2025

Uh oh!

duttonw commented Aug 11, 2025

Uh oh!

Uh oh!

perf: implement chunked fulltext indexing for large datasets #256

perf: implement chunked fulltext indexing for large datasets #256

Uh oh!

Conversation

cgoldshtein commented Jul 31, 2025

Description

Changes

Uh oh!

duttonw left a comment

Choose a reason for hiding this comment

Uh oh!

duttonw Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

cgoldshtein Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wardi commented Aug 10, 2025

Uh oh!

duttonw commented Aug 11, 2025

Uh oh!

wardi commented Aug 11, 2025

Uh oh!

duttonw commented Aug 11, 2025

Uh oh!

Uh oh!