Skip to content

Store extracted text in nidx#3588

Merged
jotare merged 33 commits intomainfrom
store-extracted-text-in-nidx
Apr 15, 2026
Merged

Store extracted text in nidx#3588
jotare merged 33 commits intomainfrom
store-extracted-text-in-nidx

Conversation

@jotare
Copy link
Copy Markdown
Contributor

@jotare jotare commented Apr 9, 2026

Description

Object storage response time for extracted text blobs is around 50-200ms depending on the region, activity... This is a first approach towards storing the text in nidx, which provides response times in the order or µs inside nidx and ~10ms for a gRPC call (mostly network).

This is a quite naive implementation and many things can be improved

How was this PR tested?

Describe how you tested this PR.

@jotare jotare requested a review from a team April 9, 2026 14:58
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 21.48438% with 201 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.26%. Comparing base (0c43f8f) to head (d1bf5ca).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
nidx/nidx_text/src/lib.rs 0.00% 48 Missing ⚠️
nidx/nidx_text/src/reader.rs 0.00% 48 Missing ⚠️
nidx/src/searcher/shard_text.rs 0.00% 47 Missing ⚠️
nucliadb/src/nucliadb/common/cache.py 33.33% 20 Missing ⚠️
nucliadb/src/nucliadb/search/search/paragraphs.py 41.17% 20 Missing ⚠️
...cliadb_utils/src/nucliadb_utils/featureflagging.py 60.00% 10 Missing ⚠️
nidx/src/api/shards.rs 60.00% 1 Missing and 1 partial ⚠️
nidx/src/searcher/grpc.rs 0.00% 2 Missing ⚠️
nidx/nidx_text/src/schema.rs 75.00% 1 Missing ⚠️
nidx/src/api/grpc.rs 0.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3588      +/-   ##
==========================================
- Coverage   85.62%   85.26%   -0.36%     
==========================================
  Files         551      552       +1     
  Lines       46872    47099     +227     
  Branches    13293    13448     +155     
==========================================
+ Hits        40133    40161      +28     
- Misses       6151     6348     +197     
- Partials      588      590       +2     
Flag Coverage Δ
nidx 79.77% <5.69%> (-0.91%) ⬇️
nucliadb 72.97% <46.93%> (-0.09%) ⬇️
nucliadb-ingest 43.59% <30.61%> (-0.04%) ⬇️
nucliadb-reader 43.65% <30.61%> (-0.06%) ⬇️
nucliadb-search 53.96% <43.87%> (-0.06%) ⬇️
nucliadb-standalone 46.06% <30.61%> (-0.02%) ⬇️
nucliadb-train 44.70% <30.61%> (-0.04%) ⬇️
nucliadb-writer 47.01% <30.61%> (-0.04%) ⬇️
nucliadb_dataset 73.76% <ø> (ø)
nucliadb_models 71.63% <ø> (ø)
nucliadb_sdk 83.49% <ø> (ø)
nucliadb_telemetry 82.74% <ø> (-0.18%) ⬇️
nucliadb_utils 80.61% <55.55%> (-0.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jotare jotare force-pushed the store-extracted-text-in-nidx branch from 87a8e7c to fdfd9f6 Compare April 10, 2026 10:10
Comment thread nidx/nidx_protos/nidx.proto Outdated
Comment thread nucliadb/src/nucliadb/common/cluster/rollover.py
Comment thread nucliadb/src/nucliadb/common/cache.py Outdated
Comment thread nucliadb_utils/src/nucliadb_utils/featureflagging.py Outdated
Comment thread nucliadb_utils/src/nucliadb_utils/featureflagging.py Outdated
Comment thread nucliadb_utils/src/nucliadb_utils/featureflagging.py Outdated
@jotare jotare merged commit 039c2f2 into main Apr 15, 2026
46 checks passed
@jotare jotare deleted the store-extracted-text-in-nidx branch April 15, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants