feat(api): add /shard endpoint for row-to-shard mapping#3276
Open
The-Obstacle-Is-The-Way wants to merge 5 commits intohuggingface:mainfrom
Open
feat(api): add /shard endpoint for row-to-shard mapping#3276The-Obstacle-Is-The-Way wants to merge 5 commits intohuggingface:mainfrom
The-Obstacle-Is-The-Way wants to merge 5 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
40f0a0d to
98fe400
Compare
Implements Row-to-Shard API that maps a row index to its original input shard and parquet output shard. This enables data provenance tracking for datasets using the new `original_shard_lengths` field from huggingface/datasets PR #7897. API: GET /shard?dataset=X&config=Y&split=Z&row=N - Adds core algorithm in libapi/shard_utils.py - Registers /shard route in API service (no nginx changes needed) - Uses single cache call to config-parquet-and-info for optimization - Handles missing original_shard_lengths for legacy datasets - Includes unit, integration, and E2E tests
- Consolidate single-line statements within 119 char limit - Sort imports per isort rules (I001) - Add trailing commas to dict literals - Expand long dict entries to multi-line format All CI quality checks verified locally: - ruff check src/tests: PASS - ruff format --check src/tests: PASS - mypy src/tests: PASS - bandit -r src --skip B615: PASS
Critical fixes: - Add ResponseNotFoundError/ResponseNotReadyError to exception handling (fixes 404 being incorrectly returned as 500) - Add shard_lengths validation to catch corrupted parquet metadata early - Add headers to 400 response in OpenAPI spec for consistency - Add 500 response to OpenAPI spec for completeness Minor improvements: - DRY: store sum(original_shard_lengths) in variable - Improve test assertion specificity (assert error code value)
Add missing return type annotations and parameter type hints to fixtures and test functions to satisfy mypy strict checking.
AI-generated code was fabricating filenames instead of raising errors:
- Empty parquet_files -> was returning fabricated "{split}.parquet"
- More shards than files -> was fabricating "{split}-{idx:05d}.parquet"
Now follows codebase pattern (duckdb.py:97-98): raise ValueError for
metadata inconsistencies instead of hiding data corruption.
Added tests for both error cases.
98fe400 to
1867dee
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
/shardendpoint that maps a row index to its corresponding shard information. This enables data provenance tracking - users can determine which original input file a specific row came from.This is the "next step" mentioned in huggingface/datasets#7897, which added
original_shard_lengthsto split info.API
Response:
{ "row_index": 150, "original_shard_index": 1, "original_shard_start_row": 100, "original_shard_end_row": 199, "parquet_shard_index": 0, "parquet_shard_file": "train-00000-of-00002.parquet" }For legacy datasets without
original_shard_lengths, the original shard fields returnnullwith an explanatory message.Implementation
config-parquet-and-infofor efficiencyduckdb.py)Files Changed
libs/libapi/src/libapi/shard_utils.pyservices/api/src/api/routes/shard.pyservices/api/src/api/app.pydocs/source/openapi.jsonTest Plan
libs/libapi/tests/test_shard_utils.py)services/api/tests/routes/test_shard.py)e2e/tests/test_56_shard.py)