feat(api): add /shard endpoint for row-to-shard mapping by The-Obstacle-Is-The-Way · Pull Request #3276 · huggingface/dataset-viewer

The-Obstacle-Is-The-Way · 2025-12-06T18:35:28Z

Summary

Adds a /shard endpoint that maps a row index to its corresponding shard information. This enables data provenance tracking - users can determine which original input file a specific row came from.

This is the "next step" mentioned in huggingface/datasets#7897, which added original_shard_lengths to split info.

API

GET /shard?dataset=X&config=Y&split=Z&row=N

Response:

{
  "row_index": 150,
  "original_shard_index": 1,
  "original_shard_start_row": 100,
  "original_shard_end_row": 199,
  "parquet_shard_index": 0,
  "parquet_shard_file": "train-00000-of-00002.parquet"
}

For legacy datasets without original_shard_lengths, the original shard fields return null with an explanatory message.

Implementation

Single cache call to config-parquet-and-info for efficiency
Cumulative sum algorithm to find shard boundaries
Proper error handling (no silent fallbacks - raises errors for corrupted metadata)
Follows existing codebase patterns (validated against duckdb.py)

Files Changed

File	Purpose
`libs/libapi/src/libapi/shard_utils.py`	Core shard lookup algorithm
`services/api/src/api/routes/shard.py`	Endpoint handler
`services/api/src/api/app.py`	Route registration
`docs/source/openapi.json`	API specification

Test Plan

Unit tests (libs/libapi/tests/test_shard_utils.py)
Integration tests (services/api/tests/routes/test_shard.py)
E2E tests (e2e/tests/test_56_shard.py)
mypy passes
ruff formatting applied

HuggingFaceDocBuilderDev · 2025-12-08T09:03:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Implements Row-to-Shard API that maps a row index to its original input shard and parquet output shard. This enables data provenance tracking for datasets using the new `original_shard_lengths` field from huggingface/datasets PR #7897. API: GET /shard?dataset=X&config=Y&split=Z&row=N - Adds core algorithm in libapi/shard_utils.py - Registers /shard route in API service (no nginx changes needed) - Uses single cache call to config-parquet-and-info for optimization - Handles missing original_shard_lengths for legacy datasets - Includes unit, integration, and E2E tests

- Consolidate single-line statements within 119 char limit - Sort imports per isort rules (I001) - Add trailing commas to dict literals - Expand long dict entries to multi-line format All CI quality checks verified locally: - ruff check src/tests: PASS - ruff format --check src/tests: PASS - mypy src/tests: PASS - bandit -r src --skip B615: PASS

Critical fixes: - Add ResponseNotFoundError/ResponseNotReadyError to exception handling (fixes 404 being incorrectly returned as 500) - Add shard_lengths validation to catch corrupted parquet metadata early - Add headers to 400 response in OpenAPI spec for consistency - Add 500 response to OpenAPI spec for completeness Minor improvements: - DRY: store sum(original_shard_lengths) in variable - Improve test assertion specificity (assert error code value)

Add missing return type annotations and parameter type hints to fixtures and test functions to satisfy mypy strict checking.

AI-generated code was fabricating filenames instead of raising errors: - Empty parquet_files -> was returning fabricated "{split}.parquet" - More shards than files -> was fabricating "{split}-{idx:05d}.parquet" Now follows codebase pattern (duckdb.py:97-98): raise ValueError for metadata inconsistencies instead of hiding data corruption. Added tests for both error cases.

The-Obstacle-Is-The-Way force-pushed the feature/row-to-shard-api branch 2 times, most recently from 40f0a0d to 98fe400 Compare December 12, 2025 23:22

The-Obstacle-Is-The-Way added 5 commits December 18, 2025 13:52

fix: add type annotations to test_shard.py for mypy compliance

481c410

Add missing return type annotations and parameter type hints to fixtures and test functions to satisfy mypy strict checking.

The-Obstacle-Is-The-Way force-pushed the feature/row-to-shard-api branch from 98fe400 to 1867dee Compare December 18, 2025 18:52

lhoestq mentioned this pull request Jan 13, 2026

Add _generate_shards huggingface/datasets#7943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add /shard endpoint for row-to-shard mapping#3276

feat(api): add /shard endpoint for row-to-shard mapping#3276
The-Obstacle-Is-The-Way wants to merge 5 commits intohuggingface:mainfrom
The-Obstacle-Is-The-Way:feature/row-to-shard-api

The-Obstacle-Is-The-Way commented Dec 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

The-Obstacle-Is-The-Way commented Dec 6, 2025

Summary

API

Implementation

Files Changed

Test Plan

Uh oh!

HuggingFaceDocBuilderDev commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants