Skip to content

(refactor) remove redundant logic in _check_valid_index_key #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

suzyahyah
Copy link

@suzyahyah suzyahyah commented Mar 30, 2025

This PR contributes a minor refactor, in a small function in src/datasets/formatting/formatting.py. No change in logic.

In the original code, there are separate if-conditionals for isinstance(key, range) and isinstance(key, Iterable), with essentially the same logic.

This PR combines these two using a single if statement.

Considerations

  1. Although range in python is guaranteed to have integers, internally calling int() on an object that is already an int is negligible. (In python it returns the original object. It doesn't create a new integer object or perform any actual conversion)

  2. Technically a range is already an Iterable, and we could just do isinstance(key, Iterable) but I explicitly did isinstance(key, (range, Iterable)) just to be super obvious and consistent that both cases are handled because I see slice, range, Iterable everywhere in this formatting.py

  3. This PR removes the if len(key)>0 conditional. I think it is cleaner to have it this way for three reasons.

  • There was originally no else statement and the code would have failed silently anyway.
  • The if len(key)>0 should be caught much earlier, rather than in formatting.py.
  • There are actually multiple cases where this would fail, if len(key)>0, if key is non numeric or float, or if key is a list of lists. It's clunky to state all this and the error be thrown during max or indexing.

Previous PR and Issues Checks

  1. No known PR or Issues (both closed or open) in hf datasets repository

Tests

  1. Tested using Dataset (load_dataset("wikitext", "wikitext-103-raw-v1")), Pytorch DataLoader, with a Pytorch BatchSampler (list of indexes returned instead of single index).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant