Skip to content

Fix misleading add_column() usage example in docstring #7648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 17, 2025

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025

Fixes #7611

This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place.

Why:
The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change.

This should make the behavior clearer for users.
@lhoestq @davanstrien

This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place.

Why:
The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change.

Fixes huggingface#7611
@lhoestq
Copy link
Member

lhoestq commented Jul 7, 2025

I believe there are other occurences of cases like this, like select_columns, select, filter, shard and flatten, could you also fix the docstring for them as well before we merge ?

… shard, and flatten

Fix misleading docstring examples for select_columns, select, filter, shard, and flatten

- Updated usage examples to show correct behavior (methods return new datasets)
- Added inline comments to clarify that methods do not modify in-place
- Fixes follow-up from issue huggingface#7611 and @lhoestq’s review on PR huggingface#7648
@ArjunJagdale
Copy link
Contributor Author

Done! @lhoestq! I've updated the docstring examples for the following methods to clarify that they return new datasets instead of modifying in-place:

  • select_columns
  • select
  • filter
  • shard
  • flatten

@ArjunJagdale
Copy link
Contributor Author

Also, any suggestions on what kind of issues I should work on next? I tried looking on my own, but I’d be happy if you could assign me something — I’ll do my best!

@ArjunJagdale
Copy link
Contributor Author

Hi! any update on this PR?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating the other ones as well :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit 7af7ace into huggingface:main Jul 17, 2025
@lhoestq
Copy link
Member

lhoestq commented Jul 17, 2025

Also, any suggestions on what kind of issues I should work on next? I tried looking on my own, but I’d be happy if you could assign me something — I’ll do my best!

Hmm. One long lasting issue is the one about being able to download only one split of a dataset (currently load_dataset() downloads all the splits, even when only one of train/test/validation is passed with load_dataset(..., split=split))

This makes some downloads pretty long, I remember Mario started to work on this in this PR but couldn't finish it: #6832

I think it would be a challenging but pretty impactful addition, and feel free to ping me if you have questions or if I can help. You can also take a look at Mario's first PR which was already in an advanced state.

Let me know if it sounds like the kind of contribution you're looking for :)

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 17, 2025

Hi @lhoestq, thanks for the thoughtful suggestion!

The issue you mentioned sounds like a meaningful problem to tackle, and I’d love to take a closer look at it. I’ll start by reviewing Mario’s PR (#6832), understand what was implemented so far, and what remains to be done.

If I have any questions or run into anything unclear, I’ll be sure to reach out.

I plan to give this a solid try. Thanks again — contributing to Hugging Face is something I truly hope to grow into.


Once again the the main Issue is to -

Allow users to download only the requested split(s) in load_dataset(...), avoiding unnecessary processing/downloading of the full dataset (especially important for large datasets like svhn, squad, glue).

right?

Also I have gone through some related / mentioned issues and PRs -


IF I am not wrong, #2249 had some limitations -

  • Only worked for some dataset scripts where the download dict had split names as keys (like natural_questions).

  • Would fail or cause confusing behavior on datasets with:
    1] Custom download keys (TRAIN_DOWNLOAD_URL, val_nyt, metadata)
    2] Files passed one by one to dl_manager.download(), not as a dict

  • Reused DownloadConfig, which led to blurry separation between cached_path, DownloadManager, and dataset logic.

  • Needed to modify each dataset's _split_generators() to fully support split filtering.

  • Risked partial or inconsistent caching if logic wasn’t tight.

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 28, 2025

Hi @lhoestq, thanks for the thoughtful suggestion!

The issue you mentioned sounds like a meaningful problem to tackle, and I’d love to take a closer look at it. I’ll start by reviewing Mario’s PR (#6832), understand what was implemented so far, and what remains to be done.

If I have any questions or run into anything unclear, I’ll be sure to reach out.

I plan to give this a solid try. Thanks again — contributing to Hugging Face is something I truly hope to grow into.

Once again the the main Issue is to -

Allow users to download only the requested split(s) in load_dataset(...), avoiding unnecessary processing/downloading of the full dataset (especially important for large datasets like svhn, squad, glue).

right?

Also I have gone through some related / mentioned issues and PRs -

IF I am not wrong, #2249 had some limitations -

  • Only worked for some dataset scripts where the download dict had split names as keys (like natural_questions).
  • Would fail or cause confusing behavior on datasets with:
    1] Custom download keys (TRAIN_DOWNLOAD_URL, val_nyt, metadata)
    2] Files passed one by one to dl_manager.download(), not as a dict
  • Reused DownloadConfig, which led to blurry separation between cached_path, DownloadManager, and dataset logic.
  • Needed to modify each dataset's _split_generators() to fully support split filtering.
  • Risked partial or inconsistent caching if logic wasn’t tight.

Also This one is in the charge now - #7706 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code example for dataset.add_column() does not reflect correct way to use function
3 participants