Fix misleading add_column() usage example in docstring #7648

ArjunJagdale · 2025-06-27T05:27:04Z

This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place.

Why:
The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change.

This should make the behavior clearer for users.
@lhoestq @davanstrien

This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place. Why: The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change. Fixes huggingface#7611

lhoestq · 2025-07-07T15:11:04Z

I believe there are other occurences of cases like this, like select_columns, select, filter, shard and flatten, could you also fix the docstring for them as well before we merge ?

@lhoestq

… shard, and flatten Fix misleading docstring examples for select_columns, select, filter, shard, and flatten - Updated usage examples to show correct behavior (methods return new datasets) - Added inline comments to clarify that methods do not modify in-place - Fixes follow-up from issue huggingface#7611 and @lhoestq’s review on PR huggingface#7648

ArjunJagdale · 2025-07-07T18:11:25Z

Done! @lhoestq! I've updated the docstring examples for the following methods to clarify that they return new datasets instead of modifying in-place:

select_columns
select
filter
shard
flatten

ArjunJagdale · 2025-07-08T07:25:26Z

Also, any suggestions on what kind of issues I should work on next? I tried looking on my own, but I’d be happy if you could assign me something — I’ll do my best!

ArjunJagdale · 2025-07-14T10:42:21Z

Hi! any update on this PR?

lhoestq

thanks for updating the other ones as well :)

src/datasets/arrow_dataset.py

HuggingFaceDocBuilderDev · 2025-07-17T13:13:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2025-07-17T13:20:08Z

Also, any suggestions on what kind of issues I should work on next? I tried looking on my own, but I’d be happy if you could assign me something — I’ll do my best!

Hmm. One long lasting issue is the one about being able to download only one split of a dataset (currently load_dataset() downloads all the splits, even when only one of train/test/validation is passed with load_dataset(..., split=split))

This makes some downloads pretty long, I remember Mario started to work on this in this PR but couldn't finish it: #6832

I think it would be a challenging but pretty impactful addition, and feel free to ping me if you have questions or if I can help. You can also take a look at Mario's first PR which was already in an advanced state.

Let me know if it sounds like the kind of contribution you're looking for :)

ArjunJagdale · 2025-07-17T15:04:44Z

Hi @lhoestq, thanks for the thoughtful suggestion!

The issue you mentioned sounds like a meaningful problem to tackle, and I’d love to take a closer look at it. I’ll start by reviewing Mario’s PR (#6832), understand what was implemented so far, and what remains to be done.

If I have any questions or run into anything unclear, I’ll be sure to reach out.

I plan to give this a solid try. Thanks again — contributing to Hugging Face is something I truly hope to grow into.

Once again the the main Issue is to -

Allow users to download only the requested split(s) in load_dataset(...), avoiding unnecessary processing/downloading of the full dataset (especially important for large datasets like svhn, squad, glue).

right?

Also I have gone through some related / mentioned issues and PRs -

PR Support downloading specific splits in load_dataset #6832 | Mario's main implementation for per-split download logic. Introduces splits param, _available_splits, and conditional logic in download_and_prepare()
PR Run download_and_prepare if missing splits #6639 | Your earlier PR to trigger download_and_prepare() only when splits are missing from disk
Issue How can I download only the train and test split for full numbers using load_dataset()? #4101 / Loading partial dataset when debugging #2538 / Impossible to only download a test split #6529 | Real-world user complaints about load_dataset(..., split=...) still downloading everything. Confirm the need for this fix
Allow downloading/processing/caching only specific splits #2249 | Referenced by albertvillanova — old idea of caching only specific splits

IF I am not wrong, #2249 had some limitations -

Only worked for some dataset scripts where the download dict had split names as keys (like natural_questions).
Would fail or cause confusing behavior on datasets with:
1] Custom download keys (TRAIN_DOWNLOAD_URL, val_nyt, metadata)
2] Files passed one by one to dl_manager.download(), not as a dict
Reused DownloadConfig, which led to blurry separation between cached_path, DownloadManager, and dataset logic.
Needed to modify each dataset's _split_generators() to fully support split filtering.
Risked partial or inconsistent caching if logic wasn’t tight.

ArjunJagdale · 2025-07-28T19:42:05Z

Hi @lhoestq, thanks for the thoughtful suggestion!

The issue you mentioned sounds like a meaningful problem to tackle, and I’d love to take a closer look at it. I’ll start by reviewing Mario’s PR (#6832), understand what was implemented so far, and what remains to be done.

If I have any questions or run into anything unclear, I’ll be sure to reach out.

I plan to give this a solid try. Thanks again — contributing to Hugging Face is something I truly hope to grow into.

Once again the the main Issue is to -

Allow users to download only the requested split(s) in load_dataset(...), avoiding unnecessary processing/downloading of the full dataset (especially important for large datasets like svhn, squad, glue).

right?

Also I have gone through some related / mentioned issues and PRs -

PR Support downloading specific splits in load_dataset #6832 | Mario's main implementation for per-split download logic. Introduces splits param, _available_splits, and conditional logic in download_and_prepare()

PR Run download_and_prepare if missing splits #6639 | Your earlier PR to trigger download_and_prepare() only when splits are missing from disk

Issue How can I download only the train and test split for full numbers using load_dataset()? #4101 / Loading partial dataset when debugging #2538 / Impossible to only download a test split #6529 | Real-world user complaints about load_dataset(..., split=...) still downloading everything. Confirm the need for this fix

Allow downloading/processing/caching only specific splits #2249 | Referenced by albertvillanova — old idea of caching only specific splits

IF I am not wrong, #2249 had some limitations -

Only worked for some dataset scripts where the download dict had split names as keys (like natural_questions).

Would fail or cause confusing behavior on datasets with:
1] Custom download keys (TRAIN_DOWNLOAD_URL, val_nyt, metadata)
2] Files passed one by one to dl_manager.download(), not as a dict

Reused DownloadConfig, which led to blurry separation between cached_path, DownloadManager, and dataset logic.

Needed to modify each dataset's _split_generators() to fully support split filtering.

Risked partial or inconsistent caching if logic wasn’t tight.

Also This one is in the charge now - #7706 (comment)

ArjunJagdale added 2 commits July 7, 2025 23:05

Merge branch 'huggingface:main' into patch-12

fa20ffe

lhoestq approved these changes Jul 17, 2025

View reviewed changes

Apply suggestions from code review

60fd956

lhoestq merged commit 7af7ace into huggingface:main Jul 17, 2025

This was referenced Jul 21, 2025

Support downloading specific splits in load_dataset #6832

Open

Support downloading specific splits in load_dataset #7695

Closed

ArjunJagdale mentioned this pull request Jul 28, 2025

Reimplemented partial split download support (revival of #6832) #7706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix misleading add_column() usage example in docstring #7648

Fix misleading add_column() usage example in docstring #7648

Uh oh!

ArjunJagdale commented Jun 27, 2025 •

edited

Loading

Uh oh!

lhoestq commented Jul 7, 2025

Uh oh!

ArjunJagdale commented Jul 7, 2025

Uh oh!

ArjunJagdale commented Jul 8, 2025

Uh oh!

ArjunJagdale commented Jul 14, 2025

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 17, 2025

Uh oh!

lhoestq commented Jul 17, 2025

Uh oh!

ArjunJagdale commented Jul 17, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix misleading add_column() usage example in docstring #7648

Fix misleading add_column() usage example in docstring #7648

Uh oh!

Conversation

ArjunJagdale commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jul 7, 2025

Uh oh!

ArjunJagdale commented Jul 7, 2025

Uh oh!

ArjunJagdale commented Jul 8, 2025

Uh oh!

ArjunJagdale commented Jul 14, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 17, 2025

Uh oh!

lhoestq commented Jul 17, 2025

Uh oh!

ArjunJagdale commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArjunJagdale commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArjunJagdale commented Jun 27, 2025 •

edited

Loading

ArjunJagdale commented Jul 17, 2025 •

edited

Loading

ArjunJagdale commented Jul 28, 2025 •

edited

Loading