-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix misleading add_column() usage example in docstring #7648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place. Why: The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change. Fixes huggingface#7611
I believe there are other occurences of cases like this, like select_columns, select, filter, shard and flatten, could you also fix the docstring for them as well before we merge ? |
… shard, and flatten Fix misleading docstring examples for select_columns, select, filter, shard, and flatten - Updated usage examples to show correct behavior (methods return new datasets) - Added inline comments to clarify that methods do not modify in-place - Fixes follow-up from issue huggingface#7611 and @lhoestq’s review on PR huggingface#7648
Done! @lhoestq! I've updated the docstring examples for the following methods to clarify that they return new datasets instead of modifying in-place:
|
Also, any suggestions on what kind of issues I should work on next? I tried looking on my own, but I’d be happy if you could assign me something — I’ll do my best! |
Hi! any update on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating the other ones as well :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hmm. One long lasting issue is the one about being able to download only one split of a dataset (currently This makes some downloads pretty long, I remember Mario started to work on this in this PR but couldn't finish it: #6832 I think it would be a challenging but pretty impactful addition, and feel free to ping me if you have questions or if I can help. You can also take a look at Mario's first PR which was already in an advanced state. Let me know if it sounds like the kind of contribution you're looking for :) |
Hi @lhoestq, thanks for the thoughtful suggestion! The issue you mentioned sounds like a meaningful problem to tackle, and I’d love to take a closer look at it. I’ll start by reviewing Mario’s PR (#6832), understand what was implemented so far, and what remains to be done. If I have any questions or run into anything unclear, I’ll be sure to reach out. I plan to give this a solid try. Thanks again — contributing to Hugging Face is something I truly hope to grow into. Once again the the main Issue is to -
right? Also I have gone through some related / mentioned issues and PRs -
IF I am not wrong, #2249 had some limitations -
|
Also This one is in the charge now - #7706 (comment) |
Fixes #7611
This PR fixes the usage example in the Dataset.add_column() docstring, which previously implied that add_column() modifies the dataset in-place.
Why:
The method returns a new dataset with the additional column, and users must assign the result to a variable to preserve the change.
This should make the behavior clearer for users.
@lhoestq @davanstrien