-
Notifications
You must be signed in to change notification settings - Fork 65
feat: Add documentation on data config selection #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add documentation on data config selection #567
Conversation
|
Thanks for making a pull request! 😃 |
Signed-off-by: Akash-Nayak <akash19nayak@gmail.com>
6dcbeb5 to
1e977a5
Compare
ashokponkumar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then indentation seems to be off in most yamls. Can we please check it and fix it?
docs/data-config-selection.md
Outdated
| - name: dataset_1 | ||
| # sampling: 1.0 | ||
| data_paths: | ||
| - "tests/artifacts/jsonl/twitter_complaints_small.jsonl" | ||
| # Either the below data_handlers section can be used or the dataset_text_field in the tuning config can be used for specifying the field in the dataset that contains the training text for EPT. | ||
| # In this sample ept_data, "output" field contains the text for training. Please change it according to your data. | ||
| # If your data is already tokenized data, then comment the data handlers section | ||
| data_handlers: | ||
| - name: tokenize | ||
| arguments: | ||
| remove_columns: all | ||
| batched: false | ||
| fn_kwargs: | ||
| text_column_name: "output" | ||
| max_length: 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation seems to be off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have corrected the indentation.
Signed-off-by: Akash-Nayak <akash19nayak@gmail.com>
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of the change
This PR adds documentation providing guidance on selecting the appropriate data config based on the format of the training data.