Skip to content

Conversation

bayo-ibm
Copy link
Contributor

@bayo-ibm bayo-ibm commented Sep 8, 2025

Description of the change

The PR aims to optimize the tokenized data function in fms_mo. Prior to now, the train data and test data were not in the same data format. Also, a unified approach was not used in generating the tokenized train_data, leaving room for further improvement of the associated code.

The training data and test dataset are now in the same data format.
The data structure for the training dataset has been changed from a list to a dict
Also, the format of the test_dataset was changed from BatchEncoding to dict.
An edit was also made in the DQ code to be able to read the test and train datasets.

Related issues or PRs

181

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
  • I have ensured all unit tests pass

Checklist for passing CI/CD:

  • All commits are signed showing "Signed-off-by: Name <email@domain.com>" with git commit -signoff or equivalent
  • PR title and commit messages adhere to Conventional Commits
  • [x ] Contribution is formatted with tox -e fix
  • Contribution passes linting with tox -e lint
  • Contribution passes spellcheck with tox -e spellcheck
  • Contribution passes all unit tests with tox -e unit

Note: CI/CD performs unit tests on multiple versions of Python from a fresh install. There may be differences with your local environment and the test environment.

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
@bayo-ibm bayo-ibm changed the title Fix: Improving Data Handling Capability Fix: improving the output of tokenized data generation for DQ Sep 8, 2025
@bayo-ibm bayo-ibm changed the title Fix: improving the output of tokenized data generation for DQ fix: improving the output of tokenized data generation for DQ Sep 8, 2025
@github-actions github-actions bot added the fix label Sep 8, 2025
@bayo-ibm bayo-ibm changed the title fix: improving the output of tokenized data generation for DQ fix: improving the output of get tokenized data output for DQ Sep 8, 2025
@bayo-ibm bayo-ibm changed the title fix: improving the output of get tokenized data output for DQ fix: improving the output of get tokenized data function for DQ Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant