fix: improving the output of get tokenized data function for DQ #184

bayo-ibm · 2025-09-08T18:19:22Z

Description of the change

The PR aims to optimize the tokenized data function in fms_mo. Prior to now, the train data and test data were not in the same data format. Also, a unified approach was not used in generating the tokenized train_data, leaving room for further improvement of the associated code.

The training data and test dataset are now in the same data format.
The data structure for the training dataset has been changed from a list to a dict
Also, the format of the test_dataset was changed from BatchEncoding to dict.
An edit was also made in the DQ code to be able to read the test and train datasets.

Related issues or PRs

181

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
I have ensured all unit tests pass

Checklist for passing CI/CD:

All commits are signed showing "Signed-off-by: Name <email@domain.com>" with git commit -signoff or equivalent
PR title and commit messages adhere to Conventional Commits
[x ] Contribution is formatted with tox -e fix
Contribution passes linting with tox -e lint
Contribution passes spellcheck with tox -e spellcheck
Contribution passes all unit tests with tox -e unit

Note: CI/CD performs unit tests on multiple versions of Python from a fresh install. There may be differences with your local environment and the test environment.

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

bayo-ibm added 3 commits September 3, 2025 11:55

fix: optimize the data_handling for DQ

a2000c3

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

fix: corrected the trailing spaces

cfb9ea9

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

fix: adding hints to the arguments and returns

a0f74da

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

bayo-ibm changed the title ~~Fix: Improving Data Handling Capability~~ Fix: improving the output of tokenized data generation for DQ Sep 8, 2025

bayo-ibm changed the title ~~Fix: improving the output of tokenized data generation for DQ~~ fix: improving the output of tokenized data generation for DQ Sep 8, 2025

github-actions bot added the fix label Sep 8, 2025

bayo-ibm changed the title ~~fix: improving the output of tokenized data generation for DQ~~ fix: improving the output of get tokenized data output for DQ Sep 8, 2025

bayo-ibm changed the title ~~fix: improving the output of get tokenized data output for DQ~~ fix: improving the output of get tokenized data function for DQ Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improving the output of get tokenized data function for DQ #184

fix: improving the output of get tokenized data function for DQ #184

Uh oh!

bayo-ibm commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: improving the output of get tokenized data function for DQ #184

Are you sure you want to change the base?

fix: improving the output of get tokenized data function for DQ #184

Uh oh!

Conversation

bayo-ibm commented Sep 8, 2025

Description of the change

Related issues or PRs

How to verify the PR

Was the PR tested

Checklist for passing CI/CD:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant