Add RoBERTa FP8 support with refactoring #72

andrea-fasoli · 2025-06-30T23:00:54Z

This PR introduces launch scripts and utility functions for FP8 encoders inference: RoBERTa for QuestionAnswering or MaskedLM architecture and tasks are supported.

The PR incorporates several aspects of the ongoing refactoring process for INT8 LLM from #34 (including addressing the issues raises during that review process), but limits updates to encoder models (i.e., no changes to LLM nor inference.py) and focuses on FP8 quantization instead. It assumes support for FP8 has been incorporated in FMS and FMS-MO (ongoing effort).

Main launch script for encoders is scripts/encoders_inference.py
Utility for various setup steps and inference are:

aiu_setup               set up AIU environment variables
args_parsing            define script arguments across all model configurations
encoders_utils          classes to run QA and MaskedLM tasks
model_setup             define model dtype, device, and distributed strategy
quantization_setup      import FMS-MO addons and define linear_config if needed

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli · 2025-06-30T23:01:23Z

cc: @ani300 @JRosenkranz

aiu_fms_testing_utils/utils/args_parsing.py

aiu_fms_testing_utils/utils/quantization_setup.py

aiu_fms_testing_utils/utils/model_setup.py

scripts/encoders_inference.py

ani300 · 2025-07-01T21:04:05Z

aiu_fms_testing_utils/utils/aiu_setup.py

-    if local_size < 0:
-        local_size = world_size


we'll eventually need local_rank (when multi-aiu moves to multi-node multi-aiu)

good point to keep in mind. Should we add it back at that time or now?

generally torchrun already provides it, so we might as well keep it even if it's the same as RANK for now. Some models/algorithms might even expect to use that instead of rank, so it's good to have both

OK, local_rank and local_size are not in use right now but we keep them in the setup for future needs

aiu_fms_testing_utils/utils/encoders_utils.py

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

ani300 · 2025-07-01T21:13:53Z

aiu_fms_testing_utils/utils/encoders_utils.py

+        max_prompt_length = (
+            args.max_prompt_length
+            if args.max_prompt_length is not None
+            else 384


should we get this from the model itself?

I don't like the hardcoding either but I think the 384 limit is a task-related (QA) default.

Yeah, I was looking for that in the config.json for Roberta, but the limit there is 514. Let's add a comment that this limit is QA related

aiu_fms_testing_utils/utils/encoders_utils.py

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

aiu_fms_testing_utils/utils/encoders_utils.py

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

aiu_fms_testing_utils/utils/encoders_utils.py

scripts/encoders_inference.py

ani300

I have left a few comments, but I'm optimistic about merging this tomorrow

ani300 · 2025-07-01T21:29:33Z

Can we also remove the current roberta example script in FMS? I think this supersedes it.

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli · 2025-07-01T22:30:20Z

do we like encoders_inference.py as a name for the entry point script?

andrea-fasoli · 2025-07-01T22:37:41Z

the pre- and post-processing functions of the QA task are based on torch QA example. I am sure they can be streamlined (as a non-urgent item). If there's any concern with this, let me know.

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli · 2025-07-01T23:02:34Z

do we need any updates to tests/models/test_encoder.py?

ani300 · 2025-07-02T14:20:52Z

do we like encoders_inference.py as a name for the entry point script?

I like run_encoders.py better, and it matches test_encoders

the pre- and post-processing functions of the QA task are based on torch QA example. I am sure they can be streamlined (as a non-urgent item). If there's any concern with this, let me know.

Preprocessing needs to run on every process if multi-AIU, but the post-processing should only run on a single rank (unless we want to ensure all ranks get the same output?) Maybe keep it as is as it makes it easier to debug.

do we need any updates to tests/models/test_encoder.py?

I think the tests do what they need already, which is to ensure the model outputs stay consistent over time using the model signatures framework (which comes from FMS). I don't know if aiu-fms-testing-utils is the place to have validation/e2e tests for specific tasks, @dpatel-ops what's your opinion?

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

ani300

lgtm!

andrea-fasoli added 3 commits June 26, 2025 21:59

initial encoder refactoring (wip)

4d57dd0

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

fp8 encoder support

6de2a31

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Update detection of RoBERTa architecture

0c6a36b

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

ani300 requested review from ani300 and JRosenkranz July 1, 2025 20:30