Refactor inference.py for LLM and RoBERTa support #34

andrea-fasoli · 2025-04-23T16:43:38Z

This PR implements a substantial refactoring of inference.py which becomes the single entry point for LLMs and RoBERTa models. Support covers non-quantized, GPTQ W4A16, and INT8 models.

inference.py code has been streamlined. It is now structured into the following sections:

args_parsing            define script arguments across all model configurations
aiu_setup               set up AIU environment variables
model_setup             define model dtype, device, and distributed strategy
quantization_setup      import FMS-MO addons and define linear_config for FMS
direct_quantization     quantize a non-quantized model to INT8 (WIP)
decoders                run token generation task with LLMs
encoders                run QA or MLM task with RoBERTa

Extensive code validation is needed prior merging.

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli · 2025-04-23T16:44:45Z

cc for review: @ani300 @JRosenkranz

ani300 · 2025-04-28T14:32:28Z

aiu_fms_testing_utils/utils/aiu_setup.py

@@ -57,6 +59,8 @@ def aiu_setup(rank=0, world_size=1, local_rank=0, local_size=1, verbose=False):
 def aiu_dist_setup(rank, world_size, local_rank=-0, local_size=-1, verbose=False):
    if local_rank < 0:
        local_rank = rank
+
+    # FIXME: local_size not in use ?


maybe it's no longer needed? if you can't find any reference to it feel free to delete

ani300 · 2025-04-28T14:33:09Z

aiu_fms_testing_utils/utils/aiu_setup.py

+    _target_cache_size = max(
+        int(args.max_new_tokens * 2),
+        int(args.min_pad_length * 2.5),
+        int(args.fixed_prompt_length * 2.5),
+    )
+    _prompt_size = max(int(args.min_pad_length), int(args.fixed_prompt_length))
+    if hasattr(torch._dynamo.config, "accumulated_cache_size_limit"):
+        if _target_cache_size > torch._dynamo.config.accumulated_cache_size_limit:
+            _prev = torch._dynamo.config.accumulated_cache_size_limit
+            torch._dynamo.config.accumulated_cache_size_limit = _target_cache_size
+            dprint(
+                "NOTICE: Adjusting torch._dynamo.config.accumulated_cache_size_limit "
+                f"from {_prev} to {torch._dynamo.config.accumulated_cache_size_limit} "
+                f"to accomodate prompt size of {_prompt_size} and decode tokens of "
+                f"{args.max_new_tokens}"
+            )
+
+    if _target_cache_size > torch._dynamo.config.cache_size_limit:
+        _prev = torch._dynamo.config.cache_size_limit
+        torch._dynamo.config.cache_size_limit = _target_cache_size
+        dprint(
+            f"NOTICE: Adjusting torch._dynamo.config.cache_size_limit from {_prev} to "
+            f"{torch._dynamo.config.cache_size_limit} to accomodate prompt size of "
+            f"{_prompt_size} and decode tokens of {args.max_new_tokens}"
+        )


this is only needed if compile_dynamic is disabled, can we gate it?

ani300 · 2025-04-28T14:33:49Z

aiu_fms_testing_utils/utils/aiu_setup.py

+    os.environ.setdefault("SENCORES", "32")
+    os.environ.setdefault("SENCORELETS", "2")
+    os.environ.setdefault("DATA_PREC", "fp16")
+    os.environ.setdefault("FLEX_OVERWRITE_NMB_FRAME", "1")


I think some of these are already set by default on the e2e_stable image, can we check and remove the ones we don't need anymore?

confirmed these are no longer needed

is os.environ.setdefault("DTCOMPILER_KEEP_EXPORT", "true") still needed or not? It's the env var that was set after these ones

ani300 · 2025-04-28T14:34:11Z

aiu_fms_testing_utils/utils/aiu_setup.py

+    os.environ.setdefault("FLEX_OVERWRITE_NMB_FRAME", "1")
+    os.environ.setdefault("DTCOMPILER_KEEP_EXPORT", "true")
+
+    os.environ.setdefault("COMPILATION_MODE", "offline_decoder")


this one is only needed for decoder models, for roberta it will probably make it not work

confirmed it will not work for roberta, we need to set it depending on the kind of model

ani300 · 2025-04-28T14:34:51Z

aiu_fms_testing_utils/utils/aiu_setup.py

+            print("must set AIU_WORLD_RANK_0")
+            exit()
+        os.environ.setdefault("FLEX_COMPUTE", "SENTIENT")
+        os.environ.setdefault("FLEX_DEVICE", "VFIO")


I think VFIO is now PF or VF, depending on the AIU setup, we can set PF by default as most cards are running in PF mode

confirmed it is now PF for all clusters we have access to and will eventually be VF

I set it to PF but do we need an argument for this?
One using: choices=["VF", "PF"], default="PF"

ani300 · 2025-04-28T14:36:51Z

aiu_fms_testing_utils/utils/args_parsing.py

+        action="store_true",
+        help=(
+            "If set to True, this will unfuse any fused weight modules that "
+            "support the unfuse_weights method"


this comment can be upgraded to "If set to True, this will unfuse any fused weights in the model" as the way it's done doesn't involve "unfuse_weights" anymore

ani300 · 2025-04-28T14:37:55Z

aiu_fms_testing_utils/utils/args_parsing.py

+        "--seed",
+        type=int,
+        default=81072,
+        help="Run seed (only needed if eval dataset is shuffled)",


run seed can also be relevant for randomly initialized models, which we sometimes use

ani300 · 2025-04-28T14:38:25Z

aiu_fms_testing_utils/utils/args_parsing.py

+    parser.add_argument(
+        "--deterministic",
+        action="store_true",
+        help="`deterministic` requires env variable `CUBLAS_WORKSPACE_CONFIG=:4096:8`",


only for cpu/cuda, aiu doesn't have any change

ani300 · 2025-04-28T14:38:39Z

aiu_fms_testing_utils/utils/args_parsing.py

+        '-v', '--verbose',
+        action='count',
+        default=0,
+        help="Set verbosity level (pass flag as `-v`, `-vv`, `-vvv`)"


where is this used?

mostly in the quantization functions of int8 roberta for now (printing out model parameters if so desired - it's crucial for debug), but I supposed it could be a useful flag for decoders too. I am not using the count functionality at this time, could be just a True/False flag.

ani300 · 2025-04-28T14:54:51Z

aiu_fms_testing_utils/utils/args_parsing.py

+    parser.add_argument(
+        "--max_seq_length",
+        type=int,
+        default=384,
+        help=(
+            "The maximum total input sequence length after tokenization. "
+            "Sequences longer than this will be truncated, "
+            "sequences shorter will be padded if `--pad_to_max_length` is passed."
+        ),
+    )
+    parser.add_argument(
+        "--pad_to_max_length",
+        action="store_true",
+        help=(
+            "If passed, pad all samples to `max_seq_length`. "
+            "Otherwise, dynamic padding is used."
+        ),
+    )


isn't this repeated? or at least very similar to the decoder arguments. It might be worth using argument groups (https://docs.python.org/3/library/argparse.html#argument-groups) and making the roberta and decoder-specific arguments mutually exclusive based on what you pass to some other argument

there is some overlap indeed.
I combined max_seq_length (enc) with max_prompt_len (dec), as they share the same meaning (although the default values differed).
pad_to_max_length (enc) conceptually overlaps with min_pad_length and fixed_prompt_length (dec), but the first is a boolean while the others are int. I couldn't find a clean way to implement a 3-way exclusivity but I'll add some argument validation at the time of loading encoder vs. decoder.

ani300 · 2025-04-28T14:56:59Z

aiu_fms_testing_utils/utils/args_parsing.py

+    args = parser.parse_args()
+
+    # Add convenient arguments to parser
+    args.is_encoder = "bert" in args.architecture.lower()  # TODO: improve this check


maybe add a real "is_encoder" argument and that would help with the mutually exclusive argument groups

ani300 · 2025-04-28T14:58:58Z

aiu_fms_testing_utils/utils/quantization_setup.py

+import os
+
+# Third Party
+from transformers import PreTrainedModel


make this import gated and fail only if args.is_quantized is True in inference.py

ani300 · 2025-06-11T14:40:45Z

aiu_fms_testing_utils/utils/aiu_setup.py

+
+    if not args.compile_dynamic:
+        torch._dynamo.config.assume_static_by_default = True
+        torch._dynamo.config.dynamic_shapes = False


this is now deprecated in pytorch, only assume and automatic are needed

ani300 · 2025-06-11T14:42:55Z

aiu_fms_testing_utils/utils/args_parsing.py

+    parser.add_argument(
+        "--quantization",
+        type=str,
+        choices=["gptq", "int8"],


please add a TODO to add FP8 inference once that lands too

ani300 · 2025-06-11T14:43:10Z

aiu_fms_testing_utils/utils/args_parsing.py

+        help="Enable smoothquant in INT8 quantized model",
+    )
+    parser.add_argument(  # NOTE: roberta only so far but should expand to LLM
+        "--direct_quantization",


add the int8 prefix

ani300 · 2025-06-11T14:43:19Z

aiu_fms_testing_utils/utils/args_parsing.py

+        help="Train INT8 model with Direct Quantization",
+    )
+    parser.add_argument(
+        "--num_dq_samples",


add the int8 prefix

ani300 · 2025-06-11T14:44:22Z

aiu_fms_testing_utils/utils/args_parsing.py

+        "--compile_dynamic",
+        action="store_true",
+        help="Use dynamic shapes with torch.compile",
+    )


there is a new --compile_dynamic_sendnn that needs to be added here

Now that we are cleaning up the args, I wonder if it would make sense to combine compile_dynamic and compile_dynamic_sendnn. What would happen if a user does compile_dynamic_sendnn and not compile_dynamic? We may want to make this something like --compile_dynamic=<static_inputs, symbolic_inputs>

ani300 · 2025-06-11T14:45:45Z

aiu_fms_testing_utils/utils/args_parsing.py

+        ),
+    )
+
+    # RoBERTa-specific evaluation arguments


should we add a prefix to mark these as encoder-specific?

ani300 · 2025-06-11T14:47:58Z

aiu_fms_testing_utils/utils/decoders.py

+        model: PreTrainedModel,
+        tokenizer: PreTrainedTokenizerBase,


These are wrong types

ani300

There's plenty of things to change, and I'd like to think more on the general architecture... Given how different the parameters and general flow is for both encoder tasks and decoders, it might be worth splitting inference.py into encoders.py and decoders.py. Most of the code can be reused anyways if the arguments are groups using the API for this and then just picking the relevant groups for each script.

Of course, documentation (README.md) also needs to be updated for this new structure.

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

JRosenkranz · 2025-06-20T03:11:51Z

aiu_fms_testing_utils/utils/args_parsing.py

+        help="A csv or a json file containing the validation data.",
+    )
+    args_encoder.add_argument(
+        "--pad_to_max_length",


should we use min_pad_length here? If the min_pad_length is not specified, then it will implicitly pad to max length. If min_pad_length is specified and is larger than the largest sequence, this will add extra pads (if we want to simulate a different sequence length)

I put some thoughts into this and it is surprisingly tricky because for encoders the tokenization is performed under the hood of a transformers' PretrainedTokenizer, which does not handle truncation and padding the same way as FMS (i.e, with explicit calls to FMS truncate_prompts_to_max_length and pad_input_ids).

PretrainedTokenizer receives a max_length argument for truncation AND padding, and a padding argument which can be True (pad to max sequence), False (do not pad), or string "max_length" (which will pad to the max_length argument).

To make it behave like our decoder tokenization, we would need to adjust max_length and padding based on the tokenized sequence length... which is not known yet at the time of PretrainedTokenizer call.

We could eventually change the whole tokenization and feature preparation process (which right now is mostly based on a pytorch example, would be nice to rework it), but for the time being I would keep pad_to_max_length argument for encoders only.

JRosenkranz · 2025-06-20T03:15:39Z

aiu_fms_testing_utils/utils/decoders_utils.py

+
+        tokens = self.tokenizer.tokenize(prompt)
+        ids = self.tokenizer.convert_tokens_to_ids(tokens)
+        if self.add_special_tokens:


Is this handling the case where tokenizer.bos_token_id != tokenizer.eos_token_id?

I reproduced this from inference.py. Only difference is the location of where self.add_special_tokens is defined. Now it's updated at the beginning of process_eval_set (before any ids_for_prompt call) as:

self.add_special_tokens = ( self.tokenizer.bos_token_id != self.tokenizer.eos_token_id )

Should be correct.

JRosenkranz · 2025-06-20T03:16:13Z

aiu_fms_testing_utils/utils/decoders_utils.py

+                f"Architecture {args.architecture} should be run as an encoder model."
+            )
+
+    def ids_for_prompt(self, prompt):


We maybe able to reuse the function ids_for_prompt in utils/__init__.py

good catch, it's the same function. It is duplicated also in the current inference.py...

at a second thought, this function may belong to the DecoderInfer class, as the encoders use a different approach. I could remove the duplicate from utils/__init__.py

never mind, ids_for_prompt in utils/__init__.py is also used to generate ids for testing. inference.py and validation.py duplicate this function though. We should sort this out

JRosenkranz · 2025-06-20T03:20:39Z

aiu_fms_testing_utils/utils/encoders_utils.py

+            raise ValueError(
+                "Running encoder model but is_encoder argument is either not set or False."
+            )
+        if args.min_pad_length != 0:


can we use the same arguments for this as decoder rather than introducing pad_to_max_length?

see answer above regarding tokenization in encoders

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli added 8 commits April 22, 2025 21:13

Refactor argument parsing

6e5c0d2

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor model setup

5c0d9ec

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor setup of quantization (addons, linear_config)

d7730c8

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor LLM handling

819b147

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor RoBERTa handling

13c0917

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor Direct Quantization (wip)

5895831

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor inference entry point for LLM and RoBERTa

8b1d37e

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Refactor AIU setup (relocate env vars setup)

238b05d

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

JRosenkranz requested review from JRosenkranz and ani300 April 25, 2025 17:51

ani300 reviewed Apr 28, 2025

View reviewed changes

ani300 reviewed Jun 11, 2025

View reviewed changes

ani300 requested changes Jun 11, 2025

View reviewed changes

andrea-fasoli added 20 commits June 18, 2025 14:00

Remove deprecated local_size

38fef7a

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Remove env vars already set in e2e_stable image

3b94a36

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Group and update parser arguments

effb27b

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Rename enc/dec utils

600ba67

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Gating some AIU settings

031abde

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Split inference into decoder/encoder scripts (wip)

8657a11

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Fix tokenizer; add some dec/enc args validation

0d042c6

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Update AIU env var

3f13729

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Minor args update

4e731d8

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Relocate print_model_params function

fd70377

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Gate transformers import

7a5c9df

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Bring recent updates to inference.py into run_decoder.py

3ad7050

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Add new sendnn compile arg

92d05ef

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Remove unified inference.py

011ec33

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Small fixes

5292128

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Remove deprecated torch dynamo config option

17be9a7

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Update type hints

6780770

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Update skip compile message

0437c50

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Adjust extra_generation_kwargs handling

dfd6758

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

Remove INT8 DQ

f7c458e

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli force-pushed the refactor_inference branch from c13afc9 to f7c458e Compare June 20, 2025 01:29

JRosenkranz reviewed Jun 20, 2025

View reviewed changes

Update import of ids_for_prompt and fix some formatting

3434641

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli force-pushed the refactor_inference branch from d386fc9 to 3434641 Compare June 20, 2025 17:50

Minor changes

a543198

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>

andrea-fasoli mentioned this pull request Jun 30, 2025

Add RoBERTa FP8 support with refactoring #72

Merged

Refactor inference.py for LLM and RoBERTa support #34

Are you sure you want to change the base?

Refactor inference.py for LLM and RoBERTa support #34

Uh oh!

Conversation

andrea-fasoli commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrea-fasoli commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ani300 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andrea-fasoli commented Apr 23, 2025 •

edited

Loading