feat: fast loading and saving functionality #180

bayo-ibm · 2025-08-20T15:24:17Z

Description of the change

This PR enables fast inference loading functionality for Fp8 DQ. It also enables saving a fms_mo checkpoint in a format compatible with HF/VLLM. It also enables a function that converts a saved fms_mo checkpoint directly to HF/VLLM checkpoint format.

Related issues or PRs

How to verify the PR

Was the PR tested

yes

I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
I have ensured all unit tests pass

Checklist for passing CI/CD:

All commits are signed showing "Signed-off-by: Name <email@domain.com>" with git commit -signoff or equivalent
PR title and commit messages adhere to Conventional Commits
Contribution is formatted with tox -e fix
Contribution passes linting with tox -e lint
[x ] Contribution passes spellcheck with tox -e spellcheck
Contribution passes all unit tests with tox -e unit

Note: CI/CD performs unit tests on multiple versions of Python from a fresh install. There may be differences with your local environment and the test environment.

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

fms_mo/dq.py

fms_mo/modules/linear.py

fms_mo/recipes/quant.json

fms_mo/prep.py

fms_mo/dq.py

fms_mo/modules/linear.py

fms_mo/utils/dq_inf.py

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

fms_mo/dq.py

andrea-fasoli · 2025-08-28T18:45:14Z

fms_mo/prep.py

-                "Otherwise please create an equivalen Linear wrapper and change qcfg['mapping']."
-            )
-
+            if available_packages["compressed_tensors"]:


this block is incorrect because if compressed_tensors is not installed, the warning is never triggered.

you can incorporate the 3 nested ifs into a single check:

if ( module.__class__ != nn.Linear and ( not available_packages["compressed_tensors"] or not isinstance( module, compressed_tensors.linear.compressed_linear.CompressedLinear ) ) ): logger.warning(...)

if you set it up this way, add a comment to explain when this if clause is triggered because it is not immediate to understand

not yet fixed

fms_mo/quant/quantizers.py

fms_mo/utils/dq_inf.py

andrea-fasoli · 2025-08-28T19:19:29Z

pyproject.toml

 [project.optional-dependencies]
 examples = ["ninja>=1.11.1.1,<2.0", "evaluate", "huggingface_hub"]
-fp8 = ["llmcompressor", "torchao==0.11"]
+fp8 = ["llmcompressor", "torchao==0.11", "compressed_tensors"]


we have to be careful with this because it adds another requirement to our FP8 configuration, which would only be used in the specific scenario of loading an llm_compressor model back into fms-mo for evaluation.

This is unless compressed_tensors is already a requirement of llmcompressor, in which case this additional import is not needed.

cc: @BrandonGroth @tharapalanivel

following up on this to confirm that compressed_tensors is already a dependency of llmcompressor:

from llm-compressor/setup.py:

install_requires=[ ... "compressed-tensors==0.11.0" if BUILD_TYPE == "release" else "compressed-tensors>=0.11.1a2"

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

fms_mo/utils/dq_inf.py

andrea-fasoli · 2025-09-04T16:21:24Z

fms_mo/utils/dq_inf.py

+    raise ValueError("This quantization method is not supported for inferencing")
+
+
+def load_inference_qconfig_file(model_args, fms_mo_args):


add type hint for return (I suppose dict, please check)

also add type hint for input arguments

andrea-fasoli · 2025-09-04T16:23:25Z

fms_mo/utils/dq_inf.py

+    return qcfg
+
+
+def update_qcfg_from_model_config(model_args, qcfg):


add type hint on input arguments and return across all functions in this file

andrea-fasoli · 2025-09-04T16:27:25Z

fms_mo/dq.py

    logger.info(f"Tokenizer is {tokenizer}, block size is {block_size}")
-    qcfg = qconfig_init(recipe="dq", args=fms_mo_args)
+
+    quant_mode = check_quantization_setting(model)


quant_mode is boolean but it isn't clear from the name of this variable what its role is (to me, it implies a quantization strategy). I think do_quantization or run_quantization or (with opposite meaning) inference_only would be clearer names. Consider updating this name.

changed to inference_only

andrea-fasoli · 2025-09-04T16:33:18Z

fms_mo/utils/dq_inf.py

+        if quant_config["config_groups"]["group_0"]["weights"]["num_bits"] != 8:
+            raise ValueError("Only 8-bit FP weight quantization  is supported")
+
+        if quant_config["kv_cache_scheme"] is None:


replace empty if branch that uses pass with a check against is not None

andrea-fasoli · 2025-09-04T16:35:34Z

fms_mo/utils/dq_inf.py

+    if quant_config is None:
+        return False
+
+    logger.info("Validating config settings")


check if quant_config dict has a quant_method key. If not, raise error.

andrea-fasoli · 2025-09-04T16:38:01Z

fms_mo/utils/dq_inf.py

+
+        return True
+
+    raise ValueError("This quantization method is not supported for inferencing")


only "compressed-tensors" quant_method is supported for inference-only run, correct? If so, update this error to be more specific about what quant_method options (a single one) are supported.

andrea-fasoli · 2025-09-04T16:40:40Z

fms_mo/utils/dq_inf.py

+                recipe=model_args.model_name_or_path + "/qcfg", args=fms_mo_args
+            )
+        else:
+            logger.info("qcfg file found, loading the qcfg file ")


I would rephrase this message as:

logger.info(f"loading quantization configuration from {model_args.model_name_or_path + '/qcfg.json'}")

andrea-fasoli · 2025-09-04T16:41:23Z

fms_mo/utils/dq_inf.py

+            )
+        else:
+            logger.info("qcfg file found, loading the qcfg file ")
+            qcfg = qconfig_init(recipe=model_args.model_name_or_path + "/qcfg")


is it OK here if we use "/qcfg" instead of "/qcfg.json"?

I don't think "qcfg" alone 'll work.

andrea-fasoli · 2025-09-04T16:49:21Z

fms_mo/utils/dq_inf.py

+    return qcfg
+
+
+# def rename_fms_dict_to_vllm_dict (model_dict : dict= None, qcfg : dict = None):


remove comment

andrea-fasoli · 2025-09-04T16:53:44Z

fms_mo/utils/dq_inf.py

+        if ".weight" in k:
+            key = k.split("weight")[0]
+            if key + "quantize_weight.scale" in keys:
+                weight, scale = to_fp8_scaled_perCh(v, emulate=False)


if missing, add warning (somewhere, could be outside this function) that only static weights per-channel is supported for conversion from FMS-MO to vLLM at this time

warning added

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

… layers Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

chichun-charlie-liu · 2025-09-22T13:25:42Z

fms_mo/utils/dq_inf.py

+        return False
+
+    logger.info("Validating config settings")
+    if "quant_method" in quant_config.keys():


this nested if/raise can be further cleaned-up a bit, for example

if "quant_method" in quant_config: # NOTE "in dict.keys()" can be simplified as "in dict" if quant_config["quant_method"] == "compressed-tensors": if quant_config["format"] != "float-quantized": <do something....> raise Error1 raise Error2

can be rewritten as

if "quant_method" not in quant_config: raise Error2 if quant_config.get("quant_method", None) != "compressed-tensors": raise Error1 if quant_config["format"] != "float-quantized": <do something....>

should give us fewer indents and possibly fewer line breaks => improves readability
Also @andrea-fasoli 's note was not addressed earlier, use dict.get(xxx, default) to avoid missing keys or additional checks

chichun-charlie-liu · 2025-09-22T13:44:06Z

fms_mo/utils/dq_inf.py

+                raise ValueError("Only 8-bit FP weight quantization  is supported")
+
+            if quant_config["kv_cache_scheme"] is not None:
+                if quant_config["kv_cache_scheme"]["type"] is not float:


we may want to allow dq to run with NO kv compression as well, i.e. it does not always have to be 8bit.

chichun-charlie-liu · 2025-09-22T13:54:08Z

fms_mo/dq.py

    logger.info(f"Tokenizer is {tokenizer}, block size is {block_size}")
-    qcfg = qconfig_init(recipe="dq", args=fms_mo_args)
+
+    inference_only = check_quantization_setting(model)


"inference only" seems more appropriate to be a new "mode" in run_quant.quantize() rather than part of dq, we may want to separate "inference only" code here to an independent function/file and add to run_quant

chichun-charlie-liu · 2025-09-22T14:27:14Z

fms_mo/prep.py


+def swap_qbmm(model: nn.Module, qcfg: dict):
+    """Go through all model.named_modules(), try to create an equivalent
+    Qbmm layer to replace each of the existing linear Bmm layers.


this func name and description seem inaccurate and may cause confusion. what this func does is to create and attach new Qbmm modules to a module where torch.matmul/torch.bmm is being used. Has nothing to do with "swap" and "replace Linear BMM layers"...

bayo-ibm added 3 commits August 5, 2025 12:59

feat: fast model inference

b137e0c

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

Merge branch 'main' into fast_loading

e9874ef

feat: enable fast loading and vllm format saving functionality in fms_mo

0b5d68a

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

github-actions bot added the feat label Aug 20, 2025

bayo-ibm changed the title ~~feat: fast loading and savings functionality~~ feat: fast loading and saving functionality Aug 20, 2025

BrandonGroth reviewed Aug 20, 2025

View reviewed changes

fms_mo/dq.py Outdated Show resolved Hide resolved

BrandonGroth reviewed Aug 20, 2025

View reviewed changes

fms_mo/modules/linear.py Outdated Show resolved Hide resolved

BrandonGroth requested changes Aug 21, 2025

View reviewed changes

fix: updated the code to reflect PR update

b458d18

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>

bayo-ibm force-pushed the fast_loading branch from 24a7205 to b458d18 Compare August 22, 2025 04:10

fix: re-naming of qcfg inference parameter

adb7f38

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>