Skip to content

Conversation

bayo-ibm
Copy link
Contributor

Description of the change

This PR enables fast inference loading functionality for Fp8 DQ. It also enables saving a fms_mo checkpoint in a format compatible with HF/VLLM. It also enables a function that converts a saved fms_mo checkpoint directly to HF/VLLM checkpoint format.

Related issues or PRs

How to verify the PR

Was the PR tested

yes

  • I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
  • I have ensured all unit tests pass

Checklist for passing CI/CD:

  • All commits are signed showing "Signed-off-by: Name <email@domain.com>" with git commit -signoff or equivalent
  • PR title and commit messages adhere to Conventional Commits
  • Contribution is formatted with tox -e fix
  • Contribution passes linting with tox -e lint
  • [x ] Contribution passes spellcheck with tox -e spellcheck
  • Contribution passes all unit tests with tox -e unit

Note: CI/CD performs unit tests on multiple versions of Python from a fresh install. There may be differences with your local environment and the test environment.

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
@github-actions github-actions bot added the feat label Aug 20, 2025
@bayo-ibm bayo-ibm changed the title feat: fast loading and savings functionality feat: fast loading and saving functionality Aug 20, 2025
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
"Otherwise please create an equivalen Linear wrapper and change qcfg['mapping']."
)

if available_packages["compressed_tensors"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block is incorrect because if compressed_tensors is not installed, the warning is never triggered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can incorporate the 3 nested ifs into a single check:

if (
    module.__class__ != nn.Linear
    and (
         not available_packages["compressed_tensors"]
         or not isinstance(
             module, compressed_tensors.linear.compressed_linear.CompressedLinear
         )
     )
):
    logger.warning(...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you set it up this way, add a comment to explain when this if clause is triggered because it is not immediate to understand

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed now

pyproject.toml Outdated
[project.optional-dependencies]
examples = ["ninja>=1.11.1.1,<2.0", "evaluate", "huggingface_hub"]
fp8 = ["llmcompressor", "torchao==0.11"]
fp8 = ["llmcompressor", "torchao==0.11", "compressed_tensors"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to be careful with this because it adds another requirement to our FP8 configuration, which would only be used in the specific scenario of loading an llm_compressor model back into fms-mo for evaluation.

This is unless compressed_tensors is already a requirement of llmcompressor, in which case this additional import is not needed.

cc: @BrandonGroth @tharapalanivel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up on this to confirm that compressed_tensors is already a dependency of llmcompressor:

from llm-compressor/setup.py:

install_requires=[
...
            "compressed-tensors==0.11.0"
            if BUILD_TYPE == "release"
            else "compressed-tensors>=0.11.1a2"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
raise ValueError("This quantization method is not supported for inferencing")


def load_inference_qconfig_file(model_args, fms_mo_args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type hint for return (I suppose dict, please check)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add type hint for input arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return qcfg


def update_qcfg_from_model_config(model_args, qcfg):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type hint on input arguments and return across all functions in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fms_mo/dq.py Outdated
logger.info(f"Tokenizer is {tokenizer}, block size is {block_size}")
qcfg = qconfig_init(recipe="dq", args=fms_mo_args)

quant_mode = check_quantization_setting(model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quant_mode is boolean but it isn't clear from the name of this variable what its role is (to me, it implies a quantization strategy). I think do_quantization or run_quantization or (with opposite meaning) inference_only would be clearer names. Consider updating this name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to inference_only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

if quant_config["config_groups"]["group_0"]["weights"]["num_bits"] != 8:
raise ValueError("Only 8-bit FP weight quantization is supported")

if quant_config["kv_cache_scheme"] is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace empty if branch that uses pass with a check against is not None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

if quant_config is None:
return False

logger.info("Validating config settings")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if quant_config dict has a quant_method key. If not, raise error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


return True

raise ValueError("This quantization method is not supported for inferencing")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only "compressed-tensors" quant_method is supported for inference-only run, correct? If so, update this error to be more specific about what quant_method options (a single one) are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

recipe=model_args.model_name_or_path + "/qcfg", args=fms_mo_args
)
else:
logger.info("qcfg file found, loading the qcfg file ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase this message as:

logger.info(f"loading quantization configuration from {model_args.model_name_or_path + '/qcfg.json'}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
else:
logger.info("qcfg file found, loading the qcfg file ")
qcfg = qconfig_init(recipe=model_args.model_name_or_path + "/qcfg")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it OK here if we use "/qcfg" instead of "/qcfg.json"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "qcfg" alone 'll work.

return qcfg


# def rename_fms_dict_to_vllm_dict (model_dict : dict= None, qcfg : dict = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

if ".weight" in k:
key = k.split("weight")[0]
if key + "quantize_weight.scale" in keys:
weight, scale = to_fp8_scaled_perCh(v, emulate=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if missing, add warning (somewhere, could be outside this function) that only static weights per-channel is supported for conversion from FMS-MO to vLLM at this time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
… layers

Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
Signed-off-by: Omobayode Fagbohungbe <omobayode.fagbohungbe@ibm.com>
return False

logger.info("Validating config settings")
if "quant_method" in quant_config.keys():
Copy link
Collaborator

@chichun-charlie-liu chichun-charlie-liu Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this nested if/raise can be further cleaned-up a bit, for example

if "quant_method" in quant_config:  # NOTE "in dict.keys()" can be simplified as "in dict" 
    if quant_config["quant_method"] == "compressed-tensors":
        if quant_config["format"] != "float-quantized":
            <do something....>
    raise Error1
raise Error2

can be rewritten as

if "quant_method" not in quant_config:
    raise Error2

if quant_config.get("quant_method", None) != "compressed-tensors":
    raise Error1

if quant_config["format"] != "float-quantized":
    <do something....>

should give us fewer indents and possibly fewer line breaks => improves readability
Also @andrea-fasoli 's note was not addressed earlier, use dict.get(xxx, default) to avoid missing keys or additional checks

raise ValueError("Only 8-bit FP weight quantization is supported")

if quant_config["kv_cache_scheme"] is not None:
if quant_config["kv_cache_scheme"]["type"] is not float:
Copy link
Collaborator

@chichun-charlie-liu chichun-charlie-liu Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to allow dq to run with NO kv compression as well, i.e. it does not always have to be 8bit.

logger.info(f"Tokenizer is {tokenizer}, block size is {block_size}")
qcfg = qconfig_init(recipe="dq", args=fms_mo_args)

inference_only = check_quantization_setting(model)
Copy link
Collaborator

@chichun-charlie-liu chichun-charlie-liu Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inference only" seems more appropriate to be a new "mode" in run_quant.quantize() rather than part of dq, we may want to separate "inference only" code here to an independent function/file and add to run_quant


def swap_qbmm(model: nn.Module, qcfg: dict):
"""Go through all model.named_modules(), try to create an equivalent
Qbmm layer to replace each of the existing linear Bmm layers.
Copy link
Collaborator

@chichun-charlie-liu chichun-charlie-liu Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this func name and description seem inaccurate and may cause confusion. what this func does is to create and attach new Qbmm modules to a module where torch.matmul/torch.bmm is being used. Has nothing to do with "swap" and "replace Linear BMM layers"...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants