Fix bnb for the weights refactor #42043

SunMarc · 2025-11-05T16:27:35Z

What does this PR do?

This PR fixes bnb support in the new weight loading logic.

Testing

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "meta-llama/Llama-3.2-3B-Instruct"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

#model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
# don't pass quantization_config

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map=0
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, do_sample=False, max_new_tokens=1024)
print(tokenizer.decode(outputs[0]))

check why the memory is way too high when quantizing on the fly
bnb tests

HuggingFaceDocBuilderDev · 2025-11-05T16:36:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This reverts commit 3fea865.

MekkCyber

Niice! makes sense

src/transformers/integrations/finegrained_fp8.py

src/transformers/integrations/bitsandbytes.py

ArthurZucker

nice!

ArthurZucker · 2025-11-06T17:36:04Z

src/transformers/integrations/__init__.py

    from .eetq import replace_with_eetq_linear
    from .fbgemm_fp8 import FbgemmFp8Linear, FbgemmFp8Llama4TextExperts, replace_with_fbgemm_fp8_linear
-    from .finegrained_fp8 import FP8Linear, replace_with_fp8_linear
+    from .finegrained_fp8 import FP8Linear, Fp8Quantize, replace_with_fp8_linear


i think we removed it because it was importing with a jit derocator making it slow

ok, I will import it directly from the correct file then

src/transformers/integrations/bitsandbytes.py

ArthurZucker · 2025-11-06T17:38:01Z

src/transformers/integrations/bitsandbytes.py

+            # Save the states for later quantization when they are all gathered
+            if not hasattr(self.hf_quantizer, "param_quant_stats"):
+                self.hf_quantizer.param_quant_stats = defaultdict(dict)


they are gathered from what? sorry I am not familiar with it, you need which states?

Basically, we need to store some parameters to create the quantized weight. For example, bnb requires 6 values that are stored in the checkpoint to recover the quantized weight. So we store them in a dict that it stored in hf_quantizer for now as we can't save it in the op since we create an op per tensor.

src/transformers/integrations/finegrained_fp8.py

ArthurZucker · 2025-11-06T17:39:10Z

src/transformers/quantizers/base.py

+    def is_valid_unexpected_keys(self, k):
+        """ 
+        Check if the keys is valid or not even if it is not in the state_dict of the meta model.
+        This is because the state dict of the model might change after quantization like for 4bit bnb
+        """
+        return False


i would love to avoid this, meaning can we make sure the 1st call to hf_quantizer.quantize_model just properly prepares the meta model?

This is more to take care of the case where we load the quantized checkpoint. I don't think there is a good way to fix this but let's keep this for now. We can think of fixing this later

ArthurZucker · 2025-11-06T17:40:53Z

src/transformers/core_model_loading.py


-        if ref is not None and ref.shape != param_value.shape:
+        # skip mismatch for hf_quantizer for now
+        if ref is not None and ref.shape != param_value.shape and hf_quantizer is None:


why is the shape of the BnbLinear not correct? this I also don't think we want long term no?

this is because when we initialize the meta model with nn.Linear4bit, those don't have the right shape as the weights are not quantized yet. But yeah maybe we can fix this by overwriting the shape of the param when replacing the layers. In long term, we will remove this yes.

ArthurZucker · 2025-11-06T17:41:07Z

src/transformers/core_model_loading.py

-                    from .integrations.finegrained_fp8 import Fp8Quantize
-
-                    converter.quantization_operation = Fp8Quantize()  # TODO support other methods
+                if hf_quantizer is not None and hf_quantizer.is_valid_unexpected_keys(t):


ArthurZucker · 2025-11-06T17:42:16Z

src/transformers/core_model_loading.py

+                converter.quantization_operation = hf_quantizer.get_quantize_ops()
+                # TODO: to clean later. We need to use the empty_param from the checkpoint to decide if we upcast the param to a specific dtype
+                k_dtype = tensor.get_dtype()
+                dtype = str_to_torch_dtype[k_dtype]
+                empty_param_checkpoint = torch.empty(size=tensor.get_shape(), dtype=dtype, device="meta")
+                _, _dtype = _infer_parameter_dtype(model, t, empty_param_checkpoint, hf_quantizer)


why? is it because BNB needs say bf16 always? can you elaborate here because I don't upcast any of the parameters, they just have the _dtype on meta, and then get whatever was loaded from the weights

We need to infer the right dtype for each values in the checkpoints:

some of the values are not parameters or buffers of the model so we shouldn't change the dtype

for some parameters / buffers, we should also keep the same dtype as the checkpoint (empty_param_checkpoint) because the _dtype on meta is not correct ... (fp16 instead of int8) . But this can be fixed potentially if we initialize the correct dtype. For bnb it should work but not sure for other method like torchao as the dtype is hard to infer from the beginning.

but if quantize we never change the dtype of the param, which is the source of truth

ArthurZucker · 2025-11-06T17:42:31Z

src/transformers/core_model_loading.py

-                                        op.convert(
-                                            {k: realized_value.pop(k)}, quant_config=quantizer.quantization_config
-                                        )
+                                        op.convert({k: realized_value.pop(k)}, model=model)


not a fan of passing the whole model!

I wish I could not do that but let's keep this for now

ArthurZucker · 2025-11-06T17:43:02Z

src/transformers/core_model_loading.py

+    is_torch_e4m3fn_available = hasattr(torch, "float8_e4m3fn")
+    # We convert floating dtypes to the `dtype` passed except for float8_e4m3fn type. We also want to keep the buffers/params
+    # in int/uint/bool and not cast them.
+    casting_dtype = None


can't the methods do this in the quantize because they should!

src/transformers/core_model_loading.py

…fix-bnb

github-actions · 2025-11-07T13:20:25Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: finegrained_fp8

Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>

SunMarc · 2025-11-07T16:11:31Z

src/transformers/core_model_loading.py

+    base_cls: an nn.Parameter subclass (or nn.Parameter)
+    Returns a new class that combines the base_cls with LoadedParameterMixin
+    """
+    class LoadedParam(base_cls):
+        _inplace_methods = [
+                'add_', 'mul_', 'clamp_', 'zero_', 'fill_', 'normal_', 'uniform_',
+                'copy_', 'erfinv_', 'log_'
+            ]
+        def __new__(cls, from_existing, **kwargs):
+            inst = super().__new__(cls, from_existing.data, from_existing.requires_grad, **from_existing.__dict__)
+            inst._original_param = from_existing
+            # Explicitly override all in-place methods per instance
+            for method_name in inst._inplace_methods:
+                setattr(inst, method_name, MethodType(inst._skip, inst))
+
+            return inst
+
+        def _skip(self, *args, **kwargs):
+            """Helper to skip in-place operations."""


simplified a bit to accomodate subclass of nn.parameters cc @ArthurZucker. Also if we are using this class, it means the param is initialized as you said so let's simplify everything

yep absolutely, ty

fix

710b1ff

SunMarc requested a review from MekkCyber November 5, 2025 16:29

ArthurZucker added 5 commits November 5, 2025 22:02

fixes for more models torch_bc

f72f96d

nits and fixes

e341529

last update

0e51dec

Revert "tied weight first shot to the fiiiixxxxxx"

0f022b5

This reverts commit 3fea865.

here we go again

1dabb4c

MekkCyber reviewed Nov 6, 2025

View reviewed changes

src/transformers/integrations/finegrained_fp8.py Show resolved Hide resolved

src/transformers/integrations/bitsandbytes.py Show resolved Hide resolved

ArthurZucker and others added 6 commits November 6, 2025 10:20

an attempt

0c2b667

up?

c48e1ed

nits

d223635

Fix bnb loading !

bdbc01a

rm print

399388d

Merge branch 'refactor-weight-loading' into fix-bnb

acbeeae

SunMarc requested review from ArthurZucker and Cyrilvallez November 6, 2025 17:31

SunMarc changed the title ~~Fix bnb on the fly for the weights refactor~~ Fix bnb for the weights refactor Nov 6, 2025

ArthurZucker reviewed Nov 6, 2025

View reviewed changes

matthewdouglas reviewed Nov 6, 2025

View reviewed changes

src/transformers/core_model_loading.py Outdated Show resolved Hide resolved

ArthurZucker force-pushed the refactor-weight-loading branch from 28b620d to f692f4b Compare November 7, 2025 07:55

SunMarc added 3 commits November 7, 2025 14:12

rm import

e16da23

update

386e259

Merge remote-tracking branch 'upstream/refactor-weight-loading' into …

9788014

…fix-bnb

SunMarc and others added 3 commits November 7, 2025 14:21

Update src/transformers/core_model_loading.py

72eff97

Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>

Fix loadedparam

d841a04

Merge remote-tracking branch 'upstream/fix-bnb' into fix-bnb

e235eed

SunMarc commented Nov 7, 2025

View reviewed changes

rm report

e4df752

SunMarc added 2 commits November 7, 2025 18:28

Fix tests single gpu

3e69622

should fix it

a052513

Fix bnb for the weights refactor #42043

Are you sure you want to change the base?

Fix bnb for the weights refactor #42043

Conversation

SunMarc commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Uh oh!

HuggingFaceDocBuilderDev commented Nov 5, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

SunMarc Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SunMarc commented Nov 5, 2025 •

edited

Loading

SunMarc Nov 7, 2025 •

edited

Loading