vipshop
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/pipeline/run_hunyuan_image_2.1.py‎
Lines changed: 0 additions & 5 deletions b/‎examples/pipeline/run_hunyuan_image_2.1.py‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎examples/quantize/run_flux_ao.py‎
Lines changed: 65 additions & 0 deletions b/‎examples/quantize/run_flux_ao.py‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎examples/quantize/run_flux_nunchaku.py‎
Lines changed: 102 additions & 0 deletions b/‎examples/quantize/run_flux_nunchaku.py‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎examples/quantize/run_qwen_image_nunchaku.py‎
Lines changed: 135 additions & 0 deletions b/‎examples/quantize/run_qwen_image_nunchaku.py‎
Lines changed: 135 additions & 0 deletions
diff --git a/‎src/cache_dit/cache_factory/block_adapters/__init__.py‎
Lines changed: 4 additions & 1 deletion b/‎src/cache_dit/cache_factory/block_adapters/__init__.py‎
Lines changed: 4 additions & 1 deletion
@@ -147,14 +147,15 @@ You can install the stable release of cache-dit from PyPI, or the latest develop
 - **[🎉Easy New Model Integration](./docs/User_Guide.md#automatic-block-adapter)**: Features like **Unified Cache APIs**, **Forward Pattern Matching**, **Automatic Block Adapter**, **Hybrid Forward Pattern**, and **Patch Functor** make it highly functional and flexible. For example, we achieved 🎉 Day 1 support for [HunyuanImage-2.1](https://github.com/Tencent-Hunyuan/HunyuanImage-2.1) with 1.7x speedup w/o precision loss—even before it was available in the Diffusers library.  
 - **[🎉State-of-the-Art Performance](./bench/)**: Compared with algorithms including Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa, cache-dit achieved the **SOTA** performance w/ **7.4x↑🎉** speedup on ClipScore!
 - **[🎉Support for 4/8-Steps Distilled Models](./bench/)**: Surprisingly, cache-dit's **DBCache** works for extremely few-step distilled models—something many other methods fail to do.  
-- **[🎉Compatibility with Other Optimizations](./docs/User_Guide.md#️torch-compile)**: Designed to work seamlessly with torch.compile, model CPU offload, sequential CPU offload, group offloading, etc.  
+- **[🎉Compatibility with Other Optimizations](./docs/User_Guide.md#️torch-compile)**: Designed to work seamlessly with torch.compile, model CPU offload, sequential CPU offload, group offloading, Quantization(**[torchao](./examples/quantize/)**, **[🔥nunchaku](./examples/quantize/)**), etc.  
 - **[🎉Hybrid Cache Acceleration](./docs/User_Guide.md#taylorseer-calibrator)**: Now supports hybrid **Block-wise Cache + Calibrator** schemes (e.g., DBCache or DBPrune + TaylorSeerCalibrator). DBCache or DBPrune acts as the **Indicator** to decide *when* to cache, while the Calibrator decides *how* to cache. More mainstream cache acceleration algorithms (e.g., FoCa) will be supported in the future, along with additional benchmarks—stay tuned for updates!  
 - **[🤗Diffusers Ecosystem Integration](https://huggingface.co/docs/diffusers/main/en/optimization/cache_dit)**: 🔥**cache-dit** has joined the Diffusers community ecosystem as the **first** DiT-specific cache acceleration framework! Check out the documentation here: <a href="https://huggingface.co/docs/diffusers/main/en/optimization/cache_dit"><img src=https://img.shields.io/badge/🤗Diffusers-ecosystem-yellow.svg ></a>
 
 ![](https://github.com/vipshop/cache-dit/raw/main/assets/clip-score-bench.png)
 
 ## 🔥Important News
 
+- 2025.10.15: 🎉cache-dit now supported [**🔥nunchaku**](https://github.com/nunchaku-tech/nunchaku): Qwen-Image/FLUX.1 [4-bits examples](./examples/quantize/)
 - 2025.10.13: 🎉cache-dit achieved the **SOTA** performance w/ **7.4x↑🎉** speedup on ClipScore!
 - 2025.10.10: 🔥[**Qwen-Image-ControlNet-Inpainting**](https://huggingface.co/InstantX/Qwen-Image-ControlNet-Inpainting) **2.3x↑🎉** speedup! Check the [example](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline/run_qwen_image_controlnet_inpaint.py).
 - 2025.09.26: 🔥[**Qwen-Image-Edit-Plus(2509)**](https://github.com/QwenLM/Qwen-Image) **2.1x↑🎉** speedup! Please check the [example](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline/run_qwen_image_edit_plus.py).
 
@@ -1,6 +1,5 @@
 import os
 import sys
-import gc
 
 sys.path.append("..")
 sys.path.append(os.environ.get("HYIMAGE_PKG_DIR", "."))
@@ -67,10 +66,6 @@
         pipe.text_encoder,
         quant_type=args.quantize_type,
     )
-    time.sleep(0.5)
-    torch.cuda.empty_cache()
-    gc.collect()
-
 
 pipe.to("cuda")
 
 
@@ -0,0 +1,65 @@
+import os
+import sys
+
+sys.path.append("..")
+
+import time
+import torch
+from diffusers import FluxPipeline, FluxTransformer2DModel
+from utils import get_args, strify, cachify
+import cache_dit
+
+
+args = get_args()
+print(args)
+
+
+pipe: FluxPipeline = FluxPipeline.from_pretrained(
+    os.environ.get(
+        "FLUX_DIR",
+        "black-forest-labs/FLUX.1-dev",
+    ),
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+
+
+if args.cache:
+    cachify(args, pipe)
+
+
+if args.quantize:
+    assert isinstance(pipe.transformer, FluxTransformer2DModel)
+    pipe.transformer = cache_dit.quantize(
+        pipe.transformer,
+        quant_type=args.quantize_type,
+    )
+
+
+def run_pipe(pipe: FluxPipeline):
+    image = pipe(
+        "A cat holding a sign that says hello world",
+        num_inference_steps=28,
+        generator=torch.Generator("cpu").manual_seed(0),
+    ).images[0]
+    return image
+
+
+if args.compile:
+    assert isinstance(pipe.transformer, FluxTransformer2DModel)
+    pipe.transformer.compile_repeated_blocks(fullgraph=True)
+
+    # warmup
+    _ = run_pipe(pipe)
+
+
+start = time.time()
+image = run_pipe(pipe)
+end = time.time()
+
+cache_dit.summary(pipe)
+
+time_cost = end - start
+save_path = f"flux.ao.{strify(args, pipe)}.png"
+print(f"Time cost: {time_cost:.2f}s")
+print(f"Saving image to {save_path}")
+image.save(save_path)
@@ -0,0 +1,102 @@
+import os
+import sys
+
+sys.path.append("..")
+import time
+
+import torch
+from diffusers import FluxPipeline, FluxTransformer2DModel
+
+from nunchaku.models.transformers.transformer_flux_v2 import (
+    NunchakuFluxTransformer2DModelV2,
+)
+from utils import get_args, strify
+import cache_dit
+
+args = get_args()
+print(args)
+
+nunchaku_flux_dir = os.environ.get(
+    "NUNCHAKA_FLUX_DIR",
+    "nunchaku-tech/nunchaku-flux.1-dev",
+)
+transformer = NunchakuFluxTransformer2DModelV2.from_pretrained(
+    f"{nunchaku_flux_dir}/svdq-int4_r32-flux.1-dev.safetensors",
+)
+pipe: FluxPipeline = FluxPipeline.from_pretrained(
+    os.environ.get("FLUX_DIR", "black-forest-labs/FLUX.1-dev"),
+    transformer=transformer,
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+
+
+if args.cache:
+    from cache_dit import (
+        ParamsModifier,
+        DBCacheConfig,
+        TaylorSeerCalibratorConfig,
+    )
+
+    cache_dit.enable_cache(
+        pipe,
+        cache_config=DBCacheConfig(
+            Fn_compute_blocks=args.Fn,
+            Bn_compute_blocks=args.Bn,
+            max_warmup_steps=args.max_warmup_steps,
+            max_cached_steps=args.max_cached_steps,
+            max_continuous_cached_steps=args.max_continuous_cached_steps,
+            residual_diff_threshold=args.rdt,
+        ),
+        calibrator_config=(
+            TaylorSeerCalibratorConfig(
+                taylorseer_order=args.taylorseer_order,
+            )
+            if args.taylorseer
+            else None
+        ),
+        params_modifiers=[
+            ParamsModifier(
+                # transformer_blocks
+                cache_config=DBCacheConfig().reset(
+                    residual_diff_threshold=args.rdt
+                ),
+            ),
+            ParamsModifier(
+                # single_transformer_blocks
+                cache_config=DBCacheConfig().reset(
+                    residual_diff_threshold=args.rdt * 3
+                ),
+            ),
+        ],
+    )
+
+
+def run_pipe(pipe: FluxPipeline):
+    image = pipe(
+        "A cat holding a sign that says hello world",
+        num_inference_steps=28,
+        generator=torch.Generator("cpu").manual_seed(0),
+    ).images[0]
+    return image
+
+
+if args.compile:
+    assert isinstance(pipe.transformer, FluxTransformer2DModel)
+    cache_dit.set_compile_configs()
+    pipe.transformer = torch.compile(pipe.transformer)
+
+    # warmup
+    _ = run_pipe(pipe)
+
+
+start = time.time()
+image = run_pipe(pipe)
+end = time.time()
+
+cache_dit.summary(pipe)
+
+time_cost = end - start
+save_path = f"flux.nunchaku.int4.{strify(args, pipe)}.png"
+print(f"Time cost: {time_cost:.2f}s")
+print(f"Saving image to {save_path}")
+image.save(save_path)
@@ -0,0 +1,135 @@
+import os
+import sys
+
+sys.path.append("..")
+
+import time
+import torch
+from diffusers.quantizers import PipelineQuantizationConfig
+from diffusers import QwenImagePipeline, QwenImageTransformer2DModel
+from nunchaku.models.transformers.transformer_qwenimage import (
+    NunchakuQwenImageTransformer2DModel,
+)
+
+from utils import get_args, strify
+import cache_dit
+
+
+args = get_args()
+print(args)
+
+nunchaku_qwen_image_dir = os.environ.get(
+    "NUNCHAKA_QWEN_IMAGE_DIR",
+    "nunchaku-tech/nunchaku-qwen-image.1-dev",
+)
+transformer = NunchakuQwenImageTransformer2DModel.from_pretrained(
+    f"{nunchaku_qwen_image_dir}/svdq-int4_r32-qwen-image.safetensors"
+)
+
+# Minimize VRAM required: 20GiB
+pipe = QwenImagePipeline.from_pretrained(
+    os.environ.get(
+        "QWEN_IMAGE_DIR",
+        "Qwen/Qwen-Image",
+    ),
+    transformer=transformer,
+    torch_dtype=torch.bfloat16,
+    quantization_config=PipelineQuantizationConfig(
+        quant_backend="bitsandbytes_4bit",
+        quant_kwargs={
+            "load_in_4bit": True,
+            "bnb_4bit_quant_type": "nf4",
+            "bnb_4bit_compute_dtype": torch.bfloat16,
+        },
+        components_to_quantize=["text_encoder"],
+    ),
+).to("cuda")
+
+
+if args.cache:
+    from cache_dit import (
+        DBCacheConfig,
+        TaylorSeerCalibratorConfig,
+    )
+
+    cache_dit.enable_cache(
+        pipe,
+        cache_config=DBCacheConfig(
+            Fn_compute_blocks=args.Fn,
+            Bn_compute_blocks=args.Bn,
+            max_warmup_steps=args.max_warmup_steps,
+            max_cached_steps=args.max_cached_steps,
+            max_continuous_cached_steps=args.max_continuous_cached_steps,
+            residual_diff_threshold=args.rdt,
+        ),
+        calibrator_config=(
+            TaylorSeerCalibratorConfig(
+                taylorseer_order=args.taylorseer_order,
+            )
+            if args.taylorseer
+            else None
+        ),
+    )
+
+
+positive_magic = {
+    "en": ", Ultra HD, 4K, cinematic composition.",  # for english prompt
+    "zh": ", 超清，4K，电影级构图.",  # for chinese prompt
+}
+
+# Generate image
+prompt = """A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition"""
+
+# using an empty string if you do not have specific concept to remove
+negative_prompt = " "
+
+
+# Generate with different aspect ratios
+aspect_ratios = {
+    "1:1": (1328, 1328),
+    "16:9": (1664, 928),
+    "9:16": (928, 1664),
+    "4:3": (1472, 1140),
+    "3:4": (1140, 1472),
+    "3:2": (1584, 1056),
+    "2:3": (1056, 1584),
+}
+
+width, height = aspect_ratios["16:9"]
+
+assert isinstance(pipe.transformer, QwenImageTransformer2DModel)
+
+
+def run_pipe():
+    # do_true_cfg = true_cfg_scale > 1 and has_neg_prompt
+    image = pipe(
+        prompt=prompt + positive_magic["en"],
+        negative_prompt=negative_prompt,
+        width=width,
+        height=height,
+        num_inference_steps=50,
+        true_cfg_scale=4.0,
+        generator=torch.Generator(device="cpu").manual_seed(42),
+    ).images[0]
+    return image
+
+
+if args.compile:
+    cache_dit.set_compile_configs()
+    pipe.transformer = torch.compile(pipe.transformer)
+
+    # warmup
+    run_pipe()
+
+
+start = time.time()
+image = run_pipe()
+end = time.time()
+
+stats = cache_dit.summary(pipe)
+
+time_cost = end - start
+save_path = f"qwen-image.nunchaku.{strify(args, stats)}.png"
+print(f"Time cost: {time_cost:.2f}s")
+print(f"Saving image to {save_path}")
+image.save(save_path)
@@ -12,7 +12,10 @@ def flux_adapter(pipe, **kwargs) -> BlockAdapter:
     from cache_dit.utils import is_diffusers_at_least_0_3_5
 
     assert isinstance(pipe.transformer, FluxTransformer2DModel)
-    if is_diffusers_at_least_0_3_5():
+    transformer_cls_name: str = pipe.transformer.__class__.__name__
+    if is_diffusers_at_least_0_3_5() and not transformer_cls_name.startswith(
+        "Nunchaku"
+    ):
         return BlockAdapter(
             pipe=pipe,
             transformer=pipe.transformer,