[Performance] Parallelize modifier compression #1558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

kylesayrs wants to merge 41 commits into main from kylesayrs/parallel-compression

Collaborator

kylesayrs commented Jun 16, 2025 •

edited

Loading

A promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

Notes:

It seems like parallel compression of layers is slower (33s vs 18s). I suspect this is because the GPTQ algorithm is very instruction intensive and has lots of branching. This change may have to be preceded by a change to the quantize_weight function to make it more torch.compilable

kylesayrs and others added 30 commits

June 5, 2025 13:57


          wip: alignment context

1aea4dd

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          touchups based on remaining steps

6705bf4

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>


          implement oneshot_device, pipeline warnings

cf1f87d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          simplify example

97c8d30

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          move offloading outside of preprocess, which is shared with train

ecfe15d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          cleanup

6f86244

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          update examples, remove offload devicemap utils

929f678

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

…ding


          update examples to load before generating

a275f53

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove hooks

9d6c227

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

fab6fe1

…ding


          Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

6fdcdb1

…ding


          name change

8351ac9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          cleanup and nits

ad71c5b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          rename function

819df1c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

6d942cc

…ding


          add dispatch utility

7dd71b9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          apply style

8ba0f2c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          update examples

fbf2a6d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          update examples 2

91b349b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove fallback_to_cpu, use ct utils

8e58e35

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove hook from module within utils function

96631d1

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove unused util

96476fe

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

2d87993

…ding


          docstring

cb965c9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove big model example tests

8769b85

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          big modeling example readme

a389d14

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          deprecate sequential_targets on modifiers

b336fa2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          update examples

34ef394

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          fix deprecation warning

58fe929

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added 11 commits

June 16, 2025 00:01


          fix layer sequential pipeline

54ef06a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove unused import

4bb86e5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          dispatch in pipelines

b2367ce

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          add train dispatch

06bb661

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          use remove_dispatch

a64a777

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          fix example

8f71004

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          remove device arg from e2e

7d7b00d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          simplify pipeline inference logic, add comment

501056e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          update examples imports

74aa7c9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          fix call

e4487e2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>


          wip: run compression in parallel

f134e56

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces parallel processing to the OBCQ (Optimal Brain Compression Quantization) algorithm, specifically targeting the module quantization step within the GPTQ modifier. The primary goal is to improve performance by allowing multiple layers to be quantized concurrently using a thread pool. While the PR description notes that initial tests show a performance regression, this change lays the groundwork for future optimizations and ensures thread-safe execution of underlying PyTorch linear algebra operations.

Highlights

Parallelization of Module Quantization: The compress_modules method in GPTQModifier has been refactored to leverage concurrent.futures.ThreadPoolExecutor for parallelizing the quantization of individual modules. This aims to speed up the overall compression process by distributing the workload across multiple threads.
Extraction of Module Compression Logic: The core logic for quantizing a single module has been extracted into a new private helper method, _compress_module. This refactoring makes the code more modular and enables its concurrent execution within the thread pool.
Thread-Safe Linear Algebra Initialization: A new function, initialize_linalg, has been introduced and integrated to pre-load torch.linalg operations on specific devices. This addresses a known issue where lazy loading of torch.linalg within multiple threads could lead to assertion errors, ensuring stability in the parallelized environment.
Logging and Performance Tracking: The CompressionLogger context manager has been removed from the per-module loop, and overall timing for the parallel compression process is now explicitly measured and logged, providing a clearer picture of the total execution time.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request aims to improve performance by parallelizing the OBCQ compression process using ThreadPoolExecutor. It introduces a helper function initialize_linalg to pre-load torch.linalg and prevent potential lazy loading issues in threaded environments.

While the intent is performance improvement, the PR description notes that the parallelized version is currently slower. My main feedback points revolve around this performance regression, the choice of ThreadPoolExecutor for potentially CPU-bound tasks, error handling in the parallel execution, and a minor style point. Addressing the performance issue is key, and further investigation into why the threaded version is slower, possibly exploring ProcessPoolExecutor or optimizing the quantize_weight function itself (as you suggested regarding torch.compile), will be important.

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

Contributor

gemini-code-assist bot commented Jun 16, 2025

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

3 similar comments

Contributor

gemini-code-assist bot commented Jun 16, 2025

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Contributor

gemini-code-assist bot commented Jun 16, 2025

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Contributor

gemini-code-assist bot commented Jun 16, 2025

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Base automatically changed from kylesayrs/sequential-onloading to main

June 17, 2025 20:45

kylesayrs mentioned this pull request

Remove oneshot_device #1568

Merged

brian-dellabetta mentioned this pull request

NotImplementedError: Operator aten.amin.default does not have a sharding strategy registered. #1537

Open

kylesayrs changed the title ~~[Performance] Parallelize OBCQ compression~~ [Performance] Parallelize modifier compression

kylesayrs mentioned this pull request

Latest LLM Compressor version slowdown #1610

Closed

brian-dellabetta added a commit that referenced this pull request


          add DeepseekV3 AWQ mapping (#1619)

92cdf63

SUMMARY:

Add AWQ activation-smooth mapping for `DeepseekV3ForCausalLM`.


TEST PLAN:


[examples/quantizing_moe/deepseek_r1_example.py](./examples/quantizing_moe/deepseek_r1_example.py)
but recipe adapted to use `AWQModifier` instead:

```python
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modeling import prepare_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.

# This script takes about 48 hours on 1xA100 to complete.
# Future improvements will reduce this runtime (#1561, #1558).

# For DeepSeek-R1, we require a full precision model in order to properly calibrate
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16

model_id = "unsloth/DeepSeek-R1-0528-BF16"
config = AutoConfig.from_pretrained(model_id)
del config.quantization_config  # fp8 qconfig no longer appplies to bf16 model
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", config=config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = prepare_for_calibration(model)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = AWQModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head", "re:.*mlp.gate$"]
)

# Apply algorithms.
# due to the large size of DeepSeekV3, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    sequential_targets=["DeepseekV3Attention", "DeepseekV3MLP"],
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

```

---------

Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>

aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request


          add DeepseekV3 AWQ mapping (vllm-project#1619)

c92e704

SUMMARY:

Add AWQ activation-smooth mapping for `DeepseekV3ForCausalLM`.


TEST PLAN:


[examples/quantizing_moe/deepseek_r1_example.py](./examples/quantizing_moe/deepseek_r1_example.py)
but recipe adapted to use `AWQModifier` instead:

```python
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modeling import prepare_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.

# This script takes about 48 hours on 1xA100 to complete.
# Future improvements will reduce this runtime (vllm-project#1561, vllm-project#1558).

# For DeepSeek-R1, we require a full precision model in order to properly calibrate
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16

model_id = "unsloth/DeepSeek-R1-0528-BF16"
config = AutoConfig.from_pretrained(model_id)
del config.quantization_config  # fp8 qconfig no longer appplies to bf16 model
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", config=config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = prepare_for_calibration(model)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = AWQModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head", "re:.*mlp.gate$"]
)

# Apply algorithms.
# due to the large size of DeepSeekV3, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    sequential_targets=["DeepseekV3Attention", "DeepseekV3MLP"],
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

```

---------

Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>

kylesayrs mentioned this pull request

Support user-defined batch size for one shot #1117

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet