Skip to content

[Performance] Parallelize modifier compression #1558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jun 16, 2025

#1382

A promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

  1. Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
  2. Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

Notes:

It seems like parallel compression of layers is slower (33s vs 18s). I suspect this is because the GPTQ algorithm is very instruction intensive and has lots of branching. This change may have to be preceded by a change to the quantize_weight function to make it more torch.compilable

kylesayrs and others added 30 commits June 5, 2025 13:57
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
kylesayrs added 11 commits June 16, 2025 00:01
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces parallel processing to the OBCQ (Optimal Brain Compression Quantization) algorithm, specifically targeting the module quantization step within the GPTQ modifier. The primary goal is to improve performance by allowing multiple layers to be quantized concurrently using a thread pool. While the PR description notes that initial tests show a performance regression, this change lays the groundwork for future optimizations and ensures thread-safe execution of underlying PyTorch linear algebra operations.

Highlights

  • Parallelization of Module Quantization: The compress_modules method in GPTQModifier has been refactored to leverage concurrent.futures.ThreadPoolExecutor for parallelizing the quantization of individual modules. This aims to speed up the overall compression process by distributing the workload across multiple threads.
  • Extraction of Module Compression Logic: The core logic for quantizing a single module has been extracted into a new private helper method, _compress_module. This refactoring makes the code more modular and enables its concurrent execution within the thread pool.
  • Thread-Safe Linear Algebra Initialization: A new function, initialize_linalg, has been introduced and integrated to pre-load torch.linalg operations on specific devices. This addresses a known issue where lazy loading of torch.linalg within multiple threads could lead to assertion errors, ensuring stability in the parallelized environment.
  • Logging and Performance Tracking: The CompressionLogger context manager has been removed from the per-module loop, and overall timing for the parallel compression process is now explicitly measured and logged, providing a clearer picture of the total execution time.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by parallelizing the OBCQ compression process using ThreadPoolExecutor. It introduces a helper function initialize_linalg to pre-load torch.linalg and prevent potential lazy loading issues in threaded environments.

While the intent is performance improvement, the PR description notes that the parallelized version is currently slower. My main feedback points revolve around this performance regression, the choice of ThreadPoolExecutor for potentially CPU-bound tasks, error handling in the parallel execution, and a minor style point. Addressing the performance issue is key, and further investigation into why the threaded version is slower, possibly exploring ProcessPoolExecutor or optimizing the quantize_weight function itself (as you suggested regarding torch.compile), will be important.

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

3 similar comments
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Base automatically changed from kylesayrs/sequential-onloading to main June 17, 2025 20:45
@kylesayrs kylesayrs mentioned this pull request Jun 18, 2025
@kylesayrs kylesayrs changed the title [Performance] Parallelize OBCQ compression [Performance] Parallelize modifier compression Jun 24, 2025
brian-dellabetta added a commit that referenced this pull request Jul 24, 2025
SUMMARY:

Add AWQ activation-smooth mapping for `DeepseekV3ForCausalLM`.


TEST PLAN:


[examples/quantizing_moe/deepseek_r1_example.py](./examples/quantizing_moe/deepseek_r1_example.py)
but recipe adapted to use `AWQModifier` instead:

```python
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modeling import prepare_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.

# This script takes about 48 hours on 1xA100 to complete.
# Future improvements will reduce this runtime (#1561, #1558).

# For DeepSeek-R1, we require a full precision model in order to properly calibrate
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16

model_id = "unsloth/DeepSeek-R1-0528-BF16"
config = AutoConfig.from_pretrained(model_id)
del config.quantization_config  # fp8 qconfig no longer appplies to bf16 model
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", config=config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = prepare_for_calibration(model)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = AWQModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head", "re:.*mlp.gate$"]
)

# Apply algorithms.
# due to the large size of DeepSeekV3, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    sequential_targets=["DeepseekV3Attention", "DeepseekV3MLP"],
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

```

---------

Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
SUMMARY:

Add AWQ activation-smooth mapping for `DeepseekV3ForCausalLM`.


TEST PLAN:


[examples/quantizing_moe/deepseek_r1_example.py](./examples/quantizing_moe/deepseek_r1_example.py)
but recipe adapted to use `AWQModifier` instead:

```python
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modeling import prepare_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.

# This script takes about 48 hours on 1xA100 to complete.
# Future improvements will reduce this runtime (vllm-project#1561, vllm-project#1558).

# For DeepSeek-R1, we require a full precision model in order to properly calibrate
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16

model_id = "unsloth/DeepSeek-R1-0528-BF16"
config = AutoConfig.from_pretrained(model_id)
del config.quantization_config  # fp8 qconfig no longer appplies to bf16 model
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", config=config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = prepare_for_calibration(model)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = AWQModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head", "re:.*mlp.gate$"]
)

# Apply algorithms.
# due to the large size of DeepSeekV3, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    sequential_targets=["DeepseekV3Attention", "DeepseekV3MLP"],
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

```

---------

Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants