Skip to content

Commit 7d7f274

Browse files
committed
feedback
1 parent 3102eb2 commit 7d7f274

File tree

1 file changed

+11
-16
lines changed

1 file changed

+11
-16
lines changed

docs/source/en/optimization/speed-memory-optims.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -12,26 +12,26 @@ specific language governing permissions and limitations under the License.
1212

1313
# Compile and offloading
1414

15-
There are trade-offs associated with optimizing solely for [inference speed](./fp16) or [memory-usage](./memory). For example, [caching](./cache) increases inference speed but requires more memory to store the intermediate outputs from the attention layers.
15+
When optimizing models, you often face trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it comes at the cost of increased memory consumption since it needs to store intermediate attention layer outputs.
1616

17-
If your hardware is sufficiently powerful, you can choose to focus on one or the other. For a more balanced approach that doesn't sacrifice too much in terms of inference speed and memory-usage, try compiling and offloading a model.
17+
A more balanced optimization strategy combines [torch.compile](./fp16#torchcompile) with various offloading methods. This approach not only accelerates inference but also helps lower memory-usage.
1818

19-
Refer to the table below for the latency and memory-usage of each combination.
19+
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage.
2020

21-
| combination | latency | memory usage |
21+
| combination | latency | memory-usage |
2222
|---|---|---|
2323
| quantization, torch.compile | | |
2424
| quantization, torch.compile, model CPU offloading | | |
2525
| quantization, torch.compile, group offloading | | |
2626

27-
This guide will show you how to compile and offload a model to improve both inference speed and memory-usage.
27+
This guide will show you how to compile and offload a model.
2828

2929
## Quantization and torch.compile
3030

3131
> [!TIP]
3232
> The quantization backend, such as [bitsandbytes](../quantization/bitsandbytes#torchcompile), must be compatible with torch.compile. Refer to the quantization [overview](https://huggingface.co/docs/transformers/quantization/overview#overview) table to see which backends support torch.compile.
3333
34-
Start by [quantizing](../quantization/overview) a model to reduce the memory required to store it and [compiling](./fp16#torchcompile) it to accelerate inference.
34+
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
3535

3636
```py
3737
import torch
@@ -52,9 +52,7 @@ pipeline = DiffusionPipeline.from_pretrained(
5252

5353
# compile
5454
pipeline.transformer.to(memory_format=torch.channels_last)
55-
pipeline.transformer = torch.compile(
56-
pipeline.transformer, mode="ax-autotune", fullgraph=True
57-
)
55+
pipeline.transformer.compile( mode="max-autotune", fullgraph=True)
5856
pipeline("""
5957
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
6058
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
@@ -93,9 +91,7 @@ pipeline.enable_model_cpu_offload()
9391

9492
# compile
9593
pipeline.transformer.to(memory_format=torch.channels_last)
96-
pipeline.transformer = torch.compile(
97-
pipeline.transformer, mode="ax-autotune", fullgraph=True
98-
)
94+
pipeline.transformer.compile( mode="max-autotune", fullgraph=True)
9995
pipeline(
10096
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
10197
).images[0]
@@ -132,13 +128,12 @@ offload_device = torch.device("cpu")
132128

133129
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)
134130
pipeline.vae.enable_group_offload(onload_device=onload_device, offload_type="leaf_level", use_stream=True)
135-
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=1, use_stream=True)
131+
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="leaf_level", use_stream=True)
132+
apply_group_offloading(pipeline.text_encoder_2, onload_device=onload_device, offload_type="leaf_level", use_stream=True)
136133

137134
# compile
138135
pipeline.transformer.to(memory_format=torch.channels_last)
139-
pipeline.transformer = torch.compile(
140-
pipeline.transformer, mode="ax-autotune", fullgraph=True
141-
)
136+
pipeline.transformer.compile( mode="max-autotune", fullgraph=True)
142137
pipeline(
143138
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
144139
).images[0]

0 commit comments

Comments
 (0)