You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/speed-memory-optims.md
+11-16Lines changed: 11 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -12,26 +12,26 @@ specific language governing permissions and limitations under the License.
12
12
13
13
# Compile and offloading
14
14
15
-
There are trade-offs associated with optimizing solely for[inference speed](./fp16)or[memory-usage](./memory). For example, [caching](./cache)increases inference speed but requires more memory to store the intermediate outputs from the attention layers.
15
+
When optimizing models, you often face trade-offs between[inference speed](./fp16)and[memory-usage](./memory). For instance, while [caching](./cache)can boost inference speed, it comes at the cost of increased memory consumption since it needs to store intermediate attention layer outputs.
16
16
17
-
If your hardware is sufficiently powerful, you can choose to focus on one or the other. For a more balanced approach that doesn't sacrifice too much in terms of inference speed and memory-usage, try compiling and offloading a model.
17
+
A more balanced optimization strategy combines [torch.compile](./fp16#torchcompile) with various offloading methods. This approach not only accelerates inference but also helps lower memory-usage.
18
18
19
-
Refer to the table below for the latency and memory-usage of each combination.
19
+
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage.
20
20
21
-
| combination | latency | memoryusage |
21
+
| combination | latency | memory-usage |
22
22
|---|---|---|
23
23
| quantization, torch.compile |||
24
24
| quantization, torch.compile, model CPU offloading |||
25
25
| quantization, torch.compile, group offloading |||
26
26
27
-
This guide will show you how to compile and offload a model to improve both inference speed and memory-usage.
27
+
This guide will show you how to compile and offload a model.
28
28
29
29
## Quantization and torch.compile
30
30
31
31
> [!TIP]
32
32
> The quantization backend, such as [bitsandbytes](../quantization/bitsandbytes#torchcompile), must be compatible with torch.compile. Refer to the quantization [overview](https://huggingface.co/docs/transformers/quantization/overview#overview) table to see which backends support torch.compile.
33
33
34
-
Start by [quantizing](../quantization/overview) a model to reduce the memory required to store it and [compiling](./fp16#torchcompile) it to accelerate inference.
34
+
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
0 commit comments