From 336ce6f0fd9b76ee263eb88f3ed0a5787fe9bc7d Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 15 Jul 2025 14:29:51 -0700 Subject: [PATCH 01/27] draft --- docs/configuration/other_hardware/tpu_tips.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 docs/configuration/other_hardware/tpu_tips.md diff --git a/docs/configuration/other_hardware/tpu_tips.md b/docs/configuration/other_hardware/tpu_tips.md new file mode 100644 index 000000000000..7cfd85cee1f1 --- /dev/null +++ b/docs/configuration/other_hardware/tpu_tips.md @@ -0,0 +1,22 @@ +# TPU Optimization Tips + +This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. + +### TPU workload sizing +- [link to easy HBM calculator colab] + +### Optimize based on your data +- max model len +- most model len +- padding + +### If possible, use the precision supported by the chip +- v5e has bf16 hardware acceleration +- v6e has int8/int4 hardware acceleration + +### Don't set TP to be less than the number of chips on the host +- If you need 1 or 4 chips, just create an instance with 1 or 4 chips, don't try to fragment 2 different workloads across 8 chips. + + + + From ecd6308c104bacc1fdabbd1b4b19039a82709aa4 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 15 Jul 2025 14:32:59 -0700 Subject: [PATCH 02/27] rename --- docs/configuration/{other_hardware/tpu_tips.md => tpu/README.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/configuration/{other_hardware/tpu_tips.md => tpu/README.md} (100%) diff --git a/docs/configuration/other_hardware/tpu_tips.md b/docs/configuration/tpu/README.md similarity index 100% rename from docs/configuration/other_hardware/tpu_tips.md rename to docs/configuration/tpu/README.md From de7986ef71aae049e4ab80eff812d1f1469b7db2 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 15 Jul 2025 14:37:39 -0700 Subject: [PATCH 03/27] add tuner --- docs/configuration/tpu/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 7cfd85cee1f1..436869157d6e 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -17,6 +17,5 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo ### Don't set TP to be less than the number of chips on the host - If you need 1 or 4 chips, just create an instance with 1 or 4 chips, don't try to fragment 2 different workloads across 8 chips. - - - +### Tune your workloads! +- Although we try to have great default configs, we strongly recommend you checkout our auto-tuner and optimize for your workload[LINK]. From 19164c87a9bb5efe47d3668bbbe2b260fd17c13c Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 15 Jul 2025 15:05:19 -0700 Subject: [PATCH 04/27] add information about calculator --- docs/configuration/tpu/README.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 436869157d6e..3245f0ed4c1d 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -3,7 +3,13 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. ### TPU workload sizing -- [link to easy HBM calculator colab] +- The following calculator will tell you: + - KV cache size requirement per token and per request + - GPU memory consumed by the model weights + - GPU memory allocated for KV cache + - Maximum # of requests you can set (--max-num-seqs) + +[link to easy HBM calculator colab] ### Optimize based on your data - max model len From a0eedfa81cb298797d071a36d49fdbd8b0f1ba4b Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Thu, 17 Jul 2025 16:59:39 -0700 Subject: [PATCH 05/27] updating the docs --- docs/configuration/tpu/README.md | 73 ++++++++++++++++------ docs/configuration/tpu/most_model_len.png | Bin 0 -> 12126 bytes 2 files changed, 54 insertions(+), 19 deletions(-) create mode 100644 docs/configuration/tpu/most_model_len.png diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 3245f0ed4c1d..86547243f3de 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -1,27 +1,62 @@ -# TPU Optimization Tips +# **TPU Optimization Tips** This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. -### TPU workload sizing -- The following calculator will tell you: - - KV cache size requirement per token and per request - - GPU memory consumed by the model weights - - GPU memory allocated for KV cache - - Maximum # of requests you can set (--max-num-seqs) +### **Get started** -[link to easy HBM calculator colab] +Looking for setup and installation instructions? Find them [here](https://vllm--19708.org.readthedocs.build/en/19708/getting_started/installation/google_tpu.html). -### Optimize based on your data -- max model len -- most model len -- padding +### **TPU workload sizing** -### If possible, use the precision supported by the chip -- v5e has bf16 hardware acceleration -- v6e has int8/int4 hardware acceleration +When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. -### Don't set TP to be less than the number of chips on the host -- If you need 1 or 4 chips, just create an instance with 1 or 4 chips, don't try to fragment 2 different workloads across 8 chips. +The following colab [calculator](https://colab.sandbox.google.com/drive/1M_f3xZm-_Ce2D-UMAyGNyacEIN-6rUbf) will tell you: -### Tune your workloads! -- Although we try to have great default configs, we strongly recommend you checkout our auto-tuner and optimize for your workload[LINK]. +- KV cache size requirement per token and per request +- TPU/GPU memory consumed by the model weights +- TPU/GPU memory allocated for the KV cache +- Maximum \# of requests you can approximately set (--max-num-seqs) + +This approach serves as a general rule of thumb. As latency becomes more important, you may want to reduce –max-num-seqs and/or increase the number of chips in increments of 128. + +### **Optimize based on your data** + +#### *max model len vs. most model len* + +![image](most_model_len.png) + +Here's a nifty trick: Setting max model len too high can hurt performance when the bulk of your traffic isn't anywhere near the max length, but you still need to retain the ability to handle the off max length request. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. + +For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32000` and use `VLLM_TPU_MOST_MODEL_LEN=2000`. + +The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time. + +#### *Padding* + +For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc. + +The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. Currently, there are 2 ways to pad the requests: + +1) the default exponential padding (pad to the nearest power of 2) +2) bucket padding (pad to the nearest linearly increasing bucket). + +When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`. + +For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512]. + +The fewer tokens we pad, the less unnecessary computation TPU does, the better performance we can get. For example, if num_tokens=300, with exponential padding, we pad to 512, with the bucket_padding above, we pad to 320. + +However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. + +### **If possible, use the precision that matches the chip’s hardware acceleration** + +- v5e has int4/int8 hardware acceleration in the MXU +- v6e has int4/int8 hardware acceleration in the MXU + +### **Don't set TP to be less than the number of chips on a single-host deployment** + +Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). + +### **Tune your workloads!** + +Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](https://github.com/vllm-project/vllm/pull/20779/files?short_path=f9b273a#diff-f9b273a10e0688ba63c38bd93a2e64ceb54d4fdd7ff7b82d347df06d0d34e39c) to optimize your workloads for your use case. \ No newline at end of file diff --git a/docs/configuration/tpu/most_model_len.png b/docs/configuration/tpu/most_model_len.png new file mode 100644 index 0000000000000000000000000000000000000000..344a81ed90801ee1a2ff1343f3609c8318c96f75 GIT binary patch literal 12126 zcmds73p~_m_aD~emZDhORwAoha+}F5>o&x=MZ>sX8fKVb#*8uUAw{}ucnc{ZMGRq+ zyO3@w6^WF4wGtUp?za4&-C?O25Nbqf5JL`WB8NcWXfzc(3Fk}4QG!$g2n-+s<2YYJ0B3?VHG)JY3k;7IVYBmwQF z8b~b_q$Uu{TcYi)t>h5K;5(V*PXNEn2;TlwXo)eAPNe{esX9_yMGg7`gmyR|9G$d6 z1+>!;0zHUC4Ol)50-6kTLjytHI5J^btiCMqqSEmM`mzY{c7awnFE1a;ziZ>_q=cE#aWtYe70XBbJqml{2sPhe5}pvm(YCyjL8X!zB-$^9 z-qe5qf;R_Bjz%1vP7VF#G#@INb2`o(8aVhbv_k^BUzrq=I+E}V;0us5X=`w{2g%U{ zk}r|_W_2~r2nENT%n<|;ad>Lz@_M#tB`V8y5`KwCM?okR&UCXgYA&r=zX~$*BR1 z6>=Z;50D2j=v05giV=i~mL`A;if$;7Xt$@Y_(V!1QNS*sTqjL_>4h95T*}K(?g@kvOT7zT9p81UhKUe=+_O z_*^$36MUcz{H~L%NXyE9TVCGDH6_OgJb%uUvxkyHG-A))}<>AyOpe=2=B@FBUe zy5d^>hlKQh(9`^rL;7!vXas^&jdCk39nSL#m*58P%1VnH_0`=v@^n@Qq1M(j>lG~GKQ#1c>F!h=cwHA7?12T-T~)Y z^k&bMF4oL`+lagqoH;SPIIpk2Wq9c0{Qil0L38kPCllbt4t#Y-Va8czZ=t<{YYjbbvo`Je!t`6; z#TnLB(vwThlP_;mo?dni8O{vtzISSSv2SZuW!R)wOSVDdG^?kpHiR|W4YpJ?J=)ol zIZzwyzi}6%fYM!l4~y9nJ(qK0v+mPRKYp9UOd-u%+_#!;Bs$h2bA30vr zoE0_W+~pf@ohy@PYj1dBlTJInGsl?UwZbn|*te!HxXWEL@TEzHK}>^nwtnBEYy&FL zYB^vuG%Da`d?48?0xutHi!OG$a-%-hf6MDLzFAkEz+Hl4FBV#)_EZPHa$8(T z-ENcBl9M|l?xGX)nt{7?Ks#hCy)Sx^RT=oov6p@L3 z+Zysa4yUAjD0D0xpUJTAyf*jp_)HW3mg=Yrt@?`(^sLSk^jRN&xcjf0EJtP;dc?Lf zxIWpXDj>6G{S$iso5vU3VSW=paiK?xM~A!wQBv;$%ki<&3YRO_%}>-05j`tfSbRMI!6n5fl@4Q%(bwH zEem0}%&NulVtDe5TaF5?#v&v38QFXqDZmz`QJ$CG7AFg(ZqEw4feqPZAFNP4HIHuG zV?i0^S;wjvAwIqk|1d+V6$>`57~Xk$!?5sd$|PKX`JKVx-X;yc^4{PcN>z;(Q7@#} zGa`BLzHw5bf8(~^z2(HcO7;b|g_*%8^de`RnhwHMC#>W5JdN&T+U563IlsTJte|jD zna87=)}*&DkbuoEJ%o`68_MYdDTHN^{WP@fS$$-c>w?{Yj_BRHym9uXQhEzgNS=D8 zs!O>%IkK?9ucJWhamAF=HK~}S&YaSV?NUzF$r$Y2SVJEP>0HFyyL8VH#q&aDonrjF z0w&u*8XInYE)WOC(!Ub8WqKg_*a_KU>33UrP{#yg4as~0%xBu|LcDR}L9kdDc!JrI z5DoAtcaP&^9`%So+$Tr zHQly5La5Y?J%W}dqp`&c21&K)l_lS7lTt@1QbNwfgZ<2HHAK%N)=hl@op(?XxQ02q zx*046+hn!1sYBZYQXR;U(x0>RI#wpNZC6vzqjvV5W9QZhY{6h(3CGs4(S2tt{rJ1& z_cXUR>(~IE5yM!0!9E-nE}%e7IGiX09Z?kX0I~#(=W~|ch_}ykTz`#4E%08dYF$)m zH-fet`;~y(^s84A7;Moe)m7H~PfWaCiuE863h8%9640TpEAgQ27ri)-WQWB)aO~G# z{Gl`dwu^rbQk?Pqo_^D3>d_DO$L+he37qbomGn(4hB<9%^{rQ&yOk6om~Xb2a<&Z(tJR|@) zcU2kA-vuQrLvc_ViHDtIU#6m$L6dq}wODp} z2@Q+ZAhuoC>vL?3{wROedcbZPpPl&NU5eL&1w9x#Wfc6z;S^p)*~X^zm=RrOHKbBh zDfWBM2St90va32rJBzG&QpFafig@xiLkp0CAoA*;D-|$SOMUwc3=iwikOXa`R&UO^ zZ=P)zgQQEOpV=lkdl1^Q673=Nc*>6tJtZxqRx@j8@*Su~rLmILwYJmi{ZuQus>?FJ zYe-7_+Q#D$n0_n6@cclXs<^iQ)5|q(sdA+cZm>4b+ZW&N!4F15fjYp)E5f8{73~v> zmy9%Pu;yQYmd(l0ZW>!ul};2j_y5ateQZDE+xkMeuIOTyiu*iaisT8M-6*BREv z+xyJzc{`~eUL6CSh6qvu5!cNw3eW)LRG(dYOrRVn-~#f4OZp5uKeX>iR*}gwYnzWK z*qO9p`&OVoP!@m8noiYi}Vd&^=%Ch zN*T_IeB^G{ga{dPDgu3uYj@ zldb<0Fek!T3^*s?iHaRdJuXfBPy&+?{%z?2$p1&UVDE`cv?)>^x4t%+ufYjY~I!3?m0X8HdPa(54 zz#U+*<$`W&6tFr%M?7Obc4Dm~zHope1A4Vx%kW1R0i_`&x!U^d&Ok>y;?RRLPds8i z20PYUozs52D`boq^W$Vfra??U>qD+jjjA|C$>FjZYw1CjUQbr|>kYb53*|+HfD`fo z{d*%Y>qEe#G__6^&t2M=Py`-u zU!$Nrudojs_J_3liJNr1AkIZ1rJpFlC!uKnJz$nx8}`p{>N{_Ai2Ul#iAu%@MsiKz zHAcb0Uc%l>R%ec&{JKiqOE8$ui`RPr@8e$zg2|y_f4HQ83ubkcM@+uon|s@{84}0S z_@=KUhW3XyH!OXMbd(s4F8Q|O(zwz|3k#Xz4<+B?rvNX}Lv8%{I!!*fz;6+_K*{{z zst;?SvSqNF_q1`LS&~#qyFh%$TSb_3bs7?a;3WC)i>dX zZdh(n;@b+?m)og3oDsK^{Qya%_x`TzxJzN8{XKc@hEEIrRX<)zK$d&v8e7H)*<)=B zXT)<0+u4LjW~rb2$z(-`%jp_`f+$zb6S8RrR(6+Qs&!JdwjUvTcWyz z6_j(B$DU-R&m=U=zuxHIS(xhT(Sdnowj~M=4s=ymQl&c2LeDlwj|p7jEU#(jd6bh8 za^H2>+YY9kiA>(2Bkk=%!~v5oMtC-3{TAlY0#=*<_F3Fm0i9JK7O08e1j;nlLC5)2-T`J^jOlHDHr!cGac}!I)h( zHmeRNf}By8FC+25@o)#Ajo*~CvOC9Q&DbM=Pf0pH@%(BtYi<0@CDiGshj=5(_C)Tz zw-&oALIBkyIc~fAQr@i&sR(foKou&RC2l%x^F(%N)b+)UcvaQr_)nOa{Crr&RKD5E zQ(OFO7I81u)K}W(pqA#9^qv~8+`(VlI0U@ z1J|TA`5bhKO4DRPO}#Pafd>0!?cW%p^NyZ4#eN8ggUJNTrqrxrUvOu^yq#A9+1=&! e6QXl5M|<@J;ajtk!GD;CneX0XQi9rd@P7b99exG? literal 0 HcmV?d00001 From 62e520eff66c33c7e43e637731e02e50d200486e Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Thu, 17 Jul 2025 17:05:09 -0700 Subject: [PATCH 06/27] update --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 86547243f3de..7cc788bbc7c1 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -25,7 +25,7 @@ This approach serves as a general rule of thumb. As latency becomes more importa ![image](most_model_len.png) -Here's a nifty trick: Setting max model len too high can hurt performance when the bulk of your traffic isn't anywhere near the max length, but you still need to retain the ability to handle the off max length request. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. +If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32000` and use `VLLM_TPU_MOST_MODEL_LEN=2000`. From caf6377158fd176e0201ab8787f2eb9122c2ffcf Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Thu, 17 Jul 2025 17:17:41 -0700 Subject: [PATCH 07/27] add link to more docs --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 7cc788bbc7c1..25861bb4ced7 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -35,7 +35,7 @@ The requests get subdivided into max-model-len and most-model-len categories, fo For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc. -The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. Currently, there are 2 ways to pad the requests: +The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about tpu padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests: 1) the default exponential padding (pad to the nearest power of 2) 2) bucket padding (pad to the nearest linearly increasing bucket). From a764418986aedcf27eee1abe7294b0f62a5d6a5c Mon Sep 17 00:00:00 2001 From: Brittany <24945384+bvrockwell@users.noreply.github.com> Date: Thu, 17 Jul 2025 17:56:00 -0700 Subject: [PATCH 08/27] Update docs/configuration/tpu/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 25861bb4ced7..47224bfe688f 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -17,7 +17,7 @@ The following colab [calculator](https://colab.sandbox.google.com/drive/1M_f3xZm - TPU/GPU memory allocated for the KV cache - Maximum \# of requests you can approximately set (--max-num-seqs) -This approach serves as a general rule of thumb. As latency becomes more important, you may want to reduce –max-num-seqs and/or increase the number of chips in increments of 128. +This approach serves as a general rule of thumb. As latency becomes more important, you may want to reduce --max-num-seqs and/or increase the number of chips in increments of 128. ### **Optimize based on your data** From 415de9c6ac25e7f76d8d8290873cd06271574527 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Mon, 21 Jul 2025 15:45:40 -0700 Subject: [PATCH 09/27] expand on certain topics --- docs/configuration/tpu/README.md | 36 ++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 47224bfe688f..f2448b140edc 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -17,7 +17,28 @@ The following colab [calculator](https://colab.sandbox.google.com/drive/1M_f3xZm - TPU/GPU memory allocated for the KV cache - Maximum \# of requests you can approximately set (--max-num-seqs) -This approach serves as a general rule of thumb. As latency becomes more important, you may want to reduce --max-num-seqs and/or increase the number of chips in increments of 128. +This approach serves as a general rule of thumb. + +#### Latency-throughput tradeoff + +As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` in increments of 128 and/or increasing the number of chips can help reduce latency. + +`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request. + +Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload. + +#### Compilation and Caching + +Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process. + +To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used. + +Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. + +Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future launches. + +#### Reducing compilation time +This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-model-len`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. ### **Optimize based on your data** @@ -59,4 +80,15 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ ### **Tune your workloads!** -Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](https://github.com/vllm-project/vllm/pull/20779/files?short_path=f9b273a#diff-f9b273a10e0688ba63c38bd93a2e64ceb54d4fdd7ff7b82d347df06d0d34e39c) to optimize your workloads for your use case. \ No newline at end of file +Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](https://github.com/vllm-project/vllm/tree/main/benchmarks/auto_tune) to optimize your workloads for your use case. + + +### Future Topics We'll Cover + +#### **Profiling** + +The above auto tuner will produce a profile of the most optimized configs as a last step, however, we acknoledge interpreting the profile can be challenging for new users.We will expand on this section in the future, but feel free to read up on [how to collect a TPU profile](https://docs.vllm.ai/en/latest/examples/offline_inference/profiling_tpu.html) in the meantime natively in vLLM. + +#### **SPMD** + +#### Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips. \ No newline at end of file From 95c0174feb3958e87a86b7a60c3392d756ad353a Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Mon, 21 Jul 2025 15:50:54 -0700 Subject: [PATCH 10/27] clean up --- docs/configuration/tpu/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index f2448b140edc..b3444a80806a 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -87,8 +87,9 @@ Although we try to have great default configs, we strongly recommend you check o #### **Profiling** -The above auto tuner will produce a profile of the most optimized configs as a last step, however, we acknoledge interpreting the profile can be challenging for new users.We will expand on this section in the future, but feel free to read up on [how to collect a TPU profile](https://docs.vllm.ai/en/latest/examples/offline_inference/profiling_tpu.html) in the meantime natively in vLLM. +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](https://docs.vllm.ai/en/latest/examples/offline_inference/profiling_tpu.html). This profile can provide valuable insights into your workload's performance. #### **SPMD** +More details to come. #### Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips. \ No newline at end of file From ae718904c56d53ae9583130dae1aedb8cf288bfd Mon Sep 17 00:00:00 2001 From: Brittany <24945384+bvrockwell@users.noreply.github.com> Date: Mon, 21 Jul 2025 16:07:00 -0700 Subject: [PATCH 11/27] Update docs/configuration/tpu/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index b3444a80806a..91060dc84843 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -4,7 +4,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo ### **Get started** -Looking for setup and installation instructions? Find them [here](https://vllm--19708.org.readthedocs.build/en/19708/getting_started/installation/google_tpu.html). +Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/en/latest/getting_started/installation/google_tpu.html). ### **TPU workload sizing** From d6d6127e9429589fe42c76615e7afc6334363879 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Mon, 21 Jul 2025 17:26:15 -0700 Subject: [PATCH 12/27] update --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 91060dc84843..2719aa26cd99 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -44,7 +44,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### *max model len vs. most model len* -![image](most_model_len.png) +![image](docs/configuration/tpu/most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. From e24df037cf099b3024929c444bf71996fd162083 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Mon, 21 Jul 2025 17:27:05 -0700 Subject: [PATCH 13/27] update --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 2719aa26cd99..91060dc84843 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -44,7 +44,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### *max model len vs. most model len* -![image](docs/configuration/tpu/most_model_len.png) +![image](most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. From cd36e2ae0a4deca3d34eb2248c0078536fdfa501 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 22 Jul 2025 16:01:10 -0700 Subject: [PATCH 14/27] update calculator URL --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 91060dc84843..2122059eaaa8 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -10,7 +10,7 @@ Looking for setup and installation instructions? Find them [here](https://docs.v When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. -The following colab [calculator](https://colab.sandbox.google.com/drive/1M_f3xZm-_Ce2D-UMAyGNyacEIN-6rUbf) will tell you: +The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you: - KV cache size requirement per token and per request - TPU/GPU memory consumed by the model weights From e3f9818e2afc91f48a0d9373df6d41561598ba21 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 22 Jul 2025 16:22:38 -0700 Subject: [PATCH 15/27] update image path --- .../design/v1}/tpu/most_model_len.png | Bin docs/configuration/tpu/README.md | 4 +++- 2 files changed, 3 insertions(+), 1 deletion(-) rename docs/{configuration => assets/design/v1}/tpu/most_model_len.png (100%) diff --git a/docs/configuration/tpu/most_model_len.png b/docs/assets/design/v1/tpu/most_model_len.png similarity index 100% rename from docs/configuration/tpu/most_model_len.png rename to docs/assets/design/v1/tpu/most_model_len.png diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 2122059eaaa8..38a0d9924dd4 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -44,7 +44,9 @@ This initial compilation time ranges significantly and is impacted by many of th #### *max model len vs. most model len* -![image](most_model_len.png) +
+![most_model_len](../assets/design/v1/tpu/most_model_len.png) +
If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. From 8527af4ac0e0e373cf3135c59a2bd340862dea58 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Tue, 22 Jul 2025 16:24:00 -0700 Subject: [PATCH 16/27] update --- docs/configuration/tpu/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 38a0d9924dd4..711a10bd17c7 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -44,9 +44,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### *max model len vs. most model len* -
![most_model_len](../assets/design/v1/tpu/most_model_len.png) -
If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. From 661debe517eb6fe7bfa0a10caec070af601799e6 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Wed, 23 Jul 2025 11:18:37 -0700 Subject: [PATCH 17/27] fix image --- docs/configuration/tpu/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 711a10bd17c7..1e52f2111028 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -44,7 +44,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### *max model len vs. most model len* -![most_model_len](../assets/design/v1/tpu/most_model_len.png) +![most_model_len](../../assets/design/v1/tpu/most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. From 6f581e8bf173df7b7b5acc7deb2f6f2bbe3c0025 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Wed, 23 Jul 2025 18:59:50 -0700 Subject: [PATCH 18/27] updates --- docs/configuration/tpu/README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu/README.md index 1e52f2111028..1e74de977dae 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu/README.md @@ -21,12 +21,14 @@ This approach serves as a general rule of thumb. #### Latency-throughput tradeoff -As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` in increments of 128 and/or increasing the number of chips can help reduce latency. +As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency. `--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request. Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload. +In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput. + #### Compilation and Caching Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process. @@ -38,7 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future launches. #### Reducing compilation time -This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-model-len`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. +This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. ### **Optimize based on your data** @@ -48,7 +50,7 @@ This initial compilation time ranges significantly and is impacted by many of th If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. -For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32000` and use `VLLM_TPU_MOST_MODEL_LEN=2000`. +For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`. The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time. From 84d4e0c171b28157c849a6a5587173bd1e295905 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 11:01:26 -0700 Subject: [PATCH 19/27] clean up --- docs/configuration/{tpu/README.md => tpu.md} | 24 ++++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) rename docs/configuration/{tpu/README.md => tpu.md} (92%) diff --git a/docs/configuration/tpu/README.md b/docs/configuration/tpu.md similarity index 92% rename from docs/configuration/tpu/README.md rename to docs/configuration/tpu.md index 1e74de977dae..7e2eb4737085 100644 --- a/docs/configuration/tpu/README.md +++ b/docs/configuration/tpu.md @@ -1,12 +1,12 @@ -# **TPU Optimization Tips** +# TPU Optimization Tips This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. -### **Get started** +### Get started Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/en/latest/getting_started/installation/google_tpu.html). -### **TPU workload sizing** +### PU workload sizing When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. @@ -42,9 +42,9 @@ Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for #### Reducing compilation time This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. -### **Optimize based on your data** +### Optimize based on your data -#### *max model len vs. most model len* +#### max model len vs. most model len ![most_model_len](../../assets/design/v1/tpu/most_model_len.png) @@ -54,7 +54,7 @@ For example, 1% requests are 32k length and 99% requests are 2k length. You can The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time. -#### *Padding* +#### Padding For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc. @@ -71,27 +71,27 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. -### **If possible, use the precision that matches the chip’s hardware acceleration** +**If possible, use the precision that matches the chip’s hardware acceleration** - v5e has int4/int8 hardware acceleration in the MXU - v6e has int4/int8 hardware acceleration in the MXU -### **Don't set TP to be less than the number of chips on a single-host deployment** +**Don't set TP to be less than the number of chips on a single-host deployment** Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). -### **Tune your workloads!** +### Tune your workloads! Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](https://github.com/vllm-project/vllm/tree/main/benchmarks/auto_tune) to optimize your workloads for your use case. ### Future Topics We'll Cover -#### **Profiling** +#### Profiling The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](https://docs.vllm.ai/en/latest/examples/offline_inference/profiling_tpu.html). This profile can provide valuable insights into your workload's performance. -#### **SPMD** +#### SPMD More details to come. -#### Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips. \ No newline at end of file +**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** \ No newline at end of file From f3cc0ee34a1e6e997dfd750de19175b6b1b74e53 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 11:11:07 -0700 Subject: [PATCH 20/27] updating --- docs/configuration/tpu.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 7e2eb4737085..e0b905fdfe9e 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -4,7 +4,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo ### Get started -Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/en/latest/getting_started/installation/google_tpu.html). +Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md). ### PU workload sizing @@ -46,7 +46,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### max model len vs. most model len -![most_model_len](../../assets/design/v1/tpu/most_model_len.png) +![most_model_len](../assets/design/v1/tpu/most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. @@ -82,14 +82,14 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ ### Tune your workloads! -Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](https://github.com/vllm-project/vllm/tree/main/benchmarks/auto_tune) to optimize your workloads for your use case. +Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. ### Future Topics We'll Cover #### Profiling -The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](https://docs.vllm.ai/en/latest/examples/offline_inference/profiling_tpu.html). This profile can provide valuable insights into your workload's performance. +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../../examples/offline_inference/profiling_tpu/README.md). This profile can provide valuable insights into your workload's performance. #### SPMD More details to come. From b58940d43b45f5ad9508b31abaa4ffe86c91bbf6 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 11:12:30 -0700 Subject: [PATCH 21/27] fix --- docs/configuration/tpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index e0b905fdfe9e..66fba3263d9a 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -6,7 +6,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md). -### PU workload sizing +### TPU workload sizing When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. From 4194f12c22b653c0888ac4537e74d2b6a22a6d30 Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 11:14:51 -0700 Subject: [PATCH 22/27] reword --- docs/configuration/tpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 66fba3263d9a..4b9d628d56d0 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -37,7 +37,7 @@ To manage this, vLLM performs a one-time "warmup" process when you first launch Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. -Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future launches. +Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). #### Reducing compilation time This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. From 34d6ce6a13f53964bcc17685d4a4a91d8293f2fb Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 15:09:45 -0700 Subject: [PATCH 23/27] update quantization --- docs/configuration/tpu.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 4b9d628d56d0..07ada0cbfd5f 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -76,6 +76,13 @@ However, you need to be careful to choose the padding gap. If the gap is too sma - v5e has int4/int8 hardware acceleration in the MXU - v6e has int4/int8 hardware acceleration in the MXU +Supported quantized formats and features in vLLM on TPU [Jul '25] +- INT8 W8A8 +- INT8 W8A16 +- FP8 KV cache +- [WIP] FP8 W8A8 +- [WIP] AWQ + **Don't set TP to be less than the number of chips on a single-host deployment** Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). From 6806be75551c5be944649d660bb62b8e56b98d4f Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 15:17:49 -0700 Subject: [PATCH 24/27] quantization update --- docs/configuration/tpu.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 07ada0cbfd5f..11cb977c2321 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -82,6 +82,7 @@ Supported quantized formats and features in vLLM on TPU [Jul '25] - FP8 KV cache - [WIP] FP8 W8A8 - [WIP] AWQ +- [WIP] FP4 W4A8 **Don't set TP to be less than the number of chips on a single-host deployment** From 13e46258f3f55929be1f8ee2ce558c647d0b78cc Mon Sep 17 00:00:00 2001 From: bvrockwell Date: Fri, 25 Jul 2025 17:41:40 -0700 Subject: [PATCH 25/27] lint --- docs/configuration/tpu.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 11cb977c2321..ba97513b0a2b 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -2,7 +2,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. -### Get started +## Get started Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md). @@ -17,7 +17,7 @@ The following colab [calculator](https://colab.research.google.com/github/ericeh - TPU/GPU memory allocated for the KV cache - Maximum \# of requests you can approximately set (--max-num-seqs) -This approach serves as a general rule of thumb. +This approach serves as a general rule of thumb. #### Latency-throughput tradeoff @@ -35,12 +35,12 @@ Coming from a GPU background, one of the key differences you'll notice with TPUs To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used. -Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. +Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. -Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). +Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). #### Reducing compilation time -This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. +This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. ### Optimize based on your data @@ -61,9 +61,9 @@ For online serving with latency requirements, consider switching to bucket paddi The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about tpu padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests: 1) the default exponential padding (pad to the nearest power of 2) -2) bucket padding (pad to the nearest linearly increasing bucket). +2) bucket padding (pad to the nearest linearly increasing bucket). -When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`. +When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`. For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512]. @@ -92,8 +92,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. - -### Future Topics We'll Cover +### Future Topics We'll Cover #### Profiling @@ -102,4 +101,4 @@ The auto-tuner provides a profile of optimized configurations as its final step. #### SPMD More details to come. -**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** \ No newline at end of file +**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** From f981ae6cac568ffad0fbfde593f66048abee43bc Mon Sep 17 00:00:00 2001 From: Brittany <24945384+bvrockwell@users.noreply.github.com> Date: Mon, 28 Jul 2025 09:55:12 -0700 Subject: [PATCH 26/27] Update docs/configuration/tpu.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --- docs/configuration/tpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index ba97513b0a2b..0c2db8e74b00 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -96,7 +96,7 @@ Although we try to have great default configs, we strongly recommend you check o #### Profiling -The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../../examples/offline_inference/profiling_tpu/README.md). This profile can provide valuable insights into your workload's performance. +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu/README.md). This profile can provide valuable insights into your workload's performance. #### SPMD More details to come. From ea67d204d4f05cc282f5e2c221e236f42d864efa Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 29 Jul 2025 15:05:19 +0100 Subject: [PATCH 27/27] Fix example link Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --- docs/configuration/tpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 0c2db8e74b00..005b7f78f440 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -96,7 +96,7 @@ Although we try to have great default configs, we strongly recommend you check o #### Profiling -The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu/README.md). This profile can provide valuable insights into your workload's performance. +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. #### SPMD More details to come.