[Feature] Support using prefix-caching + cudagraph for inference #2924

zeroRains · 2025-07-19T07:27:29Z

pcard-71500

支持prefix-caching+ cudagraph同时启用的场景。

支持结果：
四个场景的多卡和单卡服务均能启动，并且推理结果无乱码。
cudagraph + no_profile
cudagraph + profile
cudagraph + profile + prefix-cahing
cudagraph + no_profile + prefix-caching

不兼容的原因分析与解决方案：

cudagraph捕获图时使用的是dummy_run，这个操作不会触发从cache_manager中取cache的操作。
- 添加判断条件，使用prefix-caching时在dummy_run之前执行从cache_manager中取cache的操作
在以profile模式运行时，捕获图需要cache_manager启动，cache_manager启动需要worker启动完成，worker启动需要完成捕获图，存在执行依赖，导致profile模式无法正常执行。
- 添加信号量，在获取全局的num_blocks时，触发信号量通知engine启动cache_manager
- 同时在执行捕获图之前，等待cache_manager启动完成

服务启动指令：

export CUDA_VISIBLE_DEVICES=2
MODEL_DIR="/workspace/EB45T-21B-Paddle"
# enable-prefix-caching 控制是否使用prefix-caching
# num-gpu-blocks-override 控制是否需要profile_run
python -m fastdeploy.entrypoints.openai.api_server --model ${MODEL_DIR} \
  --max-num-seqs 6 --max-model-len 8192 \
  --host 127.0.0.1 \
  --port 18888 --engine-worker-queue-port 27102 \
  --metrics-port 17203 --tensor-parallel-size 1 \
  --enable-prefix-caching \ 
  --num-gpu-blocks-override 3000 \
  --graph-optimization-config '{"use_cudagraph":true, "graph_opt_level":0, "cudagraph_capture_sizes": [1, 3]}'

…profile Change-Id: Ibf2ba3f2e3b08641d03f4b1391d7c862c3efa397

paddle-bot · 2025-07-19T07:27:34Z

Thanks for your contribution!

fastdeploy/engine/engine.py

gongshaotian

LGTM

gongshaotian

LGTM

fastdeploy/worker/gpu_model_runner.py

fastdeploy/worker/worker_process.py

gongshaotian · 2025-07-21T11:33:33Z

删除这个分支

fastdeploy/worker/worker_process.py

gongshaotian · 2025-07-21T11:40:10Z

gpu_worker 中统一 gpu_model_runner.initialize_kv_cache 和 gpu_model_runner.update_share_input_block_num 的调用

fastdeploy/worker/gpu_model_runner.py

fastdeploy/worker/gpu_worker.py

zeroRains · 2025-07-21T14:06:18Z

gpu_worker 中统一 gpu_model_runner.initialize_kv_cache 和 gpu_model_runner.update_share_input_block_num 的调用

这两个是包含关系，update_share_input_block_num里包含了initialize_kv_cache，update_share_input_block_num是一定要调用的，因为他除了初始化kv_cache之外还在self.share_inputs中添加了一些变量。另外 gpu_model_runner.initialize_kv_cache这个部分应该指的是profile_run中的init_kv_cache，这个已经封装在profile_run中了，所以没有独立出来。

fastdeploy/worker/worker_process.py

gongshaotian

LGTM

… prefix

zeroRains · 2025-07-22T05:17:35Z

fastdeploy/engine/engine.py

+        checking_worker_init_kv_cache_status_thread = None
+        if self.do_profile:
+            checking_worker_init_kv_cache_status_thread = threading.Thread(target=self._stop_profile, daemon=True)
+            checking_worker_init_kv_cache_status_thread.start()


新增一个线程，用于等待cache_manager的启动信号并启动cache_manager

yuanlehome · 2025-07-22T05:41:04Z

fastdeploy/engine/engine.py


        self.checking_worker_status_thread = threading.Thread(target=detect_thread, daemon=True)
        self.checking_worker_status_thread.start()


这几行也需要删掉

这个删掉了之后，engine启动那边加载模型的进度条就没有了，只能在workerlog.0中看到加载进度。

目前版本engine感知到worker启动结束，也需要这个线程去判断，是需要更改现有engine感知worker启动的方式吗？

范围标错了

fastdeploy/engine/engine.py

Jiang-Jia-Jun

LGTM

zeroRains added 2 commits July 18, 2025 16:43

fix the bug in cudagraph+prefix-caching but still have some bug with …

94e5403

…profile Change-Id: Ibf2ba3f2e3b08641d03f4b1391d7c862c3efa397

add the signal to make sure cache manager launched

6080e07

zeroRains added 3 commits July 19, 2025 15:35

fix judge condition

560ab7d

fix conflict

a8da2cd

reomove useless control

e0ffe2e

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/engine/engine.py Show resolved Hide resolved

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/engine/engine.py Show resolved Hide resolved

gongshaotian previously approved these changes Jul 21, 2025

View reviewed changes

zeroRains changed the title ~~[Fix] fix the bug when use prefix-caching + cudagraph~~ [Feature] support using prefix-caching + cudagraph for inference Jul 21, 2025

zeroRains changed the title ~~[Feature] support using prefix-caching + cudagraph for inference~~ [Feature] Support using prefix-caching + cudagraph for inference Jul 21, 2025

fix conflct

d280301

zeroRains dismissed gongshaotian’s stale review via d280301 July 21, 2025 08:41

gongshaotian previously approved these changes Jul 21, 2025

View reviewed changes

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/worker/worker_process.py Outdated Show resolved Hide resolved

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/worker/worker_process.py Show resolved Hide resolved

gongshaotian reviewed Jul 21, 2025

View reviewed changes

fastdeploy/worker/worker_process.py Outdated Show resolved Hide resolved

gongshaotian closed this Jul 21, 2025

gongshaotian reopened this Jul 21, 2025

gongshaotian mentioned this pull request Jul 21, 2025

[Feature] Support block scheduler v1 for FD #2928

Merged

update control stream

9dad411

zeroRains dismissed gongshaotian’s stale review via 9dad411 July 21, 2025 13:39

update

bf28bcf

zeroRains commented Jul 21, 2025

View reviewed changes

fastdeploy/worker/gpu_model_runner.py Show resolved Hide resolved

fastdeploy/worker/gpu_model_runner.py Show resolved Hide resolved

fastdeploy/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

fastdeploy/worker/gpu_worker.py Show resolved Hide resolved

fix xpu

d8b9317

gongshaotian reviewed Jul 22, 2025

View reviewed changes

fastdeploy/worker/worker_process.py Outdated Show resolved Hide resolved

zeroRains and others added 3 commits July 22, 2025 11:35

change the do_profile flag

2fda709

update

ad3ca95

Merge branch 'develop' into prefix

b609160

gongshaotian previously approved these changes Jul 22, 2025

View reviewed changes

zeroRains added 2 commits July 22, 2025 13:08

add new threads to init cache_manager

b12667b

Merge branch 'prefix' of https://github.com/zeroRains/FastDeploy into…

1bf753d

… prefix

zeroRains dismissed gongshaotian’s stale review via 1bf753d July 22, 2025 05:09

zeroRains commented Jul 22, 2025

View reviewed changes

yuanlehome reviewed Jul 22, 2025

View reviewed changes

Jiang-Jia-Jun reviewed Jul 22, 2025

View reviewed changes

fastdeploy/engine/engine.py Show resolved Hide resolved

Jiang-Jia-Jun approved these changes Jul 22, 2025

View reviewed changes

yuanlehome merged commit 89a485b into PaddlePaddle:develop Jul 22, 2025
4 of 5 checks passed


		self.checking_worker_status_thread = threading.Thread(target=detect_thread, daemon=True)
		self.checking_worker_status_thread.start()

[Feature] Support using prefix-caching + cudagraph for inference #2924

[Feature] Support using prefix-caching + cudagraph for inference #2924

Uh oh!

Conversation

zeroRains commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongshaotian commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

gongshaotian commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeroRains commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

zeroRains Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

zeroRains Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroRains Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jiang-Jia-Jun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zeroRains commented Jul 19, 2025 •

edited

Loading

zeroRains commented Jul 21, 2025 •

edited

Loading

zeroRains Jul 22, 2025 •

edited

Loading