Skip to content

[Feature] Support using prefix-caching + cudagraph for inference #2924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 22, 2025

Conversation

zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Jul 19, 2025

pcard-71500

支持prefix-caching+ cudagraph同时启用的场景。

支持结果:
四个场景的多卡和单卡服务均能启动,并且推理结果无乱码。
cudagraph + no_profile
cudagraph + profile
cudagraph + profile + prefix-cahing
cudagraph + no_profile + prefix-caching

不兼容的原因分析与解决方案:

  1. cudagraph捕获图时使用的是dummy_run,这个操作不会触发从cache_manager中取cache的操作。
    • 添加判断条件,使用prefix-caching时在dummy_run之前执行从cache_manager中取cache的操作
  2. 在以profile模式运行时,捕获图需要cache_manager启动,cache_manager启动需要worker启动完成,worker启动需要完成捕获图,存在执行依赖,导致profile模式无法正常执行。
    • 添加信号量,在获取全局的num_blocks时,触发信号量通知engine启动cache_manager
    • 同时在执行捕获图之前,等待cache_manager启动完成

服务启动指令:

export CUDA_VISIBLE_DEVICES=2
MODEL_DIR="/workspace/EB45T-21B-Paddle"
# enable-prefix-caching 控制是否使用prefix-caching
# num-gpu-blocks-override 控制是否需要profile_run
python -m fastdeploy.entrypoints.openai.api_server --model ${MODEL_DIR} \
  --max-num-seqs 6 --max-model-len 8192 \
  --host 127.0.0.1 \
  --port 18888 --engine-worker-queue-port 27102 \
  --metrics-port 17203 --tensor-parallel-size 1 \
  --enable-prefix-caching \ 
  --num-gpu-blocks-override 3000 \
  --graph-optimization-config '{"use_cudagraph":true, "graph_opt_level":0, "cudagraph_capture_sizes": [1, 3]}'

Copy link

paddle-bot bot commented Jul 19, 2025

Thanks for your contribution!

gongshaotian
gongshaotian previously approved these changes Jul 21, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zeroRains zeroRains changed the title [Fix] fix the bug when use prefix-caching + cudagraph [Feature] support using prefix-caching + cudagraph for inference Jul 21, 2025
@zeroRains zeroRains changed the title [Feature] support using prefix-caching + cudagraph for inference [Feature] Support using prefix-caching + cudagraph for inference Jul 21, 2025
gongshaotian
gongshaotian previously approved these changes Jul 21, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongshaotian
Copy link
Collaborator

773fe668b235e856df43682de18056df 删除这个分支

@gongshaotian
Copy link
Collaborator

gpu_worker 中统一 gpu_model_runner.initialize_kv_cache 和 gpu_model_runner.update_share_input_block_num 的调用

@zeroRains
Copy link
Contributor Author

zeroRains commented Jul 21, 2025

gpu_worker 中统一 gpu_model_runner.initialize_kv_cache 和 gpu_model_runner.update_share_input_block_num 的调用

这两个是包含关系,update_share_input_block_num里包含了initialize_kv_cache,update_share_input_block_num是一定要调用的,因为他除了初始化kv_cache之外还在self.share_inputs中添加了一些变量。另外 gpu_model_runner.initialize_kv_cache这个部分应该指的是profile_run中的init_kv_cache,这个已经封装在profile_run中了,所以没有独立出来。

gongshaotian
gongshaotian previously approved these changes Jul 22, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +1184 to +1187
checking_worker_init_kv_cache_status_thread = None
if self.do_profile:
checking_worker_init_kv_cache_status_thread = threading.Thread(target=self._stop_profile, daemon=True)
checking_worker_init_kv_cache_status_thread.start()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增一个线程,用于等待cache_manager的启动信号并启动cache_manager

Comment on lines 1181 to 1183

self.checking_worker_status_thread = threading.Thread(target=detect_thread, daemon=True)
self.checking_worker_status_thread.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几行也需要删掉

Copy link
Contributor Author

@zeroRains zeroRains Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个删掉了之后,engine启动那边加载模型的进度条就没有了,只能在workerlog.0中看到加载进度。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前版本engine感知到worker启动结束,也需要这个线程去判断,是需要更改现有engine感知worker启动的方式吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

范围标错了

Copy link
Collaborator

@Jiang-Jia-Jun Jiang-Jia-Jun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuanlehome yuanlehome merged commit 89a485b into PaddlePaddle:develop Jul 22, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants