-
Notifications
You must be signed in to change notification settings - Fork 571
[Feature] Support using prefix-caching + cudagraph for inference #2924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…profile Change-Id: Ibf2ba3f2e3b08641d03f4b1391d7c862c3efa397
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gpu_worker 中统一 gpu_model_runner.initialize_kv_cache 和 gpu_model_runner.update_share_input_block_num 的调用 |
这两个是包含关系,update_share_input_block_num里包含了initialize_kv_cache,update_share_input_block_num是一定要调用的,因为他除了初始化kv_cache之外还在self.share_inputs中添加了一些变量。另外 gpu_model_runner.initialize_kv_cache这个部分应该指的是profile_run中的init_kv_cache,这个已经封装在profile_run中了,所以没有独立出来。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
checking_worker_init_kv_cache_status_thread = None | ||
if self.do_profile: | ||
checking_worker_init_kv_cache_status_thread = threading.Thread(target=self._stop_profile, daemon=True) | ||
checking_worker_init_kv_cache_status_thread.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增一个线程,用于等待cache_manager的启动信号并启动cache_manager
|
||
self.checking_worker_status_thread = threading.Thread(target=detect_thread, daemon=True) | ||
self.checking_worker_status_thread.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几行也需要删掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个删掉了之后,engine启动那边加载模型的进度条就没有了,只能在workerlog.0中看到加载进度。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前版本engine感知到worker启动结束,也需要这个线程去判断,是需要更改现有engine感知worker启动的方式吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
范围标错了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pcard-71500
支持prefix-caching+ cudagraph同时启用的场景。
支持结果:
四个场景的多卡和单卡服务均能启动,并且推理结果无乱码。
cudagraph + no_profile
cudagraph + profile
cudagraph + profile + prefix-cahing
cudagraph + no_profile + prefix-caching
不兼容的原因分析与解决方案:
服务启动指令: