-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[RFC][core][V1] generalize structured output manager and backends #17503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[RFC][core][V1] generalize structured output manager and backends #17503
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
I think the main concern with sharing GPU and TPU code is that changes that clearly benefit the GPU side may impact TPU catastrophically.
In particular here we need to guarantee that whatever operation is done inside apply_grammar_bitmask
is done on CPU and the resulting bitmask is only then moved to device.
If we foresee any possible changes to this invariant, then we may need to re-think the work here or leave TPU as is.
cc @Chenyaaang that led structured decoding support on TPU
I admit I didn't put alot of thought into the TPU implementation when I designed this - And reviewing the I could restructure into a strategy template like so: Where the bitmask strategy defines CPU logic and the Backends define device logic. You can then add as many backends and strategy's as you wish without any duplication. Init would look something like this: if xgrammar:
strategy = XgrammarStrategy()
elif:
strategy = GuidanceStrategy()
...
if gpu:
backend = GPUBitMaskBackend(strategy)
elif tpu:
backend = TPUBitMaskBackend(strategy)
|
Hey, sure having a separate backend would surely work for this case. I am wondering whether we could strive for a simpler hierarchy here though, given tpu being the sole exception. But then again I am not sure what kind of expansion are you anticipating on structured output side in terms of strategies. |
This pull request has merge conflicts that must be resolved before it can be |
9bc0c1e
to
84a0241
Compare
I have tested the latest version on a TPU and seems to have introduced an intermittent bug - I'm still not sure what the cause is... |
This pull request has merge conflicts that must be resolved before it can be |
ca981e9
to
9b983cf
Compare
9b983cf
to
9e001b2
Compare
I have tested this now on V0 and V1 Qwen & LLama single node on GPU and TPU and seems to perform well. |
This pull request has merge conflicts that must be resolved before it can be |
9e001b2
to
fe6a707
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, left some comments, please take a look when you have the time.
Overall I am a bit concerned with scattering tpu logic around the code base with the current implementation.
I think what I wanted to propose was to use a single interface for some of the methods but hide that away from direct use; eg topk/topp sampler provides a single forward() entrypoint whose behavior depends on the platform.
What I would want to avoid is having some generic class (BitmaskStructuredOutputBackend
here) which I use like obj.do_tpu_specific_thing
(eg prepare_structured_decoding_input_tpu
).
I'd rather have the runner use said obj while dealing with TPU details, like
# runner pseudocode, compilation
for size in compile_sizes:
dummy_data = ..
obj.forward(dummy_data)
@yaochengji can you help me review the TPU part here?
def precompile(self, num_reqs_paddings: list[int], device: torch.device, | ||
hidden_states_dtype: torch.dtype): | ||
if current_platform.is_tpu(): | ||
for num_reqs in num_reqs_paddings: | ||
dummy_logits = torch.zeros((num_reqs, self.tpu_vocab_size), | ||
device=device, | ||
dtype=hidden_states_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here, I think it's best to keep the TPU code inside the runner, and then have the runner use the bitmask object to compile.
I don't think we should scatter the tpu init logic through the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be do-able if I create a dummy scheduler_output and input_batch within the tpu worker precompile code
fe6a707
to
fe34514
Compare
The latest version of moves to the aforementioned strategy template for separating platform logic. Now the manager constructs a strategy: xgrammar or guidance then constructs a platform specific backend using the desired strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here! I am much happier with the tpu changes now, thanks for the patience.
Let's wait for green tests.
super().__init__(vllm_config, tokenizer, vocab_size, reasoner, | ||
strategy) | ||
self._grammar_bitmask: Optional[torch.Tensor] = None | ||
self.max_num_reqs: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to keep self.max_num_reqs
around?
If this is only needed for initializing the tensors we could check if one of the tensor is not None or have tpu model runner call init_tensors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was purely using it to manage state - if it's None we need to init tensors - I didn't want to cover complicate the type checking to check that all three tensors are not None. If we did this:
self.require_structured_out_cpu: Optional[torch.Tensor] = None
self.structured_decode_arange: Optional[torch.Tensor] = None
self.grammar_bitmask_cpu: Optional[torch.Tensor] = None
We would need assertions in all the compute methods or to ignore typing at each occurrence
cc @russellb |
5d24e8e
to
3d6863c
Compare
Ready to merge now @russellb @NickLucche |
Signed-off-by: william-baker-inflection <william.baker@inflection.ai>
3d6863c
to
714292e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to understand more of the standardization reasoning behind here. If we need a design to support oot plugins, then I believe we will need some design decision here. Initial reason for the code to live in Scheduler
and gpu_model_runner
is that it provides a better separation of concern. The manager acts more like a holder for scheduler, so i can see the point of moving some scheduler logic here.
Simply the GPU runner to have well defined entry-points to structured outputs
add flexibility for future structured outputs implementations
expose logits directly to structured output backend (the code is already moving in this direction in tpu_model_runner)
My general philosophy is that we should only establish interface only when we can't find a simpler way to do it. Right now, I think this is too complicated for little improvement in readability. (for example the init_batch
function, or the structured outputs worker abstraction)
A priori assumption that can be made:
- We will probably only deal with bitmask implementation (as I don't know any other atm), given that it yields the most performance (as observed with xgrammar + llguidance)
- the logic to apply bitmask should be worker-specific.
There are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in vllm/v1/structured_output/backend_guidance.py and vllm/v1/structured_output/backend_xgrammar.py with the largest change being to move bitmasking logic from the gpu_model_runner to vllm/v1/structured_output/bitmasking_grammar.py.
WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between gpu_model_runner and tpu_model_runner and add any tpu logic into bitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to the tpu_model_runner.
If we want to improve readability I think we can create a mixin, instead of creating another worker abstraction, which I think is a bit overkill.
@@ -0,0 +1,56 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesistant to introduce the worker abstraction to structured outputs, as it is confusing to maintain + interlude with the worker concept in vLLM.
I think if we just need an interface to be implemented at the worker level, it is better to have certain abs functions that lives inside v1/workers
as structured_outputs_worker_mixin
or sth.
The goal for v1/structured_output
are mostly include the backend implementation + utilities for structured output, so I don't really see of having another interfaces for worker in here makes much sense.
def get_worker_backend( | ||
vllm_config: VllmConfig) -> StructuredOutputWorkerBackend: | ||
if current_platform.is_tpu(): | ||
return BitmaskTPUBackend(vllm_config) | ||
else: | ||
return BitmaskGPUBackend(vllm_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto wrt reasoning behind worker backend abstraction.
@@ -81,7 +78,7 @@ def _validate_sampling_params( | |||
params: SamplingParams, | |||
lora_request: Optional[LoRARequest], | |||
) -> None: | |||
self._validate_structured_output(params) | |||
StructuredOutputManager.validate_request(params, self.vllm_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that much add to for this to be simply a validation functions.
If you want this validation function, then probably it is more useful to create a structured_outputs/utils.py
. We are going to move some utilities functions from v0 to v1 soon anw, so might be good to have a utils file live somewhere in v1/structured_outputs
# Meta data for structured output batches | ||
# By default this holds only the structured_output_request_ids | ||
# but backends may extend this to hold more data for the batch | ||
structured_output_meta: Optional[StructuredOutputBatchMetaData] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
structured_output_meta: Optional[StructuredOutputBatchMetaData] | |
structured_output_metadata: Optional[StructuredOutputBatchMetadata] |
apriori assumption is that we will always deal with batch in a scheduler v1 design, hence I don't think we will need to explicitly imply the name to be BatchMetadata
. However, I do see the point of having a verbose naming here.
grammar_bitmask: torch.Tensor | ||
|
||
|
||
class BitmaskStructuredOutputBackend(StructuredOutputBackend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see the benefit of subclass this from the original backend class, and called it bitmask backend.
I think going forward we will only support bitmask backend. I don't know if there are better implementation other than bitmask.
If we need to return logit for some reason, then it is not within the scope of structured outputs I think (logits should then be returned after sampling, whereas with bitmask apply we will do it b4 sampling)
class BitmaskSOBatchMetaData(StructuredOutputBatchMetaData): | ||
""" | ||
This class is used to store the bitmask for structured output requests. | ||
It is used to pass the bitmask to the GPU workers. | ||
""" | ||
|
||
grammar_bitmask: torch.Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just make it all into the batchmetadata class, I think this makes the code way too complex.
def init_batch( | ||
self, requests: dict[str, Request], | ||
structured_output_request_ids: dict[str, int], | ||
scheduled_spec_decode_tokens: dict[str, list[int]] | ||
) -> StructuredOutputBatchMetaData: | ||
bitmask = self.grammar_bitmask(requests, structured_output_request_ids, | ||
scheduled_spec_decode_tokens) | ||
return BitmaskSOBatchMetaData(structured_output_request_ids, bitmask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this function is essentially two indirections, which introduces unecesssary complexity.
instead of calling backend.init_batch else where, I don't see the difference with
structured_output_metadata = StructuredOutputMetadata(structured_output_request_ids=structured_output_request_ids,
bitmask=structured_output_manager.grammar_bitmask(...))
Thank you for your prompt detailed feedback you make some really good comments! — I had wanted to give an update on my progress. Let me provide some context for these changes. We are developing structured output mechanisms that extend beyond the current xgrammar and guidance implementations. Our approach involves a logit modification module that goes beyond simple bitmasking, which led us to implement it as a separate backend. The core objectives of this PR are:
Earlier versions of this PR didn't properly account for the multiprocess nature of distributed workers—this was due to both my limited familiarity with vLLM's architecture and the absence of distributed testing in I hope this addresses your concerns about the complexity of abstracting away from bitmasking. |
This pull request has merge conflicts that must be resolved before it can be |
Proposing to move all Structured Outputs related logic into v1/structured_output and standardise the interfaces with structured outputs in the rest of the code. Currently there is structure outputs logic scattered and fragmented through the code base with logic in
Scheduler
,StructuredOutputManager
andgpu_model_runner
.Benefits:
init_batch
callback could be used to trigger the re-shuffling of the grammar mask asynchronously rather than synchronously in thegpu_model_runner
as is currently implementedThere are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in
vllm/v1/structured_output/backend_guidance.py
andvllm/v1/structured_output/backend_xgrammar.py
with the largest change being to move bitmasking logic from thegpu_model_runner
tovllm/v1/structured_output/bitmasking_grammar.py
.I have tested this with xgrammar and guidance.
WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between
gpu_model_runner
andtpu_model_runner
and add any tpu logic intobitmasking_grammar.py
. I wanted to gather feedback before taking the time to make changes to thetpu_model_runner
.