Skip to content

[RFC][core][V1] generalize structured output manager and backends #17503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

william-baker-inflection
Copy link

@william-baker-inflection william-baker-inflection commented Apr 30, 2025

Proposing to move all Structured Outputs related logic into v1/structured_output and standardise the interfaces with structured outputs in the rest of the code. Currently there is structure outputs logic scattered and fragmented through the code base with logic in Scheduler, StructuredOutputManager and gpu_model_runner.

Benefits:

  • Simply the GPU runner to have well defined entry-points to structured outputs
  • add flexibility for future structured outputs implementations
  • expose logits directly to structured output backend (the code is already moving in this direction in tpu_model_runner)
    • This in turn will allow the backend to make use of the logits if needed e.g. a verbose mode
  • remove dependency on xgrammar within GPU runner
  • generalise where possible for expandability
    • The init_batch callback could be used to trigger the re-shuffling of the grammar mask asynchronously rather than synchronously in the gpu_model_runner as is currently implemented
  • no performance impact

There are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in vllm/v1/structured_output/backend_guidance.py and vllm/v1/structured_output/backend_xgrammar.py with the largest change being to move bitmasking logic from the gpu_model_runner to vllm/v1/structured_output/bitmasking_grammar.py.

I have tested this with xgrammar and guidance.

WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between gpu_model_runner and tpu_model_runner and add any tpu logic into bitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to the tpu_model_runner.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

I think the main concern with sharing GPU and TPU code is that changes that clearly benefit the GPU side may impact TPU catastrophically.
In particular here we need to guarantee that whatever operation is done inside apply_grammar_bitmask is done on CPU and the resulting bitmask is only then moved to device.

If we foresee any possible changes to this invariant, then we may need to re-think the work here or leave TPU as is.

cc @Chenyaaang that led structured decoding support on TPU

@william-baker-inflection
Copy link
Author

Thanks for your contribution!

I think the main concern with sharing GPU and TPU code is that changes that clearly benefit the GPU side may impact TPU catastrophically. In particular here we need to guarantee that whatever operation is done inside apply_grammar_bitmask is done on CPU and the resulting bitmask is only then moved to device.

If we foresee any possible changes to this invariant, then we may need to re-think the work here or leave TPU as is.

cc @Chenyaaang that led structured decoding support on TPU

I admit I didn't put alot of thought into the TPU implementation when I designed this - And reviewing the tpu_model_runner further I can see how implementations could need to be independent. I can restructure the code fairly easily to achieve this, here is the current structure:
image

I could restructure into a strategy template like so:
image

Where the bitmask strategy defines CPU logic and the Backends define device logic. You can then add as many backends and strategy's as you wish without any duplication.

Init would look something like this:

if xgrammar:
   strategy = XgrammarStrategy()
elif:
    strategy = GuidanceStrategy()
...
if gpu:
   backend = GPUBitMaskBackend(strategy)
elif tpu:
   backend = TPUBitMaskBackend(strategy)
  • I'll also do proper doc strings in the next version.

@NickLucche
Copy link
Contributor

NickLucche commented May 8, 2025

Hey, sure having a separate backend would surely work for this case. I am wondering whether we could strive for a simpler hierarchy here though, given tpu being the sole exception.
I think something like TopPTopkSampler https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_sampler.py#L75 is very easy to maintain.

But then again I am not sure what kind of expansion are you anticipating on structured output side in terms of strategies.

Copy link

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from 9bc0c1e to 84a0241 Compare May 12, 2025 16:26
@mergify mergify bot removed the needs-rebase label May 12, 2025
@william-baker-inflection
Copy link
Author

I have tested the latest version on a TPU and seems to have introduced an intermittent bug - I'm still not sure what the cause is...

Copy link

mergify bot commented May 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 14, 2025
@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from ca981e9 to 9b983cf Compare May 15, 2025 14:12
@mergify mergify bot removed the needs-rebase label May 15, 2025
@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from 9b983cf to 9e001b2 Compare May 15, 2025 14:19
@william-baker-inflection
Copy link
Author

I have tested this now on V0 and V1 Qwen & LLama single node on GPU and TPU and seems to perform well.
Pending passing tests this should be ready to merge.
In addition to the benefits mentioned at the beginning of this PR this now also improves TPU performance by modifying logits in-place

Copy link

mergify bot commented May 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 17, 2025
@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from 9e001b2 to fe6a707 Compare May 19, 2025 09:33
@mergify mergify bot removed the needs-rebase label May 19, 2025
Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, left some comments, please take a look when you have the time.
Overall I am a bit concerned with scattering tpu logic around the code base with the current implementation.

I think what I wanted to propose was to use a single interface for some of the methods but hide that away from direct use; eg topk/topp sampler provides a single forward() entrypoint whose behavior depends on the platform.

What I would want to avoid is having some generic class (BitmaskStructuredOutputBackend here) which I use like obj.do_tpu_specific_thing (eg prepare_structured_decoding_input_tpu).
I'd rather have the runner use said obj while dealing with TPU details, like

# runner pseudocode, compilation
for size in compile_sizes:
    dummy_data = ..
    obj.forward(dummy_data)

@yaochengji can you help me review the TPU part here?

Comment on lines 343 to 349
def precompile(self, num_reqs_paddings: list[int], device: torch.device,
hidden_states_dtype: torch.dtype):
if current_platform.is_tpu():
for num_reqs in num_reqs_paddings:
dummy_logits = torch.zeros((num_reqs, self.tpu_vocab_size),
device=device,
dtype=hidden_states_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here, I think it's best to keep the TPU code inside the runner, and then have the runner use the bitmask object to compile.
I don't think we should scatter the tpu init logic through the code base.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be do-able if I create a dummy scheduler_output and input_batch within the tpu worker precompile code

@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from fe6a707 to fe34514 Compare May 21, 2025 10:07
@william-baker-inflection
Copy link
Author

The latest version of moves to the aforementioned strategy template for separating platform logic. Now the manager constructs a strategy: xgrammar or guidance then constructs a platform specific backend using the desired strategy.
Precompilation has been simplified to depend soley on dummy logits. kwargs have been added to filter logits and precompile to support platform specific args.
I have tested fully on GPU and partially on TPU as the current version of main gives nan model outputs on qwen.

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here! I am much happier with the tpu changes now, thanks for the patience.
Let's wait for green tests.

super().__init__(vllm_config, tokenizer, vocab_size, reasoner,
strategy)
self._grammar_bitmask: Optional[torch.Tensor] = None
self.max_num_reqs: Optional[int] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep self.max_num_reqs around?
If this is only needed for initializing the tensors we could check if one of the tensor is not None or have tpu model runner call init_tensors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was purely using it to manage state - if it's None we need to init tensors - I didn't want to cover complicate the type checking to check that all three tensors are not None. If we did this:

self.require_structured_out_cpu: Optional[torch.Tensor] = None
self.structured_decode_arange: Optional[torch.Tensor] = None
self.grammar_bitmask_cpu: Optional[torch.Tensor] = None

We would need assertions in all the compute methods or to ignore typing at each occurrence

@NickLucche
Copy link
Contributor

cc @russellb

@william-baker-inflection william-baker-inflection force-pushed the generalized-structured-decoding branch from 5d24e8e to 3d6863c Compare May 23, 2025 19:01
@william-baker-inflection
Copy link
Author

Ready to merge now @russellb @NickLucche

Signed-off-by: william-baker-inflection <william.baker@inflection.ai>
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to understand more of the standardization reasoning behind here. If we need a design to support oot plugins, then I believe we will need some design decision here. Initial reason for the code to live in Scheduler and gpu_model_runner is that it provides a better separation of concern. The manager acts more like a holder for scheduler, so i can see the point of moving some scheduler logic here.

Simply the GPU runner to have well defined entry-points to structured outputs
add flexibility for future structured outputs implementations
expose logits directly to structured output backend (the code is already moving in this direction in tpu_model_runner)

My general philosophy is that we should only establish interface only when we can't find a simpler way to do it. Right now, I think this is too complicated for little improvement in readability. (for example the init_batch function, or the structured outputs worker abstraction)

A priori assumption that can be made:

  • We will probably only deal with bitmask implementation (as I don't know any other atm), given that it yields the most performance (as observed with xgrammar + llguidance)
  • the logic to apply bitmask should be worker-specific.

There are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in vllm/v1/structured_output/backend_guidance.py and vllm/v1/structured_output/backend_xgrammar.py with the largest change being to move bitmasking logic from the gpu_model_runner to vllm/v1/structured_output/bitmasking_grammar.py.

WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between gpu_model_runner and tpu_model_runner and add any tpu logic into bitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to the tpu_model_runner.

If we want to improve readability I think we can create a mixin, instead of creating another worker abstraction, which I think is a bit overkill.

@@ -0,0 +1,56 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesistant to introduce the worker abstraction to structured outputs, as it is confusing to maintain + interlude with the worker concept in vLLM.

I think if we just need an interface to be implemented at the worker level, it is better to have certain abs functions that lives inside v1/workers as structured_outputs_worker_mixin or sth.

The goal for v1/structured_output are mostly include the backend implementation + utilities for structured output, so I don't really see of having another interfaces for worker in here makes much sense.

Comment on lines +74 to +79
def get_worker_backend(
vllm_config: VllmConfig) -> StructuredOutputWorkerBackend:
if current_platform.is_tpu():
return BitmaskTPUBackend(vllm_config)
else:
return BitmaskGPUBackend(vllm_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto wrt reasoning behind worker backend abstraction.

@@ -81,7 +78,7 @@ def _validate_sampling_params(
params: SamplingParams,
lora_request: Optional[LoRARequest],
) -> None:
self._validate_structured_output(params)
StructuredOutputManager.validate_request(params, self.vllm_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that much add to for this to be simply a validation functions.

If you want this validation function, then probably it is more useful to create a structured_outputs/utils.py. We are going to move some utilities functions from v0 to v1 soon anw, so might be good to have a utils file live somewhere in v1/structured_outputs

# Meta data for structured output batches
# By default this holds only the structured_output_request_ids
# but backends may extend this to hold more data for the batch
structured_output_meta: Optional[StructuredOutputBatchMetaData]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
structured_output_meta: Optional[StructuredOutputBatchMetaData]
structured_output_metadata: Optional[StructuredOutputBatchMetadata]

apriori assumption is that we will always deal with batch in a scheduler v1 design, hence I don't think we will need to explicitly imply the name to be BatchMetadata. However, I do see the point of having a verbose naming here.

grammar_bitmask: torch.Tensor


class BitmaskStructuredOutputBackend(StructuredOutputBackend):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the benefit of subclass this from the original backend class, and called it bitmask backend.

I think going forward we will only support bitmask backend. I don't know if there are better implementation other than bitmask.

If we need to return logit for some reason, then it is not within the scope of structured outputs I think (logits should then be returned after sampling, whereas with bitmask apply we will do it b4 sampling)

Comment on lines +30 to +36
class BitmaskSOBatchMetaData(StructuredOutputBatchMetaData):
"""
This class is used to store the bitmask for structured output requests.
It is used to pass the bitmask to the GPU workers.
"""

grammar_bitmask: torch.Tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just make it all into the batchmetadata class, I think this makes the code way too complex.

Comment on lines +131 to +138
def init_batch(
self, requests: dict[str, Request],
structured_output_request_ids: dict[str, int],
scheduled_spec_decode_tokens: dict[str, list[int]]
) -> StructuredOutputBatchMetaData:
bitmask = self.grammar_bitmask(requests, structured_output_request_ids,
scheduled_spec_decode_tokens)
return BitmaskSOBatchMetaData(structured_output_request_ids, bitmask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this function is essentially two indirections, which introduces unecesssary complexity.

instead of calling backend.init_batch else where, I don't see the difference with

structured_output_metadata = StructuredOutputMetadata(structured_output_request_ids=structured_output_request_ids,
                         bitmask=structured_output_manager.grammar_bitmask(...))

@william-baker-inflection
Copy link
Author

Thank you for your prompt detailed feedback you make some really good comments! — I had wanted to give an update on my progress.

Let me provide some context for these changes. We are developing structured output mechanisms that extend beyond the current xgrammar and guidance implementations. Our approach involves a logit modification module that goes beyond simple bitmasking, which led us to implement it as a separate backend.
Given that other organizations working on structured outputs likely face similar requirements, we saw an opportunity to contribute a more flexible architecture to the vLLM community.

The core objectives of this PR are:

  1. Decouple structured outputs from bitmasking: Remove the hard requirement for structured output implementations to use bitmasking exclusively
  2. Simplify backend development: Make it easier to contribute new backend implementations by consolidating the necessary interfaces within the v1/structured_output folder, including validation logic and abstract metadata handling

Earlier versions of this PR didn't properly account for the multiprocess nature of distributed workers—this was due to both my limited familiarity with vLLM's architecture and the absence of distributed testing in test_struct_output_generate.py. The current version fixes this while maintaining support for multiple backend implementations. I envision adding a serving argument to switch between worker backends in future, similar to how attention implementations can be selected. But this has changed the initial stance of the PR which originally brought the GPU logic into the structured output backends - I now know this isn't technically possible.

I hope this addresses your concerns about the complexity of abstracting away from bitmasking.

Copy link

mergify bot commented Jun 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants