[RFC][core][V1] generalize structured output manager and backends #17503

william-baker-inflection · 2025-04-30T21:42:13Z

Proposing to move all Structured Outputs related logic into v1/structured_output and standardise the interfaces with structured outputs in the rest of the code. Currently there is structure outputs logic scattered and fragmented through the code base with logic in Scheduler, StructuredOutputManager and gpu_model_runner.

Benefits:

Simply the GPU runner to have well defined entry-points to structured outputs
add flexibility for future structured outputs implementations
expose logits directly to structured output backend (the code is already moving in this direction in tpu_model_runner)
- This in turn will allow the backend to make use of the logits if needed e.g. a verbose mode
remove dependency on xgrammar within GPU runner
generalise where possible for expandability
- The init_batch callback could be used to trigger the re-shuffling of the grammar mask asynchronously rather than synchronously in the gpu_model_runner as is currently implemented
no performance impact

There are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in vllm/v1/structured_output/backend_guidance.py and vllm/v1/structured_output/backend_xgrammar.py with the largest change being to move bitmasking logic from the gpu_model_runner to vllm/v1/structured_output/bitmasking_grammar.py.

I have tested this with xgrammar and guidance.

WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between gpu_model_runner and tpu_model_runner and add any tpu logic into bitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to the tpu_model_runner.

github-actions · 2025-04-30T21:42:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche

Thanks for your contribution!

I think the main concern with sharing GPU and TPU code is that changes that clearly benefit the GPU side may impact TPU catastrophically.
In particular here we need to guarantee that whatever operation is done inside apply_grammar_bitmask is done on CPU and the resulting bitmask is only then moved to device.

If we foresee any possible changes to this invariant, then we may need to re-think the work here or leave TPU as is.

cc @Chenyaaang that led structured decoding support on TPU

vllm/v1/structured_output/__init__.py

william-baker-inflection · 2025-05-05T16:13:28Z

Thanks for your contribution!

I think the main concern with sharing GPU and TPU code is that changes that clearly benefit the GPU side may impact TPU catastrophically. In particular here we need to guarantee that whatever operation is done inside apply_grammar_bitmask is done on CPU and the resulting bitmask is only then moved to device.

If we foresee any possible changes to this invariant, then we may need to re-think the work here or leave TPU as is.

cc @Chenyaaang that led structured decoding support on TPU

I admit I didn't put alot of thought into the TPU implementation when I designed this - And reviewing the tpu_model_runner further I can see how implementations could need to be independent. I can restructure the code fairly easily to achieve this, here is the current structure:

I could restructure into a strategy template like so:

Where the bitmask strategy defines CPU logic and the Backends define device logic. You can then add as many backends and strategy's as you wish without any duplication.

Init would look something like this:

if xgrammar:
   strategy = XgrammarStrategy()
elif:
    strategy = GuidanceStrategy()
...
if gpu:
   backend = GPUBitMaskBackend(strategy)
elif tpu:
   backend = TPUBitMaskBackend(strategy)

I'll also do proper doc strings in the next version.

NickLucche · 2025-05-08T07:42:15Z

Hey, sure having a separate backend would surely work for this case. I am wondering whether we could strive for a simpler hierarchy here though, given tpu being the sole exception.
I think something like TopPTopkSampler https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_sampler.py#L75 is very easy to maintain.

But then again I am not sure what kind of expansion are you anticipating on structured output side in terms of strategies.

mergify · 2025-05-12T16:06:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

william-baker-inflection · 2025-05-13T18:47:08Z

I have tested the latest version on a TPU and seems to have introduced an intermittent bug - I'm still not sure what the cause is...

mergify · 2025-05-14T23:25:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

william-baker-inflection · 2025-05-15T15:53:54Z

I have tested this now on V0 and V1 Qwen & LLama single node on GPU and TPU and seems to perform well.
Pending passing tests this should be ready to merge.
In addition to the benefits mentioned at the beginning of this PR this now also improves TPU performance by modifying logits in-place

mergify · 2025-05-17T22:48:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche

Hey, left some comments, please take a look when you have the time.
Overall I am a bit concerned with scattering tpu logic around the code base with the current implementation.

I think what I wanted to propose was to use a single interface for some of the methods but hide that away from direct use; eg topk/topp sampler provides a single forward() entrypoint whose behavior depends on the platform.

What I would want to avoid is having some generic class (BitmaskStructuredOutputBackend here) which I use like obj.do_tpu_specific_thing (eg prepare_structured_decoding_input_tpu).
I'd rather have the runner use said obj while dealing with TPU details, like

# runner pseudocode, compilation
for size in compile_sizes:
    dummy_data = ..
    obj.forward(dummy_data)

@yaochengji can you help me review the TPU part here?

vllm/v1/structured_output/bitmasking_grammar.py

NickLucche · 2025-05-19T12:18:09Z

vllm/v1/structured_output/bitmasking_grammar.py

+    def precompile(self, num_reqs_paddings: list[int], device: torch.device,
+                   hidden_states_dtype: torch.dtype):
+        if current_platform.is_tpu():
+            for num_reqs in num_reqs_paddings:
+                dummy_logits = torch.zeros((num_reqs, self.tpu_vocab_size),
+                                           device=device,
+                                           dtype=hidden_states_dtype)


same thing here, I think it's best to keep the TPU code inside the runner, and then have the runner use the bitmask object to compile.
I don't think we should scatter the tpu init logic through the code base.

This should be do-able if I create a dummy scheduler_output and input_batch within the tpu worker precompile code

vllm/v1/structured_output/bitmasking_grammar.py

vllm/v1/worker/tpu_model_runner.py

william-baker-inflection · 2025-05-21T10:23:50Z

The latest version of moves to the aforementioned strategy template for separating platform logic. Now the manager constructs a strategy: xgrammar or guidance then constructs a platform specific backend using the desired strategy.
Precompilation has been simplified to depend soley on dummy logits. kwargs have been added to filter logits and precompile to support platform specific args.
I have tested fully on GPU and partially on TPU as the current version of main gives nan model outputs on qwen.

NickLucche

Nice work here! I am much happier with the tpu changes now, thanks for the patience.
Let's wait for green tests.

vllm/v1/worker/tpu_model_runner.py

NickLucche · 2025-05-21T13:53:47Z

vllm/v1/structured_output/bitmasking_tpu_backend.py

+        super().__init__(vllm_config, tokenizer, vocab_size, reasoner,
+                         strategy)
+        self._grammar_bitmask: Optional[torch.Tensor] = None
+        self.max_num_reqs: Optional[int] = None


do we need to keep self.max_num_reqs around?
If this is only needed for initializing the tensors we could check if one of the tensor is not None or have tpu model runner call init_tensors.

I was purely using it to manage state - if it's None we need to init tensors - I didn't want to cover complicate the type checking to check that all three tensors are not None. If we did this:

self.require_structured_out_cpu: Optional[torch.Tensor] = None self.structured_decode_arange: Optional[torch.Tensor] = None self.grammar_bitmask_cpu: Optional[torch.Tensor] = None

We would need assertions in all the compute methods or to ignore typing at each occurrence

NickLucche · 2025-05-21T14:38:42Z

cc @russellb

william-baker-inflection · 2025-05-26T11:21:28Z

Ready to merge now @russellb @NickLucche

Signed-off-by: william-baker-inflection <william.baker@inflection.ai>

aarnphm

I want to understand more of the standardization reasoning behind here. If we need a design to support oot plugins, then I believe we will need some design decision here. Initial reason for the code to live in Scheduler and gpu_model_runner is that it provides a better separation of concern. The manager acts more like a holder for scheduler, so i can see the point of moving some scheduler logic here.

Simply the GPU runner to have well defined entry-points to structured outputs
add flexibility for future structured outputs implementations
expose logits directly to structured output backend (the code is already moving in this direction in tpu_model_runner)

My general philosophy is that we should only establish interface only when we can't find a simpler way to do it. Right now, I think this is too complicated for little improvement in readability. (for example the init_batch function, or the structured outputs worker abstraction)

A priori assumption that can be made:

We will probably only deal with bitmask implementation (as I don't know any other atm), given that it yields the most performance (as observed with xgrammar + llguidance)
the logic to apply bitmask should be worker-specific.

There are no functional changes to vLLM, only re-arranging code and changing the interfaces for xgrammar and guidance in vllm/v1/structured_output/backend_guidance.py and vllm/v1/structured_output/backend_xgrammar.py with the largest change being to move bitmasking logic from the gpu_model_runner to vllm/v1/structured_output/bitmasking_grammar.py.

WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between gpu_model_runner and tpu_model_runner and add any tpu logic into bitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to the tpu_model_runner.

If we want to improve readability I think we can create a mixin, instead of creating another worker abstraction, which I think is a bit overkill.

aarnphm · 2025-06-06T15:26:03Z

vllm/v1/structured_output/worker_backend.py

@@ -0,0 +1,56 @@
+# SPDX-License-Identifier: Apache-2.0


I'm a bit hesistant to introduce the worker abstraction to structured outputs, as it is confusing to maintain + interlude with the worker concept in vLLM.

I think if we just need an interface to be implemented at the worker level, it is better to have certain abs functions that lives inside v1/workers as structured_outputs_worker_mixin or sth.

The goal for v1/structured_output are mostly include the backend implementation + utilities for structured output, so I don't really see of having another interfaces for worker in here makes much sense.

aarnphm · 2025-06-06T15:28:01Z

vllm/v1/structured_output/__init__.py

+    def get_worker_backend(
+            vllm_config: VllmConfig) -> StructuredOutputWorkerBackend:
+        if current_platform.is_tpu():
+            return BitmaskTPUBackend(vllm_config)
+        else:
+            return BitmaskGPUBackend(vllm_config)


ditto wrt reasoning behind worker backend abstraction.

aarnphm · 2025-06-06T15:29:35Z

vllm/v1/engine/processor.py

@@ -81,7 +78,7 @@ def _validate_sampling_params(
        params: SamplingParams,
        lora_request: Optional[LoRARequest],
    ) -> None:
-        self._validate_structured_output(params)
+        StructuredOutputManager.validate_request(params, self.vllm_config)


I don't see that much add to for this to be simply a validation functions.

If you want this validation function, then probably it is more useful to create a structured_outputs/utils.py. We are going to move some utilities functions from v0 to v1 soon anw, so might be good to have a utils file live somewhere in v1/structured_outputs

aarnphm · 2025-06-06T15:33:44Z

vllm/v1/core/sched/output.py

+    # Meta data for structured output batches
+    # By default this holds only the structured_output_request_ids
+    # but backends may extend this to hold more data for the batch
+    structured_output_meta: Optional[StructuredOutputBatchMetaData]


Suggested change

structured_output_meta: Optional[StructuredOutputBatchMetaData]

structured_output_metadata: Optional[StructuredOutputBatchMetadata]

apriori assumption is that we will always deal with batch in a scheduler v1 design, hence I don't think we will need to explicitly imply the name to be BatchMetadata. However, I do see the point of having a verbose naming here.

aarnphm · 2025-06-06T15:36:59Z

vllm/v1/structured_output/backend_bitmasking.py

+    grammar_bitmask: torch.Tensor
+
+
+class BitmaskStructuredOutputBackend(StructuredOutputBackend):


I don't really see the benefit of subclass this from the original backend class, and called it bitmask backend.

I think going forward we will only support bitmask backend. I don't know if there are better implementation other than bitmask.

If we need to return logit for some reason, then it is not within the scope of structured outputs I think (logits should then be returned after sampling, whereas with bitmask apply we will do it b4 sampling)

aarnphm · 2025-06-06T15:37:36Z

vllm/v1/structured_output/backend_bitmasking.py

+class BitmaskSOBatchMetaData(StructuredOutputBatchMetaData):
+    """
+    This class is used to store the bitmask for structured output requests.
+    It is used to pass the bitmask to the GPU workers.
+    """
+
+    grammar_bitmask: torch.Tensor


Let's just make it all into the batchmetadata class, I think this makes the code way too complex.

aarnphm · 2025-06-06T15:52:10Z

vllm/v1/structured_output/backend_bitmasking.py

+    def init_batch(
+        self, requests: dict[str, Request],
+        structured_output_request_ids: dict[str, int],
+        scheduled_spec_decode_tokens: dict[str, list[int]]
+    ) -> StructuredOutputBatchMetaData:
+        bitmask = self.grammar_bitmask(requests, structured_output_request_ids,
+                                       scheduled_spec_decode_tokens)
+        return BitmaskSOBatchMetaData(structured_output_request_ids, bitmask)


imo this function is essentially two indirections, which introduces unecesssary complexity.

instead of calling backend.init_batch else where, I don't see the difference with

structured_output_metadata = StructuredOutputMetadata(structured_output_request_ids=structured_output_request_ids, bitmask=structured_output_manager.grammar_bitmask(...))

william-baker-inflection · 2025-06-06T16:58:48Z

Thank you for your prompt detailed feedback you make some really good comments! — I had wanted to give an update on my progress.

Let me provide some context for these changes. We are developing structured output mechanisms that extend beyond the current xgrammar and guidance implementations. Our approach involves a logit modification module that goes beyond simple bitmasking, which led us to implement it as a separate backend.
Given that other organizations working on structured outputs likely face similar requirements, we saw an opportunity to contribute a more flexible architecture to the vLLM community.

The core objectives of this PR are:

Decouple structured outputs from bitmasking: Remove the hard requirement for structured output implementations to use bitmasking exclusively
Simplify backend development: Make it easier to contribute new backend implementations by consolidating the necessary interfaces within the v1/structured_output folder, including validation logic and abstract metadata handling

Earlier versions of this PR didn't properly account for the multiprocess nature of distributed workers—this was due to both my limited familiarity with vLLM's architecture and the absence of distributed testing in test_struct_output_generate.py. The current version fixes this while maintaining support for multiple backend implementations. I envision adding a serving argument to switch between worker backends in future, similar to how attention implementations can be selected. But this has changed the initial stance of the PR which originally brought the GPU logic into the structured output backends - I now know this isn't technically possible.

I hope this addresses your concerns about the complexity of abstracting away from bitmasking.

mergify · 2025-06-12T23:31:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @william-baker-inflection.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

william-baker-inflection requested review from mgoin, russellb, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 30, 2025 21:42

mergify bot added structured-output v1 tpu Related to Google TPUs labels Apr 30, 2025

github-project-automation bot added this to Structured Output Apr 30, 2025

NickLucche suggested changes May 5, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Outdated Show resolved Hide resolved

shen-shanshan mentioned this pull request May 8, 2025

[Feature]: Add support for Guided Decoding vllm-project/vllm-ascend#177

Closed

8 tasks

william-baker-inflection requested review from zhuohan123 and youkaichao as code owners May 12, 2025 16:06

mergify bot added the needs-rebase label May 12, 2025

william-baker-inflection force-pushed the generalized-structured-decoding branch from 9bc0c1e to 84a0241 Compare May 12, 2025 16:26

mergify bot removed the needs-rebase label May 12, 2025

mergify bot added the needs-rebase label May 14, 2025

william-baker-inflection force-pushed the generalized-structured-decoding branch from ca981e9 to 9b983cf Compare May 15, 2025 14:12

mergify bot removed the needs-rebase label May 15, 2025

william-baker-inflection force-pushed the generalized-structured-decoding branch from 9b983cf to 9e001b2 Compare May 15, 2025 14:19

mergify bot added the needs-rebase label May 17, 2025

william-baker-inflection force-pushed the generalized-structured-decoding branch from 9e001b2 to fe6a707 Compare May 19, 2025 09:33

mergify bot removed the needs-rebase label May 19, 2025

NickLucche suggested changes May 19, 2025

View reviewed changes

NickLucche reviewed May 20, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

william-baker-inflection force-pushed the generalized-structured-decoding branch from fe6a707 to fe34514 Compare May 21, 2025 10:07

NickLucche suggested changes May 21, 2025

View reviewed changes

william-baker-inflection force-pushed the generalized-structured-decoding branch from 5d24e8e to 3d6863c Compare May 23, 2025 19:01

generalize structured output manager and backends

714292e

Signed-off-by: william-baker-inflection <william.baker@inflection.ai>

william-baker-inflection force-pushed the generalized-structured-decoding branch from 3d6863c to 714292e Compare June 6, 2025 14:48

william-baker-inflection requested a review from aarnphm as a code owner June 6, 2025 14:48

aarnphm requested changes Jun 6, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 12, 2025

	structured_output_meta: Optional[StructuredOutputBatchMetaData]
	structured_output_metadata: Optional[StructuredOutputBatchMetadata]

		grammar_bitmask: torch.Tensor


		class BitmaskStructuredOutputBackend(StructuredOutputBackend):

Uh oh!

[RFC][core][V1] generalize structured output manager and backends #17503

Are you sure you want to change the base?

[RFC][core][V1] generalize structured output manager and backends #17503

Uh oh!

Conversation

william-baker-inflection commented Apr 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

william-baker-inflection commented May 5, 2025

Uh oh!

NickLucche commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

william-baker-inflection commented May 13, 2025

Uh oh!

mergify bot commented May 14, 2025

Uh oh!

william-baker-inflection commented May 15, 2025

Uh oh!

mergify bot commented May 17, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

william-baker-inflection commented May 21, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented May 21, 2025

Uh oh!

william-baker-inflection commented May 26, 2025

Uh oh!

aarnphm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

william-baker-inflection commented Jun 6, 2025

Uh oh!

mergify bot commented Jun 12, 2025

Uh oh!

Uh oh!

william-baker-inflection commented Apr 30, 2025 •

edited by github-actions bot

Loading

NickLucche commented May 8, 2025 •

edited

Loading

aarnphm left a comment •

edited

Loading