[V1] Resolve failed concurrent structred output requests

russellb · russellb · commit 3df1171802b6 · 2025-06-12T16:00:33.000Z
Closes vllm-project#19493 Closes vllm-project#18376 Related to vllm-project#18780 Several people have noticed errors when using both the `xgrammar` and `guidance` backends where we would start generating invalid tokens for a request and they would be continuously rejected by the backend currently in use. The conditions seemed to be: - Only impacts certain models - Occurs with concurrent structured output requests After further investigation once an easy way to reproduce was provided via vllm-project#19493, I identified more details about the failure: - When the failured occurred in my test using a concurrency of 2, whichever request came in first was always successful. It was the second request that would fail. Debugging further identified that the bitmask was not being applied correctly, but only for that second request. In the GPU model runner, this translates to the 2nd row in the bitmask tensor and the 2nd row of the logits tensor. I could see that a couple bytes were left unmasked. I suspect the reason the issue appears to be model specific has to do with the vocab and what the tokens are that were left unmasked. I have not verified this part for sure. The reason it occurred with both structured output backends is because we use the `xgrammar` library's implementation of applying the bitmask in all cases. Xgrammar on cuda, by default, uses a triton kernel for applying the bitmask. I identified that by forcing it to use the `torch.compile` implementation instead, the problem is resolved. The torch implementation is used for all other accelerator types in Xgrammar's logic, so it seems fine to just force the use of that implementation. I have not yet narrowed down the problem in triton kernel, but this change works around the problem for vLLM. We can move back to Xgrammar's wrapper that chooses which implementation to use once we can verify everything is working properly again. Signed-off-by: Russell Bryant <rbryant@redhat.com>
diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py
@@ -65,11 +65,15 @@
 
 if TYPE_CHECKING:
     import xgrammar as xgr
+    import xgrammar.kernels.apply_token_bitmask_inplace_torch_compile as xgr_torch_compile  # noqa: E501
 
     from vllm.model_executor.model_loader.tensorizer import TensorizerConfig
     from vllm.v1.core.sched.output import SchedulerOutput
 else:
     xgr = LazyLoader("xgr", globals(), "xgrammar")
+    xgr_torch_compile = LazyLoader(
+        "xgr_torch_compile", globals(),
+        "xgrammar.kernels.apply_token_bitmask_inplace_torch_compile")
 
 logger = init_logger(__name__)
 
@@ -1102,7 +1106,7 @@ def apply_grammar_bitmask(
         # so we receive it in that format.
         grammar_bitmask = torch.from_numpy(grammar_bitmask)
 
-        xgr.apply_token_bitmask_inplace(
+        xgr_torch_compile.apply_token_bitmask_inplace_torch_compile(
             logits,
             grammar_bitmask.to(self.device, non_blocking=True),
             indices=out_indices,