You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Closesvllm-project#19493Closesvllm-project#18376
Related to vllm-project#18780
Several people have noticed errors when using both the `xgrammar` and
`guidance` backends where we would start generating invalid tokens for a
request and they would be continuously rejected by the backend currently
in use. The conditions seemed to be:
- Only impacts certain models
- Occurs with concurrent structured output requests
After further investigation once an easy way to reproduce was provided
via vllm-project#19493, I identified more details about the failure:
- When the failured occurred in my test using a concurrency of 2,
whichever request came in first was always successful. It was the
second request that would fail.
Debugging further identified that the bitmask was not being applied
correctly, but only for that second request. In the GPU model runner,
this translates to the 2nd row in the bitmask tensor and the 2nd row
of the logits tensor. I could see that a couple bytes were left
unmasked.
I suspect the reason the issue appears to be model specific has to do
with the vocab and what the tokens are that were left unmasked. I have
not verified this part for sure.
The reason it occurred with both structured output backends is because
we use the `xgrammar` library's implementation of applying the bitmask
in all cases.
Xgrammar on cuda, by default, uses a triton kernel for applying the
bitmask. I identified that by forcing it to use the `torch.compile`
implementation instead, the problem is resolved. The torch
implementation is used for all other accelerator types in Xgrammar's
logic, so it seems fine to just force the use of that implementation.
I have not yet narrowed down the problem in triton kernel, but this
change works around the problem for vLLM.
We can move back to Xgrammar's wrapper that chooses which implementation
to use once we can verify everything is working properly again.
Signed-off-by: Russell Bryant <rbryant@redhat.com>
0 commit comments