-
-
Notifications
You must be signed in to change notification settings - Fork 9k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
(see repro info below)
🐛 Describe the bug
With the google/gemma-3-27b-it
model, if I request a JSON formatted response, only a couple of tokens are generated before the request hangs. Subsequent requests to the server then all hang as well even after cancelling the first hung request (and vLLM logging that it aborted the request).
To reproduce, simply run the server:
vllm serve google/gemma-3-27b-it --max-model-len 32000 --max-num-seqs 16 --enforce-eager
and send a request with JSON structured output:
curl --request POST 'http://localhost:8000/v1/chat/completions' -H 'Content-Type: application/json' -H 'Accept: application/json' --data-raw '{
"model": "google/gemma-3-27b-it",
"stream": true,
"response_format": {"type": "json_object"},
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate a sample JSON document."
}
]
}
]
}'
Full Logs
INFO 07-17 22:44:02 [__init__.py:244] Automatically detected platform cuda.
INFO 07-17 22:44:05 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-17 22:44:05 [cli_args.py:325] non-default args: {'model': 'google/gemma-3-27b-it', 'max_model_len': 64000, 'enforce_eager': True, 'max_num_seqs': 16}
INFO 07-17 22:44:12 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-17 22:44:12 [config.py:1472] Using max model len 64000
INFO 07-17 22:44:12 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-17 22:44:12 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 07-17 22:44:14 [core.py:526] Waiting for init message from front-end.
INFO 07-17 22:44:14 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='google/gemma-3-27b-it', speculative_config=None, tokenizer='google/gemma-3-27b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=64000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=google/gemma-3-27b-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 07-17 22:44:15 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 07-17 22:44:22 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-17 22:44:22 [gpu_model_runner.py:1770] Starting to load model google/gemma-3-27b-it...
INFO 07-17 22:44:22 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-17 22:44:22 [cuda.py:287] Using FlexAttention backend on V1 engine.
INFO 07-17 22:44:23 [cuda.py:284] Using Flash Attention backend on V1 engine.
INFO 07-17 22:44:23 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/12 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 8% Completed | 1/12 [00:00<00:08, 1.30it/s]
Loading safetensors checkpoint shards: 17% Completed | 2/12 [00:01<00:08, 1.19it/s]
Loading safetensors checkpoint shards: 25% Completed | 3/12 [00:02<00:07, 1.15it/s]
Loading safetensors checkpoint shards: 33% Completed | 4/12 [00:03<00:07, 1.13it/s]
Loading safetensors checkpoint shards: 42% Completed | 5/12 [00:04<00:06, 1.11it/s]
Loading safetensors checkpoint shards: 50% Completed | 6/12 [00:05<00:05, 1.10it/s]
Loading safetensors checkpoint shards: 58% Completed | 7/12 [00:06<00:04, 1.09it/s]
Loading safetensors checkpoint shards: 67% Completed | 8/12 [00:07<00:03, 1.09it/s]
Loading safetensors checkpoint shards: 75% Completed | 9/12 [00:08<00:02, 1.09it/s]
Loading safetensors checkpoint shards: 83% Completed | 10/12 [00:08<00:01, 1.10it/s]
Loading safetensors checkpoint shards: 92% Completed | 11/12 [00:09<00:00, 1.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:10<00:00, 1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:10<00:00, 1.19it/s]
INFO 07-17 22:44:33 [default_loader.py:272] Loading weights took 10.11 seconds
INFO 07-17 22:44:34 [gpu_model_runner.py:1801] Model loading took 51.4478 GiB and 10.975492 seconds
INFO 07-17 22:44:34 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
INFO 07-17 22:44:37 [gpu_worker.py:232] Available KV cache memory: 18.83 GiB
WARNING 07-17 22:44:37 [kv_cache_utils.py:831] Add 8 padding layers, may waste at most 15.38% KV cache memory
INFO 07-17 22:44:37 [kv_cache_utils.py:873] GPU KV cache size: 35,248 tokens
INFO 07-17 22:44:37 [kv_cache_utils.py:877] Maximum concurrency for 64,000 tokens per request: 2.99x
INFO 07-17 22:44:37 [core.py:172] init engine (profile, create kv cache, warmup model) took 3.67 seconds
INFO 07-17 22:44:39 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 15426
WARNING 07-17 22:44:39 [config.py:1392] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-17 22:44:39 [serving_chat.py:125] Using default chat sampling params from model: {'top_k': 64, 'top_p': 0.95}
INFO 07-17 22:44:39 [serving_completion.py:72] Using default completion sampling params from model: {'top_k': 64, 'top_p': 0.95}
INFO 07-17 22:44:39 [api_server.py:1457] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 07-17 22:44:39 [launcher.py:29] Available routes are:
INFO 07-17 22:44:39 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-17 22:44:39 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-17 22:44:39 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-17 22:44:39 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-17 22:44:39 [launcher.py:37] Route: /health, Methods: GET
INFO 07-17 22:44:39 [launcher.py:37] Route: /load, Methods: GET
INFO 07-17 22:44:39 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-17 22:44:39 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-17 22:44:39 [launcher.py:37] Route: /version, Methods: GET
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /score, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-17 22:44:39 [launcher.py:37] Route: /metrics, Methods: GET
INFO: Started server process [9140]
INFO: Waiting for application startup.
INFO: Application startup complete.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 07-17 22:45:43 [chat_utils.py:444] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
INFO 07-17 22:45:43 [logger.py:43] Received request chatcmpl-ca6de8da7d9644f69f4304590c5b4693: prompt: '<bos><start_of_turn>user\nGenerate a sample JSON document.<end_of_turn>\n<start_of_turn>model\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=63985, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:44890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-17 22:45:43 [async_llm.py:270] Added request chatcmpl-ca6de8da7d9644f69f4304590c5b4693.
[rank0]:W0717 22:45:44.070000 9411 torch/_dynamo/convert_frame.py:964] [0/8] torch._dynamo hit config.recompile_limit (8)
[rank0]:W0717 22:45:44.070000 9411 torch/_dynamo/convert_frame.py:964] [0/8] function: 'forward_static' (/tmp/home/my-vllm/lib64/python3.12/site-packages/vllm/model_executor/layers/layernorm.py:241)
[rank0]:W0717 22:45:44.070000 9411 torch/_dynamo/convert_frame.py:964] [0/8] last reason: 0/7: expected type of 'residual' to be a tensor type, ' but found <class 'NoneType'>
[rank0]:W0717 22:45:44.070000 9411 torch/_dynamo/convert_frame.py:964] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:W0717 22:45:44.070000 9411 torch/_dynamo/convert_frame.py:964] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
INFO 07-17 22:45:47 [async_llm.py:431] Aborted request chatcmpl-ca6de8da7d9644f69f4304590c5b4693.
INFO 07-17 22:45:47 [async_llm.py:339] Request chatcmpl-ca6de8da7d9644f69f4304590c5b4693 aborted.
INFO 07-17 22:45:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
INFO 07-17 22:46:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
INFO 07-17 22:46:01 [logger.py:43] Received request chatcmpl-0fc7ae25138246a2b6b525e6565e5600: prompt: '<bos><start_of_turn>user\nGenerate a sample JSON document.<end_of_turn>\n<start_of_turn>model\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=63985, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=True, backend=None, backend_was_auto=False, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, whitespace_pattern=None, structural_tag=None), extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-17 22:46:01 [async_llm.py:270] Added request chatcmpl-0fc7ae25138246a2b6b525e6565e5600.
INFO 07-17 22:46:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
** Hung here
INFO 07-17 22:46:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
I ran the repro against recent past versions of vLLM (and deps) as well and found:
- this worked fine back in v0.8.4
- in v0.8.5 and v0.8.5.post1, the response streams empty tokens and logs an interesting warning indicating that there are no tokens that match the guide:
ERROR 07-17 22:03:54 [backend_xgrammar.py:167] Failed to advance FSM for request chatcmpl-78eca22187e24416a39e4c12c73dad75 for tokens 0. Please file an issue.
- for v0.9.0 and later, it streams 1 token and then hangs without any ERROR message
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working