Skip to content

[BUG] Qwen 1.5B Inference Crash on Vertex AI Platform #144

@rothn

Description

@rothn

I'm trying to deploy Qwen 1.5B on Vertex AI Endpoints, and I get a crash deploying Qwen 1.5B while Qwen 7B deploys perfectly fine, using the same HuggingFace TRL configuration (other than the base model) to train both. Note that training and local inference work fine both for 1.5B and 7B. The container I'm using is us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-4.ubuntu2204.py311. My requirements.txt file is as follows for the training / local-inference setup is as follows:

accelerate==1.4.0
deepspeed==0.16.3
importlib-metadata==8.6.1
transformers==4.49.0
trl @ git+https://github.com/huggingface/trl@v0.15.1
protobuf==5.29.3
sentencepiece==0.2.0

Logs from the container referenced above:
aiplatform_endpoints_crash.log

Container environment variables:

serving_container_environment_variables={
          "NUM_SHARD": "1",
          "MAX_INPUT_TOKENS": "512",
          "MAX_TOTAL_TOKENS": "1024",
          "MAX_BATCH_PREFILL_TOKENS": "1512",
          "CUDA_LAUNCH_BLOCKING": "1", # Debug for Qwen 1.5B
          "TORCH_USE_CUDA_DSA": "1",   # Debug for Qwen 1.5B
      }

I wonder if there's some sort of version mismatch here between the training and serving containers, or perhaps 2.4.0 is just too old/buggy, since the latest release of text-generation-inference appears to be 3.1.0. Is there a newer container I can try?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions