-
Notifications
You must be signed in to change notification settings - Fork 26
Description
I'm trying to deploy Qwen 1.5B on Vertex AI Endpoints, and I get a crash deploying Qwen 1.5B while Qwen 7B deploys perfectly fine, using the same HuggingFace TRL configuration (other than the base model) to train both. Note that training and local inference work fine both for 1.5B and 7B. The container I'm using is us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-4.ubuntu2204.py311. My requirements.txt file is as follows for the training / local-inference setup is as follows:
accelerate==1.4.0
deepspeed==0.16.3
importlib-metadata==8.6.1
transformers==4.49.0
trl @ git+https://github.com/huggingface/trl@v0.15.1
protobuf==5.29.3
sentencepiece==0.2.0
Logs from the container referenced above:
aiplatform_endpoints_crash.log
Container environment variables:
serving_container_environment_variables={
"NUM_SHARD": "1",
"MAX_INPUT_TOKENS": "512",
"MAX_TOTAL_TOKENS": "1024",
"MAX_BATCH_PREFILL_TOKENS": "1512",
"CUDA_LAUNCH_BLOCKING": "1", # Debug for Qwen 1.5B
"TORCH_USE_CUDA_DSA": "1", # Debug for Qwen 1.5B
}
I wonder if there's some sort of version mismatch here between the training and serving containers, or perhaps 2.4.0 is just too old/buggy, since the latest release of text-generation-inference appears to be 3.1.0. Is there a newer container I can try?