Skip to content

Support multimodal models in vLLM with transformers backend #37780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 7 tasks
zucchini-nlp opened this issue Apr 25, 2025 · 0 comments
Open
2 of 7 tasks

Support multimodal models in vLLM with transformers backend #37780

zucchini-nlp opened this issue Apr 25, 2025 · 0 comments
Assignees
Labels
External Using the library with external tools (onnx, tflite, ...) VLM

Comments

@zucchini-nlp
Copy link
Member

This is a tracker issue since there are now too many PRs here and there, all making some sort of standardization to help vLLM support vision LLMs.

What we need is:

  • Identical naming for special multimodal tokens, same way as we have config.pad_token_id (🚨[VLMs] use only xxx_token_id for multimodal tokens #37573)
  • Add base model and nudge new models to follow LLM-like format. In other words, a base model holds everything without head and the ConditionalGeneration holds "base model + head" (🔴 [VLM] Add base model without head  #37033)
  • Helper fn to obtain multimodal embeddings from the base model because each model has its own pre/post projection layers on top ([VLMs] add helpers to get multimodal encodings #37743)
  • Clean up qwen models which are completely different from existing VLMs in structure ([qwen-vl] Standardize config #37268) and hopefully after all done it will be okay. Needs verification
  • Support attention backends for all models by using new attn interface and correctly propagating kwargs . BTW, we also need to use self.loss_fn after this PR, to fix issues in Trainer with grad accum. But that is not vLLM related and comes in subsequent PR ([VLMs] support attention backends #37576)
  • Processors need helpers to calculate num_multimodal_tokens given input image sizes and to return mm_token_type_ids (there is a draft locally)
  • All vision backbones should return embeds of shape (bs, ...., dim). I found Pixtral doesn't do so, and will need to go over other models as well. To check if it is doable in vision encoder (BC breaking?) or we need to postprocess embeds in _get_image_features

cc @hmellor so you can keep track
cc @ArthurZucker @Cyrilvallez , I will be pinging you for reviews 😄

@zucchini-nlp zucchini-nlp self-assigned this Apr 25, 2025
@zucchini-nlp zucchini-nlp added External Using the library with external tools (onnx, tflite, ...) VLM labels Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Using the library with external tools (onnx, tflite, ...) VLM
Projects
None yet
Development

No branches or pull requests

1 participant