Support multimodal models in vLLM with transformers backend

This is a tracker issue since there are now too many PRs here and there, all making some sort of standardization to help vLLM support vision LLMs.

What we need is:
- [x] Identical naming for special multimodal tokens, same way as we have `config.pad_token_id` (https://github.com/huggingface/transformers/pull/37573)
- [x] Add base model and nudge new models to follow LLM-like format. In other words, a base model holds everything without head and the `ConditionalGeneration` holds "base model + head" (https://github.com/huggingface/transformers/pull/37033)
- [ ] Helper fn to obtain multimodal embeddings from the base model because each model has its own pre/post projection layers on top (https://github.com/huggingface/transformers/pull/37743)
- [x] Clean up qwen models which are completely different from existing VLMs in structure (https://github.com/huggingface/transformers/pull/37268) and hopefully after all done it will be okay. Needs verification
- [X] Support attention backends for all models by using new attn interface and correctly propagating `kwargs` . BTW, we also need to use `self.loss_fn` after this PR, to fix issues in Trainer with grad accum. But that is not vLLM related and comes in subsequent PR (https://github.com/huggingface/transformers/pull/37576)
- [ ] Processors need helpers to calculate `num_multimodal_tokens` given input image sizes and to return `mm_token_type_ids` (https://github.com/huggingface/transformers/pull/37915)
- [ ] All vision backbones should return `embeds` of shape `(bs, ...., dim)`. I found Pixtral doesn't do so, and will need to go over other models as well. To check if it is doable in vision encoder (BC breaking?) or we need to postprocess `embeds` in `_get_image_features`


cc @hmellor so you can keep track
cc @ArthurZucker @Cyrilvallez , I will be pinging you for reviews 😄 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support multimodal models in vLLM with transformers backend #37780

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support multimodal models in vLLM with transformers backend #37780

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions