Closed
Description
This is a tracker issue since there are now too many PRs here and there, all making some sort of standardization to help vLLM support vision LLMs.
What we need is:
- Identical naming for special multimodal tokens, same way as we have
config.pad_token_id
(🚨[VLMs] use onlyxxx_token_id
for multimodal tokens #37573) - Add base model and nudge new models to follow LLM-like format. In other words, a base model holds everything without head and the
ConditionalGeneration
holds "base model + head" (🔴 [VLM] Add base model without head #37033) - Helper fn to obtain multimodal embeddings from the base model because each model has its own pre/post projection layers on top ([VLMs] add helpers to get multimodal encodings #37743)
- Clean up qwen models which are completely different from existing VLMs in structure ([qwen-vl] Standardize config #37268) and hopefully after all done it will be okay. Needs verification
- Support attention backends for all models by using new attn interface and correctly propagating
kwargs
. BTW, we also need to useself.loss_fn
after this PR, to fix issues in Trainer with grad accum. But that is not vLLM related and comes in subsequent PR ([VLMs] support attention backends #37576) - Processors need helpers to calculate
num_multimodal_tokens
given input image sizes and to returnmm_token_type_ids
([transformers x vLLM] standardize processors #37915) - All vision backbones should return
embeds
of shape(bs, ...., dim)
. I found Pixtral doesn't do so, and will need to go over other models as well. To check if it is doable in vision encoder (BC breaking?) or we need to postprocessembeds
in_get_image_features
cc @hmellor so you can keep track
cc @ArthurZucker @Cyrilvallez , I will be pinging you for reviews 😄