You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add base model and nudge new models to follow LLM-like format. In other words, a base model holds everything without head and the ConditionalGeneration holds "base model + head" (🔴 [VLM] Add base model without head #37033)
Clean up qwen models which are completely different from existing VLMs in structure ([qwen-vl] Standardize config #37268) and hopefully after all done it will be okay. Needs verification
Support attention backends for all models by using new attn interface and correctly propagating kwargs . BTW, we also need to use self.loss_fn after this PR, to fix issues in Trainer with grad accum. But that is not vLLM related and comes in subsequent PR ([VLMs] support attention backends #37576)
Processors need helpers to calculate num_multimodal_tokens given input image sizes and to return mm_token_type_ids (there is a draft locally)
All vision backbones should return embeds of shape (bs, ...., dim). I found Pixtral doesn't do so, and will need to go over other models as well. To check if it is doable in vision encoder (BC breaking?) or we need to postprocess embeds in _get_image_features
This is a tracker issue since there are now too many PRs here and there, all making some sort of standardization to help vLLM support vision LLMs.
What we need is:
config.pad_token_id
(🚨[VLMs] use onlyxxx_token_id
for multimodal tokens #37573)ConditionalGeneration
holds "base model + head" (🔴 [VLM] Add base model without head #37033)kwargs
. BTW, we also need to useself.loss_fn
after this PR, to fix issues in Trainer with grad accum. But that is not vLLM related and comes in subsequent PR ([VLMs] support attention backends #37576)num_multimodal_tokens
given input image sizes and to returnmm_token_type_ids
(there is a draft locally)embeds
of shape(bs, ...., dim)
. I found Pixtral doesn't do so, and will need to go over other models as well. To check if it is doable in vision encoder (BC breaking?) or we need to postprocessembeds
in_get_image_features
cc @hmellor so you can keep track
cc @ArthurZucker @Cyrilvallez , I will be pinging you for reviews 😄
The text was updated successfully, but these errors were encountered: