Skip to content

[New Model]: Support ColQwen2VL #19381

@mgoin

Description

@mgoin

The model to consider.

ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.

This was supported in HF Transformers as of huggingface/transformers#35778

An initial attempt to support the model was posted in #14291 but it was made before the HF definition was finalized so it grew out-of-date.

The closest model vllm already supports.

Qwen2VL is used as a base, so mostly it is wrapping that backbone

What's your difficulty of supporting the model you want?

See previous attempt #14291

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    multi-modalityRelated to multi-modality (#4194)new-modelRequests to new models

    Type

    No type

    Projects

    Status

    Abandoned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions