-
Notifications
You must be signed in to change notification settings - Fork 145
[OpenVINO] Add support for GLM-4.1V-9B-Thinkin #1387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[OpenVINO] Add support for GLM-4.1V-9B-Thinkin #1387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @openvino-dev-samples
return causal_mask | ||
|
||
|
||
def _glm4v_update_causal_mask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? I was thinking that we can remove _glm4v_update_causal_mask
and _glm4v_prepare_4d_causal_attention_mask_with_cache_position
as we should be compatible with create_causal_mask
https://github.com/huggingface/transformers/blob/v4.53.0/src/transformers/models/glm4v/modular_glm4v.py#L982 (#1377 will soon be merged)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, there is accuracy issue on conversion torch.vmap, which is included in create_casual_mask.
def init_model_configs(): | ||
if "open_clip" not in TasksManager._LIBRARY_TO_SUPPORTED_MODEL_TYPES: | ||
TasksManager._LIBRARY_TO_SUPPORTED_MODEL_TYPES["open_clip"] = {} | ||
TasksManager._CUSTOM_CLASSES[("pt", "glm4v", "image-text-to-text")] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here why not used AutoModelForImageTextToText
directly to load all the image-text-to-text task models?
https://github.com/huggingface/transformers/blob/5dba4bc7b2c1ef517ed44bba76bb70b59001c737/src/transformers/models/auto/modeling_auto.py#L941
"phi4_multimodal": _OVPhi4MMForCausalLM, | ||
"llama4": _OVLlama4ForCausalLM, | ||
"glm4v": _OVGlm4vForCausalLM, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind adding a test with a tiny random model ?
) | ||
|
||
if input_name == "seqlens": | ||
return torch.tensor([grid_t * grid_h * grid_w], dtype=torch.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question : do we need to generate seqlens
in the input generator or should we infer it directly in the patch instead ? (given hidden_states
for example) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echarlaix Sorry, I dont understand what "infer it directly in the patch instead" means ? could you show me a link of example code ? thanks
[grid_t * grid_h * grid_w, 2], max_value=grid_h, framework=framework, dtype=int_dtype | ||
) | ||
|
||
if input_name == "grid_thw": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question shouldn't it be inferred in the patch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its directly from forward function
dim = self.embed_dim // self.num_heads // 2 | ||
return self.random_float_tensor([grid_h * grid_t * grid_w, dim], framework=framework, dtype=float_dtype) | ||
|
||
if input_name == "image_type_ids": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like DummyGlm4vVisionEmbedInputGenerator
could inherit from DummyQwen2VLVisionEmbedInputGenerator
as they are both very close (just need to overwrite generate
and just add image_type_ids
) what do you think ?
f"Initialization model for {self.config.model_type} required at least transformers >= 4.45" | ||
) | ||
|
||
def get_rope_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind adding a link to the original code https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/models/glm4v/modular_glm4v.py#L1014 (and highlight any modifications if any)
No description provided.