fix: improve video processing fps assignment logic #42009

Xqle · 2025-11-04T13:12:44Z

What does this PR do?

1. Improve `fps` handling in `Qwen2_5_VLProcessor.call()`

Previously, the __call__() function always defaulted to fps=2, regardless of the input video.
Now, when fps is not explicitly provided and cannot be read from the video metadata, it falls back to 24 by default. Otherwise, the provided or metadata fps is used. It follows a priority order:
```
kwargs["fps"] → video_metadata["fps"] → 24 (default)
```
The default value 24 is chosen as it represents a common frame rate for most videos.
This prevents potential errors in second_per_grid_ts calculation when the actual video frame rate differs from the assumed default.

2. More robust `do_sample_frames` behavior

If fps is manually provided but do_sample_frames is not specified, do_sample_frames will now be automatically set to True.
This avoids cases where custom fps is ignored due to missing sampling flag.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

Rocketknight1 · 2025-11-05T12:15:45Z

cc @molbap @yonigozlan for processors

zucchini-nlp

Hey @Xqle , thanks for the PR though I am not convinced we have to automatically change the flag depending on input args. Please see my comments below

zucchini-nlp · 2025-11-05T12:50:39Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

+                output_kwargs["videos_kwargs"]["do_sample_frames"] = True
+                logger.info(
+                    "User specified 'fps' without 'do_sample_frames'; "
+                    "'do_sample_frames' has been automatically enabled."
+                )


i don't think this is a good idea. We can't automatically change the flag only for a single model. If we are to change to all models, it will be much harder because the sampling can be set either with num_frames or fps and optionally other args. Checking for what required args are present per model is not as straightforward always

For this part, we can just revert the change.

zucchini-nlp · 2025-11-05T12:53:54Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

+            fps = output_kwargs["videos_kwargs"].get(
+                "fps", [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata]
+            )


i don't think this is the fps if the original video here we are using. It's more probably the sampling fps rate used which is in videos_kwargs

The sampling fps however isn't always there, we can sample with number of frames, with default values of video processor or do not sample at all. So the 2 here is just the default value of the video processor

@zucchini-nlp

Qwen2VL Video Processor Default Parameters

Thanks for the feedback.

In Qwen2.5-VL, the video processor we are using is the Qwen2-VL video processor, which does not initialize fps or num_frames by default, see above. So the model will not sample the input video by default.

And I think the fps in this context is not only used for frame sampling, but is later involved in computing seconds_per_grid_ts, which directly affects the temporal RoPE index inside get_rope_index. So, the fps used here should correspond to the effective fps of the processed video (i.e., the fps after potential frame sampling).

Specifically:

When fps and do_sample_frames are passed in video_kwargs, the processor will sample the video according to that fps, so the processed video’s fps should match the sampling fps in video_kwargs.

When eitherfps ordo_sample_frames is not provided, no sampling occurs, and the processed video keeps its original fps, which is metadata.fps. In that case, using metadata.fps ensures that seconds_per_grid_ts and thus the temporal RoPE positions remain consistent with the actual video duration.

Finally, if no sampling occurs and metadata.fps is unavailable, it’s reasonable to fall back to a typical video frame rate such as 24 fps, to maintain temporal consistency in downstream computations.

So, I think we can revert the part of automatic change of flag do_sample_frames, and change this part to something like:

fps = [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata] if output_kwargs["videos_kwargs"].get("do_sample_frames") is not None: if output_kwargs["videos_kwargs"].get("fps") is not None: fps = output_kwargs["videos_kwargs"]["fps"] elif output_kwargs["videos_kwargs"].get("num_frames") is not None: fps = [output_kwargs["videos_kwargs"]["num_frames"] / metadata.duration for metadata in video_metadata]

Thanks for detailed explanation. I looked through the qwen paper and the m-rope documentation and it looks like the FPS is indeed the video's fps after the optional sampling. Though it will be breaking wrt the current main branch because the default fps when not provided is 24 now. I want to also ask the authors @BakerBunker to take a look and confirm this is how QwenVL should work

For impl, imo it's better if we add a new field in metadata for sampled_fps and use it later. The field might be useful not only in QwenVL but other models also

@zucchini-nlp Sounds good. I can try to work on it later.

Currently, fps refers to the frame rate expected by the language model, not the actual frame rate of the video itself.

Currently, fps refers to the frame rate expected by the language model, not the actual frame rate of the video itself.

Thanks for the reply. I am wondering if that is the case, will the rope index be inconsistent with the input video since videos with different fps, assuming no sampling is applied, will be using the same temporl rope index.

The input video should be resampled by video processor, so we assume the videos have same fps

Xqle · 2025-11-05T15:14:46Z

@zucchini-nlp

Thanks for the review. But I think the fps should align with the fps of video after video processing, no matter the video is sampled during video processing or not.
Please see my reply for your comments above.

Avoid error when 'num_frames' is passed rather than 'fps'.

Xqle · 2025-11-07T03:32:25Z

@zucchini-nlp I've added sampled_fps to the metadata as discussed. If this implementation looks good to you, I can also update the other models that use FPS in a similar way. From what I’ve checked, only qwen3_omni_moe and qwen2_5_omni would need similar changes.

Also, I noticed a potential inconsistency regarding the previously discussed do_sample_frames flag.

When calling processor.apply_chat_template(..., tokenize=True, fps=1), do_sample_frames is automatically set to True (see code in processor_utils.py).
However, if we call processor.apply_chat_template(..., tokenize=False) and then use processor(..., fps=1) manually for processing, do_sample_frames remains unset. So the video will not be sampled and there’s no warning or error about it.

I understand your previous concern that adding this logic inside the processor might require updating every model’s processor implementation, which could be non-trivial. Maybe we could move the flag assignment from processor_utils.apply_chat_template to video_processing_utils.preprocess?

If that sounds reasonable to you, I can open a separate PR to address this specifically.

zucchini-nlp · 2025-11-07T10:30:46Z

Also, I noticed a potential inconsistency regarding the previously discussed do_sample_frames flag.

ah yeah, we had to do the automatic switch trick for BC. Prev we didn't have a dedicated video processor and thus no flag to control the sampling behavior. Sampling was done only when applying chat template by helper functions. So we had to keep it and always same frames when the processor is called via apply_chat_template

Though when simply calling, I still think it should not be automatically changing. For all new models it is set to True anyway so it will be sampling by default. For old model we have the flag set to False (again for BC) and will require users to toggle it if they want sampling

zucchini-nlp · 2025-11-07T10:31:41Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

            video_grid_thw = videos_inputs["video_grid_thw"]

+            # Get video metadata
+            if "return_metadata" not in kwargs:


it could be in kwargs but set to False, let's check it's value if the key exists

Thanks for pointing that out. I will udpateqwen2_5_vl in this PR.
I also checked other models and found that a few of them have the same issue.
To keep the changes clean and consistent, I think it might be better to handle other models in a separate PR.

zucchini-nlp · 2025-11-07T10:38:53Z

src/transformers/video_utils.py

+    # Case 1: functools.partial
+    if isinstance(sample_indices_fn, partial):
+        num_frames = sample_indices_fn.keywords.get("num_frames")
+        fps = sample_indices_fn.keywords.get("fps")
+
+    # Case 2: normal function with closure
+    elif isinstance(sample_indices_fn, FunctionType):
+        if sample_indices_fn.__closure__:
+            closure_vars = {
+                var: cell.cell_contents
+                for var, cell in zip(sample_indices_fn.__code__.co_freevars, sample_indices_fn.__closure__)
+            }
+            num_frames = closure_vars.get("num_frames")
+            fps = closure_vars.get("fps")
+


personally I don't like examining the function to get the sampling FPS. Looks a bit overengineered, so maybe we could obtain it by returning directly from sample_indices_fn? Or have it as a property of VideoMetadata and infer from total frames and length of sampled frames?

Yep, I think it can be inferred directly from indices, and I’ve updated the code accordingly. The logic looks much cleaner now, thanks for the suggestion.

zucchini-nlp · 2025-11-10T08:16:37Z

src/transformers/video_utils.py

-    indices = sample_indices_fn(metadata=metadata, **kwargs)

+    indices = sample_indices_fn(metadata=metadata, **kwargs)
+    sampled_fps = len(indices) / total_num_frames * video_fps if video_fps else 24


we infer it in the same way everywhere so imo we can move it as a property of class VideoMetadata, similar to how timestamps work

zucchini-nlp

Thanks! The CI is failing for unrelated reasons, should be fixed soon. Then we'll merge

github-actions · 2025-11-10T12:44:08Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_vl

HuggingFaceDocBuilderDev · 2025-11-10T15:07:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Xqle added 4 commits November 4, 2025 12:55

fix: improve video processing fps and do_sample_frames assignment logic

c1a2ab3

fix: set return_metadata=True to get metadata

d2fbd5a

reformat the modular file

9b1fb16

fix typo

93c6cb0

zucchini-nlp reviewed Nov 5, 2025

View reviewed changes

revert flag change and fix fps assignment

d84fa90

Xqle added 4 commits November 5, 2025 15:26

Taking 'num_frames' into considered.

67e33dc

Avoid error when 'num_frames' is passed rather than 'fps'.

fix

1087852

fix: avoid potential reference before assignment error

3bbf6db

fix

f6ca84c

Xqle changed the title ~~fix: improve video processing fps and do_sample_frames assignment logic~~ fix: improve video processing fps assignment logic Nov 6, 2025

Xqle added 5 commits November 6, 2025 08:54

add 'sample_fps' to 'VideoMetadata'

ea944c4

fix missing comma

af2ae81

fix trailing whitespace

d1465dd

Handle different 'sample_indices_fn'

fb42c12

Cleaning white space

c410427

Xqle force-pushed the fix-qwen25vl-video-fps branch from 1690273 to c410427 Compare November 6, 2025 11:05

import callable from collections.abc

d2090ff

zucchini-nlp reviewed Nov 7, 2025

View reviewed changes

Xqle added 4 commits November 8, 2025 04:01

calculate sampled_fps using indices

80137c5

correct the order

0a41c09

fix

939fbb1

properly check value in kwargs

7b1f3b9

Xqle mentioned this pull request Nov 8, 2025

Fix return metadata checking logic #42108

Merged

5 tasks

zucchini-nlp reviewed Nov 10, 2025

View reviewed changes

handle sampled_fps as property

4908f15

Xqle added 4 commits November 10, 2025 08:33

remove duplicated definition

0aa1f33

fix

a1142f7

fix

dad654e

add safety check

e40de1b

Xqle force-pushed the fix-qwen25vl-video-fps branch from 3aa3deb to e40de1b Compare November 10, 2025 09:38

zucchini-nlp approved these changes Nov 10, 2025

View reviewed changes

Merge branch 'main' into fix-qwen25vl-video-fps

09ea379

zucchini-nlp merged commit 3c0b2b1 into huggingface:main Nov 11, 2025
23 checks passed

Xqle deleted the fix-qwen25vl-video-fps branch November 11, 2025 10:01

fix: improve video processing fps assignment logic #42009

fix: improve video processing fps assignment logic #42009

Uh oh!

Conversation

Xqle commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

1. Improve fps handling in Qwen2_5_VLProcessor.__call__()

2. More robust do_sample_frames behavior

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Nov 5, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xqle Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xqle commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xqle commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xqle Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Xqle commented Nov 4, 2025 •

edited

Loading

1. Improve `fps` handling in `Qwen2_5_VLProcessor.call()`

2. More robust `do_sample_frames` behavior

Xqle Nov 5, 2025 •

edited

Loading

Xqle commented Nov 5, 2025 •

edited

Loading

Xqle commented Nov 7, 2025 •

edited

Loading

Xqle Nov 8, 2025 •

edited

Loading