-
Notifications
You must be signed in to change notification settings - Fork 31.1k
fix: improve video processing fps assignment logic #42009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @molbap @yonigozlan for processors |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Xqle , thanks for the PR though I am not convinced we have to automatically change the flag depending on input args. Please see my comments below
| output_kwargs["videos_kwargs"]["do_sample_frames"] = True | ||
| logger.info( | ||
| "User specified 'fps' without 'do_sample_frames'; " | ||
| "'do_sample_frames' has been automatically enabled." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is a good idea. We can't automatically change the flag only for a single model. If we are to change to all models, it will be much harder because the sampling can be set either with num_frames or fps and optionally other args. Checking for what required args are present per model is not as straightforward always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part, we can just revert the change.
| fps = output_kwargs["videos_kwargs"].get( | ||
| "fps", [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is the fps if the original video here we are using. It's more probably the sampling fps rate used which is in videos_kwargs
The sampling fps however isn't always there, we can sample with number of frames, with default values of video processor or do not sample at all. So the 2 here is just the default value of the video processor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
In Qwen2.5-VL, the video processor we are using is the Qwen2-VL video processor, which does not initialize fps or num_frames by default, see above. So the model will not sample the input video by default.
And I think the fps in this context is not only used for frame sampling, but is later involved in computing seconds_per_grid_ts, which directly affects the temporal RoPE index inside get_rope_index. So, the fps used here should correspond to the effective fps of the processed video (i.e., the fps after potential frame sampling).
Specifically:
-
When
fpsanddo_sample_framesare passed invideo_kwargs, the processor will sample the video according to that fps, so the processed video’s fps should match the samplingfpsinvideo_kwargs. -
When either
fpsordo_sample_framesis not provided, no sampling occurs, and the processed video keeps its original fps, which ismetadata.fps. In that case, usingmetadata.fpsensures thatseconds_per_grid_tsand thus the temporal RoPE positions remain consistent with the actual video duration. -
Finally, if no sampling occurs and
metadata.fpsis unavailable, it’s reasonable to fall back to a typical video frame rate such as 24 fps, to maintain temporal consistency in downstream computations.
So, I think we can revert the part of automatic change of flag do_sample_frames, and change this part to something like:
fps = [metadata.fps if metadata.fps is not None else 24 for metadata in video_metadata]
if output_kwargs["videos_kwargs"].get("do_sample_frames") is not None:
if output_kwargs["videos_kwargs"].get("fps") is not None:
fps = output_kwargs["videos_kwargs"]["fps"]
elif output_kwargs["videos_kwargs"].get("num_frames") is not None:
fps = [output_kwargs["videos_kwargs"]["num_frames"] / metadata.duration for metadata in video_metadata]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed explanation. I looked through the qwen paper and the m-rope documentation and it looks like the FPS is indeed the video's fps after the optional sampling. Though it will be breaking wrt the current main branch because the default fps when not provided is 24 now. I want to also ask the authors @BakerBunker to take a look and confirm this is how QwenVL should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For impl, imo it's better if we add a new field in metadata for sampled_fps and use it later. The field might be useful not only in QwenVL but other models also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp Sounds good. I can try to work on it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, fps refers to the frame rate expected by the language model, not the actual frame rate of the video itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently,
fpsrefers to the frame rate expected by the language model, not the actual frame rate of the video itself.
Thanks for the reply. I am wondering if that is the case, will the rope index be inconsistent with the input video since videos with different fps, assuming no sampling is applied, will be using the same temporl rope index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input video should be resampled by video processor, so we assume the videos have same fps
|
Thanks for the review. But I think the |
Avoid error when 'num_frames' is passed rather than 'fps'.
1690273 to
c410427
Compare
|
@zucchini-nlp I've added Also, I noticed a potential inconsistency regarding the previously discussed
I understand your previous concern that adding this logic inside the processor might require updating every model’s processor implementation, which could be non-trivial. Maybe we could move the flag assignment from If that sounds reasonable to you, I can open a separate PR to address this specifically. |
ah yeah, we had to do the automatic switch trick for BC. Prev we didn't have a dedicated video processor and thus no flag to control the sampling behavior. Sampling was done only when applying chat template by helper functions. So we had to keep it and always same frames when the processor is called via Though when simply calling, I still think it should not be automatically changing. For all new models it is set to |
| video_grid_thw = videos_inputs["video_grid_thw"] | ||
|
|
||
| # Get video metadata | ||
| if "return_metadata" not in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be in kwargs but set to False, let's check it's value if the key exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I will udpateqwen2_5_vl in this PR.
I also checked other models and found that a few of them have the same issue.
To keep the changes clean and consistent, I think it might be better to handle other models in a separate PR.
src/transformers/video_utils.py
Outdated
| # Case 1: functools.partial | ||
| if isinstance(sample_indices_fn, partial): | ||
| num_frames = sample_indices_fn.keywords.get("num_frames") | ||
| fps = sample_indices_fn.keywords.get("fps") | ||
|
|
||
| # Case 2: normal function with closure | ||
| elif isinstance(sample_indices_fn, FunctionType): | ||
| if sample_indices_fn.__closure__: | ||
| closure_vars = { | ||
| var: cell.cell_contents | ||
| for var, cell in zip(sample_indices_fn.__code__.co_freevars, sample_indices_fn.__closure__) | ||
| } | ||
| num_frames = closure_vars.get("num_frames") | ||
| fps = closure_vars.get("fps") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I don't like examining the function to get the sampling FPS. Looks a bit overengineered, so maybe we could obtain it by returning directly from sample_indices_fn? Or have it as a property of VideoMetadata and infer from total frames and length of sampled frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think it can be inferred directly from indices, and I’ve updated the code accordingly. The logic looks much cleaner now, thanks for the suggestion.
src/transformers/video_utils.py
Outdated
| indices = sample_indices_fn(metadata=metadata, **kwargs) | ||
|
|
||
| indices = sample_indices_fn(metadata=metadata, **kwargs) | ||
| sampled_fps = len(indices) / total_num_frames * video_fps if video_fps else 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we infer it in the same way everywhere so imo we can move it as a property of class VideoMetadata, similar to how timestamps work
3aa3deb to
e40de1b
Compare
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The CI is failing for unrelated reasons, should be fixed soon. Then we'll merge
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen2_5_vl |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |

What does this PR do?
1. Improve
fpshandling inQwen2_5_VLProcessor.__call__()Previously, the
__call__()function always defaulted tofps=2, regardless of the input video.Now, when fps is not explicitly provided and cannot be read from the video metadata, it falls back to 24 by default. Otherwise, the provided or metadata fps is used. It follows a priority order:
The default value 24 is chosen as it represents a common frame rate for most videos.
This prevents potential errors in
second_per_grid_tscalculation when the actual video frame rate differs from the assumed default.2. More robust
do_sample_framesbehaviorfpsis manually provided butdo_sample_framesis not specified,do_sample_frameswill now be automatically set to True.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp