[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

sarim-next · 2025-03-19T12:14:14Z

What is the problem this feature will solve?

Current popular video training methods operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid. This leads to suboptimal accuracy-computation trade-offs due to inherent video redundancy. Additionally, these models lack adaptability to varying computational budgets for downstream tasks, hindering the application of competitive models in real-world scenarios with limited resources.

What is the feature?

The feature request is to add the FluxViT model, as described in the paper "Make Your Training Flexible: Towards Deployment-Efficient Video Models". FluxViT introduces a new test setting, "Token Optimization," which maximizes input information across different computational budgets by optimizing the set of input tokens through token selection from more suitably sampled videos. It utilizes a novel augmentation tool called "Flux" which makes the sampling grid flexible and leverages token selection. Integrating Flux into video training frameworks boosts model robustness with minimal additional cost. The paper demonstrates that FluxViT achieves state-of-the-art results across various video understanding tasks with standard costs and can match the performance of previous state-of-the-art models with significantly reduced computational cost (e.g., using only 1/4 tokens).

What alternatives have you considered?

The paper discusses alternatives like token reduction on densely sampled tokens and existing methods for flexible network training that operate at different spatial or temporal resolutions. However, it argues that these approaches are suboptimal as they either suffer from performance degradation with significant reduction rates or fail to optimize token capacity utilization under computational constraints.

sarim-next · 2025-03-19T12:14:28Z

https://arxiv.org/pdf/[2503.14237](https://arxiv.org/pdf/2503.14237)

mm-assistant bot assigned cir7 Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

sarim-next commented Mar 19, 2025

sarim-next commented Mar 19, 2025

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

Comments

sarim-next commented Mar 19, 2025

What is the problem this feature will solve?

What is the feature?

What alternatives have you considered?

sarim-next commented Mar 19, 2025