Skip to content

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sarim-next opened this issue Mar 19, 2025 · 1 comment
Open
Assignees

Comments

@sarim-next
Copy link

What is the problem this feature will solve?

Current popular video training methods operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid. This leads to suboptimal accuracy-computation trade-offs due to inherent video redundancy. Additionally, these models lack adaptability to varying computational budgets for downstream tasks, hindering the application of competitive models in real-world scenarios with limited resources.

What is the feature?

The feature request is to add the FluxViT model, as described in the paper "Make Your Training Flexible: Towards Deployment-Efficient Video Models". FluxViT introduces a new test setting, "Token Optimization," which maximizes input information across different computational budgets by optimizing the set of input tokens through token selection from more suitably sampled videos. It utilizes a novel augmentation tool called "Flux" which makes the sampling grid flexible and leverages token selection. Integrating Flux into video training frameworks boosts model robustness with minimal additional cost. The paper demonstrates that FluxViT achieves state-of-the-art results across various video understanding tasks with standard costs and can match the performance of previous state-of-the-art models with significantly reduced computational cost (e.g., using only 1/4 tokens).

What alternatives have you considered?

The paper discusses alternatives like token reduction on densely sampled tokens and existing methods for flexible network training that operate at different spatial or temporal resolutions. However, it argues that these approaches are suboptimal as they either suffer from performance degradation with significant reduction rates or fail to optimize token capacity utilization under computational constraints.

@sarim-next
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants