You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current popular video training methods operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid. This leads to suboptimal accuracy-computation trade-offs due to inherent video redundancy. Additionally, these models lack adaptability to varying computational budgets for downstream tasks, hindering the application of competitive models in real-world scenarios with limited resources.
What is the feature?
The feature request is to add the FluxViT model, as described in the paper "Make Your Training Flexible: Towards Deployment-Efficient Video Models". FluxViT introduces a new test setting, "Token Optimization," which maximizes input information across different computational budgets by optimizing the set of input tokens through token selection from more suitably sampled videos. It utilizes a novel augmentation tool called "Flux" which makes the sampling grid flexible and leverages token selection. Integrating Flux into video training frameworks boosts model robustness with minimal additional cost. The paper demonstrates that FluxViT achieves state-of-the-art results across various video understanding tasks with standard costs and can match the performance of previous state-of-the-art models with significantly reduced computational cost (e.g., using only 1/4 tokens).
What alternatives have you considered?
The paper discusses alternatives like token reduction on densely sampled tokens and existing methods for flexible network training that operate at different spatial or temporal resolutions. However, it argues that these approaches are suboptimal as they either suffer from performance degradation with significant reduction rates or fail to optimize token capacity utilization under computational constraints.
The text was updated successfully, but these errors were encountered:
What is the problem this feature will solve?
Current popular video training methods operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid. This leads to suboptimal accuracy-computation trade-offs due to inherent video redundancy. Additionally, these models lack adaptability to varying computational budgets for downstream tasks, hindering the application of competitive models in real-world scenarios with limited resources.
What is the feature?
The feature request is to add the FluxViT model, as described in the paper "Make Your Training Flexible: Towards Deployment-Efficient Video Models". FluxViT introduces a new test setting, "Token Optimization," which maximizes input information across different computational budgets by optimizing the set of input tokens through token selection from more suitably sampled videos. It utilizes a novel augmentation tool called "Flux" which makes the sampling grid flexible and leverages token selection. Integrating Flux into video training frameworks boosts model robustness with minimal additional cost. The paper demonstrates that FluxViT achieves state-of-the-art results across various video understanding tasks with standard costs and can match the performance of previous state-of-the-art models with significantly reduced computational cost (e.g., using only 1/4 tokens).
What alternatives have you considered?
The paper discusses alternatives like token reduction on densely sampled tokens and existing methods for flexible network training that operate at different spatial or temporal resolutions. However, it argues that these approaches are suboptimal as they either suffer from performance degradation with significant reduction rates or fail to optimize token capacity utilization under computational constraints.
The text was updated successfully, but these errors were encountered: