
🌟 Comprehensive Video Intelligence: An All-in-One Framework for Understanding, Editing, and Remaking
In this video, we demonstrate how to use VideoAgent to:
- Clearly articulate user requirements
- Achieve intent analysis and autonomous tool use & planning
- Create multi-modal products, including detailed workflows
- Fully automatic generation of video overview
🧠 - Understanding Video Content
Enable in-depth analysis, summarization, and insight extraction from video media with advanced multi-modal intelligence capabilities.
✂️ - Editing Video Clips
Provide intuitive tools for assembling, clipping, and reconfiguring content with seamless workflow integration.
🎨 - Remaking Creative Videos
Utilize generative technologies to produce new, imaginative video content through AI-powered creative assistance.
🔧 - Multi-Modal Agentic Framework
Deliver comprehensive video intelligence through an integrated framework that combines multiple AI modalities for enhanced performance.
🚀 - Seamless Natural Language Experience
Transform video interaction and creation through pure conversational AI - no complex interfaces or technical expertise required, just natural dialogue with VideoAgent.
graph TB
A[🎬 VideoAgent Framework] --> B[🧠 Video Understanding & Summarization]
A --> C[✂️ Video Editing]
A --> D[🎨 VIdeo Remaking]
B --> B1[Video Q&A]
B --> B2[Video Summarization]
C --> C1[Movie Edits]
C --> C2[Commentary Video]
C --> C3[Video Overview]
D --> D1[Meme Videos]
D --> D2[Music Videos]
D --> D3[Cross-Cultural Comedy]
VideoAgent | Director | Funclip | NarratoAI | NotebookLM | |
---|---|---|---|---|---|
Beat-synced Edits | ✅ | ✅ | ✅ | — | — |
Storytelling Video | ✅ | — | — | — | — |
Video Overview | ✅ | ✅ | ✅ | ✅ | ✅ |
Meme Video Remaking | ✅ | — | — | — | — |
Song Remixes | ✅ | — | — | — | — |
Cross-lingual Adaptations | ✅ | — | — | — | — |
Video Q&A | ✅ | ✅ | — | — | ✅ |
Sound Effects Tools | ✅ | — | — | — | — |
🧠 Easy-to-Use | 🚀 Boundless Creativity | 🎨 High-Quality |
---|---|---|
One-Prompt Video Creation | Create From Any Ideas | Human-Quality Video Production |
Transform your ideas into professional videos | Workflow generation for your unique ideas | Deliver videos that meet professional standards |
Our system introduces three key innovations for automated video processing. Intent Analysis captures both explicit and implicit sub-intents beyond user commands. Autonamous Tool Use & Planning employs graph-powered workflow generation with adaptive feedback loops for automated agent orchestration. Multi-Modal Understanding transforms raw input into semantically aligned visual queries for enhanced retrieval.
-
🔍 VideoAgent intelligently decomposes user instructions into both explicit and implicit sub-intents, capturing nuanced requirements that users may not explicitly state. This advanced parsing ensures comprehensive understanding of user goals beyond surface-level commands.
-
🎯 Through an intent-to-agent mapping mechanism, the system identifies precisely which capabilities within the multi-agent framework are needed. This targeted approach enables efficient activation of relevant system components while avoiding unnecessary computational overhead for optimal task execution.
-
⚙️ A graph-powered framework automatically translates user intents into executable workflows. The system dynamically selects appropriate agents and constructs optimal execution sequences. Nodes represent tool capabilities while edges define workflow connections for complex video tasks.
-
🔄 Adaptive feedback loops continuously refine the planning process through two-step self-evaluation. This ensures robust automated decision-making and seamless execution. The system self-corrects and optimizes performance throughout the entire task lifecycle.
-
📋 The Storyboard Agent transforms raw user input into optimized visual queries. It first analyzes pre-captioned video material banks to understand available resources. This foundational analysis ensures the system knows exactly what content is accessible for query processing.
-
💡 The agent then decomposes user input into fine-grained sub-queries that are both visually and semantically aligned. This sophisticated breakdown enables enhanced video retrieval by matching user intentions with the most relevant visual content in the database.
We conduct extensive experiments across multiple dimensions to validate the effectiveness of VideoAgent in addressing key challenges.
To evaluate VideoAgent's boundless creativity through automatic workflow construction, we compared five broadly applicable agents across three backbone models. Our findings demonstrate that VideoAgent significantly outperforms other baselines on the Audio and Video datasets, showcasing its creative workflow generation capabilities through graph-structured guidance and self-reflection driven by dedicated self-evaluation feedback. Furthermore, we observe that VideoAgent exhibits superior and more stable creative performance under the Claude 3.7 backbone compared to GPT-4o and Deepseek-v3, while other baseline methods show fluctuations across different backbones. This highlights VideoAgent's ability to unleash boundless creativity by automatically constructing diverse and effective workflows that adapt to various user requirements, with more capable LLMs achieving deeper comprehension and providing more robust creative solutions for complex graph-based tasks.
To validate our multimodal understanding capabilities, we conducted text-to-video retrieval experiments using shuffled caption queries. The evaluation employs three metrics to assess our model's ability to retrieve corresponding visual content: Recall measures the model's ability to correctly reorder shuffled video clips by comparing retrieved clip midpoints against ground truth positions; Embedding Matching-based score assesses coarse-grained alignment between generated videos and high-level caption summaries; and Intersection over Union quantifies temporal alignment accuracy at the clip level by computing the ratio of temporal overlap to total coverage between retrieved and ground truth intervals. The experimental results demonstrate that our approach can retrieve more accurate video segments, thereby showcasing our precise multimodal understanding capabilities.
We investigate VideoAgent's iterative refinement capabilities by analyzing the impact of reflection rounds on performance. Through comprehensive hyperparameter experiments on workflow composition across two datasets using three LLM backbones, we demonstrate VideoAgent's notable self-improvement ability. The results reveal that while early iterations produce baseline results, our system's adaptive reflection mechanism drives significant performance gains with each subsequent round. VideoAgent achieves consistent workflow composition success rates of 0.95 across all tested configurations, showcasing its robust self-correction capabilities and reliable high-quality output regardless of the underlying LLM backbone.
GPU Memory: 8GB
OS: Linux, Windows
git clone https://github.com/HKUDS/VideoAgent.git
conda create --name videoagent python=3.10
conda activate videoagent
conda install -y -c conda-forge pynini==2.1.5 ffmpeg
pip install -r requirements.txt
# Download CosyVoice
cd tools/CosyVoice
huggingface-cli download PillowTa1k/CosyVoice --local-dir pretrained_models
# Download fish-speech
cd tools/fish-speech
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
# Download seed-vc
cd tools/seed-vc
huggingface-cli download PillowTa1k/seed-vc --local-dir checkpoints
# Download DiffSinger
cd tools/DiffSinger
huggingface-cli download PillowTa1k/DiffSinger --local-dir checkpoints
# Download Whisper
cd tools
huggingface-cli download openai/whisper-large-v3-turbo --local-dir whisper-large-v3-turbo
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
# Download ImageBind
cd tools
mkdir .checkpoints
cd .checkpoints
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
🌟 Multiple models are available for your convenience; you may wish to download only those relevant to your project.
Feature Type | Video Demo | Required Models |
---|---|---|
Cross Talk | English Stand-up Comedy to Chinese Crosstalk | CosyVoice, Whisper, ImageBind |
Talk Show | Chinese Crosstalk to English Stand-up Comedy | CosyVoice, Whisper, ImageBind |
MAD TTS | Xiao-Ming-Jian-Mo(小明剑魔) Meme | fish-speech |
MAD SVC | AI Music Videos | DiffSinger, seed-vc, Whisper, ImageBind |
Rhythm | Spider-Man: Across the Spider-Verse | Whisper, ImageBind |
Comm | Commentary Video | CosyVoice, Whisper, ImageBind |
News | Tech News: OpenAI's GPT-4o Image Generation Release | CosyVoice, Whisper, ImageBind |
Video QA/Summarization | Dune 2 Movie Cast Update Podcast | Whisper |
# VideoAgent\environment\config\config.yml
# Applicable scenarios and LLM configuration
# Claude is required as it powers the Agentic Graph Router
llm:
# Video Remixing/TTS/SVC/Stand-up/CrossTalk
deepseek_api_key: ""
deepseek_base_url: ""
# Agentic Graph Router/TTS/SVC/Stand-up/CrossTalk
claude_api_key: ""
claude_base_url: ""
# Video Editing/Overview/Summarization/QA/Commentary Video
gpt_api_key: ""
gpt_base_url: ""
# MLLM for caption and fine-grained video understanding
gemini_api_key: ""
gemini_base_url: ""
# With the configuration now complete, proceed to run the following instructions:
python main.py
# The console will output:
User Requirement: ...
# Requirement Example:
# 1. I need to create a reworded version of an existing video where the speech content is modified while maintaining the original speaker's voice. The video should have the same visuals as the original, but with updated dialogue that follows my specific requirements.
# 2. I have a standup comedy script that I'd like to turn into a professional-looking video. I need the script to be performed with good comedic timing and audience reactions, then matched with relevant video footage to create a complete standup comedy special. I already have a reference script and some footage I want to use for the video.
The current LLM selections are optimized for each function.
You can also adjust the model names in VideoAgent\environment\config\llm.py
if needed.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
For additional demo usage details, please refer to:
👉 Demos Documentation
You can find more fun videos on our Bilibili channel here:
👉 Bilibili Homepage
Feel free to check it out for more entertaining content! 😊
Note: All videos are used for research and demonstration purposes only. The audio and visual assets are sourced from the Internet. Please contact us if you believe any content infringes upon your intellectual property rights.
We express our deepest gratitude to the numerous individuals and organizations that have made VideoAgent possible. This framework stands on the shoulders of giants, benefiting from the collective wisdom of the open-source community and the groundbreaking work of researchers worldwide.
Our work has been significantly enriched by the creative contributions of content creators across various platforms. We acknowledge:
- 🎬 Content Creators: The talented creators behind the original video content used for testing and demonstration
- 🎭 Comedy Artists: Those whose work inspired our cross-cultural adaptations
- 🎥 Filmmakers: The production teams behind the movies and TV shows featured in our demos