GitHub - HKUDS/VideoAgent: "VideoAgent: Transform any video with a single prompt—seamlessly generate engaging commentary, cross-lingual adaptations, viral memes, and stunning music remixes, all in one go!"

🌟 Comprehensive Video Intelligence: An All-in-One Framework for Understanding, Editing, and Remaking

📹 Demo Video

In this video, we demonstrate how to use VideoAgent to:

Clearly articulate user requirements
Achieve intent analysis and autonomous tool use & planning
Create multi-modal products, including detailed workflows
Fully automatic generation of video overview

🚀 Key Features

🧠 - Understanding Video Content
Enable in-depth analysis, summarization, and insight extraction from video media with advanced multi-modal intelligence capabilities.

✂️ - Editing Video Clips
Provide intuitive tools for assembling, clipping, and reconfiguring content with seamless workflow integration.

🎨 - Remaking Creative Videos
Utilize generative technologies to produce new, imaginative video content through AI-powered creative assistance.

🔧 - Multi-Modal Agentic Framework
Deliver comprehensive video intelligence through an integrated framework that combines multiple AI modalities for enhanced performance.

🚀 - Seamless Natural Language Experience
Transform video interaction and creation through pure conversational AI - no complex interfaces or technical expertise required, just natural dialogue with VideoAgent.

graph TB
    A[🎬 VideoAgent Framework] --> B[🧠 Video Understanding & Summarization]
    A --> C[✂️ Video Editing]
    A --> D[🎨 VIdeo Remaking]
    
    B --> B1[Video Q&A]
    B --> B2[Video Summarization]
    
    C --> C1[Movie Edits]
    C --> C2[Commentary Video]
    C --> C3[Video Overview]
    
    D --> D1[Meme Videos]
    D --> D2[Music Videos]
    D --> D3[Cross-Cultural Comedy]

	VideoAgent	Director	Funclip	NarratoAI	NotebookLM
Beat-synced Edits	✅	✅	✅	—	—
Storytelling Video	✅	—	—	—	—
Video Overview	✅	✅	✅	✅	✅
Meme Video Remaking	✅	—	—	—	—
Song Remixes	✅	—	—	—	—
Cross-lingual Adaptations	✅	—	—	—	—
Video Q&A	✅	✅	—	—	✅
Sound Effects Tools	✅	—	—	—	—

🔥 Why VideoAgent?

🧠 Easy-to-Use	🚀 Boundless Creativity	🎨 High-Quality
One-Prompt Video Creation	Create From Any Ideas	Human-Quality Video Production
Transform your ideas into professional videos	Workflow generation for your unique ideas	Deliver videos that meet professional standards

🌟System Overview

Our system introduces three key innovations for automated video processing. Intent Analysis captures both explicit and implicit sub-intents beyond user commands. Autonamous Tool Use & Planning employs graph-powered workflow generation with adaptive feedback loops for automated agent orchestration. Multi-Modal Understanding transforms raw input into semantically aligned visual queries for enhanced retrieval.

🧠 Intent Analysis

🔍 VideoAgent intelligently decomposes user instructions into both explicit and implicit sub-intents, capturing nuanced requirements that users may not explicitly state. This advanced parsing ensures comprehensive understanding of user goals beyond surface-level commands.
🎯 Through an intent-to-agent mapping mechanism, the system identifies precisely which capabilities within the multi-agent framework are needed. This targeted approach enables efficient activation of relevant system components while avoiding unnecessary computational overhead for optimal task execution.

🔧 Autonomous Tool Use & Planning

⚙️ A graph-powered framework automatically translates user intents into executable workflows. The system dynamically selects appropriate agents and constructs optimal execution sequences. Nodes represent tool capabilities while edges define workflow connections for complex video tasks.
🔄 Adaptive feedback loops continuously refine the planning process through two-step self-evaluation. This ensures robust automated decision-making and seamless execution. The system self-corrects and optimizes performance throughout the entire task lifecycle.

🎬 Multi-Modal Understanding

📋 The Storyboard Agent transforms raw user input into optimized visual queries. It first analyzes pre-captioned video material banks to understand available resources. This foundational analysis ensures the system knows exactly what content is accessible for query processing.
💡 The agent then decomposes user input into fine-grained sub-queries that are both visually and semantically aligned. This sophisticated breakdown enables enhanced video retrieval by matching user intentions with the most relevant visual content in the database.

🔧Evaluation

We conduct extensive experiments across multiple dimensions to validate the effectiveness of VideoAgent in addressing key challenges.

Boundless Creativity via Workflow Construction

To evaluate VideoAgent's boundless creativity through automatic workflow construction, we compared five broadly applicable agents across three backbone models. Our findings demonstrate that VideoAgent significantly outperforms other baselines on the Audio and Video datasets, showcasing its creative workflow generation capabilities through graph-structured guidance and self-reflection driven by dedicated self-evaluation feedback. Furthermore, we observe that VideoAgent exhibits superior and more stable creative performance under the Claude 3.7 backbone compared to GPT-4o and Deepseek-v3, while other baseline methods show fluctuations across different backbones. This highlights VideoAgent's ability to unleash boundless creativity by automatically constructing diverse and effective workflows that adapt to various user requirements, with more capable LLMs achieving deeper comprehension and providing more robust creative solutions for complex graph-based tasks.

Superior Multimodal Understanding

To validate our multimodal understanding capabilities, we conducted text-to-video retrieval experiments using shuffled caption queries. The evaluation employs three metrics to assess our model's ability to retrieve corresponding visual content: Recall measures the model's ability to correctly reorder shuffled video clips by comparing retrieved clip midpoints against ground truth positions; Embedding Matching-based score assesses coarse-grained alignment between generated videos and high-level caption summaries; and Intersection over Union quantifies temporal alignment accuracy at the clip level by computing the ratio of temporal overlap to total coverage between retrieved and ground truth intervals. The experimental results demonstrate that our approach can retrieve more accurate video segments, thereby showcasing our precise multimodal understanding capabilities.

More Iterations, Better Performance

We investigate VideoAgent's iterative refinement capabilities by analyzing the impact of reflection rounds on performance. Through comprehensive hyperparameter experiments on workflow composition across two datasets using three LLM backbones, we demonstrate VideoAgent's notable self-improvement ability. The results reveal that while early iterations produce baseline results, our system's adaptive reflection mechanism drives significant performance gains with each subsequent round. VideoAgent achieves consistent workflow composition success rates of 0.95 across all tested configurations, showcasing its robust self-correction capabilities and reliable high-quality output regardless of the underlying LLM backbone.

🚀Quick Start

🖥️ Environment

GPU Memory: 8GB  
OS: Linux, Windows

📥 Clone and Install

git clone https://github.com/HKUDS/VideoAgent.git
conda create --name videoagent python=3.10
conda activate videoagent
conda install -y -c conda-forge pynini==2.1.5 ffmpeg
pip install -r requirements.txt

📦 Model Download

# Download CosyVoice
cd tools/CosyVoice
huggingface-cli download PillowTa1k/CosyVoice --local-dir pretrained_models

# Download fish-speech
cd tools/fish-speech
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5

# Download seed-vc
cd tools/seed-vc
huggingface-cli download PillowTa1k/seed-vc --local-dir checkpoints

# Download DiffSinger
cd tools/DiffSinger
huggingface-cli download PillowTa1k/DiffSinger --local-dir checkpoints

# Download Whisper
cd tools
huggingface-cli download openai/whisper-large-v3-turbo --local-dir whisper-large-v3-turbo

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# Download ImageBind
cd tools
mkdir .checkpoints
cd .checkpoints
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth

🌟 Multiple models are available for your convenience; you may wish to download only those relevant to your project.

Feature Type	Video Demo	Required Models
Cross Talk	English Stand-up Comedy to Chinese Crosstalk	CosyVoice, Whisper, ImageBind
Talk Show	Chinese Crosstalk to English Stand-up Comedy	CosyVoice, Whisper, ImageBind
MAD TTS	Xiao-Ming-Jian-Mo(小明剑魔) Meme	fish-speech
MAD SVC	AI Music Videos	DiffSinger, seed-vc, Whisper, ImageBind
Rhythm	Spider-Man: Across the Spider-Verse	Whisper, ImageBind
Comm	Commentary Video	CosyVoice, Whisper, ImageBind
News	Tech News: OpenAI's GPT-4o Image Generation Release	CosyVoice, Whisper, ImageBind
Video QA/Summarization	Dune 2 Movie Cast Update Podcast	Whisper

🤖 LLM Configuration

# VideoAgent\environment\config\config.yml
# Applicable scenarios and LLM configuration
# Claude is required as it powers the Agentic Graph Router 
llm:
  # Video Remixing/TTS/SVC/Stand-up/CrossTalk
  deepseek_api_key: ""  
  deepseek_base_url: ""  

  # Agentic Graph Router/TTS/SVC/Stand-up/CrossTalk
  claude_api_key: ""  
  claude_base_url: ""

  # Video Editing/Overview/Summarization/QA/Commentary Video
  gpt_api_key: ""  
  gpt_base_url: ""  

  # MLLM for caption and fine-grained video understanding
  gemini_api_key: ""  
  gemini_base_url: ""

🎯 Usage

# With the configuration now complete, proceed to run the following instructions:
python main.py
# The console will output:
User Requirement: ...
# Requirement Example:
# 1. I need to create a reworded version of an existing video where the speech content is modified while maintaining the original speaker's voice. The video should have the same visuals as the original, but with updated dialogue that follows my specific requirements.
# 2. I have a standup comedy script that I'd like to turn into a professional-looking video. I need the script to be performed with good comedic timing and audience reactions, then matched with relevant video footage to create a complete standup comedy special. I already have a reference script and some footage I want to use for the video.

The current LLM selections are optimized for each function.

You can also adjust the model names in VideoAgent\environment\config\llm.py if needed.

🔮Demos

Movie Edits	Meme Videos	Music Videos
Verbal Comedy Arts	Commentary Video	Video Overview

For additional demo usage details, please refer to:
👉 Demos Documentation

You can find more fun videos on our Bilibili channel here:
👉 Bilibili Homepage
Feel free to check it out for more entertaining content! 😊

Note: All videos are used for research and demonstration purposes only. The audio and visual assets are sourced from the Internet. Please contact us if you believe any content infringes upon your intellectual property rights.

💖Acknowledgments

We express our deepest gratitude to the numerous individuals and organizations that have made VideoAgent possible. This framework stands on the shoulders of giants, benefiting from the collective wisdom of the open-source community and the groundbreaking work of researchers worldwide.

🔧 Open-Source Community and Service Providers

🎨 Content Creators and Inspiration

Our work has been significantly enriched by the creative contributions of content creators across various platforms. We acknowledge:

🎬 Content Creators: The talented creators behind the original video content used for testing and demonstration
🎭 Comedy Artists: Those whose work inspired our cross-cultural adaptations
🎥 Filmmakers: The production teams behind the movies and TV shows featured in our demos

⚠️ Note: All content used in our demonstrations is for research purposes only. We deeply respect the intellectual property rights of all content creators and welcome any concerns or feedback regarding content usage.

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
assets		assets
dataset		dataset
environment		environment
tools		tools
.DS_Store		.DS_Store
LICENSE		LICENSE
demos_documents.md		demos_documents.md
main.py		main.py
pyproject.toml		pyproject.toml
readme.md		readme.md
readme_zh.md		readme_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📹 Demo Video

🚀 Key Features

📑 Table of Contents

🔥 Why VideoAgent?

🌟System Overview

🧠 Intent Analysis

🔧 Autonomous Tool Use & Planning

🎬 Multi-Modal Understanding

🔧Evaluation

Boundless Creativity via Workflow Construction

Superior Multimodal Understanding

More Iterations, Better Performance

🚀Quick Start

🖥️ Environment

📥 Clone and Install

📦 Model Download

🤖 LLM Configuration

🎯 Usage

🔮Demos

💖Acknowledgments

🔧 Open-Source Community and Service Providers

🎨 Content Creators and Inspiration

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

HKUDS/VideoAgent

Folders and files

Latest commit

History

Repository files navigation

📹 Demo Video

🚀 Key Features

📑 Table of Contents

🔥 Why VideoAgent?

🌟System Overview

🧠 Intent Analysis

🔧 Autonomous Tool Use & Planning

🎬 Multi-Modal Understanding

🔧Evaluation

Boundless Creativity via Workflow Construction

Superior Multimodal Understanding

More Iterations, Better Performance

🚀Quick Start

🖥️ Environment

📥 Clone and Install

📦 Model Download

🤖 LLM Configuration

🎯 Usage

🔮Demos

💖Acknowledgments

🔧 Open-Source Community and Service Providers

🎨 Content Creators and Inspiration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages