openvinotoolkit
diff --git a/‎notebooks/wan2.1-text-to-video/README.md
Lines changed: 35 additions & 0 deletions b/‎notebooks/wan2.1-text-to-video/README.md
Lines changed: 35 additions & 0 deletions
diff --git a/‎notebooks/wan2.1-text-to-video/gradio_helper.py
Lines changed: 59 additions & 0 deletions b/‎notebooks/wan2.1-text-to-video/gradio_helper.py
Lines changed: 59 additions & 0 deletions
@@ -0,0 +1,35 @@
+# Text to Video generation with Wan2.1 and OpenVINO
+
+ Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation.
+
+ Built upon the mainstream diffusion transformer paradigm, Wan 2.1 achieves significant advancements in generative capabilities through a series of innovations, including our novel spatio-temporal variational autoencoder (VAE), scalable pre-training strategies, large-scale data construction, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility.
+
+ You can find more details about model in [model card](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) and [original repository](https://github.com/Wan-Video/Wan2.1)
+
+ In this tutorial we consider how to convert, optimize and run Wan2.1 model using OpenVINO.
+ Additionally, for achieving inference speedup, we will apply [CausVid](https://causvid.github.io/) distillation approach using LoRA.
+
+ ![](https://causvid.github.io/images/methods.jpg)
+
+ Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. CausVid address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, the authors extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator.
+
+ The method distills a many-step, bidirectional video diffusion model sdata into a 4-step, causal generator. The training process consists of two stages: 
+ 1. Student Initialization: Initialization of the causal student by pretraining it on a small set of ODE solution pairs generated by the bidirectional teacher. This step helps stabilize the subsequent distillation training.
+ 2. Asymmetric Distillation: Using the bidirectional teacher model, we train the causal student generator through a distribution matching distillation loss. 
+
+More details about CuasVid can be found in the [paper](https://arxiv.org/abs/2412.07772), [original repository](https://github.com/tianweiy/CausVi) and [project page](https://causvid.github.io/)
+
+
+## Notebook contents
+This tutorial consists of the following steps:
+- Prerequisites
+- Convert and Optimize model
+- Run inference pipeline
+- Interactive inference
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
+
+<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/wan2.1-text-to-video/README.md" />
@@ -0,0 +1,59 @@
+import gradio as gr
+import torch
+from diffusers.utils import export_to_video
+import numpy as np
+
+MAX_SEED = np.iinfo(np.int32).max
+
+
+def make_demo(pipeline):
+    def generate_video(prompt, negative_prompt="", guidance_scale=1.0, seed=42, progress=gr.Progress(track_tqdm=True)):
+        output = pipeline(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            height=480,
+            width=832,
+            num_frames=20,
+            guidance_scale=guidance_scale,
+            num_inference_steps=4,
+            generator=torch.Generator().manual_seed(seed),
+        ).frames[0]
+
+        video_path = "output.mp4"
+        export_to_video(output, video_path, fps=10)
+        return video_path
+
+    iface = gr.Interface(
+        fn=generate_video,
+        inputs=[
+            gr.Textbox(label="Prompt", placeholder="Enter your video prompt here"),
+            gr.Textbox(label="Negative Prompt", placeholder="Optional negative prompt", value=""),
+            gr.Slider(
+                label="Guidance scale",
+                minimum=0.0,
+                maximum=20.0,
+                step=0.1,
+                value=1.0,
+            ),
+            gr.Slider(
+                label="Seed",
+                minimum=0,
+                maximum=MAX_SEED,
+                step=1,
+                value=42,
+            ),
+        ],
+        outputs=gr.Video(label="Generated Video"),
+        title="Wan2.1-T2V-1.3B OpenVINO Video Generator",
+        flagging_mode="never",
+        examples=[
+            ["a penguin playfully dancing in the snow, Antarctica", "", 1.0, 42],
+            [
+                "A cat walks on the grass, realistic",
+                "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards",
+                2.5,
+                678,
+            ],
+        ],
+    )
+    return iface