-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Description
[Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version
Problem Description
I encountered significant video quality degradation when attempting to apply the RIFLEx method to the Image-to-Video (I2V) version of CogVideoX-5b. When using RIFLEx to modify RoPE, the generated video frames exhibit blurring and distortion.
Environment Information
- Model Version: CogVideoX-5b-I2V
- Hardware: NVIDIA A100
- Software: PyTorch 2.0, transformers 4.30.2
Reproduction Steps
I used the following code to test RIFLEx on the I2V version:
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, help='Random seed',
default=1234)
parser.add_argument('--k', type=int, help='Index of intrinsic frequency',
default=2)
parser.add_argument('--N_k', type=int, help='The period of intrinsic frequency in latent space',
default=20)
parser.add_argument('--num_frames', type=int, help='Number of frames for inference',
default=97)
parser.add_argument('--finetune', help='Whether finetuned version', action='store_true')
parser.add_argument('--model_id', type=str, help='huggingface path for models',
default="THUDM/CogVideoX-5b-I2V")
parser.add_argument('--image', type=str, help='Image for generation',
default="CogKit/quickstart/data/i2v/train/images/1d50a3d9703f152758d5422c8b48010f.png")
parser.add_argument('--prompt', type=str, help='Prompts for generation',
default="A dynamic sequence unfolds on the deck of a ship, where a small, mouse-like character with large ears and short pants enthusiastically steers the vessel using a wheel. A larger, bulky character with a long pole engages in a playful confrontation, asserting dominance or playfully provoking the smaller one. Expressive gestures and movements convey emotions and intentions, set against a nautical backdrop featuring a steering wheel, life preserver, and bell. The two characters interact in a lively, competitive, or friendly exchange.")
args = parser.parse_args()
assert (args.num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"
L_test = (args.num_frames - 1) // 4 + 1 # latent frames
transformer = CogVideoXTransformer3DModel.from_pretrained(
args.model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
transformer=transformer,
torch_dtype=torch.bfloat16
).to("cuda")
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
generator = torch.Generator("cuda").manual_seed(args.seed)
# For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
if L_test > args.N_k and not args.finetune:
pipe._prepare_rotary_positional_embeddings = MethodType(
partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)
# We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
if args.finetune:
L_test = args.N_k # the fine-tuning frequency setting
pipe._prepare_rotary_positional_embeddings = MethodType(
partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)
image = load_image(args.image)
video = pipe(image=image, prompt=args.prompt, num_frames=args.num_frames, height=480, width=720, guidance_scale=6,
num_inference_steps=50, generator=generator).frames[0]
export_to_video(video, f"seed_{args.seed}_{args.prompt[:20]}.mp4", fps=8)
Expected Results
Applying RIFLEx should generate coherent, high-quality video sequences that maintain content consistency and temporal coherence with the input image.
Actual Results
- When RIFLEx is enabled, the generated videos exhibit:
Questions
- Is RIFLEx designed to work with the I2V version of CogVideoX, or is it only applicable to the Text-to-Video version?
- Are there any special configurations or parameter adjustments required for using RIFLEx with the I2V version?
Additional Information
- I have tried using CogVideoX-5b-I2V without modification to generate a 97-frame video, and the result was exactly the same as when using RIFLEx + CogVideoX-5b-I2V. Does this suggest that RIFLEx has no effect on CogVideoX-5b-I2V?
Thank you for your assistance!
BaseMe2
Metadata
Metadata
Assignees
Labels
No labels