Applicability of RIFLEx to CogVideoX Image-to-Video Version

# [Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version

## Problem Description
I encountered significant video quality degradation when attempting to apply the RIFLEx method to the Image-to-Video (I2V) version of CogVideoX-5b. When using RIFLEx to modify RoPE, the generated video frames exhibit blurring  and distortion.

## Environment Information
- Model Version: CogVideoX-5b-I2V
- Hardware: NVIDIA A100
- Software: PyTorch 2.0, transformers 4.30.2

## Reproduction Steps
I used the following code to test RIFLEx on the I2V version:

      if __name__ == "__main__":
      parser = argparse.ArgumentParser()
      parser.add_argument('--seed', type=int, help='Random seed', 
                          default=1234)
      parser.add_argument('--k', type=int, help='Index of intrinsic frequency', 
                          default=2)
      parser.add_argument('--N_k', type=int, help='The period of intrinsic frequency in latent space', 
                          default=20)
      parser.add_argument('--num_frames', type=int, help='Number of frames for inference', 
                          default=97)
      parser.add_argument('--finetune', help='Whether finetuned version', action='store_true')
      parser.add_argument('--model_id', type=str, help='huggingface path for models', 
                          default="THUDM/CogVideoX-5b-I2V")
      parser.add_argument('--image', type=str, help='Image for generation',
                          default="CogKit/quickstart/data/i2v/train/images/1d50a3d9703f152758d5422c8b48010f.png")
      parser.add_argument('--prompt', type=str, help='Prompts for generation',
                          default="A dynamic sequence unfolds on the deck of a ship, where a small, mouse-like character with large ears and short pants enthusiastically steers the vessel using a wheel. A larger, bulky character with a long pole engages in a playful confrontation, asserting dominance or playfully provoking the smaller one. Expressive gestures and movements convey emotions and intentions, set against a nautical backdrop featuring a steering wheel, life preserver, and bell. The two characters interact in a lively, competitive, or friendly exchange.")
    args = parser.parse_args()

    assert (args.num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"
    L_test = (args.num_frames - 1) // 4 + 1  # latent frames
    transformer = CogVideoXTransformer3DModel.from_pretrained(
        args.model_id,
        subfolder="transformer",
        torch_dtype=torch.bfloat16,
    )

    pipe = CogVideoXImageToVideoPipeline.from_pretrained(
        "THUDM/CogVideoX-5b-I2V",
        transformer=transformer,
        torch_dtype=torch.bfloat16
    ).to("cuda")

    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    generator = torch.Generator("cuda").manual_seed(args.seed)

    # For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
    if L_test > args.N_k and not args.finetune:
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)

    # We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
    if args.finetune:
        L_test = args.N_k  # the fine-tuning frequency setting
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)
    
    image = load_image(args.image)

    video = pipe(image=image, prompt=args.prompt, num_frames=args.num_frames, height=480, width=720, guidance_scale=6,
                 num_inference_steps=50, generator=generator).frames[0]
    export_to_video(video, f"seed_{args.seed}_{args.prompt[:20]}.mp4", fps=8)
## Expected Results
Applying RIFLEx should generate coherent, high-quality video sequences that maintain content consistency and temporal coherence with the input image.

## Actual Results
- When RIFLEx is enabled, the generated videos exhibit:
  
![Image](https://github.com/user-attachments/assets/74f88937-1b53-42ea-b8b6-d2caf47b91fc)

![Image](https://github.com/user-attachments/assets/ddfda123-4088-41e2-afb9-724f179e5967)

## Questions
1. Is RIFLEx designed to work with the I2V version of CogVideoX, or is it only applicable to the Text-to-Video version?
2. Are there any special configurations or parameter adjustments required for using RIFLEx with the I2V version?

## Additional Information
- I have tried using CogVideoX-5b-I2V without modification to generate a 97-frame video, and the result was exactly the same as when using RIFLEx + CogVideoX-5b-I2V. Does this suggest that RIFLEx has no effect on CogVideoX-5b-I2V?


Thank you for your assistance!
    

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Applicability of RIFLEx to CogVideoX Image-to-Video Version #24

[Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version

Problem Description

Environment Information

Reproduction Steps

Expected Results

Actual Results

Questions

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Applicability of RIFLEx to CogVideoX Image-to-Video Version #24

Description

[Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version

Problem Description

Environment Information

Reproduction Steps

Expected Results

Actual Results

Questions

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions