Skip to content

Applicability of RIFLEx to CogVideoX Image-to-Video Version #24

@fuyuchenIfyw

Description

@fuyuchenIfyw

[Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version

Problem Description

I encountered significant video quality degradation when attempting to apply the RIFLEx method to the Image-to-Video (I2V) version of CogVideoX-5b. When using RIFLEx to modify RoPE, the generated video frames exhibit blurring and distortion.

Environment Information

  • Model Version: CogVideoX-5b-I2V
  • Hardware: NVIDIA A100
  • Software: PyTorch 2.0, transformers 4.30.2

Reproduction Steps

I used the following code to test RIFLEx on the I2V version:

  if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument('--seed', type=int, help='Random seed', 
                      default=1234)
  parser.add_argument('--k', type=int, help='Index of intrinsic frequency', 
                      default=2)
  parser.add_argument('--N_k', type=int, help='The period of intrinsic frequency in latent space', 
                      default=20)
  parser.add_argument('--num_frames', type=int, help='Number of frames for inference', 
                      default=97)
  parser.add_argument('--finetune', help='Whether finetuned version', action='store_true')
  parser.add_argument('--model_id', type=str, help='huggingface path for models', 
                      default="THUDM/CogVideoX-5b-I2V")
  parser.add_argument('--image', type=str, help='Image for generation',
                      default="CogKit/quickstart/data/i2v/train/images/1d50a3d9703f152758d5422c8b48010f.png")
  parser.add_argument('--prompt', type=str, help='Prompts for generation',
                      default="A dynamic sequence unfolds on the deck of a ship, where a small, mouse-like character with large ears and short pants enthusiastically steers the vessel using a wheel. A larger, bulky character with a long pole engages in a playful confrontation, asserting dominance or playfully provoking the smaller one. Expressive gestures and movements convey emotions and intentions, set against a nautical backdrop featuring a steering wheel, life preserver, and bell. The two characters interact in a lively, competitive, or friendly exchange.")
args = parser.parse_args()

assert (args.num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"
L_test = (args.num_frames - 1) // 4 + 1  # latent frames
transformer = CogVideoXTransformer3DModel.from_pretrained(
    args.model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=transformer,
    torch_dtype=torch.bfloat16
).to("cuda")

pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

generator = torch.Generator("cuda").manual_seed(args.seed)

# For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
if L_test > args.N_k and not args.finetune:
    pipe._prepare_rotary_positional_embeddings = MethodType(
        partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)

# We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
if args.finetune:
    L_test = args.N_k  # the fine-tuning frequency setting
    pipe._prepare_rotary_positional_embeddings = MethodType(
        partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)

image = load_image(args.image)

video = pipe(image=image, prompt=args.prompt, num_frames=args.num_frames, height=480, width=720, guidance_scale=6,
             num_inference_steps=50, generator=generator).frames[0]
export_to_video(video, f"seed_{args.seed}_{args.prompt[:20]}.mp4", fps=8)

Expected Results

Applying RIFLEx should generate coherent, high-quality video sequences that maintain content consistency and temporal coherence with the input image.

Actual Results

  • When RIFLEx is enabled, the generated videos exhibit:

Image

Image

Questions

  1. Is RIFLEx designed to work with the I2V version of CogVideoX, or is it only applicable to the Text-to-Video version?
  2. Are there any special configurations or parameter adjustments required for using RIFLEx with the I2V version?

Additional Information

  • I have tried using CogVideoX-5b-I2V without modification to generate a 97-frame video, and the result was exactly the same as when using RIFLEx + CogVideoX-5b-I2V. Does this suggest that RIFLEx has no effect on CogVideoX-5b-I2V?

Thank you for your assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions