Skip to content

Quality degradation after many frames with no action conditioning #35

@yondonfu

Description

@yondonfu

I observed that after generating many blocks of frames without action conditioning (i.e. mouse input = "u", keyboard input = "q") the frames show the same scene but there is quality degradation (eg distortion, visual artifacts) over time.

This can be reproduced with both inference.py and inference_streaming.py using this branch which contains a small change to generate mouse and keyboard condition tensors with all zeros (i.e. no mouse or keyboard input).

For inference.py, I used this command:

python inference.py \                                                
 --config_path configs/inference_yaml/inference_universal.yaml \                                                                              
 --checkpoint_path Matrix-Game-2.0/base_distilled_model/base_distill.safetensors \                                                            
 --img_path demo_images/universal/0011.png \                                                                                                  
 --output_folder outputs \                                                                                                                    
 --num_output_frames 150 \                                                                                                                    
 --seed 42 \                                                                                                                                  
 --pretrained_model_path Matrix-Game-2.0     

This is the resulting video:

repro_static_quality_degradation_inference.mp4

For inference_streaming.py, I used this command:

python inference_streaming.py \
 --config_path configs/inference_yaml/inference_universal.yaml \
 --checkpoint_path Matrix-Game-2.0/base_distilled_model/base_distill.safetensors \
 --output_folder outputs \
 --seed 42 \
 --pretrained_model_path Matrix-Game-2.0                                        

And also used demo_images/universal/0011.png as the input image and repeatedly entered "u" for mouse input and "q" for keyboard input.

This is the resulting video:

repro_static_quality_degradation_inference_streaming.mp4

Is this known/expected behavior? If so, I'm wondering if anyone has thoughts on the best way to avoid the quality degradation because as-is any pause in action resulting in a static scene would cause the quality to degrade. A workaround might be to skip frame generation when there are no actions, but I think the downside of that you would lose the ability to get frames for the same scene with minor animation details eg changes in the water for the demo image used above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions