Quality degradation after many frames with no action conditioning

I observed that after generating many blocks of frames without action conditioning (i.e. mouse input = "u", keyboard input = "q") the frames show the same scene but there is quality degradation (eg distortion, visual artifacts) over time.

This can be reproduced with both `inference.py` and `inference_streaming.py` using [this branch](https://github.com/yondonfu/Matrix-Game/tree/test-static-scene) which contains a [small change](https://github.com/yondonfu/Matrix-Game/commit/0d3c828a0c5488c0979fb8ccfe80aa021d489f29) to generate mouse and keyboard condition tensors with all zeros (i.e. no mouse or keyboard input).

For `inference.py`, I used this command:

```
python inference.py \                                                
 --config_path configs/inference_yaml/inference_universal.yaml \                                                                              
 --checkpoint_path Matrix-Game-2.0/base_distilled_model/base_distill.safetensors \                                                            
 --img_path demo_images/universal/0011.png \                                                                                                  
 --output_folder outputs \                                                                                                                    
 --num_output_frames 150 \                                                                                                                    
 --seed 42 \                                                                                                                                  
 --pretrained_model_path Matrix-Game-2.0     
```

This is the resulting video:

https://github.com/user-attachments/assets/45960183-82d8-499d-863f-ad2888d49764

For `inference_streaming.py`, I used this command:

```
python inference_streaming.py \
 --config_path configs/inference_yaml/inference_universal.yaml \
 --checkpoint_path Matrix-Game-2.0/base_distilled_model/base_distill.safetensors \
 --output_folder outputs \
 --seed 42 \
 --pretrained_model_path Matrix-Game-2.0                                        
```

And also used `demo_images/universal/0011.png` as the input image and repeatedly entered "u" for mouse input and "q" for keyboard input.

This is the resulting video:

https://github.com/user-attachments/assets/cd6da561-ae44-46b5-8dbd-ac1c966bfaef

Is this known/expected behavior? If so, I'm wondering if anyone has thoughts on the best way to avoid the quality degradation because as-is any pause in action resulting in a static scene would cause the quality to degrade. A workaround might be to skip frame generation when there are no actions, but I think the downside of that you would lose the ability to get frames for the same scene with minor animation details eg changes in the water for the demo image used above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quality degradation after many frames with no action conditioning #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quality degradation after many frames with no action conditioning #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions