Wan 2.2 Animate: Question about loss computation on noisy reference and temporal latents.

Hi Wan team,
I’m trying to fine-tune Wan-2.2-Animate and would like to clarify how the loss is calculated for the two “extra” latents:
- noisy reference latents
- noisy temporal latents

From the inference code I see that the denoising network receives three kinds of noisy latents:
- noisy video latents
- noisy reference latents
- noisy temporal latents

The reference latents is also denoised step-by-step, so their value changes during sampling.
This implies that you must have computed a loss on both the reference and the temporal latent during training. Could you share the details?
Did you simply apply the same flow-matching loss to all three tensors and sum/average them?

Finally, I would like to ask: in current image/video generation models such as Flux-Kontext, Stand-in, Phantom, and RealisDance-DiT, the noisy latents never contain the reference image, and reference is not supervised by any loss during training. What was the motivation behind Wan-2.2-Animate’s decision to add noise to the reference latents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wan 2.2 Animate: Question about loss computation on noisy reference and temporal latents. #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wan 2.2 Animate: Question about loss computation on noisy reference and temporal latents. #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions