Skip to content

Wan 2.2 Animate: Question about loss computation on noisy reference and temporal latents. #200

@0xLDF

Description

@0xLDF

Hi Wan team,
I’m trying to fine-tune Wan-2.2-Animate and would like to clarify how the loss is calculated for the two “extra” latents:

  • noisy reference latents
  • noisy temporal latents

From the inference code I see that the denoising network receives three kinds of noisy latents:

  • noisy video latents
  • noisy reference latents
  • noisy temporal latents

The reference latents is also denoised step-by-step, so their value changes during sampling.
This implies that you must have computed a loss on both the reference and the temporal latent during training. Could you share the details?
Did you simply apply the same flow-matching loss to all three tensors and sum/average them?

Finally, I would like to ask: in current image/video generation models such as Flux-Kontext, Stand-in, Phantom, and RealisDance-DiT, the noisy latents never contain the reference image, and reference is not supervised by any loss during training. What was the motivation behind Wan-2.2-Animate’s decision to add noise to the reference latents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions