You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Wan team,
I’m trying to fine-tune Wan-2.2-Animate and would like to clarify how the loss is calculated for the two “extra” latents:
noisy reference latents
noisy temporal latents
From the inference code I see that the denoising network receives three kinds of noisy latents:
noisy video latents
noisy reference latents
noisy temporal latents
The reference latents is also denoised step-by-step, so their value changes during sampling.
This implies that you must have computed a loss on both the reference and the temporal latent during training. Could you share the details?
Did you simply apply the same flow-matching loss to all three tensors and sum/average them?
Finally, I would like to ask: in current image/video generation models such as Flux-Kontext, Stand-in, Phantom, and RealisDance-DiT, the noisy latents never contain the reference image, and reference is not supervised by any loss during training. What was the motivation behind Wan-2.2-Animate’s decision to add noise to the reference latents.