-
Notifications
You must be signed in to change notification settings - Fork 45
Description
i'm using the selective activation checkpoint to train a video generation model,which is entirely different from the case of training large language models. because video genration model has less parameters but much more intermediate activations.
i'm training hunyuanvideo model which you can find it in diffusers library. but when i try to use selective activation checkpoint and cpu offload to reduce gpu memory cost, it still oom in cpu offload hook which is save_on_cpu, the traceback tell me that torch want to creat a head_num * seqlen * seqlen tensor in save_on_cpu hook before F.scaled_dot_product_attention, which is really big and cost 69gb vram. that 's really abnormal. because i've already use the F.scale_dot_production_attention to compute attention, there shouldn‘t be any seqlen * seqlen variables.
can anyone give me some explanations? and how can i use cpu offload to leavrage my huge host memory.