selective activation checkpoint with cpu offload

i'm using the selective activation checkpoint to train a video generation model，which is entirely different from the case of training large language models. because video genration model has less parameters but much more intermediate activations.


i'm training hunyuanvideo model which you can find it in diffusers library. but when i try to use selective activation checkpoint and cpu offload to reduce gpu memory cost, it still oom in cpu offload hook which is [save_on_cpu](https://github.com/pytorch/pytorch/blob/9752c7c1c819ce9027806c20492adc235dddecd6/torch/autograd/graph.py#L370C13-L370C33), the traceback tell me that torch want to creat a ```head_num * seqlen * seqlen``` tensor in save_on_cpu hook before F.scaled_dot_product_attention, which is really big and cost 69gb vram. that 's really abnormal. because i've already use the F.scale_dot_production_attention to compute attention, there shouldn‘t be any ```seqlen * seqlen``` variables.

can anyone give me some explanations? and how can i use cpu offload to leavrage my huge host memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

selective activation checkpoint with cpu offload #130

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

selective activation checkpoint with cpu offload #130

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions