Cannot explain recurring OOM error 

Hi there,

I am trying to use the int8 quantized model of BLOOM 175B for inference and am closely following the `bloom-accelerate-inference.py` script. I have about 1000 prompts for which I need the outputs.  I use beam size of 1 (greedy search) and batch size of 1 since I can't fit more into my GPU memory (I have 4 * 80 GB A100 GPUs). `max_new_tokens` is set to 64.

When running inference on this list of prompts, after successfully generating on the first few sentences (61 in this case), my script crashes with an OOM error:

`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 79.17 GiB total capacity; 77.63 GiB already allocated; 11.31 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Though long prompts often cause OOM, in this case, I do not think it is due to the length of the current prompt. I logged just to make sure, but prompts longer than the current one have been successfully generated in the past (in the first 61 prompts I was referring to). 

I am unable to figure out what the possible reason could be. Any suggestions/ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot explain recurring OOM error #66

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cannot explain recurring OOM error #66

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions