You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 9, 2024. It is now read-only.
I am trying to use the int8 quantized model of BLOOM 175B for inference and am closely following the bloom-accelerate-inference.py script. I have about 1000 prompts for which I need the outputs. I use beam size of 1 (greedy search) and batch size of 1 since I can't fit more into my GPU memory (I have 4 * 80 GB A100 GPUs). max_new_tokens is set to 64.
When running inference on this list of prompts, after successfully generating on the first few sentences (61 in this case), my script crashes with an OOM error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 79.17 GiB total capacity; 77.63 GiB already allocated; 11.31 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Though long prompts often cause OOM, in this case, I do not think it is due to the length of the current prompt. I logged just to make sure, but prompts longer than the current one have been successfully generated in the past (in the first 61 prompts I was referring to).
I am unable to figure out what the possible reason could be. Any suggestions/ideas?
Hi there,
I am trying to use the int8 quantized model of BLOOM 175B for inference and am closely following the
bloom-accelerate-inference.pyscript. I have about 1000 prompts for which I need the outputs. I use beam size of 1 (greedy search) and batch size of 1 since I can't fit more into my GPU memory (I have 4 * 80 GB A100 GPUs).max_new_tokensis set to 64.When running inference on this list of prompts, after successfully generating on the first few sentences (61 in this case), my script crashes with an OOM error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 79.17 GiB total capacity; 77.63 GiB already allocated; 11.31 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFThough long prompts often cause OOM, in this case, I do not think it is due to the length of the current prompt. I logged just to make sure, but prompts longer than the current one have been successfully generated in the past (in the first 61 prompts I was referring to).
I am unable to figure out what the possible reason could be. Any suggestions/ideas?