Skip to content

Commit 095b222

Browse files
authored
Merge pull request #152 from pytorch-labs/malfet-patch-1
Discovered by @HDCharles Test plan: ``` % python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cuda % python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cuda.pth --prompt "Once upon a time" --device cuda ... Using int4 weight-only quantization! Time to load model: 3.20 seconds Once upon a time I was a kid. And that kid, as I understand, went through a phase as a teen where he binge watched a whole bunch of movies. I don’t remember the exact number, but it seems like at least 50 movies in succession. I read somewhere that people would record movies on VHS tapes and then binge watched them, so maybe that’s what this kid was doing. I also read somewhere that the person had never binge watched 50 movies in succession again. That’s the truth and it’s a shame. That’s how you know the world is changing in a horrible way. The binge watcher, the VHS watcher, the guy who turns a whole bunch of movies into a marathon and then stops. The person who made that guy stop. That’s why I’m writing this: to prevent you from reading this, and I’m sorry. I’m sorry that you’ll never turn Time for inference 1: 8.27 sec total, 24.17 tokens/sec Bandwidth achieved: 106.17 GB/s ``` and ``` % python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cpu % python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cpu.pth --prompt "Once upon a time" --device cpu ... Using int4 weight-only quantization! Time to load model: 0.09 seconds Once upon a time, I was ith the new movie. Welcome to the third installment of the Once Upon a Time! series. This time around, I’ve decided to focus on a movie that has had its fair share of publicity and fame, but one that I was not familiar with before. The movie in question is the 2004 remake of the classic fairy tale The Three Little Pigs, which was released the same year as Pirates of the Caribbean: The Curse of the Black Pearl and the 2007 adaptation of the classic novel The Lion King. It was the first film in the Once Upon a Time! series that I had not seen, and as such, I was only familiar with the first half of the story. I was intrigued by the story, and I knew that I would be interested in seeing the movie when I was able. I had watched a bunch of trailers and clips to get an idea of what the movie was going Time for inference 2: 27.75 sec total, 7.21 tokens/sec Bandwidth achieved: 31.65 GB/s ```
2 parents 7d45270 + bc50dc0 commit 095b222

File tree

1 file changed

+5
-11
lines changed

1 file changed

+5
-11
lines changed

quantize.py

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -486,7 +486,7 @@ def __init__(
486486
bias=True, device=None, dtype=None, groupsize: int = 128, inner_k_tiles: int = 8, use_cuda=True,
487487
) -> None:
488488
super().__init__()
489-
self.padding = _check_linear_int4_k(in_features, groupsize, inner_k_tiles)
489+
self.padding = not _check_linear_int4_k(in_features, groupsize, inner_k_tiles)
490490
if self.padding:
491491
from model import find_multiple
492492
self.origin_in_features = in_features
@@ -500,16 +500,10 @@ def __init__(
500500

501501
assert out_features % 8 == 0, "require out_features % 8 == 0"
502502
assert in_features % (inner_k_tiles * 16) == 0, "require in_features % (innerKTiles * 16) == 0"
503-
if use_cuda:
504-
self.register_buffer(
505-
"weight",
506-
torch.empty((out_features // 8, in_features // (inner_k_tiles * 16), 32, inner_k_tiles // 2), dtype=torch.int32)
507-
)
508-
else:
509-
self.register_buffer(
510-
"weight",
511-
torch.empty((out_features, in_features // 2), dtype=torch.uint8)
512-
)
503+
self.register_buffer(
504+
"weight",
505+
torch.empty((out_features // 8, in_features // (inner_k_tiles * 16), 32, inner_k_tiles // 2), dtype=torch.int32)
506+
)
513507
self.register_buffer(
514508
"scales_and_zeros",
515509
torch.empty((in_features // groupsize, out_features, 2), dtype=torch.bfloat16)

0 commit comments

Comments
 (0)