Merge pull request #152 from pytorch-labs/malfet-patch-1

malfet · web-flow · commit 095b2229ee3a · 2024-04-05T13:44:46.000-07:00
Discovered by @HDCharles Test plan: ``` % python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cuda % python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cuda.pth --prompt "Once upon a time" --device cuda ... Using int4 weight-only quantization! Time to load model: 3.20 seconds Once upon a time I was a kid. And that kid, as I understand, went through a phase as a teen where he binge watched a whole bunch of movies. I don’t remember the exact number, but it seems like at least 50 movies in succession. I read somewhere that people would record movies on VHS tapes and then binge watched them, so maybe that’s what this kid was doing. I also read somewhere that the person had never binge watched 50 movies in succession again. That’s the truth and it’s a shame. That’s how you know the world is changing in a horrible way. The binge watcher, the VHS watcher, the guy who turns a whole bunch of movies into a marathon and then stops. The person who made that guy stop. That’s why I’m writing this: to prevent you from reading this, and I’m sorry. I’m sorry that you’ll never turn Time for inference 1: 8.27 sec total, 24.17 tokens/sec Bandwidth achieved: 106.17 GB/s ``` and ``` % python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cpu % python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cpu.pth --prompt "Once upon a time" --device cpu ... Using int4 weight-only quantization! Time to load model: 0.09 seconds Once upon a time, I was ith the new movie. Welcome to the third installment of the Once Upon a Time! series. This time around, I’ve decided to focus on a movie that has had its fair share of publicity and fame, but one that I was not familiar with before. The movie in question is the 2004 remake of the classic fairy tale The Three Little Pigs, which was released the same year as Pirates of the Caribbean: The Curse of the Black Pearl and the 2007 adaptation of the classic novel The Lion King. It was the first film in the Once Upon a Time! series that I had not seen, and as such, I was only familiar with the first half of the story. I was intrigued by the story, and I knew that I would be interested in seeing the movie when I was able. I had watched a bunch of trailers and clips to get an idea of what the movie was going Time for inference 2: 27.75 sec total, 7.21 tokens/sec Bandwidth achieved: 31.65 GB/s ```
diff --git a/quantize.py b/quantize.py
@@ -486,7 +486,7 @@ def __init__(
             bias=True, device=None, dtype=None, groupsize: int = 128, inner_k_tiles: int = 8, use_cuda=True,
     ) -> None:
         super().__init__()
-        self.padding = _check_linear_int4_k(in_features, groupsize, inner_k_tiles)
+        self.padding = not _check_linear_int4_k(in_features, groupsize, inner_k_tiles)
         if self.padding:
             from model import find_multiple
             self.origin_in_features = in_features
@@ -500,16 +500,10 @@ def __init__(
 
         assert out_features % 8 == 0, "require out_features % 8 == 0"
         assert in_features % (inner_k_tiles * 16) == 0, "require in_features % (innerKTiles * 16) == 0"
-        if use_cuda:
-            self.register_buffer(
-                "weight",
-                torch.empty((out_features // 8, in_features // (inner_k_tiles * 16), 32, inner_k_tiles // 2), dtype=torch.int32)
-            )
-        else:
-            self.register_buffer(
-                "weight",
-                torch.empty((out_features, in_features // 2), dtype=torch.uint8)
-            )
+        self.register_buffer(
+            "weight",
+            torch.empty((out_features // 8, in_features // (inner_k_tiles * 16), 32, inner_k_tiles // 2), dtype=torch.int32)
+        )
         self.register_buffer(
             "scales_and_zeros",
             torch.empty((in_features // groupsize, out_features, 2), dtype=torch.bfloat16)