You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Gemma2 model was proposed in [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by Gemma2 Team, Google.
29
-
Two Gemma2 models are released, with parameters sizes of 9 billion (9B) and 27 billion (27B).
32
+
**[Gemma 2](https://arxiv.org/pdf/2408.00118)** is Google's open-weight language model family (2B, 9B, 27B parameters) featuring interleaved local-global attention (4K sliding window + 8K global context), knowledge distillation for smaller models, and GQA for efficient inference. The 27B variant rivals models twice its size, scoring 75.2 on MMLU and 74.0 on GSM8K, while the instruction-tuned versions excel in multi-turn chat.
30
33
31
-
The abstract from the blog post is the following:
34
+
Key improvements over Gemma 1 include deeper networks, logit soft-capping, and stricter safety filters (<0.1% memorization). Available in base and instruction-tuned variants.
32
35
33
-
*Now we’re officially releasing Gemma 2 to researchers and developers globally. Available in both 9 billion (9B) and 27 billion (27B) parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December.*
36
+
The original checkpoints of Gemma 2 can be found [here](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315).
34
37
35
-
Tips:
38
+
> [!TIP]
39
+
> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.
36
40
37
-
- The original checkpoints can be converted using the conversion script `src/transformers/models/Gemma2/convert_Gemma2_weights_to_hf.py`
38
41
39
42
<Tipwarning={true}>
40
43
41
44
- Gemma2 uses sliding window attention every second layer, which makes it unsuitable for typical kv caching with [`~DynamicCache`] or tuples of tensors. To enable caching in Gemma2 forward call, you must initialize a [`~HybridCache`] instance and pass it as `past_key_values` to the forward call. Note, that you also have to prepare `cache_position` if the `past_key_values` already contains previous keys and values.
42
45
43
46
</Tip>
44
47
48
+
45
49
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Pedro Cuenca](https://huggingface.co/pcuenq) and [Tom Arsen]().
46
50
51
+
<Tip>
52
+
Click the right sidebar's Gemma 2 models for additional task examples.
53
+
</Tip>
54
+
55
+
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model google/gemma-2-2b --device 0
100
+
```
101
+
102
+
### Quantized version through `bitsandbytes`
103
+
104
+
Quantization reduces model size and speeds up inference by converting high-precision numbers (e.g., 32-bit floats) to lower-precision formats (e.g., 8-bit integers), with minimal accuracy loss
105
+
#### Using 8-bit precision (int8)
106
+
107
+
```python
108
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
0 commit comments