You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -27,16 +25,18 @@ rendered properly in your Markdown viewer.
27
25
</div>
28
26
</div>
29
27
28
+
# Gemma2
29
+
30
30
## Overview
31
31
32
-
**[Gemma 2](https://arxiv.org/pdf/2408.00118)** is Google's open-weight language model family (2B, 9B, 27B parameters) featuring interleaved local-global attention (4K sliding window + 8K global context), knowledge distillation for smaller models, and GQA for efficient inference. The 27B variant rivals models twice its size, scoring 75.2 on MMLU and 74.0 on GSM8K, while the instruction-tuned versions excel in multi-turn chat.
32
+
[Gemma 2](https://huggingface.co/papers/2408.00118) is a family of language models with pretrained and instruction-tuned variants, available in 2B, 9B, 27B parameters. The architecture is similar to the previous Gemma, except it features interleaved local attention (4096 tokens) and global attention (8192 tokens) and grouped-query attention (GQA) to increase inference performance.
33
33
34
-
Key improvements over Gemma 1 include deeper networks, logit soft-capping, and stricter safety filters (<0.1% memorization). Available in base and instruction-tuned variants.
34
+
The 2B and 9B models are trained with knowledge distillation, and the instruction-tuned variant was post-trained with supervised fine-tuning and reinforcement learning.
35
35
36
-
The original checkpoints of Gemma 2 can be found [here](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315).
36
+
You can find all the original Gemma 2 checkpoints under the [Gemma 2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) release.
37
37
38
38
> [!TIP]
39
-
> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.
39
+
> Click on the Gemma 2 models in the right sidebar for more examples of how to apply Gemma to different language tasks.
40
40
41
41
42
42
<Tipwarning={true}>
@@ -48,106 +48,88 @@ The original checkpoints of Gemma 2 can be found [here](https://huggingface.co/c
48
48
49
49
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Pedro Cuenca](https://huggingface.co/pcuenq) and [Tom Arsen]().
50
50
51
-
<Tip>
52
-
Click the right sidebar's Gemma 2 models for additional task examples.
53
-
</Tip>
54
-
55
-
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
51
+
The example below demonstrates how to chat with the model with [`Pipeline`] or the [`AutoModel`] class, and from the command line.
Quantization reduces model size and speeds up inference by converting high-precision numbers (e.g., 32-bit floats) to lower-precision formats (e.g., 8-bit integers), with minimal accuracy loss
105
-
#### Using 8-bit precision (int8)
101
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
102
+
103
+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
106
104
107
105
```python
106
+
import torch
108
107
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
0 commit comments