You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update Cohere model card to follow standard template
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update cohere.md
Update code snippet for AutoModel, quantization, and transformers-cli
* Update cohere.md
* Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
The Cohere Command-R model was proposed in the blogpost [Command-R: Retrieval Augmented Generation at Production Scale](https://txt.cohere.com/command-r/) by the Cohere Team.
12
-
13
-
The abstract from the paper is the following:
14
9
15
-
*Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. Today, we are introducing Command-R, a new LLM aimed at large-scale production workloads. Command-R targets the emerging “scalable” category of models that balance high efficiency with strong accuracy, enabling companies to move beyond proof of concept, and into production.*
10
+
# Cohere
16
11
17
-
*Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
18
-
- Strong accuracy on RAG and Tool Use
19
-
- Low latency, and high throughput
20
-
- Longer 128k context and lower pricing
21
-
- Strong capabilities across 10 key languages
22
-
- Model weights available on HuggingFace for research and evaluation
12
+
Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
23
13
24
-
Checkout model checkpoints [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01).
25
-
This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox).
14
+
You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
26
15
27
-
## Usage tips
28
16
29
-
<Tipwarning={true}>
17
+
> [!TIP]
18
+
> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
30
19
31
-
The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
32
-
used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
20
+
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
33
21
34
-
The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used.
22
+
<hfoptionsid="usage">
23
+
<hfoptionid="Pipeline">
35
24
36
-
Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
25
+
```python
26
+
import torch
27
+
from transformers import pipeline
28
+
29
+
pipeline = pipeline(
30
+
task="text-generation",
31
+
model="CohereForAI/c4ai-command-r-v01",
32
+
torch_dtype=torch.float16,
33
+
device=0
34
+
)
35
+
pipeline("Plants create energy through a process known as")
36
+
```
37
37
38
-
</Tip>
39
-
The model and tokenizer can be loaded via:
38
+
</hfoption>
39
+
<hfoptionid="AutoModel">
40
40
41
41
```python
42
-
# pip install transformers
42
+
import torch
43
43
from transformers import AutoTokenizer, AutoModelForCausalLM
- When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
66
-
61
+
</hfoption>
62
+
<hfoptionid="transformers-cli">
67
63
68
-
## Resources
64
+
```bash
65
+
# pip install -U flash-attn --no-build-isolation
66
+
transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
67
+
```
69
68
70
-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Command-R. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
69
+
</hfoption>
70
+
</hfoptions>
71
71
72
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
72
73
73
-
<PipelineTagpipeline="text-generation"/>
74
+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
74
75
75
-
Loading FP16 model
76
76
```python
77
-
# pip install transformers
78
-
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
97
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
- Don’t use the torch_dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
0 commit comments