From a50e290b8b6713458938677748a2298f2022e487 Mon Sep 17 00:00:00 2001
From: bimal-gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Thu, 27 Mar 2025 16:54:20 -0700
Subject: [PATCH 01/10] Update Cohere model card to follow standard template
---
docs/source/en/model_doc/cohere.md | 129 +++++++++++++++--------------
1 file changed, 67 insertions(+), 62 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 2ab75e9d1c8b..093177420491 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -1,42 +1,52 @@
-# Cohere
-
-
-

-

-

+
-## Overview
-The Cohere Command-R model was proposed in the blogpost [Command-R: Retrieval Augmented Generation at Production Scale](https://txt.cohere.com/command-r/) by the Cohere Team.
+# Cohere
-The abstract from the paper is the following:
+The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team.
-*Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. Today, we are introducing Command-R, a new LLM aimed at large-scale production workloads. Command-R targets the emerging “scalable” category of models that balance high efficiency with strong accuracy, enabling companies to move beyond proof of concept, and into production.*
+Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
-*Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
-- Strong accuracy on RAG and Tool Use
-- Low latency, and high throughput
-- Longer 128k context and lower pricing
-- Strong capabilities across 10 key languages
-- Model weights available on HuggingFace for research and evaluation
+Key highlights:
+- Strong accuracy on RAG and Tool Use
+- Low latency and high throughput
+- Longer 128k token context length
+- Multilingual support across 10 key languages
+- Model weights available on Hugging Face for research and evaluation
-Checkout model checkpoints [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01).
-This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox).
+You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
-## Usage tips
+This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
-
+> [!TIP]
+> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
+The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`].
-The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used.
+
+
-Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
+```python
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+ task="text-generation",
+ model="CohereForAI/c4ai-command-r-v01",
+ torch_dtype=torch.float16,
+ device=0
+)
+pipeline("Plants create energy through a process known as")
+```
-
-The model and tokenizer can be loaded via:
+
+
```python
# pip install transformers
@@ -62,42 +72,13 @@ gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)
```
-- When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
-
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Command-R. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-
-
-
-Loading FP16 model
-```python
-# pip install transformers
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-
-# Format message with the command-r chat template
-messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
-## <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+
+
-gen_tokens = model.generate(
- input_ids,
- max_new_tokens=100,
- do_sample=True,
- temperature=0.3,
- )
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
-```
+The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes).
-Loading bitsnbytes 4bit quantized model
```python
# pip install transformers bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
@@ -119,6 +100,32 @@ gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)
```
+Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
+
+```py
+from transformers.utils.attention_visualizer import AttentionMaskVisualizer
+
+visualizer = AttentionMaskVisualizer("CohereForAI/c4ai-command-r-v01")
+visualizer("Plants create energy through a process known as")
+```
+
+
+

+
+
+
+## Notes
+
+The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
+used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
+
+The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used.
+
+Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
+
+
+When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
+
## CohereConfig
@@ -142,6 +149,4 @@ print(gen_text)
## CohereForCausalLM
[[autodoc]] CohereForCausalLM
- - forward
-
-
+ - forward
\ No newline at end of file
From 00fa1392c36d5a33897811b31beef692598dccdb Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:05 -0700
Subject: [PATCH 02/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 093177420491..0400e8ffbc96 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -9,7 +9,7 @@
# Cohere
-The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team.
+Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
From 83fa9384faeb17c819d47acb91bf0005359db8f6 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:20 -0700
Subject: [PATCH 03/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 6 ------
1 file changed, 6 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 0400e8ffbc96..85d9adde5d81 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -13,12 +13,6 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f
Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
-Key highlights:
-- Strong accuracy on RAG and Tool Use
-- Low latency and high throughput
-- Longer 128k token context length
-- Multilingual support across 10 key languages
-- Model weights available on Hugging Face for research and evaluation
You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
From bd65112ca2c26912c7b977746014442ab9f65e39 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:28 -0700
Subject: [PATCH 04/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 85d9adde5d81..c26770cbcb4c 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -14,7 +14,7 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f
Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
-You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
+You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
From 8e88308d322c4bcb8ddb328c5d0790c2ea5bc163 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:36 -0700
Subject: [PATCH 05/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 1 -
1 file changed, 1 deletion(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index c26770cbcb4c..34ee3db34f76 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -16,7 +16,6 @@ Command-R is a **scalable generative model** optimized for long-context tasks su
You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
-This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
> [!TIP]
> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
From 8928e581602e5fbcf6854f33e690dfa57a09f7a1 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:16:02 -0700
Subject: [PATCH 06/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 34ee3db34f76..544d1e9ed968 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -70,7 +70,7 @@ print(gen_text)
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes).
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
```python
# pip install transformers bitsandbytes accelerate
From 4a511af87ce34925d8b43bc969dab689a246d9ab Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:16:13 -0700
Subject: [PATCH 07/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 544d1e9ed968..3b29894d0b7f 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -108,17 +108,7 @@ visualizer("Plants create energy through a process known as")
## Notes
-
-The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
-
-The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used.
-
-Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
-
-
-When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
-
+- Don’t use the torch_dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
## CohereConfig
From 1c4136cdae7ee75172dc64574ba7c7fd515ba63f Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 18:06:44 -0700
Subject: [PATCH 08/10] Update cohere.md
Update code snippet for AutoModel, quantization, and transformers-cli
---
docs/source/en/model_doc/cohere.md | 55 ++++++++++++++++--------------
1 file changed, 29 insertions(+), 26 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 3b29894d0b7f..529952d2b4cc 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -20,7 +20,7 @@ You can find all the original Command-R checkpoints under the [Command Models](h
> [!TIP]
> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`].
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
@@ -42,27 +42,30 @@ pipeline("Plants create energy through a process known as")
```python
-# pip install transformers
+import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-
-# Format message with the command-r chat template
-messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
-## <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
-gen_tokens = model.generate(
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "How do plants make energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
- )
+ cache_implementation="static",
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+
+
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
+```bash
+transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
```
@@ -73,24 +76,24 @@ Quantization reduces the memory burden of large models by representing the weigh
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
```python
-# pip install transformers bitsandbytes accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+import torch
+from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa")
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
-
-gen_tokens = model.generate(
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "How do plants make energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
- )
-
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
+ cache_implementation="static",
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
```
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
@@ -132,4 +135,4 @@ visualizer("Plants create energy through a process known as")
## CohereForCausalLM
[[autodoc]] CohereForCausalLM
- - forward
\ No newline at end of file
+ - forward
From 8dc80cfd7bfe999fa721710390380a73c5f121ce Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Wed, 2 Apr 2025 21:15:38 -0700
Subject: [PATCH 09/10] Update cohere.md
---
docs/source/en/model_doc/cohere.md | 3 ---
1 file changed, 3 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 529952d2b4cc..02215fa30fba 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -11,9 +11,6 @@
Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
-Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
-
-
You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
From 9cb6478b6c638c23f06062d5648b27bdf2de9e68 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Wed, 2 Apr 2025 21:16:22 -0700
Subject: [PATCH 10/10] Update docs/source/en/model_doc/cohere.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/cohere.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 02215fa30fba..48b924e1ff13 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -62,6 +62,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```bash
+# pip install -U flash-attn --no-build-isolation
transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
```