From a50e290b8b6713458938677748a2298f2022e487 Mon Sep 17 00:00:00 2001
From: bimal-gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Thu, 27 Mar 2025 16:54:20 -0700
Subject: [PATCH 01/10] Update Cohere model card to follow standard template

---
 docs/source/en/model_doc/cohere.md | 129 +++++++++++++++--------------
 1 file changed, 67 insertions(+), 62 deletions(-)
diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 2ab75e9d1c8b..093177420491 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -1,42 +1,52 @@
-# Cohere
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>
 
-## Overview
 
-The Cohere Command-R model was proposed in the blogpost [Command-R: Retrieval Augmented Generation at Production Scale](https://txt.cohere.com/command-r/) by the Cohere Team.
+# Cohere
 
-The abstract from the paper is the following:
+The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team.
 
-*Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. Today, we are introducing Command-R, a new LLM aimed at large-scale production workloads. Command-R targets the emerging “scalable” category of models that balance high efficiency with strong accuracy, enabling companies to move beyond proof of concept, and into production.*
+Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
 
-*Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
-- Strong accuracy on RAG and Tool Use
-- Low latency, and high throughput
-- Longer 128k context and lower pricing
-- Strong capabilities across 10 key languages
-- Model weights available on HuggingFace for research and evaluation
+Key highlights:
+- Strong accuracy on RAG and Tool Use  
+- Low latency and high throughput  
+- Longer 128k token context length  
+- Multilingual support across 10 key languages  
+- Model weights available on Hugging Face for research and evaluation  
 
-Checkout model checkpoints [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01).
-This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox).
+You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
 
-## Usage tips
+This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
 
-<Tip warning={true}>
+> [!TIP]
+> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
 
-The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. 
+The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`].
 
-The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. 
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
+```python
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+    task="text-generation",
+    model="CohereForAI/c4ai-command-r-v01",
+    torch_dtype=torch.float16,
+    device=0
+)
+pipeline("Plants create energy through a process known as")
+```
 
-</Tip>
-The model and tokenizer can be loaded via:
+</hfoption>
+<hfoption id="AutoModel">
 
 ```python
 # pip install transformers
@@ -62,42 +72,13 @@ gen_text = tokenizer.decode(gen_tokens[0])
 print(gen_text)
 ```
 
-- When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
-
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Command-R. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-
-<PipelineTag pipeline="text-generation"/>
-
-Loading FP16 model
-```python
-# pip install transformers
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-
-# Format message with the command-r chat template
-messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
-## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+</hfoption>
+</hfoptions>
 
-gen_tokens = model.generate(
-    input_ids, 
-    max_new_tokens=100, 
-    do_sample=True, 
-    temperature=0.3,
-    )
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
-```
+The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes).
 
-Loading bitsnbytes 4bit quantized model
 ```python
 # pip install transformers bitsandbytes accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
@@ -119,6 +100,32 @@ gen_text = tokenizer.decode(gen_tokens[0])
 print(gen_text)
 ```
 
+Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
+
+```py
+from transformers.utils.attention_visualizer import AttentionMaskVisualizer
+
+visualizer = AttentionMaskVisualizer("CohereForAI/c4ai-command-r-v01")
+visualizer("Plants create energy through a process known as")
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/cohere-attn-mask.png"/>
+</div>
+
+
+## Notes
+<Tip warning={true}>
+The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
+used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. 
+
+The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. 
+
+Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
+</Tip>
+
+When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
+
 
 ## CohereConfig
 
@@ -142,6 +149,4 @@ print(gen_text)
 ## CohereForCausalLM
 
 [[autodoc]] CohereForCausalLM
-    - forward
-
-
+    - forward
\ No newline at end of file

From 00fa1392c36d5a33897811b31beef692598dccdb Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:05 -0700
Subject: [PATCH 02/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 093177420491..0400e8ffbc96 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -9,7 +9,7 @@
 
 # Cohere
 
-The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team.
+Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
 
 Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
 

From 83fa9384faeb17c819d47acb91bf0005359db8f6 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:20 -0700
Subject: [PATCH 03/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 0400e8ffbc96..85d9adde5d81 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -13,12 +13,6 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f
 
 Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
 
-Key highlights:
-- Strong accuracy on RAG and Tool Use  
-- Low latency and high throughput  
-- Longer 128k token context length  
-- Multilingual support across 10 key languages  
-- Model weights available on Hugging Face for research and evaluation  
 
 You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
 

From bd65112ca2c26912c7b977746014442ab9f65e39 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:28 -0700
Subject: [PATCH 04/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 85d9adde5d81..c26770cbcb4c 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -14,7 +14,7 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f
 Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
 
 
-You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection.
+You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 
 This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
 

From 8e88308d322c4bcb8ddb328c5d0790c2ea5bc163 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:12:36 -0700
Subject: [PATCH 05/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index c26770cbcb4c..34ee3db34f76 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -16,7 +16,6 @@ Command-R is a **scalable generative model** optimized for long-context tasks su
 
 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 
-This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox).
 
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.

From 8928e581602e5fbcf6854f33e690dfa57a09f7a1 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:16:02 -0700
Subject: [PATCH 06/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 34ee3db34f76..544d1e9ed968 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -70,7 +70,7 @@ print(gen_text)
 
 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes).
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
 
 ```python
 # pip install transformers bitsandbytes accelerate

From 4a511af87ce34925d8b43bc969dab689a246d9ab Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 17:16:13 -0700
Subject: [PATCH 07/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 544d1e9ed968..3b29894d0b7f 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -108,17 +108,7 @@ visualizer("Plants create energy through a process known as")
 
 
 ## Notes
-<Tip warning={true}>
-The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. 
-
-The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. 
-
-Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
-</Tip>
-
-When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type.
-
+- Don’t use the torch_dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
 
 ## CohereConfig
 

From 1c4136cdae7ee75172dc64574ba7c7fd515ba63f Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Mon, 31 Mar 2025 18:06:44 -0700
Subject: [PATCH 08/10] Update cohere.md

Update code snippet for AutoModel, quantization, and transformers-cli
---
 docs/source/en/model_doc/cohere.md | 55 ++++++++++++++++--------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 3b29894d0b7f..529952d2b4cc 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -20,7 +20,7 @@ You can find all the original Command-R checkpoints under the [Command Models](h
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
 
-The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`].
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">
@@ -42,27 +42,30 @@ pipeline("Plants create energy through a process known as")
 <hfoption id="AutoModel">
 
 ```python
-# pip install transformers
+import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-
-# Format message with the command-r chat template
-messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
-## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
 
-gen_tokens = model.generate(
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "How do plants make energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+output = model.generate(
     input_ids, 
     max_new_tokens=100, 
     do_sample=True, 
     temperature=0.3,
-    )
+    cache_implementation="static",
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+</hfoption>
+<hfoption id="transformers-cli">
 
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
+```bash
+transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
 ```
 
 </hfoption>
@@ -73,24 +76,24 @@ Quantization reduces the memory burden of large models by representing the weigh
 The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
 
 ```python
-# pip install transformers bitsandbytes accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+import torch
+from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
 
 bnb_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa")
 
-model_id = "CohereForAI/c4ai-command-r-v01"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
-
-gen_tokens = model.generate(
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "How do plants make energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+output = model.generate(
     input_ids, 
     max_new_tokens=100, 
     do_sample=True, 
     temperature=0.3,
-    )
-
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
+    cache_implementation="static",
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
 Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
@@ -132,4 +135,4 @@ visualizer("Plants create energy through a process known as")
 ## CohereForCausalLM
 
 [[autodoc]] CohereForCausalLM
-    - forward
\ No newline at end of file
+    - forward

From 8dc80cfd7bfe999fa721710390380a73c5f121ce Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Wed, 2 Apr 2025 21:15:38 -0700
Subject: [PATCH 09/10] Update cohere.md

---
 docs/source/en/model_doc/cohere.md | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 529952d2b4cc..02215fa30fba 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -11,9 +11,6 @@
 
 Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
 
-Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases.
-
-
 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 
 

From 9cb6478b6c638c23f06062d5648b27bdf2de9e68 Mon Sep 17 00:00:00 2001
From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com>
Date: Wed, 2 Apr 2025 21:16:22 -0700
Subject: [PATCH 10/10] Update docs/source/en/model_doc/cohere.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/cohere.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md
index 02215fa30fba..48b924e1ff13 100644
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@@ -62,6 +62,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 <hfoption id="transformers-cli">
 
 ```bash
+# pip install -U flash-attn --no-build-isolation
 transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
 ```