From a50e290b8b6713458938677748a2298f2022e487 Mon Sep 17 00:00:00 2001 From: bimal-gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Thu, 27 Mar 2025 16:54:20 -0700 Subject: [PATCH 01/10] Update Cohere model card to follow standard template --- docs/source/en/model_doc/cohere.md | 129 +++++++++++++++-------------- 1 file changed, 67 insertions(+), 62 deletions(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 2ab75e9d1c8b..093177420491 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -1,42 +1,52 @@ -# Cohere - -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
-## Overview -The Cohere Command-R model was proposed in the blogpost [Command-R: Retrieval Augmented Generation at Production Scale](https://txt.cohere.com/command-r/) by the Cohere Team. +# Cohere -The abstract from the paper is the following: +The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team. -*Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. Today, we are introducing Command-R, a new LLM aimed at large-scale production workloads. Command-R targets the emerging “scalable” category of models that balance high efficiency with strong accuracy, enabling companies to move beyond proof of concept, and into production.* +Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases. -*Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts: -- Strong accuracy on RAG and Tool Use -- Low latency, and high throughput -- Longer 128k context and lower pricing -- Strong capabilities across 10 key languages -- Model weights available on HuggingFace for research and evaluation +Key highlights: +- Strong accuracy on RAG and Tool Use +- Low latency and high throughput +- Longer 128k token context length +- Multilingual support across 10 key languages +- Model weights available on Hugging Face for research and evaluation -Checkout model checkpoints [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01). -This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). +You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection. -## Usage tips +This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox). - +> [!TIP] +> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks. -The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be -used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. +The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`]. -The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. + + -Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`. +```python +import torch +from transformers import pipeline + +pipeline = pipeline( + task="text-generation", + model="CohereForAI/c4ai-command-r-v01", + torch_dtype=torch.float16, + device=0 +) +pipeline("Plants create energy through a process known as") +``` - -The model and tokenizer can be loaded via: + + ```python # pip install transformers @@ -62,42 +72,13 @@ gen_text = tokenizer.decode(gen_tokens[0]) print(gen_text) ``` -- When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type. - - -## Resources - -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Command-R. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. - - - - -Loading FP16 model -```python -# pip install transformers -from transformers import AutoTokenizer, AutoModelForCausalLM - -model_id = "CohereForAI/c4ai-command-r-v01" -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id) - -# Format message with the command-r chat template -messages = [{"role": "user", "content": "Hello, how are you?"}] -input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") -## <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|> + + -gen_tokens = model.generate( - input_ids, - max_new_tokens=100, - do_sample=True, - temperature=0.3, - ) +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -gen_text = tokenizer.decode(gen_tokens[0]) -print(gen_text) -``` +The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes). -Loading bitsnbytes 4bit quantized model ```python # pip install transformers bitsandbytes accelerate from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig @@ -119,6 +100,32 @@ gen_text = tokenizer.decode(gen_tokens[0]) print(gen_text) ``` +Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. + +```py +from transformers.utils.attention_visualizer import AttentionMaskVisualizer + +visualizer = AttentionMaskVisualizer("CohereForAI/c4ai-command-r-v01") +visualizer("Plants create energy through a process known as") +``` + +
+ +
+ + +## Notes + +The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be +used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. + +The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. + +Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`. + + +When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type. + ## CohereConfig @@ -142,6 +149,4 @@ print(gen_text) ## CohereForCausalLM [[autodoc]] CohereForCausalLM - - forward - - + - forward \ No newline at end of file From 00fa1392c36d5a33897811b31beef692598dccdb Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:12:05 -0700 Subject: [PATCH 02/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 093177420491..0400e8ffbc96 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -9,7 +9,7 @@ # Cohere -The **Cohere Command-R** model was proposed in the blog post: [Command-R: Retrieval Augmented Generation at Production Scale](https://cohere.com/blog/command-r) by the Cohere Team. +Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens. Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases. From 83fa9384faeb17c819d47acb91bf0005359db8f6 Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:12:20 -0700 Subject: [PATCH 03/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 0400e8ffbc96..85d9adde5d81 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -13,12 +13,6 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases. -Key highlights: -- Strong accuracy on RAG and Tool Use -- Low latency and high throughput -- Longer 128k token context length -- Multilingual support across 10 key languages -- Model weights available on Hugging Face for research and evaluation You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection. From bd65112ca2c26912c7b977746014442ab9f65e39 Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:12:28 -0700 Subject: [PATCH 04/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 85d9adde5d81..c26770cbcb4c 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -14,7 +14,7 @@ Cohere Command-R is a 35B parameter multilingual large language model designed f Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases. -You can find all the original Command-R checkpoints under the [Cohere Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) collection. +You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection. This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox). From 8e88308d322c4bcb8ddb328c5d0790c2ea5bc163 Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:12:36 -0700 Subject: [PATCH 05/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index c26770cbcb4c..34ee3db34f76 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -16,7 +16,6 @@ Command-R is a **scalable generative model** optimized for long-context tasks su You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection. -This model was contributed by [Saurabh Dash](https://huggingface.co/saurabhdash) and [Ahmet Üstün](https://huggingface.co/ahmetustun). The code of the implementation in Hugging Face is based on [GPT-NeoX](https://github.com/EleutherAI/gpt-neox). > [!TIP] > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks. From 8928e581602e5fbcf6854f33e690dfa57a09f7a1 Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:16:02 -0700 Subject: [PATCH 06/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 34ee3db34f76..544d1e9ed968 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -70,7 +70,7 @@ print(gen_text) Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -The example below demonstrates loading a 4bit quantized model using [bitsandbytes](../quantization/bitsandbytes). +The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits. ```python # pip install transformers bitsandbytes accelerate From 4a511af87ce34925d8b43bc969dab689a246d9ab Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 17:16:13 -0700 Subject: [PATCH 07/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 544d1e9ed968..3b29894d0b7f 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -108,17 +108,7 @@ visualizer("Plants create energy through a process known as") ## Notes - -The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be -used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. - -The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. - -Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`. - - -When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type. - +- Don’t use the torch_dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast). ## CohereConfig From 1c4136cdae7ee75172dc64574ba7c7fd515ba63f Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Mon, 31 Mar 2025 18:06:44 -0700 Subject: [PATCH 08/10] Update cohere.md Update code snippet for AutoModel, quantization, and transformers-cli --- docs/source/en/model_doc/cohere.md | 55 ++++++++++++++++-------------- 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 3b29894d0b7f..529952d2b4cc 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -20,7 +20,7 @@ You can find all the original Command-R checkpoints under the [Command Models](h > [!TIP] > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks. -The example below demonstrates how to generate text with [`Pipeline`] and[`AutoModel`]. +The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line. @@ -42,27 +42,30 @@ pipeline("Plants create energy through a process known as") ```python -# pip install transformers +import torch from transformers import AutoTokenizer, AutoModelForCausalLM -model_id = "CohereForAI/c4ai-command-r-v01" -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id) - -# Format message with the command-r chat template -messages = [{"role": "user", "content": "Hello, how are you?"}] -input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") -## <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|> +tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01") +model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa") -gen_tokens = model.generate( +# format message with the Command-R chat template +messages = [{"role": "user", "content": "How do plants make energy?"}] +input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") +output = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.3, - ) + cache_implementation="static", +) +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + + + -gen_text = tokenizer.decode(gen_tokens[0]) -print(gen_text) +```bash +transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2 ``` @@ -73,24 +76,24 @@ Quantization reduces the memory burden of large models by representing the weigh The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits. ```python -# pip install transformers bitsandbytes accelerate -from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig +import torch +from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM bnb_config = BitsAndBytesConfig(load_in_4bit=True) +tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01") +model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", torch_dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa") -model_id = "CohereForAI/c4ai-command-r-v01" -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) - -gen_tokens = model.generate( +# format message with the Command-R chat template +messages = [{"role": "user", "content": "How do plants make energy?"}] +input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") +output = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.3, - ) - -gen_text = tokenizer.decode(gen_tokens[0]) -print(gen_text) + cache_implementation="static", +) +print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. @@ -132,4 +135,4 @@ visualizer("Plants create energy through a process known as") ## CohereForCausalLM [[autodoc]] CohereForCausalLM - - forward \ No newline at end of file + - forward From 8dc80cfd7bfe999fa721710390380a73c5f121ce Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Wed, 2 Apr 2025 21:15:38 -0700 Subject: [PATCH 09/10] Update cohere.md --- docs/source/en/model_doc/cohere.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 529952d2b4cc..02215fa30fba 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -11,9 +11,6 @@ Cohere Command-R is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens. -Command-R is a **scalable generative model** optimized for long-context tasks such as retrieval-augmented generation (RAG) and external tool/API use. It is designed to work alongside Cohere’s Embed and Rerank models to provide best-in-class performance for RAG pipelines, especially in enterprise use cases. - - You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection. From 9cb6478b6c638c23f06062d5648b27bdf2de9e68 Mon Sep 17 00:00:00 2001 From: Bimal Gajera <90305421+bimal-gajera@users.noreply.github.com> Date: Wed, 2 Apr 2025 21:16:22 -0700 Subject: [PATCH 10/10] Update docs/source/en/model_doc/cohere.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/cohere.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 02215fa30fba..48b924e1ff13 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -62,6 +62,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ```bash +# pip install -U flash-attn --no-build-isolation transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2 ```