Tokenizing with `apply_chat_template` behaves differently from regular tokenizing #37686

sayanshaw24 · 2025-04-22T17:15:41Z

System Info

Using the latest transformers v4.51.3 and Python 3.11.9 on Linux (but the problem is platform generic), tokenization with apply_chat_template when setting tokenize = True behaves differently from first calling apply_chat_template and then encode on the result of that.

For instance, with the google/gemma-3-1b-it tokenizer in this example:

Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
>>> model_id = "google/gemma-3-1b-it"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [{"type": "text", "text": "What is 2 + 3?"},]
...         },
...     ],
... ]
>>> tokenizer.apply_chat_template(messages, tokenize=False)
['<bos><start_of_turn>user\nWhat is 2 + 3?<end_of_turn>\n']

The results of tokenization with apply_chat_template when setting tokenize = True is as follows:

>>> tokenizer.apply_chat_template(messages, tokenize=True)
[[2, 105, 2364, 107, 3689, 563, 236743, 236778, 900, 236743, 236800, 236881, 106, 107]]

whereas doing this separately using encode results in:

>>> tokenizer.encode(tokenizer.apply_chat_template(messages, tokenize=False)[0])
[2, 2, 105, 2364, 107, 3689, 563, 236743, 236778, 900, 236743, 236800, 236881, 106, 107]

Note that first calling apply_chat_template and then encode on the result of that results in two BOS tokens (input id 2), which is redundant (this is the only difference, even on other examples). Also, the result of this should logically be the same as tokenization with apply_chat_template when setting tokenize = True.

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Ensure environment parity (latest transformers v4.51.3 and Python 3.11.9 on Linux)
Load HF tokenizer for google/gemma-3-1b-it
Run script with above example (i.e., call apply_chat_template and set tokenize = True and compare with the results of first calling apply_chat_template with tokenize = False and then encode on the result of that.)

Expected behavior

Tokenization with apply_chat_template when setting tokenize = True should produce the same result as first calling apply_chat_template and then encode on the result of that.

Also, logically there should not be 2 BOS tokens in the result of the latter since it is redundant.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2025-04-22T18:47:40Z

Hey @sayanshaw24 ! This is an expected behavior as the chat template adds special tokens itself in the jinja template. In case you want to format the prompt and encode later, you need to set add_special_tokens=False

We have it documented in https://huggingface.co/docs/transformers/main/en/chat_templating#model-training, near the end of section :)

sayanshaw24 added the bug label Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing with `apply_chat_template` behaves differently from regular tokenizing #37686

Tokenizing with `apply_chat_template` behaves differently from regular tokenizing #37686

sayanshaw24 commented Apr 22, 2025

zucchini-nlp commented Apr 22, 2025

Tokenizing with apply_chat_template behaves differently from regular tokenizing #37686

Tokenizing with apply_chat_template behaves differently from regular tokenizing #37686

Comments

sayanshaw24 commented Apr 22, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Apr 22, 2025

Tokenizing with `apply_chat_template` behaves differently from regular tokenizing #37686

Tokenizing with `apply_chat_template` behaves differently from regular tokenizing #37686