Skip to content

Tokenizing with apply_chat_template behaves differently from regular tokenizing #37686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
sayanshaw24 opened this issue Apr 22, 2025 · 1 comment
Open
2 of 4 tasks
Labels

Comments

@sayanshaw24
Copy link

System Info

Using the latest transformers v4.51.3 and Python 3.11.9 on Linux (but the problem is platform generic), tokenization with apply_chat_template when setting tokenize = True behaves differently from first calling apply_chat_template and then encode on the result of that.

For instance, with the google/gemma-3-1b-it tokenizer in this example:

Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
>>> model_id = "google/gemma-3-1b-it"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [{"type": "text", "text": "What is 2 + 3?"},]
...         },
...     ],
... ]
>>> tokenizer.apply_chat_template(messages, tokenize=False)
['<bos><start_of_turn>user\nWhat is 2 + 3?<end_of_turn>\n']

The results of tokenization with apply_chat_template when setting tokenize = True is as follows:

>>> tokenizer.apply_chat_template(messages, tokenize=True)
[[2, 105, 2364, 107, 3689, 563, 236743, 236778, 900, 236743, 236800, 236881, 106, 107]]

whereas doing this separately using encode results in:

>>> tokenizer.encode(tokenizer.apply_chat_template(messages, tokenize=False)[0])
[2, 2, 105, 2364, 107, 3689, 563, 236743, 236778, 900, 236743, 236800, 236881, 106, 107]

Note that first calling apply_chat_template and then encode on the result of that results in two BOS tokens (input id 2), which is redundant (this is the only difference, even on other examples). Also, the result of this should logically be the same as tokenization with apply_chat_template when setting tokenize = True.

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Ensure environment parity (latest transformers v4.51.3 and Python 3.11.9 on Linux)
  2. Load HF tokenizer for google/gemma-3-1b-it
  3. Run script with above example (i.e., call apply_chat_template and set tokenize = True and compare with the results of first calling apply_chat_template with tokenize = False and then encode on the result of that.)

Expected behavior

Tokenization with apply_chat_template when setting tokenize = True should produce the same result as first calling apply_chat_template and then encode on the result of that.

Also, logically there should not be 2 BOS tokens in the result of the latter since it is redundant.

@zucchini-nlp
Copy link
Member

Hey @sayanshaw24 ! This is an expected behavior as the chat template adds special tokens itself in the jinja template. In case you want to format the prompt and encode later, you need to set add_special_tokens=False

We have it documented in https://huggingface.co/docs/transformers/main/en/chat_templating#model-training, near the end of section :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants