Tokenizing with apply_chat_template
behaves differently from regular tokenizing
#37686
Open
2 of 4 tasks
Labels
System Info
Using the latest
transformers v4.51.3
and Python 3.11.9 on Linux (but the problem is platform generic), tokenization withapply_chat_template
when settingtokenize = True
behaves differently from first callingapply_chat_template
and thenencode
on the result of that.For instance, with the
google/gemma-3-1b-it
tokenizer in this example:The results of tokenization with
apply_chat_template
when settingtokenize = True
is as follows:whereas doing this separately using
encode
results in:Note that first calling
apply_chat_template
and thenencode
on the result of that results in two BOS tokens (input id2
), which is redundant (this is the only difference, even on other examples). Also, the result of this should logically be the same as tokenization withapply_chat_template
when settingtokenize = True
.Who can help?
@ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
transformers v4.51.3
and Python 3.11.9 on Linux)google/gemma-3-1b-it
apply_chat_template
and settokenize = True
and compare with the results of first callingapply_chat_template
withtokenize = False
and thenencode
on the result of that.)Expected behavior
Tokenization with
apply_chat_template
when settingtokenize = True
should produce the same result as first callingapply_chat_template
and thenencode
on the result of that.Also, logically there should not be 2 BOS tokens in the result of the latter since it is redundant.
The text was updated successfully, but these errors were encountered: