fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion#44914
Merged
ArthurZucker merged 1 commit intohuggingface:mainfrom Mar 23, 2026
Conversation
…ersion The Llama3Converter hardcodes clean_up_tokenization_spaces=True, which applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that silently corrupt decoded text from Llama 3's BPE tokenizer. Llama 2's LlamaTokenizer and Llama 4 both use False. The True was introduced in PR huggingface#30334 and hardcoded in PR huggingface#33778 for backward compat. Fixes huggingface#35175
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: llama |
5 tasks
Collaborator
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM for the conversion, these are removed from releases so it won't change much
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
The
Llama3Converterinconvert_llama_weights_to_hf.pyhardcodesclean_up_tokenization_spaces=True(line 468). This causestokenizer.decode()to silently strip spaces before punctuation for all converted Llama 3 models, producing incorrect decoded text.clean_up_tokenization_spacesapplies BERT-era string replacements (.→.,?→?, etc.) that are destructive for Llama 3's BPE tokenizer. Llama 2'sLlamaTokenizerexplicitly set this toFalse, and Llama 4 ships withFalse. TheTruewas introduced in #30334 by inheriting the library default, then explicitly hardcoded in #33778 for backward compatibility.Minimal reproduction
Tested across every version of
transformersfrom 4.40.0 through 5.3.0 — all produce incorrect decoded text.Companion fix PRs have been opened on all 24 affected
meta-llamamodel repos on the Hub: Llama-3.1-8B-Instruct discussion #356.Fixes #35175
Fixes #31187
Before submitting
transformersversion #31187, Llama3 Tokenizer Decode Removing Space Character #32575Who can review?
@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be
Falsein #35175.