Skip to content

fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion#44914

Merged
ArthurZucker merged 1 commit intohuggingface:mainfrom
maxsloef-goodfire:fix/llama3-clean-up-tokenization-spaces
Mar 23, 2026
Merged

fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion#44914
ArthurZucker merged 1 commit intohuggingface:mainfrom
maxsloef-goodfire:fix/llama3-clean-up-tokenization-spaces

Conversation

@maxsloef-goodfire
Copy link
Contributor

@maxsloef-goodfire maxsloef-goodfire commented Mar 21, 2026

What does this PR do?

The Llama3Converter in convert_llama_weights_to_hf.py hardcodes clean_up_tokenization_spaces=True (line 468). This causes tokenizer.decode() to silently strip spaces before punctuation for all converted Llama 3 models, producing incorrect decoded text.

clean_up_tokenization_spaces applies BERT-era string replacements ( .., ??, etc.) that are destructive for Llama 3's BPE tokenizer. Llama 2's LlamaTokenizer explicitly set this to False, and Llama 4 ships with False. The True was introduced in #30334 by inheriting the library default, then explicitly hardcoded in #33778 for backward compatibility.

Minimal reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped
print(repr(tokenizer.decode(ids, clean_up_tokenization_spaces=False)))
# 'x != y and a.b == c'  ← correct

Tested across every version of transformers from 4.40.0 through 5.3.0 — all produce incorrect decoded text.

Companion fix PRs have been opened on all 24 affected meta-llama model repos on the Hub: Llama-3.1-8B-Instruct discussion #356.

Fixes #35175
Fixes #31187

Before submitting

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

…ersion

The Llama3Converter hardcodes clean_up_tokenization_spaces=True, which
applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.)
that silently corrupt decoded text from Llama 3's BPE tokenizer.

Llama 2's LlamaTokenizer and Llama 4 both use False. The True was
introduced in PR huggingface#30334 and hardcoded in PR huggingface#33778 for backward compat.

Fixes huggingface#35175
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the conversion, these are removed from releases so it won't change much

@ArthurZucker ArthurZucker merged commit 55cc1a7 into huggingface:main Mar 23, 2026
16 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detokenization discrepancy with Llama3.1 Original Llama-3 tokenizer behaves differently from transformers version

3 participants