fix: set `clean_up_tokenization_spaces=False` in Llama 3 tokenizer conversion by maxsloef-goodfire · Pull Request #44914 · huggingface/transformers

maxsloef-goodfire · 2026-03-21T20:25:51Z

What does this PR do?

The Llama3Converter in convert_llama_weights_to_hf.py hardcodes clean_up_tokenization_spaces=True (line 468). This causes tokenizer.decode() to silently strip spaces before punctuation for all converted Llama 3 models, producing incorrect decoded text.

clean_up_tokenization_spaces applies BERT-era string replacements ( . → ., ? → ?, etc.) that are destructive for Llama 3's BPE tokenizer. Llama 2's LlamaTokenizer explicitly set this to False, and Llama 4 ships with False. The True was introduced in #30334 by inheriting the library default, then explicitly hardcoded in #33778 for backward compatibility.

Minimal reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped
print(repr(tokenizer.decode(ids, clean_up_tokenization_spaces=False)))
# 'x != y and a.b == c'  ← correct

Tested across every version of transformers from 4.40.0 through 5.3.0 — all produce incorrect decoded text.

Companion fix PRs have been opened on all 24 affected meta-llama model repos on the Hub: Llama-3.1-8B-Instruct discussion #356.

Fixes #35175
Fixes #31187

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. — Detokenization discrepancy with Llama3.1 #35175, Original Llama-3 tokenizer behaves differently from transformers version #31187, Llama3 Tokenizer Decode Removing Space Character #32575
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

…ersion The Llama3Converter hardcodes clean_up_tokenization_spaces=True, which applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that silently corrupt decoded text from Llama 3's BPE tokenizer. Llama 2's LlamaTokenizer and Llama 4 both use False. The True was introduced in PR huggingface#30334 and hardcoded in PR huggingface#33778 for backward compat. Fixes huggingface#35175

github-actions · 2026-03-21T20:27:03Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

ArthurZucker

LGTM for the conversion, these are removed from releases so it won't change much

HuggingFaceDocBuilderDev · 2026-03-23T08:38:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

maxsloef-goodfire mentioned this pull request Mar 21, 2026

fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast #44915

Open

5 tasks

ArthurZucker reviewed Mar 23, 2026

View reviewed changes

ArthurZucker merged commit 55cc1a7 into huggingface:main Mar 23, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set `clean_up_tokenization_spaces=False` in Llama 3 tokenizer conversion#44914

fix: set `clean_up_tokenization_spaces=False` in Llama 3 tokenizer conversion#44914
ArthurZucker merged 1 commit intohuggingface:mainfrom
maxsloef-goodfire:fix/llama3-clean-up-tokenization-spaces

maxsloef-goodfire commented Mar 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maxsloef-goodfire commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Minimal reproduction

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maxsloef-goodfire commented Mar 21, 2026 •

edited

Loading