Avoid adding space when decoding tokenization #37659

cikay · 2025-04-21T18:21:03Z

Feature request

Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")

test_text = """
Ez ê di vê gotarê da qala ên ku ez guhdar û temaşe dikim bikim
"""

tokens = tokenizer.tokenize(test_text)

print(f"Tokens: {tokens}")

output: Tokens: ['\n', 'Ez ê ', 'di vê ', 'got', 'arê ', 'da ', 'qala ', 'ên ku ', 'ez ', 'guh', 'dar û ', 'temaşe ', 'dikim ', 'bikim', '\n']

ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")

output: IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]

text = tokenizer.decode(ids)

print(f"text: {text}")
output:
text:
Ez ê di vê got arê da qala ên ku ez guh dar û temaşe dikim bikim

As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in

individual_tokens = [tokenizer.decode([id]) for id in ids]

"".join(individual_tokens)

Motivation

Not writing custom code to avoid adding space between tokens

Your contribution

No

cikay added the Feature request Request for a new feature label Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid adding space when decoding tokenization #37659

Avoid adding space when decoding tokenization #37659

cikay commented Apr 21, 2025 •

edited

Loading

Avoid adding space when decoding tokenization #37659

Avoid adding space when decoding tokenization #37659

Comments

cikay commented Apr 21, 2025 • edited Loading

Feature request

Motivation

Your contribution

cikay commented Apr 21, 2025 •

edited

Loading