Skip to content

Avoid adding space when decoding tokenization #37659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cikay opened this issue Apr 21, 2025 · 0 comments
Open

Avoid adding space when decoding tokenization #37659

cikay opened this issue Apr 21, 2025 · 0 comments
Labels
Feature request Request for a new feature

Comments

@cikay
Copy link

cikay commented Apr 21, 2025

Feature request

Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")

test_text = """
Ez ê di vê gotarê da qala ên ku ez guhdar û temaşe dikim bikim
"""

tokens = tokenizer.tokenize(test_text)

print(f"Tokens: {tokens}")

output: Tokens: ['\n', 'Ez ê ', 'di vê ', 'got', 'arê ', 'da ', 'qala ', 'ên ku ', 'ez ', 'guh', 'dar û ', 'temaşe ', 'dikim ', 'bikim', '\n']

ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")

output: IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]

text = tokenizer.decode(ids)

print(f"text: {text}")
output:
text:
Ez ê di vê got arê da qala ên ku ez guh dar û temaşe dikim bikim

As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in

individual_tokens = [tokenizer.decode([id]) for id in ids]

"".join(individual_tokens)

Motivation

Not writing custom code to avoid adding space between tokens

Your contribution

No

@cikay cikay added the Feature request Request for a new feature label Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant