You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?
print(f"text: {text}")
output:
text:
Ez ê di vê got arê da qala ên ku ez guh dar û temaşe dikim bikim
As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in
individual_tokens = [tokenizer.decode([id]) for id in ids]
"".join(individual_tokens)
Motivation
Not writing custom code to avoid adding space between tokens
Your contribution
No
The text was updated successfully, but these errors were encountered:
Feature request
Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")
test_text = """
Ez ê di vê gotarê da qala ên ku ez guhdar û temaşe dikim bikim
"""
tokens = tokenizer.tokenize(test_text)
print(f"Tokens: {tokens}")
output: Tokens: ['\n', 'Ez ê ', 'di vê ', 'got', 'arê ', 'da ', 'qala ', 'ên ku ', 'ez ', 'guh', 'dar û ', 'temaşe ', 'dikim ', 'bikim', '\n']
ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")
output: IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]
text = tokenizer.decode(ids)
print(f"text: {text}")
output:
text:
Ez ê di vê got arê da qala ên ku ez guh dar û temaşe dikim bikim
As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in
individual_tokens = [tokenizer.decode([id]) for id in ids]
"".join(individual_tokens)
Motivation
Not writing custom code to avoid adding space between tokens
Your contribution
No
The text was updated successfully, but these errors were encountered: