-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
The description of the unigram tokenization unigram in the article seems to be incorrect? see this
Here are the frequencies of all the possible subwords in the vocabulary:
("h", 15) ("u", 36) ("g", 20) ("hu", 15) ("ug", 20) ("p", 17) ("pu", 17) ("n", 16) ("un", 16) ("b", 4) ("bu", 4) ("s", 5) ("hug", 15) ("gs", 5) ("ugs", 5)
The tokenization probability of ["p", "u", "g"] for "pug" is 5/210 * 36/210 * 20/210
Shouldn't it be 37/210 * 36/210 * 20/210? I'm also a beginner so I'm not sure if he is correct...
Metadata
Metadata
Assignees
Labels
No labels