-
-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Labels
pythonPython binding-relatedPython binding-related
Description
Creating pre-tokenizer with surface-projection specified does not overrides the projection of dictionary.
In test:
sudachi.rs/python/tests/test_pretokenizers.py
Lines 105 to 119 in 232d9ee
def test_projection_surface_override(self): | |
dictobj = sudachipy.Dictionary(config=sudachipy.config.Config(projection="reading")) | |
pretok = dictobj.pre_tokenizer(sudachipy.SplitMode.A, projection="surface") | |
vocab = { | |
"[UNK]": 0, | |
"サケ": 1, | |
"ヒト": 2, | |
"ノム": 3, | |
"ヲ": 5, | |
"外国人参政権": 4 | |
} | |
tok = tokenizers.Tokenizer(WordLevel(vocab, unk_token="[UNK]")) | |
tok.pre_tokenizer = pretok | |
res = tok.encode("酒を飲む人") | |
self.assertEqual(res.ids, [1, 5, 3, 2]) |
Is this intentional or bug?
Metadata
Metadata
Assignees
Labels
pythonPython binding-relatedPython binding-related