Skip to content

create pre-tokenizer with surface-projection does not override dictionary-projection #259

@mh-northlander

Description

@mh-northlander

Creating pre-tokenizer with surface-projection specified does not overrides the projection of dictionary.

In test:

def test_projection_surface_override(self):
dictobj = sudachipy.Dictionary(config=sudachipy.config.Config(projection="reading"))
pretok = dictobj.pre_tokenizer(sudachipy.SplitMode.A, projection="surface")
vocab = {
"[UNK]": 0,
"サケ": 1,
"ヒト": 2,
"ノム": 3,
"ヲ": 5,
"外国人参政権": 4
}
tok = tokenizers.Tokenizer(WordLevel(vocab, unk_token="[UNK]"))
tok.pre_tokenizer = pretok
res = tok.encode("酒を飲む人")
self.assertEqual(res.ids, [1, 5, 3, 2])

Is this intentional or bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    pythonPython binding-related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions