Multidoc tokenize #1466

AngledLuffa · 2025-02-28T08:28:29Z

When a document is already tokenized, and the TokenizeProcessor is set to pretokenized=True, it is not necessary to try to retokenize the text of the document. In fact, it might not even be possible if the complete document text isn't available

Addresses #1464

…ntially being added in different orders in the token / word maps Many tests are updated because SpaceAfter etc should now be at the start of a misc column

…o whitespace tokenize it when doing a bulk_process with a pretokenized TokenizeProcessor

AngledLuffa force-pushed the multidoc_tokenize branch from e7797e5 to 07c3fe1 Compare February 28, 2025 08:39

AngledLuffa added 2 commits February 28, 2025 00:49

Put MISC, START_CHAR, END_CHAR, NER in a canonical order despite pote…

377f8ed

…ntially being added in different orders in the token / word maps Many tests are updated because SpaceAfter etc should now be at the start of a misc column

Check if a Document is already chopped into sentences before trying t…

07c3fe1

…o whitespace tokenize it when doing a bulk_process with a pretokenized TokenizeProcessor

AngledLuffa merged commit a447b14 into dev Feb 28, 2025
1 check passed

AngledLuffa deleted the multidoc_tokenize branch February 28, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multidoc tokenize #1466

Multidoc tokenize #1466

AngledLuffa commented Feb 28, 2025

Multidoc tokenize #1466

Multidoc tokenize #1466

Conversation

AngledLuffa commented Feb 28, 2025