Add support for custom trained PunktTokenizer in PreProcessor #2773

danielbichuetti · 2022-07-06T19:10:36Z

danielbichuetti
Jul 6, 2022

Hi,

Today, the PreProcessor makes usages of NLTK PunktTokenizer. The default one is great, except for some specific domains. Like the legal one, where there are many abbreviations that it messes up a little.

I would like to propose and offer myself to implement the possibility to set a custom trained PunktTokenizer for any set of languages.

For example, user defines a directory with the file name using the ISO pattern where Haystack will then search for the language, if not found default to default one.

What do you think about this feature ? It wouldn't interfere and just improve specific cases (which are many in NLP domain).

Have a great day!

tstadel · 2022-07-07T09:48:44Z

tstadel
Jul 7, 2022
Collaborator

@danielbichuetti sounds like a nice addition. I have a few questions:

A single parameter added to Preprocessor would be sufficient to activate this feature?
Are there already custom models that we can point to our users within the docstrings? Otherwise would you be able to share yours?
Do you have any data on how much the custom tokenizer improves retrieval / QA results?

0 replies

danielbichuetti · 2022-07-07T10:58:11Z

danielbichuetti
Jul 7, 2022
Author

Hi @tstadel. I've come into this idea based on the legal domain.

What happens with PunktTokenizer default model ? It's trained on common news and some books. Since it uses an unsupervised algorithm when trying to get the split sentences on some legal documents, sentence get split on some dots that abbreviate articles, court names, and other specific domain ones.

I did a test of the default models for NLTK, Spacy and Stanza. The best default model for portuguese (my scenario) was Stanza. But we could get lots of improvements on NLTK using PunktTrainer with a small corpus of legal documents with some abbreviations. I think these errors in split sentences probably occurs in other domains that make usage of lots of dots inside sentences.

I didn't make a detailed study of the percentages regarding the full retrieval / QA results improvement in Haystack. But when making tests using GPT-3 which has a huge max token size we got questions not being answered which were present in the text, just because of the bad sentence split. If it happens breaking the law fundaments (article) of a judicial decision, models won't be able to correctly infer. Or when it breaks the judge name and so on. On the law domain, these abbreviations often carry a very important information.

My first idea was a parameter on PreProcessor which would represent a directory where custom models could be stored using ISO like:

pt.pickle
en.pickle

If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. Since pre-processing is a task for NLP that has a close connection to the domain of the text and the specific task, if anyone wants he could have a folder legal, another medical and so on. And when calling PreProcessor could setup a parameter model_folder=

And of course, I could share models.

2 replies

tstadel Jul 7, 2022
Collaborator

@danielbichuetti Ok, feel free to open a PR. I'm happy to review it.

tstadel Jul 8, 2022
Collaborator

@danielbichuetti I created an issue out of this: feel free to adjiust: #2780

danielbichuetti · 2022-07-22T08:53:08Z

danielbichuetti
Jul 22, 2022
Author

It's now possible to use custom trained PunktTokenizer 3948b99.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for custom trained PunktTokenizer in PreProcessor #2773

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Add support for custom trained PunktTokenizer in PreProcessor #2773

Uh oh!

danielbichuetti Jul 6, 2022

Replies: 3 comments · 2 replies

Uh oh!

tstadel Jul 7, 2022 Collaborator

Uh oh!

Uh oh!

danielbichuetti Jul 7, 2022 Author

Uh oh!

Uh oh!

tstadel Jul 7, 2022 Collaborator

Uh oh!

tstadel Jul 8, 2022 Collaborator

Uh oh!

Uh oh!

danielbichuetti Jul 22, 2022 Author

danielbichuetti
Jul 6, 2022

Replies: 3 comments 2 replies

tstadel
Jul 7, 2022
Collaborator

danielbichuetti
Jul 7, 2022
Author

tstadel Jul 7, 2022
Collaborator

tstadel Jul 8, 2022
Collaborator

danielbichuetti
Jul 22, 2022
Author