Add support for custom trained PunktTokenizer in PreProcessor #2773
Replies: 3 comments 2 replies
-
@danielbichuetti sounds like a nice addition. I have a few questions:
|
Beta Was this translation helpful? Give feedback.
-
Hi @tstadel. I've come into this idea based on the legal domain. What happens with PunktTokenizer default model ? It's trained on common news and some books. Since it uses an unsupervised algorithm when trying to get the split sentences on some legal documents, sentence get split on some dots that abbreviate articles, court names, and other specific domain ones. I did a test of the default models for NLTK, Spacy and Stanza. The best default model for portuguese (my scenario) was Stanza. But we could get lots of improvements on NLTK using PunktTrainer with a small corpus of legal documents with some abbreviations. I think these errors in split sentences probably occurs in other domains that make usage of lots of dots inside sentences. I didn't make a detailed study of the percentages regarding the full retrieval / QA results improvement in Haystack. But when making tests using GPT-3 which has a huge max token size we got questions not being answered which were present in the text, just because of the bad sentence split. If it happens breaking the law fundaments (article) of a judicial decision, models won't be able to correctly infer. Or when it breaks the judge name and so on. On the law domain, these abbreviations often carry a very important information. My first idea was a parameter on PreProcessor which would represent a directory where custom models could be stored using ISO like:
If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. Since pre-processing is a task for NLP that has a close connection to the domain of the text and the specific task, if anyone wants he could have a folder legal, another medical and so on. And when calling PreProcessor could setup a parameter And of course, I could share models. |
Beta Was this translation helpful? Give feedback.
-
It's now possible to use custom trained PunktTokenizer 3948b99. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Today, the PreProcessor makes usages of NLTK PunktTokenizer. The default one is great, except for some specific domains. Like the legal one, where there are many abbreviations that it messes up a little.
I would like to propose and offer myself to implement the possibility to set a custom trained PunktTokenizer for any set of languages.
For example, user defines a directory with the file name using the ISO pattern where Haystack will then search for the language, if not found default to default one.
What do you think about this feature ? It wouldn't interfere and just improve specific cases (which are many in NLP domain).
Have a great day!
Beta Was this translation helpful? Give feedback.
All reactions