Skip to content

Slovak lang support #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 1, 2020

Conversation

misotrnka
Copy link
Contributor

We've added support for SBD in Slovak language text.

Language specific improvements:

  • list of common slovak abbreviations
  • list of prepositive abbreviations
  • list of number abbreviations
  • handling of roman numerals
  • handling of „ text “ quotes, that are common in Slovak language
  • handling of ordinal numerals in dates, such as 17. Apríl 2020
  • modified the replacement of periods in abbreviations, so it can consistently handle common Slovak abbreviations such as Company Name s. r. o.
  • disabled processing of alphabetical lists, because of conflicts with some common abbreviations

The code has been tested for stability on a very large corpus of web text. The has been no rigorous testing for segmentation quality, but the subjective feeling in the team is very positive.

@nipunsadvilkar
Copy link
Owner

@misotrnka That's great! ✨

Can you add some tests for Slovak language as well?

Refer:
https://github.com/nipunsadvilkar/pySBD/blob/master/tests/lang/test_polish.py
https://github.com/nipunsadvilkar/pySBD/blob/master/tests/lang/test_russian.py

These tests will also act as a Golden Rules Set for Slovak language.

@nipunsadvilkar nipunsadvilkar linked an issue Nov 27, 2020 that may be closed by this pull request
@misotrnka
Copy link
Contributor Author

@nipunsadvilkar Will try to add the tests soon. 👍

@misotrnka
Copy link
Contributor Author

@nipunsadvilkar Ok, please check now, I've added a few tests.

@nipunsadvilkar
Copy link
Owner

Great 💯

@misotrnka Can you please take a pull from master? I've fixed the failing CI issue

Synced with upstream master.
@misotrnka
Copy link
Contributor Author

Ok, looks good now 😎

@nipunsadvilkar nipunsadvilkar merged commit 2010c4d into nipunsadvilkar:master Dec 1, 2020
@nipunsadvilkar
Copy link
Owner

@misotrnka Thanks for adding support for Slovak 👍

@misotrnka misotrnka deleted the slovak-lang-support branch December 1, 2020 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Multiple Language Support
2 participants