Skip to content

Exception when clean=True in search_for_connected_sentences #91

@balazik

Description

@balazik

Describe the bug
Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).

The exception is raised in:
module:
cleaner.py
class:
class Cleaner
method name:
search_for_connected_sentences
line:

txt = re.sub(re.escape(word), new_word, txt)

To Reproduce
Steps to reproduce the behavior:

# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)

Expected behavior
The output should be the same as is, but is should not trow an exception.
Workaround to see the output is to escape the backslash.

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)

Expected output:

['etc.', 'Png,Jpg,.', '\\']

Possible solution
replace txt = re.sub(re.escape(word), new_word, txt)
with txt = txt.replace(word, new_word)
It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.

Additional context
Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions