-
Notifications
You must be signed in to change notification settings - Fork 87
Description
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
input_str = """This is part 3 of MAMI-san's hair timelineThe previous hair timelines can be found hereOkay then, we'll be continuing from last time and starting off with MAMI-san's orange era, which was at this time2014/8 - Continuing the orange bob2014/09 - Used to seeing the orangeOrange fades easily, so we did a lot of maintenence2014/10 - Her hair grew long with seal extensionsA new song was released while being orange2014/11 - Completely orange☝︎A new album was also released while being orange2014/12 - The ends and bangs were cut straight acrossHeading into 2015JanuaryMAMI-san is super stylish, isn't sheFebruary2015/03 - The extensions got a little shorterMAMI had been orange for exactly one yearIn April, before their world tour, we took out the orange and put in a turquoise blue gradient2015/05 - While on break from tour, we took out the ash color and did maintenence2015/062015/072015/08 - Returning from tour and cutting it into a bobReleasing a new song while having two-toned hair2015/09 - We put in ash with a transparent feeling to it2015/11 - Her first image change in eight months2015/12 - An easy-going color so that the pudding color isn't conspicuousHer black hair has been missed by the fans, hasn't itWhen will the dyed-black MAMI-san be woken up by another desire to bleach her hairAlso, it's not that her lovely hair and color were worn out; at RISEL we have a 『hybrid bleach』 of an original super bleach dream, so please don't worryI will make her hair colors beautiful enough so that everyone will be surprised by any color, and I will support her on behalf of everyone so that MAMI-san can give her best performanceLooking back, even though we changed intense colors so many times, everything definitely looks good on MAMI-san, doesn't itWell then, please look forward to MAMI-san's hair timeline again next yearRISEL.xoxo.KAZU"""
segmenter = pysbd.Segmenter(language="en", clean=False)
segments = segmenter.segment(input_str )
Expected behavior
Array of 1 or more sentences
Additional context
The text originates from openwebtext dataset.
I also found cases where it removes or adds spaces to sentences that were not in the original strings.