Skip to content

german conventional quote style causes incorrect segmentation #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joprice opened this issue Apr 16, 2025 · 1 comment
Open

german conventional quote style causes incorrect segmentation #132

joprice opened this issue Apr 16, 2025 · 1 comment

Comments

@joprice
Copy link

joprice commented Apr 16, 2025

German texts often use a pair of and to “, to delineate quoted text. These cause issues for example in the below text:

Nach einem kurzen Zögern näherte sie sich Louis. „Darf ich mitspielen?“, fragte sie schüchtern.

Where the segmentation is

Nach einem kurzen Zögern näherte sie sich Louis. 
„Darf ich mitspielen?
“, fragte sie schüchtern."

where for other languages a similar sentence would retain the second sentence as a single entity:

Nach einem kurzen Zögern näherte sie sich Louis. 
„Darf ich mitspielen?“, fragte sie schüchtern."
@joprice
Copy link
Author

joprice commented Apr 17, 2025

I found that the same problem exists for guillemets - "«" »" commonly used in French texts.

Guillemets are also used with a spacing on either side « Bonjour ! », which when I found also causes incorrect segmentation if they are simply replaced with double quotes in a pre-processing step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant