Skip to content

Wrong genders in Romanian #1449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jonnyGitHub57 opened this issue Jan 19, 2025 · 5 comments
Open

Wrong genders in Romanian #1449

jonnyGitHub57 opened this issue Jan 19, 2025 · 5 comments
Labels

Comments

@jonnyGitHub57
Copy link

Describe the bug
When tokenizing neuter words in Romanian, they are tagged as "Gender=Masc"

To Reproduce
Analyze a sentence such as "Sistemul este foarte bun". The neuter noun "sistem" appears as:

Input sentence: Sistemul este foarte bun
[
[
{
"id": 1,
"text": "Sistemul",
"lemma": "sistem",
"upos": "NOUN",
"xpos": "Ncmsry",
"feats": "Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing",
"head": 4,
"deprel": "nsubj",
"start_char": 0,
"end_char": 8
},
{

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • OS: [e.g. Windows, Ubuntu, CentOS, MacOS]
  • Python version: [e.g. Python 3.6.8 from Anaconda]
  • Stanza version: [e.g., 1.0.0]

Additional context
Add any other context about the problem here.

@AngledLuffa
Copy link
Collaborator

When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks:

[john@localhost UD_Romanian-RRT]$ grep Sistemul  *conllu | grep -v "# text"
ro_rrt-ud-test.conllu:1 Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       2       nsubj   _       _
ro_rrt-ud-train.conllu:29       Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       27      nmod    _       _
ro_rrt-ud-train.conllu:15       Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       12      nmod    _       _
ro_rrt-ud-train.conllu:1        Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       7       nsubj:pass      _      _
ro_rrt-ud-train.conllu:1        Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       7       nsubj   _       _
[john@localhost UD_Romanian-SiMoNERo]$ grep Sistemul  *conllu | grep -v "# text"
ro_simonero-ud-train.conllu:1   Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   7       nsubj   _       _
ro_simonero-ud-train.conllu:11  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   9       nmod    _       _
ro_simonero-ud-train.conllu:29  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   27      obl:agent       _       _
ro_simonero-ud-train.conllu:40  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   29      conj    _       _
ro_simonero-ud-train.conllu:1   Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   3       nsubj   _       _

@jonnyGitHub57
Copy link
Author

jonnyGitHub57 commented Jan 20, 2025 via email

@jonnyGitHub57
Copy link
Author

jonnyGitHub57 commented Jan 20, 2025 via email

@AngledLuffa
Copy link
Collaborator

Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian

@jonnyGitHub57
Copy link
Author

jonnyGitHub57 commented Mar 25, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants