-
Notifications
You must be signed in to change notification settings - Fork 907
Wrong genders in Romanian #1449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
When I look for that word in the training data, it is labeled
|
Dear John,
Many thanks for you fast reply. I almost suspected something like that.
BR,
/Jonny
20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John ***@***.***>:
…
When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks:
***@***.*** UD_Romanian-RRT]$ grep Sistemul *conllu | grep -v "# text"
ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _
ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _
ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _
ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _
ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
***@***.*** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text"
ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _
ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _
ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _
ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _
—
Reply to this email directly, view it on GitHub <#1449 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Dear John,
I found this information on Romanian UD <https://universaldependencies.org/ro/index.html> that it is on purpose that "neute gender" are not used for nouns.
Nominal Features
* Nominal words (NOUN <https://universaldependencies.org/u/pos/NOUN.html>,PROPN <https://universaldependencies.org/u/pos/PROPN.html>andPRON <https://universaldependencies.org/u/pos/PRON.html>) have an inherentGender <https://universaldependencies.org/u/feat/Gender.html>feature with one of two values:MascorFem. The neuter is in Romanian classified as masculine singular and feminine plural.
* The following parts of speech inflect forGenderbecause they must agree with nouns:ADJ <https://universaldependencies.org/u/pos/ADJ.html>,DET <https://universaldependencies.org/u/pos/DET.html>,NUM <https://universaldependencies.org/u/pos/NUM.html>,VERB <https://universaldependencies.org/u/pos/VERB.html>,AUX <https://universaldependencies.org/u/pos/AUX_.html>. For verbs (including auxiliaries), only participles have gender.
Den måndag 20 januari 2025 kl. 09:33:10 +01:00, skrev ***@***.***>:
… Dear John,
Many thanks for you fast reply. I almost suspected something like that.
BR,
/Jonny
20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John ***@***.***>:
>
> When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks:
> ***@***.*** UD_Romanian-RRT]$ grep Sistemul *conllu | grep -v "# text"
> ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _
> ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _
> ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _
> ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _
> ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
> ***@***.*** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text"
> ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
> ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _
> ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _
> ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _
> ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _
>
> —
> Reply to this email directly, view it on GitHub <#1449 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI>.
> You are receiving this because you authored the thread.Message ID: ***@***.***>
>
|
Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian |
No, there is nothing to do since it seems to be a consious decision to not have neuter gender in the UD.
Thanks for the reminder,
/Jonny
…On Monday, 24 March 2025 at 22:00:38 +01:00, John Bauer ***@***.***> wrote:
Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian
—
Reply to this email directly, view it on GitHub <#1449 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>AngledLuffaAngledLuffa left a comment (stanfordnlp/stanza#1449) <#1449 (comment)>
Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian
—
Reply to this email directly, view it on GitHub <#1449 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When tokenizing neuter words in Romanian, they are tagged as "Gender=Masc"
To Reproduce
Analyze a sentence such as "Sistemul este foarte bun". The neuter noun "sistem" appears as:
Input sentence: Sistemul este foarte bun
[
[
{
"id": 1,
"text": "Sistemul",
"lemma": "sistem",
"upos": "NOUN",
"xpos": "Ncmsry",
"feats": "Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing",
"head": 4,
"deprel": "nsubj",
"start_char": 0,
"end_char": 8
},
{
Expected behavior
A clear and concise description of what you expected to happen.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: