Default combined CorefUD model has inconsistent outputs for English #1450

amir-zeldes · 2025-01-24T17:12:42Z

I've been testing the CorefUD-trained Stanza model on English and seeing some inconsistent results, especially with regard to singletons. Since the model is trained on data that has singletons (but possibly also data that has no singletons? Is ParCorFull included for English? Or is the default a totally multilingual model?), it should produce predictions for most obvious noun phrases and for the most part it does:

However other times it ignores very obvious mentions, perhaps because figuring out an antecedent is non-trivial:

Notice that the model misses even completely reliable mention spans such as the pronouns "I" or "their", which are virtually guaranteed to be a mention (even if we can find no antecedent, at least in a corpus with singletons they would still be annotated).

What I'm actually looking to get is English GUM-like results, and I'm wondering whether this is the result of multi-dataset training/conflicting guidelines (esp. regarding singletons). Is there any chance to get a GUM-only trained model for English?

Jemoka · 2025-01-24T17:27:43Z

I think this is the one that's trained with the mixed multilingual backbone, and possibly with a mixture with/without singletons; we can ship a GUM only model, or even perhaps OntoNotes + Singleton. @amir-zeldes — do you have the OntoNotes augmented dataset somewhere? Would love to train a test model off of that.

amir-zeldes · 2025-01-24T20:40:23Z

@Jemoka that would be amazing! I think we'd actually want all of those different models if possible, since I think ON w/singletons+GUM would be great for mention detection, but they have rather different coref guidelines, so that could create a hodgepodge of inconsistent clustering predictions. It's an empirical question, but I could imagine if you were scoring GUM-style coref incl. singletons, throwing in all predictions from both models might actually outperform either model by itself and prevent the low recall issues with ON-only models. Then again it might need some rule based postprocessing...

@yilunzhu has put the ON singleton predictions up on GitHub, I think this is the latest (Yilun please correct me if there's something newer)

For training with GUM it might also be worth waiting a little - we're close to ready to release GUM v11, with new data, probably in about 2 weeks. I can post to this thread when that happens if that's of interest.

yilunzhu · 2025-01-24T21:22:08Z

Yes this is the latest version.

Jemoka · 2025-01-24T22:21:57Z

For training with GUM it might also be worth waiting a little - we're close to ready to release GUM v11, with new data, probably in about 2 weeks. I can post to this thread when that happens if that's of interest.

Sounds good; will hold off on that. In the meantime I will train an English Ontonotes + Singletons model and reprot back on this thread.

Jemoka · 2025-01-26T09:01:47Z

Update:
Great news! We have 0.812 head-match LEA dev on this dataset using our approach + Roberta backbone.
Bad news! Manually running the model by hands reveals no corefs, something is wrong with our client inference procedure. Stay tuned.

Update 2:
Looks like the model got biased to the length of OntoNotes documents; I arbiturarily updated my test to contain much longer inputs and its doing better now. I'll run with an augmentation tomorrow/Monday were we repeat the training data a few times across varying lengths.

Jemoka · 2025-01-29T18:05:19Z

Done! CC @amir-zeldes
For English dev set—

span-match LEA: 72.19
head-word match CoNLL 2012: 82.90

Here's the weights: https://drive.google.com/drive/folders/14EwOVRSrdbp9cjARgTu-DNaCjryNePJW?usp=sharing

To use them:

nlp = stanza.Pipeline('en', processors='tokenize,coref', coref_model_path="./the_path_to/roberta_lora.pt")

AngledLuffa · 2025-01-29T22:11:59Z

Thank you, @Jemoka !

To make it more convenient to get the model, I uploaded it to HuggingFace. You should be able to download it using Stanza version 1.10 with:

pipe = stanza.Pipeline("en", processors="tokenize,coref", package={"coref": "ontonotes-singletons_roberta-large-lora"})

amir-zeldes · 2025-01-31T23:38:47Z

OK, coref still has some issues for the sample text I was using above, but this is much much better for mention detection:

The only sort of systematic concerns I have about it are direct results of ON guidelines, for example the treatment of appositions (so we get [CEO of the Refugee Council [Enver Solomon]] as a nested mention) and no compound modifiers (e.g no [asylum] in [asylum decisions]). Coordination is also a bit odd with [Albania and [France]], I'd expect the big coordination box to oscillate (not annotated unless referred back to), but Albania by itself being missing is a bit odd.

But either way, this is worlds better, thanks so much for making this model available!

We're getting close to wrapping up the latest GUM release, I'll post a link to the data as soon as it's ready.

Jemoka · 2025-02-01T19:08:41Z

sounds good; once the next GUM is released I'll be glad to build a model for that. There's a chance that upping top-k in the initial filtering step will be better for things like coordination with a lot of nesting.

amir-zeldes · 2025-02-14T22:10:33Z

Hi @Jemoka apologies this took a while - we are still adding some discourse annotations before officially releasing the data, but coref is all done so I can share that already - you can get everything from here

Let me know if you run into any issues with the data, we are still putting the finishing touches on some other annotation layers which will be included in the next UD release.

Jemoka · 2025-02-18T22:22:19Z

Great, thanks! I can try to build a model with this and see how it goes.

amir-zeldes · 2025-02-20T16:48:17Z

BTW, two quick usage questions:

is there a way to get linking probabilities for clustered mentions, meaning the probability with which the positive linking decision was reached when a mention was added to a cluster?
is there a way to set a different decision threshold to get either higher precision or recall, for example by changing say a binary classification boundary from 50% to 60% or 40% to force more or less conservative behavior from the system?

Jemoka · 2025-02-20T17:19:12Z

CC @AngledLuffa for how we can make this happen in the high level API; in the low-level API, we can definitely change things to make this happen.

AngledLuffa · 2025-02-20T17:22:33Z

Probably could just attach the scores to the coref chain etc objects in stanza/models/coref/coref_chain.py

amir-zeldes · 2025-02-20T17:34:46Z

Thanks for the quick response! It would be nice to have both features since coref cluster decisions are transitive, so it's hard to emulate 2. just by having access to the individual decision scores from 1.

QuantumStaticFR · 2025-02-21T20:57:26Z

I would love to get those linking probabilities -> #1463 (comment)

amir-zeldes · 2025-03-03T16:00:52Z

@Jemoka Just wanted to check - were you able to train that GUM model? I need a high recall model for a project so I'm looking to get that single-corpus scheme consistency but more coref back than the ON model.

AngledLuffa · 2025-03-03T16:06:21Z

He just let me know it's in the process of training. Hopefully we can post it later tonight or tomorrow

amir-zeldes · 2025-03-03T22:29:25Z

That's fantastic, thank you!

Jemoka · 2025-03-04T04:26:58Z

Hi @amir-zeldes! Apologies for the delay on this. We built a model on the new GUM data; it seems to perform overall worse than the ontonotes model, but I wonder if there are specific behaviors you are looking for that this model gives?

https://drive.google.com/file/d/1qV-lD33_RtpHIrvRGLz3e9WBRlziw_QT/view?usp=sharing

Bakeoff score: 0.81509
Span match LEA: Sentence F1 0.66215 p 0.66092 r 0.66340

As with before please feel free to load it with

nlp = stanza.Pipeline('en', processors='tokenize,coref', coref_model_path="./the_path_to/en_gum_roberta_lora.pt")

CC @AngledLuffa, we should also be able to upload it as an optional package for easy loading.

AngledLuffa · 2025-03-04T07:11:42Z

I pushed this to HF, and I believe the package name should be gum_roberta-large-lora

amir-zeldes · 2025-03-05T16:20:02Z

it seems to perform overall worse than the ontonotes model, but I wonder if there are specific behaviors you are looking for that this model gives?

I wouldn't say it performs worse - OntoNotes just covers much less coreference phenomena, so it's an easier target to score on, and of course the corpus is also bigger, so the model is good at what it covers. The model you just uploaded actually does a great job of what I need it for, which is high recall, and doesn't have the inconsistent behavior of the CorefUD model which mixes guidelines from different datasets. Here's a random CNN news article for comparison.

OntoNotes model:

GUM model:

If you look closely you'll see the ON model's coref chains are much sparser (less color above), which has to do with ON guidelines - they don't include compound modifiers, 'adjectival' use of "US" (notice how all "US" and "United States" etc. are clustered in the bottom image), the different behavior of 'apposition wrapping' (e.g. "TSMC", "VIX", "Thursday Feb 13", and also erroneously, what happens to "CNN" at the top), and more.

So bottom line, I definitely prefer the GUM model in practice - and it's working well for what I needed it for too (it's actually only one input in an ensemble, and using all three models together turns out to be helpful, though the GUM one is best)

Jemoka · 2025-03-06T04:35:49Z

Great! Makes sense. Glad that we have both available, then!
For general usage, I'm curious which one you would recommend? Right now I think the default is still our general multilingual model, which we should probably move away from for languages for which we have a better (and actually smaller) model such as English.

amir-zeldes · 2025-03-06T23:37:10Z

It's probably subjective/application dependent - the one thing I can say for sure is that after testing it a bit, I wouldn't recommend working with the CorefUD model trained on multiple datasets. I think CorefUD is a great initiative (if I may say so myself as a contributor ;) but the datasets are not yet harmonized in terms of guidelines, and the multi-dataset model clearly oscillates between definitions and practices, possibly based on what training corpus the sentences being processed are most similar to.

Other than that, I would imagine the OntoNotes model should have higher precision and lower recall, so if one of those matters more than the other, the could guide the choice. I wrote an overview of what is missing in OntoNotes-style coref here if you're interested in the details. But ultimately, what model to use as the Stanza default is maybe something to figure out empirically, for example by processing 10 documents from different sources/text types and asking a few different people which analysis they find better. I'll be rooting for GUM's notion of coref of course, but other people could have different ideas!

Jemoka · 2025-03-07T03:48:43Z

Sounds good. CC @AngledLuffa for the final call on this one, but it does sound like for English we should definitely make something other than the CorefUD dataset as default.

Thanks again!

AngledLuffa · 2025-03-07T08:04:17Z

I'm flexible and happy to pick whichever people trust the most

amir-zeldes · 2025-03-07T20:23:51Z

If someone has a sample of texts to use as a plausibility test, I'm happy to run the models and visualize the results as above, if that helps. Within-dataset scores may not be too helpful since the test sets each target different phenomena. If there's a student or someone who has some time to do an evaluation, I guess they could count the number of blunders or plainly wrong things that each model does on some texts that come from none of the datasets?

stale · 2025-05-07T22:20:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label May 7, 2025

AngledLuffa removed the stale label May 8, 2025

AngledLuffa added the enhancement label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default combined CorefUD model has inconsistent outputs for English #1450

Default combined CorefUD model has inconsistent outputs for English #1450

amir-zeldes commented Jan 24, 2025

Jemoka commented Jan 24, 2025 •

edited

Loading

amir-zeldes commented Jan 24, 2025

yilunzhu commented Jan 24, 2025

Jemoka commented Jan 24, 2025 •

edited

Loading

Jemoka commented Jan 26, 2025 •

edited

Loading

Jemoka commented Jan 29, 2025

AngledLuffa commented Jan 29, 2025

amir-zeldes commented Jan 31, 2025

Jemoka commented Feb 1, 2025

amir-zeldes commented Feb 14, 2025

Jemoka commented Feb 18, 2025

amir-zeldes commented Feb 20, 2025

Jemoka commented Feb 20, 2025

AngledLuffa commented Feb 20, 2025

amir-zeldes commented Feb 20, 2025

QuantumStaticFR commented Feb 21, 2025

amir-zeldes commented Mar 3, 2025

AngledLuffa commented Mar 3, 2025

amir-zeldes commented Mar 3, 2025

Jemoka commented Mar 4, 2025

AngledLuffa commented Mar 4, 2025

amir-zeldes commented Mar 5, 2025

Jemoka commented Mar 6, 2025

amir-zeldes commented Mar 6, 2025

Jemoka commented Mar 7, 2025

AngledLuffa commented Mar 7, 2025

amir-zeldes commented Mar 7, 2025

stale bot commented May 7, 2025

Default combined CorefUD model has inconsistent outputs for English #1450

Default combined CorefUD model has inconsistent outputs for English #1450

Comments

amir-zeldes commented Jan 24, 2025

Jemoka commented Jan 24, 2025 • edited Loading

amir-zeldes commented Jan 24, 2025

yilunzhu commented Jan 24, 2025

Jemoka commented Jan 24, 2025 • edited Loading

Jemoka commented Jan 26, 2025 • edited Loading

Jemoka commented Jan 29, 2025

AngledLuffa commented Jan 29, 2025

amir-zeldes commented Jan 31, 2025

Jemoka commented Feb 1, 2025

amir-zeldes commented Feb 14, 2025

Jemoka commented Feb 18, 2025

amir-zeldes commented Feb 20, 2025

Jemoka commented Feb 20, 2025

AngledLuffa commented Feb 20, 2025

amir-zeldes commented Feb 20, 2025

QuantumStaticFR commented Feb 21, 2025

amir-zeldes commented Mar 3, 2025

AngledLuffa commented Mar 3, 2025

amir-zeldes commented Mar 3, 2025

Jemoka commented Mar 4, 2025

AngledLuffa commented Mar 4, 2025

amir-zeldes commented Mar 5, 2025

Jemoka commented Mar 6, 2025

amir-zeldes commented Mar 6, 2025

Jemoka commented Mar 7, 2025

AngledLuffa commented Mar 7, 2025

amir-zeldes commented Mar 7, 2025

stale bot commented May 7, 2025

Jemoka commented Jan 24, 2025 •

edited

Loading

Jemoka commented Jan 24, 2025 •

edited

Loading

Jemoka commented Jan 26, 2025 •

edited

Loading