Suggested configuration for training NN models on highly imbalanced datasets

Hi!

I have a binary classification dataset with highly imbalanced label distributions (pos : neg == 1 : 200)

I was trying to apply the BERT code in [Neural Network Quick Start Tutorial](https://www.csie.ntu.edu.tw/~cjlin/libmultilabel/api/nn_tutorial.html#neural-network-quickstart-tutorial)
directly on this dataset, with val metric set to "Macro-F1", but the trained model would mostly produce all negatives in this case. 

I am wondering if there are parameters or configurations I could tune in LibMultiLabel for such an imbalanced dataset to improve the model's performance?


For your reference:

I also tried the linear method, where I saw using `train_cost_sensitive` instead of `train_1vsrest` improved noticeably on this issue. (with `train_cost_sensitive`, the model predicts 4 times more positive samples than with `train_1vsrest`. Although both methods have 'Micro-F1 and 'P@1' close to 0.99 (due to dominating negative samples) and `Macro-F1` around 0.5)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggested configuration for training NN models on highly imbalanced datasets #229

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggested configuration for training NN models on highly imbalanced datasets #229

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions