pip install -r requirements.txt
data exploration:
python explore_data.py
training:
python train_model.py
- roughly half of the items are non-scientific, the other labels appear equally often, only antibiotic appears less often
- ~15% of items have no helpful text
(see comments in explore_data.py)
Remove character sequences that do not help the model:
- Remove Wikimarkup: "\n" and "=="
- Remove falsy values from texts: "No text", "DisambiguationError"
- Strip blank spaces
Tokenization for Transformer (BERT): tokenize, adding tokens, padding, truncating
(see train_model > tokenize_function)
- transformer architecture as it is SOTA for NLP.
- it also enables to feed two parts of input with [SEP], attention mechanism learns to employ knowledge from both input parts
- BERT as an easy first implementation
- linear layer on top of [CLS]-token logits for the classification, using this token index as a central target where the transformer will be fine-tuned to propagate the required information to
(see train_model > TermClassifier)
- term is the main feature and will be fed as the first part to the transformer
- text could be empty, but important for the task as the term semantic alone would tell very little about its scientific domain; it is fed as a second optional part to the transformer (can be an empty string)
- data split: train, val, test; to make sure that one gets a realistic accuracy score for unseen data
- 20 epochs
- optimizer: AdamW, 5e-5 lr
- scheduler: linear with warmup
- plotting loss and accuracy to see progress/choose best model
On GPU Tesla V100-PCIE-32GB:
Epoch: 1
avg train loss: 0.45981357934108624
val loss: 0.38452588010268907, val accuracy: 86.5
test loss: 0.38804003052568686, test accuracy: 87.16666666666667
----------------------------------------
Epoch: 2
avg train loss: 0.32162566339674714
val loss: 0.4077319192502182, val accuracy: 87.66666666666667
test loss: 0.4174609709744497, test accuracy: 87.5
----------------------------------------
Epoch: 3
avg train loss: 0.2521410969590458
val loss: 0.33876629498379773, val accuracy: 88.58333333333333
test loss: 0.3609540869119034, test accuracy: 88.75
----------------------------------------
Epoch: 4
avg train loss: 0.20726745444039504
val loss: 0.3520806852184857, val accuracy: 89.32777777777777
test loss: 0.4039795075293902, test accuracy: 88.49444444444444
----------------------------------------
Epoch: 5
avg train loss: 0.18417148951635076
val loss: 0.36441500576833885, val accuracy: 88.07777777777777
test loss: 0.393384957042775, test accuracy: 88.33333333333333
----------------------------------------
Epoch: 6
avg train loss: 0.16998945107246982
val loss: 0.36415901160575836, val accuracy: 89.16666666666667
test loss: 0.3725149495061487, test accuracy: 89.41666666666667
----------------------------------------
Epoch: 7
avg train loss: 0.15714695367447953
val loss: 0.3906371349274074, val accuracy: 89.16666666666667
test loss: 0.3880362138935258, test accuracy: 90.25
----------------------------------------
Epoch: 8
avg train loss: 0.1507952446648172
val loss: 0.41255322317408477, val accuracy: 89.16666666666667
test loss: 0.38903524216618585, test accuracy: 89.08333333333333
----------------------------------------
Epoch: 9
avg train loss: 0.14529295997015046
val loss: 0.4172161076998115, val accuracy: 89.57777777777777
test loss: 0.40865574848908487, test accuracy: 90.08333333333333
----------------------------------------
Epoch: 10
avg train loss: 0.13440908850105188
val loss: 0.4193162186234743, val accuracy: 89.07777777777777
test loss: 0.43224049615319626, test accuracy: 89.41666666666667
----------------------------------------
I used the val-accuracy score to determine the best model, then the test-accuracy to check how well it performs within unseen data.
The accuracy on the test data was 90.08% (at epoch 9 with 89.58% val accuracy) which is probably not really suitable to use it yet.
- different transformer architectures could be tested
- hyperparameter optimization (was not enough time within this task)
- find different data sources to wikipedia articles as it is difficult to predict only from the semantics from a term (and ~15% have no matched wikipedia article)