-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Hello,
it seems that in the file lstm_imdb.py there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or <unk> and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:
counter = 0
for batch in valid_iter:
batch_text = batch.text[0]
for a in batch_text:
for b in a:
if b == 0:
counter += 1
if counter%1000==0:
print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')
It counts words embedded as 0 in valid_iter, to prove there are many of them.
I resolved the issue by manually loading the whole dataset and passing it to TEXT.build_vocab as first argument but I am sure there is a nicer way of doing it.
Hope I could help!
Metadata
Metadata
Assignees
Labels
No labels