Skip to content

lstm_imdb validation set is not included in vocabulary #1

@SlothWithCloth

Description

@SlothWithCloth

Hello,

it seems that in the file lstm_imdb.py there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or <unk> and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:

counter = 0
for batch in valid_iter:
    batch_text = batch.text[0]
    for a in batch_text:
        for b in a:
            if b == 0:
                counter += 1
                if counter%1000==0:
                    print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')

It counts words embedded as 0 in valid_iter, to prove there are many of them.

I resolved the issue by manually loading the whole dataset and passing it to TEXT.build_vocab as first argument but I am sure there is a nicer way of doing it.

Hope I could help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions