lstm_imdb validation set is not included in vocabulary



Hello,

it seems that in the file `lstm_imdb.py` there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or `<unk>` and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:
```
counter = 0
for batch in valid_iter:
    batch_text = batch.text[0]
    for a in batch_text:
        for b in a:
            if b == 0:
                counter += 1
                if counter%1000==0:
                    print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')
```

It counts words embedded as 0 in valid_iter, to prove there are many of them.

I resolved the issue by manually loading the whole dataset and passing it to `TEXT.build_vocab` as first argument but I am sure there is a nicer way of doing it.

Hope I could help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lstm_imdb validation set is not included in vocabulary #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

lstm_imdb validation set is not included in vocabulary #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions