Open
Description
This is not necessarily wrong, but I want to point out that using a ReLU here is not a very common choice as far as I know. This might not hurt anything, but if it does, this could be a thing to check.
Also: you can tie the input/output embedding matrices (ie. use a single parameter instead of self.embedding.weight
and self.out
). This will reduce your vocabulary size by half and might help a bit with overfitting. Note that you would still need the bias which is included in the self.out
layer.
Metadata
Metadata
Assignees
Labels
No labels