Test data leakage in step 05a (PyTorch and Tensorflow)

The two notebooks `05a - Deep Neural Networks (PyTorch).ipynb` and `05a - Deep Neural Networks (TensorFlow).ipynb` contain the following piece of code:

```
	 # The dataset is too small to be useful for deep learning
	 # So we'll oversample it to increase its size
	 for i in range(1,3):
		 penguins = penguins.append(penguins)
```

This creates a new dataframe that contains four copies of each row of the original dataframe. Since this happens before the training/test split, the probability of a row of the original dataframe to be present in both training and test set is approximately 0.75. In other words, one can expect 3/4 of the original rows to be present in both sets.

This constitutes a leakage of information from the test set into the training set, which renders the test set incapable of assessing the generalization capability of the trained model. In the case of the penguin toy dataset, this does not matter much: The three species appear to be well-separated in feature space, so that overfitting is not an immediate concern. Still, mixing training and test data is bad practice and should not be taught to ML beginners.

I therefore suggest the removal of the piece of code shown above. Since the model is no longer exposed to multiple copies of each row in one epoch of training, the number of epochs has to be increased to achieve the same test set accuracy. Training for 100 instead of 50 epochs worked well in my tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test data leakage in step 05a (PyTorch and Tensorflow) #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test data leakage in step 05a (PyTorch and Tensorflow) #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions