You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What would happen if we use MSE on binary classification?
Some comments on the "Bonus" section of this article from AI Summer.
Edit: The article has been updated with my demonstration (link).
Screenshot of the original article
Review
"When $\hat{y}^{(i)} = 1$" and "When $\hat{y}^{(i)} = 0$" are inverted.
I have no idea why the authors replace $y^{(i)}$ with $\sigma(\theta^\intercal x)$ or $1-\sigma(\theta^\intercal x)$ when the class changes. If properly trained, the model weights should "push" the sigmoid to output 0 or 1 depending on the input $x$.
The proposed demonstration does not actually prove anything:
Let's assume that we have a simple neural network with weights $\theta$ such as $z=\theta^\intercal x$, and outputs $\hat{y}=\sigma(z)$ after a sigmoid activation.
The chain rule gives us the gradient of the loss $L$ with respect to the weights $\theta$:
$$\frac{\partial L}{\partial \theta}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial \theta}$$
MSE loss is expressed as follows:
$$L(y, \hat{y}) = \frac{1}{2}(y-\hat{y})^2$$