Hacker News new | past | comments | ask | show | jobs | submit login

This really is a fantastic presentation for newcomers to the field. When I was taking these classes I found it difficult to keep all of the available algorithms organized in my mind. Here's an outline of his presentation:

Overview (5 slides)

General Concepts (9 slides)

K nearest Neighbor (6 slides)

Decision trees (6 slides)

K means (4 slides)

Gradient descent (2 slides)

Linear regression (9 slides)

Perceptron (6 slides)

Principal component analysis (6 slides)

Support vector machine (6 slides)

Bias and variance (4 slides)

Neural networks (6 slides)

Deep learning (15 slides)

I especially like the nonlinear SVM example on slides 57 and 58. It provides a visual of projecting data into a higher dimensional space.




Thanks for the great slides.

Some questions:

I'm a bit confused trying to understand "error function" vs "loss function" (going from Linear regression to perceptron). Coming from a numerical background:

- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?

- If it's a special function (the one that looks like MSE [2]), then it's confusing because 'error function' as a fixed/special function is erf [1] also known as Gauss error function (and looks completely different).

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).

- Also wondering why do we not use RMSE instead of MSE in linear regression. (But it might have a known explanation in statistics texts, so somewhat off-topic).

[1] https://en.wikipedia.org/wiki/Error_function

[2] https://en.wikipedia.org/wiki/Mean_squared_error


- You can think of error, loss, and cost functions as the same. In fact, two textbooks in front of me say that the loss function is a measure of error. If "loss" is a confusing word, think of it as the "information loss" of the model -- if your model is not perfect, you lose some of the information inherent in the data.

- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.

- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)

- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.


"the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different"

Not exactly because this would be optimizing the number of correctly classified elements. Instead you minimize the sum of abs(WX) for each misclassified examples.


In the case of these slides, the loss function is max(0, -xy) and the error function is the sum of these. So, the error function is the number of incorrectly classified examples (if x and y are different, it adds 1 to the error), which is exactly what we hope to minimize.

x=1,y=1 => max(0, -(1 x 1)) = 0.

x=1,y=-1 => max(0, -(-1 x 1)) = 1.

x=-1,y=1 => max(0, -(1 x -1)) = 1.

x=-1,y=-1 => max(0, -(-1 x -1)) = 0.


The transfer function is applied only at evaluation.

In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target) What you define is minimizing L(transfer(XW), target) which is not easily optimizable.


In the case of perceptrons, point taken -- I agree. However, my original statement still holds. The loss and error functions presented on the slides are still valid. Whether or not they are easily optimizable, they are still examples of loss and error functions.


- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?

For backpropagation I call "error function" a function which takes the training data and the parameters of the model and returns an empirical approximation of how good is the model to be able to differentiate it.

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

I used a squared error loss to define the error function of the linear regression model. This is arbitrary and the choice of the loss depends on the meaning we want to give to prediction errors. Here a big error is penalized way more than a small error.

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

The way I understand the difference between error and loss is by thinking about regularization methods (where you also add a penalty in the error depending on the parameters of the model). In that case error = loss + regularization. When applying gradient descend you would derivate the error and not the loss. The error is an empirical value that is supposed to estimate the performance of the model on unseen data. This is by consequent not formally defined.

- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).

The loss function used is a hinge loss. If the point is correctly classified no penalty is added, otherwise the penalty is the distance to the hyperplane.


The reason they call it error function in perceptron learning is that it relates to how the perceptron is taught a correction for an error. Loss functions are more general and usually the word people use when they're talking about optimization problem.s


RMSE and MSE are monotonic transformations of each other, so minimizing one is equivalent to minimizing the other. You can think of linear regression as minimizing RMSE if you like; it's just cleaner to do the math without the extra square root.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: