I had to make a copy to my Google account to keep the slides.
I think the comment was a word play on "script kiddie" (ppl w/o real "security chops" but who know enough to run an exploit "script" of some sort).
Overview (5 slides)
General Concepts (9 slides)
K nearest Neighbor (6 slides)
Decision trees (6 slides)
K means (4 slides)
Gradient descent (2 slides)
Linear regression (9 slides)
Perceptron (6 slides)
Principal component analysis (6 slides)
Support vector machine (6 slides)
Bias and variance (4 slides)
Neural networks (6 slides)
Deep learning (15 slides)
I especially like the nonlinear SVM example on slides 57 and 58. It provides a visual of projecting data into a higher dimensional space.
I'm a bit confused trying to understand "error function" vs "loss function" (going from Linear regression to perceptron). Coming from a numerical background:
- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?
- If it's a special function (the one that looks like MSE ), then it's confusing because 'error function' as a fixed/special function is erf  also known as Gauss error function (and looks completely different).
- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?
- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).
- Also wondering why do we not use RMSE instead of MSE in linear regression. (But it might have a known explanation in statistics texts, so somewhat off-topic).
- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.
- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)
- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.
Not exactly because this would be optimizing the number of correctly classified elements.
Instead you minimize the sum of abs(WX) for each misclassified examples.
x=1,y=1 => max(0, -(1 x 1)) = 0.
x=1,y=-1 => max(0, -(-1 x 1)) = 1.
x=-1,y=1 => max(0, -(1 x -1)) = 1.
x=-1,y=-1 => max(0, -(-1 x -1)) = 0.
In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target)
What you define is minimizing L(transfer(XW), target) which is not easily optimizable.
For backpropagation I call "error function" a function which takes the training data and the parameters of the model and returns an empirical approximation of how good is the model to be able to differentiate it.
I used a squared error loss to define the error function of the linear regression model. This is arbitrary and the choice of the loss depends on the meaning we want to give to prediction errors. Here a big error is penalized way more than a small error.
The way I understand the difference between error and loss is by thinking about regularization methods (where you also add a penalty in the error depending on the parameters of the model). In that case error = loss + regularization. When applying gradient descend you would derivate the error and not the loss.
The error is an empirical value that is supposed to estimate the performance of the model on unseen data. This is by consequent not formally defined.
The loss function used is a hinge loss. If the point is correctly classified no penalty is added, otherwise the penalty is the distance to the hyperplane.
Linear and logistic regression, gradient descent, clustering, support vector machines, bias and variance (one of the slides was taken from the course), neural networks, etc...
It's part of a Machine Learning Specialization on Coursera (5 courses + a capstone project) which goes deeper on some areas after the foundations course: https://www.coursera.org/specializations/machine-learning
I am taking this specialization and I have learned a lot so far. The material seems like it's at exactly the right level of depth (balances giving a high level overview of the field, with enough depth in specific areas to understand how things work and be able to apply them). Disclaimer: I work at Dato, and the CEO of Dato is also one of the instructors of this course.
Does anyone have any other resources?