 An introduction to Machine Learning 302 points by antoineaugusti on Jan 17, 2016 | hide | past | web | favorite | 36 comments Please note that I'm not the author of the presentation. Made by Quentin de Laroussilhe http://underflow.frI had to make a copy to my Google account to keep the slides. Worth to mention that a Statistical Learning Stanford course  just started and according to the lecturers there is a lot of overlap in both areas. Does someone know how much it will cost? It's free. They have run it a couple years in a row now starting in January. The linked page says "Free". thanks! If you are just starting out with applied machine learning I would focus heavily on understanding bias and variance as it will really help you succeed. It's I think what (largely) separates the sklearn kiddies from the pros. what's wrong with sklearn? Nothing wrong, quite the opposite scikit-learn is awesome.I think the comment was a word play on "script kiddie" (ppl w/o real "security chops" but who know enough to run an exploit "script" of some sort). This really is a fantastic presentation for newcomers to the field. When I was taking these classes I found it difficult to keep all of the available algorithms organized in my mind. Here's an outline of his presentation:Overview (5 slides)General Concepts (9 slides)K nearest Neighbor (6 slides)Decision trees (6 slides)K means (4 slides)Gradient descent (2 slides)Linear regression (9 slides)Perceptron (6 slides)Principal component analysis (6 slides)Support vector machine (6 slides)Bias and variance (4 slides)Neural networks (6 slides)Deep learning (15 slides)I especially like the nonlinear SVM example on slides 57 and 58. It provides a visual of projecting data into a higher dimensional space. Thanks for the great slides.Some questions:I'm a bit confused trying to understand "error function" vs "loss function" (going from Linear regression to perceptron). Coming from a numerical background:- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?- If it's a special function (the one that looks like MSE ), then it's confusing because 'error function' as a fixed/special function is erf  also known as Gauss error function (and looks completely different).- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).- Also wondering why do we not use RMSE instead of MSE in linear regression. (But it might have a known explanation in statistics texts, so somewhat off-topic). - You can think of error, loss, and cost functions as the same. In fact, two textbooks in front of me say that the loss function is a measure of error. If "loss" is a confusing word, think of it as the "information loss" of the model -- if your model is not perfect, you lose some of the information inherent in the data.- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs. "the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different"Not exactly because this would be optimizing the number of correctly classified elements. Instead you minimize the sum of abs(WX) for each misclassified examples. In the case of these slides, the loss function is max(0, -xy) and the error function is the sum of these. So, the error function is the number of incorrectly classified examples (if x and y are different, it adds 1 to the error), which is exactly what we hope to minimize.x=1,y=1 => max(0, -(1 x 1)) = 0.x=1,y=-1 => max(0, -(-1 x 1)) = 1.x=-1,y=1 => max(0, -(1 x -1)) = 1.x=-1,y=-1 => max(0, -(-1 x -1)) = 0. The transfer function is applied only at evaluation.In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target) What you define is minimizing L(transfer(XW), target) which is not easily optimizable. In the case of perceptrons, point taken -- I agree. However, my original statement still holds. The loss and error functions presented on the slides are still valid. Whether or not they are easily optimizable, they are still examples of loss and error functions. - Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?For backpropagation I call "error function" a function which takes the training data and the parameters of the model and returns an empirical approximation of how good is the model to be able to differentiate it.- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?I used a squared error loss to define the error function of the linear regression model. This is arbitrary and the choice of the loss depends on the meaning we want to give to prediction errors. Here a big error is penalized way more than a small error.- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?The way I understand the difference between error and loss is by thinking about regularization methods (where you also add a penalty in the error depending on the parameters of the model). In that case error = loss + regularization. When applying gradient descend you would derivate the error and not the loss. The error is an empirical value that is supposed to estimate the performance of the model on unseen data. This is by consequent not formally defined.- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).The loss function used is a hinge loss. If the point is correctly classified no penalty is added, otherwise the penalty is the distance to the hyperplane. The reason they call it error function in perceptron learning is that it relates to how the perceptron is taught a correction for an error. Loss functions are more general and usually the word people use when they're talking about optimization problem.s RMSE and MSE are monotonic transformations of each other, so minimizing one is equivalent to minimizing the other. You can think of linear regression as minimizing RMSE if you like; it's just cleaner to do the math without the extra square root. Pretty good summary of what you learn in your first machine learning class in college. Is there an online course for this I could take? Andrew Ng's popular Machine Learning course goes over most of the topics in the slides: https://www.coursera.org/learn/machine-learningLinear and logistic regression, gradient descent, clustering, support vector machines, bias and variance (one of the slides was taken from the course), neural networks, etc... You know... I believe I started this class once and didn't finish it due to time constraints. I think it's time to try again... This "Statistical Learning" course has just started on Stanford's online platform last week: I second this question. I couldn't find one on coursera or academic earth. I liked the UW Coursera class that gave a broad overview of these topics with some applications: https://www.coursera.org/learn/ml-foundationsIt's part of a Machine Learning Specialization on Coursera (5 courses + a capstone project) which goes deeper on some areas after the foundations course: https://www.coursera.org/specializations/machine-learningI am taking this specialization and I have learned a lot so far. The material seems like it's at exactly the right level of depth (balances giving a high level overview of the field, with enough depth in specific areas to understand how things work and be able to apply them). Disclaimer: I work at Dato, and the CEO of Dato is also one of the instructors of this course. Nobody concerned about plagiarism here? I am pretty sure I've seen a number of the slides and graphics elsewhere. Correct attributions however seem amiss. I did those slides for a talk at school at the very last minute and I did not expect it to be republished. I requested the edit rights on the document and I'll fix this asap. Yes, thank you. I'm hoping to build an ANN this summer and don't have the luxury of taking an actual class.Does anyone have any other resources? You should look up marI/O on YouTube it may be a good starting point for you Thanks! I'll checkout it out. that was a really good introduction :) sort of like an executive summary - all the "why we care" and some of the words you might want to look at to actually learn the details Thanx for sharing this!! is there a corresponding video where the slides are presented? Nope, sorry. This presentation was given by Quentin de Laroussilhe in Paris at EPITA recently. Thank you. Search: