
An introduction to Machine Learning - antoineaugusti
https://docs.google.com/presentation/d/1O6ozzZHHxGzU-McpvEG09hl7K6oQDd2Taw0FOlnxJc8/edit?usp=docslist_api
======
antoineaugusti
Please note that I'm not the author of the presentation. Made by Quentin de
Laroussilhe [http://underflow.fr](http://underflow.fr)

I had to make a copy to my Google account to keep the slides.

------
rafaquintanilha
Worth to mention that a Statistical Learning Stanford course [1] just started
and according to the lecturers there is a lot of overlap in both areas.

[1]
[https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)

~~~
hartator
Does someone know how much it will cost?

~~~
nightski
It's free. They have run it a couple years in a row now starting in January.

------
compactmani
If you are just starting out with applied machine learning I would focus
heavily on understanding bias and variance as it will really help you succeed.
It's I think what (largely) separates the sklearn kiddies from the pros.

~~~
johnmarinelli
what's wrong with sklearn?

~~~
ivan_ah
Nothing wrong, quite the opposite scikit-learn is awesome.

I think the comment was a word play on "script kiddie" (ppl w/o real "security
chops" but who know enough to run an exploit "script" of some sort).

------
aabajian
This really is a fantastic presentation for newcomers to the field. When I was
taking these classes I found it difficult to keep all of the available
algorithms organized in my mind. Here's an outline of his presentation:

Overview (5 slides)

General Concepts (9 slides)

K nearest Neighbor (6 slides)

Decision trees (6 slides)

K means (4 slides)

Gradient descent (2 slides)

Linear regression (9 slides)

Perceptron (6 slides)

Principal component analysis (6 slides)

Support vector machine (6 slides)

Bias and variance (4 slides)

Neural networks (6 slides)

Deep learning (15 slides)

I especially like the nonlinear SVM example on slides 57 and 58. It provides a
visual of projecting data into a higher dimensional space.

~~~
fizixer
Thanks for the great slides.

Some questions:

I'm a bit confused trying to understand "error function" vs "loss function"
(going from Linear regression to perceptron). Coming from a numerical
background:

\- Is the term 'error function' used as a special function (like sin, cosine,
etc) or is it a generalized term ?

\- If it's a special function (the one that looks like MSE [2]), then it's
confusing because 'error function' as a fixed/special function is erf [1] also
known as Gauss error function (and looks completely different).

\- Are we using the term 'loss function' as a generalized term? whose special
case is 'error function'? e.g., in linear regression loss function is 'error
function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

\- Using final output of perceptron for error function makes it a "hard
problem" agreed. But what about using just the function from linear regression
(the MSE-like one) instead of a using a brand new function max(0, -xy). (It's
not very intuitive to reason what's so special about max(0, -xy)).

\- Also wondering why do we not use RMSE instead of MSE in linear regression.
(But it might have a known explanation in statistics texts, so somewhat off-
topic).

[1]
[https://en.wikipedia.org/wiki/Error_function](https://en.wikipedia.org/wiki/Error_function)

[2]
[https://en.wikipedia.org/wiki/Mean_squared_error](https://en.wikipedia.org/wiki/Mean_squared_error)

~~~
psyklic
\- You can think of error, loss, and cost functions as the same. In fact, two
textbooks in front of me say that the loss function is a measure of error. If
"loss" is a confusing word, think of it as the "information loss" of the model
-- if your model is not perfect, you lose some of the information inherent in
the data.

\- There is no particular function used for error and loss. Different
functions can be chosen based on the model, problem type, ease of theoretical
analysis, etc. In practice, the final loss function is often experimentally
determined by whatever yields the best accuracy.

\- The perceptron uses a different loss function because it is a binary
classifier, not a regressor. In this case, because there are only two classes
(1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class
and 1 if they are different. Then, the error function just sums these losses
together. (Note this is quite similar to MSE.)

\- RMSE is also valid -- adding the square root will not affect minimization.
MSE is likely more common for minor reasons, such as slightly better
efficiency and cleaner theoretical proofs.

~~~
underflow
"the loss function max(0, -xy) is 0 if x and y are the same class and 1 if
they are different"

Not exactly because this would be optimizing the number of correctly
classified elements. Instead you minimize the sum of abs(WX) for each
misclassified examples.

~~~
psyklic
In the case of these slides, the loss function is max(0, -xy) and the error
function is the sum of these. So, the error function is the number of
incorrectly classified examples (if x and y are different, it adds 1 to the
error), which is exactly what we hope to minimize.

x=1,y=1 => max(0, -(1 x 1)) = 0.

x=1,y=-1 => max(0, -(-1 x 1)) = 1.

x=-1,y=1 => max(0, -(1 x -1)) = 1.

x=-1,y=-1 => max(0, -(-1 x -1)) = 0.

~~~
underflow
The transfer function is applied only at evaluation.

In the formulas of the slides (and in the code), for training I compute the
loss of an example X and it's expected target as: L(XW, target) What you
define is minimizing L(transfer(XW), target) which is not easily optimizable.

~~~
psyklic
In the case of perceptrons, point taken -- I agree. However, my original
statement still holds. The loss and error functions presented on the slides
are still valid. Whether or not they are easily optimizable, they are still
examples of loss and error functions.

------
yelnatz
Pretty good summary of what you learn in your first machine learning class in
college.

------
lectrick
Is there an online course for this I could take?

~~~
synotic
Andrew Ng's popular Machine Learning course goes over most of the topics in
the slides: [https://www.coursera.org/learn/machine-
learning](https://www.coursera.org/learn/machine-learning)

Linear and logistic regression, gradient descent, clustering, support vector
machines, bias and variance (one of the slides was taken from the course),
neural networks, etc...

~~~
lectrick
You know... I believe I started this class once and didn't finish it due to
time constraints. I think it's time to try again...

------
fnl
Nobody concerned about plagiarism here? I am pretty sure I've seen a number of
the slides and graphics elsewhere. Correct attributions however seem amiss.

~~~
underflow
I did those slides for a talk at school at the very last minute and I did not
expect it to be republished. I requested the edit rights on the document and
I'll fix this asap.

------
kendallpark
Yes, thank you. I'm hoping to build an ANN this summer and don't have the
luxury of taking an actual class.

Does anyone have any other resources?

~~~
bl4ckdu5t
You should look up marI/O on YouTube it may be a good starting point for you

~~~
kendallpark
Thanks! I'll checkout it out.

------
aerioux
that was a really good introduction :) sort of like an executive summary - all
the "why we care" and some of the words you might want to look at to actually
learn the details

------
max_
Thanx for sharing this!!

------
Dowwie
is there a corresponding video where the slides are presented?

~~~
antoineaugusti
Nope, sorry. This presentation was given by Quentin de Laroussilhe in Paris at
EPITA recently.

------
remriel
Thank you.

