Some sections may be less relevant, depending on what you want to do, but Section III is a very good introduction to machine learning methods.
Do the exercises as you're reading. Theory is one thing, but in ML my rule of thumb is that you don't really understand a model until you've coded it up. A collection of written exercises would be a good way to impress an interviewer, too.
⸰ Textbook homepage: http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...
⸰ Online version homepage (should always contain link to latest revision): http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...
⸰ Directory sorted by date: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/?C=M;O=D
 - http://www.computervisionmodels.com/
* fast.ai ML course: http://forums.fast.ai/t/another-treat-early-access-to-intro-...
* fast.ai DL course: part 1: http://course.fast.ai/ part 2: http://course.fast.ai/part2.html
The fast.ai courses spend very little time on theory, and you can follow the videos at your own pace.
* The best books on ML (excluding DL), in my view, are "An Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani, and "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman. The Elements arguably belongs on every ML practitioner's bookshelf -- it's a fantastic reference manual.[b]
* The only book on DL that I'm aware of is "Deep Learning," by Goodfellow, Bengio and Courville. It's a good book, but I suggest holding off on reading it until you've had a chance to experiment with a range of deep learning models. Otherwise, you will get very little useful out of it.[c]
[a] Scroll down on this page for their bios: http://course.fast.ai/about.html
[b] Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/ The Elements of Statistical Learning: https://web.stanford.edu/~hastie/ElemStatLearn/
I'm using some of those for $(redacted, the plan is to make money). Don't follow the crowd.
I think with hindsight, it's great to have a broad spectrum of methods available to you, but if you focus too much on methods at the hard-math end of the spectrum just for the sake of an intellectual challenge, you can end up fixated on an exotic solution looking for a problem while the rest of the field moves on, rather than doing useful engineering people care about.
Maybe you find a niche where something exotic really helps, maybe you don't -- maybe for research this is a risk worth taking. But just something to keep in mind.
IMO: breadth is good. Mathematical maturity helps. If one sticks around one finds uses for interesting maths eventually, but not worth trying to force it.
Another avenue for people who want to use some hardcore math: try and use it to find some good theory around why things which work well, work well. Not an easy task either by any means.
Just throw more GPUs at it bro. Gradient descent 4ever! xD
You may also enjoy graycat's comments on HN. He know his math but is contrarian about machine learning but its good to have that view point.
For general machine learning, there are many, many books. A good intro is  and a more comprehensive, reference sort of book is . Frankly, by this point, even reading the documentation and user guide of scikit-learn has a fairly good mathematical presentation of many algorithms. Another good reference book is .
Finally, I would also recommend supplementing some of that stuff with Bayesian analysis, which can address many of the same problems, or be intermixed with machine learning algorithms, but which is important for a lot of other reasons too (MCMC sampling, hierarchical regression, small data problems). For that I would recommend  and .
Stay away from bootcamps or books or lectures that seem overly branded with “data science.” This usually means more focus on data pipeline tooling, data cleaning, shallow details about a specific software package, and side tasks like wrapping something in a webservice.
That stuff is extremely easy to learn on the job and usually needs to be tailored differently for every different project or employer, so it’s a relative waste of time unless it is the only way you can get a job.
: < https://www.amazon.com/Deep-Learning-Adaptive-Computation-Ma... >
: < https://www.amazon.com/Pattern-Classification-Pt-1-Richard-D... >
: < https://www.amazon.com/Pattern-Recognition-Learning-Informat... >
: < http://www.web.stanford.edu/~hastie/ElemStatLearn/ >
: < http://www.stat.columbia.edu/~gelman/book/ >
: < http://www.stat.columbia.edu/~gelman/arm/ >
I just read over his description of how to transform a uniform random variable into a variable with a desired distribution (p. 526). It's a fairly easy trick, but if I didn't already know it I wouldn't understand his explanation
What are your thoughts/interests on analysis for ML, like the approximation theory that branched from Fourier and Wavelet analysis catching on with Cybenko , continuing with among others Mhaskar , most recently added to by Bolcskei . And then there are other areas where analysis applies like studying the optimization of such networks...
(And of course, this is just for NNs. There are other areas of research where analysis comes into ML. And of course, real analysis lays the foundations for probability and statistics, and is not abstracted away in many research areas in these fields.)
This is also a good resource for MachineLearning
I didn't learn artificial neural network stuff from there. I knew those concepts but I didn't know the matrix formalism applied to it. So this was really nice to understand why GPUs are good for this. Math-wise it was really nice watch.
In some ways, it is a "rich-get-richer" effect. But creators like 3B1B expend a lot of time and resources to do what they do, and the word of mouth he gets is an acknowledgement that the work he does is worth the money and views we provide.
The web has allowed the sharing of high quality content across the world for little to no cost, but has also created so much noise for the average user that they have little to no hope of finding the high quality content on their own. Comment sections across the web fix this problem by promoting producers that offer a superior product. This encourages everyone to make better content.
So, some time ago I contracted out writing some code for fitting a logistic regression onto a given set of observations. There were some specific requirements, but I think I should have been able to piece something myself together using mainstream LA libraries; some of them even hint at 'you could fit a logistic regression using these functions' but no complete examples. But I didn't understand well enough so I contracted it out.
The woman who ended up writing the code used a 'Hessian' matrix to do so (she actually wrote two functions doing the same thing, one used this Hessian approach - I think the idea was that it would be faster but there wasn't a lot of time and it never got tweaked enough to make a difference).
So my question - is there a layman's explanation for what a Hessian matrix is, and how it applies to fitting a logistic model? Also (with an eye to the future of my project), does it have applications for non-linear regressions?
Alternatively, are there any books where this is covered? I have most standard stats/applied stats/operations research books, as well as a few like the no bullshit guide to linear algebra, but none cover this specific issue - or even how to fit a logistic regression at all, on a practical level (so not just 'conceptually you do xyz, implementation is left as an exercise for the reader').
So regression is just a minimization problem. You're trying to find the values that minimize
∇f(X,Y,Z) = 0
Let's see how this plays out when you have a function that you're trying to minimize that only has 1 parameter. This is then just a regular old function of one variable, and we can easily visualize it.
| | | x^2 + 2x
| / 2x + 2
But when dealing with many functions you may not know the functional form, or there may not be a way to solve for the roots symbolically (or you don't know the tricks to do so), you might want to try a strictly numerical approach. You say "Look, I know how to compute the original function, can't I just use that directly?"
Well, one option would be to find two points on your function, one where it's positive, the other where it's negative, and then keep bisecting the interval down until you zoom in on the point where it crossed the x axis (with additional logic to handle multiple crossings). Of course, if you're function is discontinuous, or is discontinuous due to a floating point error, you're going to be hosed. This is the bisection method. However, it's not as fast as is desirable. Since each decimal point is an increase by a factor of 10 precision, and you typically want several decimal points at least, simply increasing your precision by a mere factor of 2 each time may not be good enough. This is especially true if you're doing this by hand. :)
So a faster method is to start at a point, and then draw a line tangent to that point, and see where that line intersects the x-axis. Evaluate that point, and normally it'll be closer to a zero than before. repeat. This, in many cases, will converge much faster than performing bisection.
This is known as the Newton-Raphson Method. Now, in order to draw a tangent line, you have to know how the function is changing at that point, so the line and the function's slopes will match, and well, that means you take the derivative. Since, however, the function which we're trying to find the root of is itself a derivative, this is now a second derivative.
So it turns out Newton-Raphson generalizes upwards in dimension. So when you start off with your error function that you're trying to minimize, you take it's derivative, but now you have to track how it changes in N dimensions, so the derivative object is now a vector.
Now, we're trying to minimize this vector valued function (the gradient), and set it to zero. So we take it's derivative, which, since it's vector valued, will now be a matrix, since each component can vary in N directions. So we now have a NxN matrix that tracks how everything is changing. This second order derivative is called the Hessian. And we can use it the same way (implementation left to the reader ;) ) as we did in one dimension.
Say we have a curve that corresponds to how good of a fit your model is. We want to try to find the maximum on that curve. However, calculating every point of the curve is too expensive, so we want to minimize the number of points we have to check. So, we start with a guess as to the highest point on the curve and take the first and second derivatives of the curve at that point. This gives us enough to fit a parabola that approximates the curve in the neighborhood of our initial guess point. Then, it's pretty easy to solve for the highest point on the parabola. That's our new guess. Repeat that a few times until the guesses stop shifting much. If the curve is nicely shaped (i.e. smooth everywhere, only has one maximum), the guesses will converge on the highest point.
This is a often faster than a similar method, gradient ascent, which relies upon only taking the first derivative. This would yield a line in the vicinity of our guess, and then we just move our guess a little bit, such that it goes up the line. This is pretty slow, since it can't just go straight to a guess of the top, and if you go too fast, then it'll blow right past the maximum.
The Hessian matrix is the higher dimensional equivalent to the second derivative there and the gradient is the equivalent for the first derivative. For example, if we have a two dimensional surface in 3D, then those matrices will be 2x2 and capture the curvature of the 3D paraboloid in the vicinity of the guess. As you go up in dimensionality, they're called quadric hypersurfaces.
When you're fitting a logistic regression, your hypersurface is the logarithm of the likelihood that the data you have fits the logistic curve with parameters at that point. The logarithm makes the hypersurface better behaved and makes the calculus easier. You just need the gradient and the Hessian, evaluate those at your initial guess, fit a quadric hypersurface to the guess there, pop up to the top of that hypersurface, repeat a few times, and you've got your model.
something of a loss of perspective.
You'd have to provide some alternative perspective or argument that goes beyond 'pretty awful sounds kinda mean'.
Remember what we have here: a free, actively maintained, accurate, comprehensive and advanced corpus of expository writing on mathematics
We already have a few of those. As I mentioned, mathworld is far better at this and it's been around longer than Wikipedia.
See the third section here for an intuitive image of repeated parabola-fitting.
I'm not sure how you saw it used for fitting a logistic model, but in general knowing the concavity of a function is useful for minimizing or maximizing it (e.g. by repeatedly approximating it as a 2nd-order polynomial and minimizing that), so maybe that's what you saw.
Probably some 2nd derivative version of gradient descent or Newton's method.
At first order you can approximate locally your function as a plane (the plane that go through the point and has the same first derivative), and to minimize that you take a small step (because your approximation is only valid locally) in the direction where the plane is inclined.
Alternatively you can make a better approximation of your function using higher order derivatives. So instead of approximating your function with a plane, you approximate it with a quadratic form (the multidimensional extension of the parabola). This quadratic form matches the first derivatives and also the Hessian (second derivatives in multi-dimension) of your function at your current parameters.
Once you have the approximation, there are closed formula for the minimum of the quadratic form so you can directly jump closer to the result (but how it will perform will depend how close your function resemble your approximation).
When to pick first order approximation or second order approximation is problem dependent, but a quick rule of thumb is second order is faster when close to the solution but consume memory quadratically with respect to the number of dimension so is only applicable when the dimension is low, or when you have problem specific simplifications like your problem being a sum of squares.
In practice interesting problem are too big and first order method is all you can do. But you can also improve things a little by approximating the diagonal of the Hessian, or some low-rank approximation of the Hessian. (This is another tractable kind of approximation of your function). You can also make some probabilistic approximation of your function (only considering a few examples instead of the whole training set) and from that you can derive all "on-line" methods, but this is a story for another day.
The Hessian is a generalization of the second derivative of elementary calculus. Recall that the second derivative of a function f(x) allows to distinguish concave (f''>0) and convex (f''<0) parts of the function. If you are at a local extremum (f'=0) it allows to distinguish maxima and minima.
The Hessian does the same thing, but for functions of several variables. It says in which spatial directions your function is convex or concave. At a local maximum, it is concave in all directions, and at a local minimum it is convex in all directions. At a saddle point, there will be directions where it is concave and directions where it is convex. The eigen decomposition of the hessian allows to find these directions.
It is useful for function approximation because the 2nd degree coefficients of a polynomial that better fits your data are precisely the entries of the Hessian matrix (due to Taylor theorem).
You can find the maximum by (1) gradient ascent, or (2) the analogue of Newton's method in multiple dimensions [++], which involves computing the Hessian matrix.
So there's your Hessian.
For a binary classification task, one could simply calculate mean squared error between predicted values and actual labels (as in linear regression) and then proceed to find the optimal weights iteratively using gradient descent. But the sigmoid shape of the logistic function makes gradient descent a poor choice of an optimization technique (w.r.t. lack of guarantee of finding a global optimum).
A surer way to find globally optimal weights is using the Newton's method of calculating weight updates. This is a numerical optimization technique that requires one to calculate the 1st and 2nd order derivatives of the error function. The matrix that 'calculates' the 1st order derivative is called a Jacobian and the one that calculates the 2nd order derivative is called a Hessian...
(I'm approaching this from a graduate level stats angle). Just like the score vector is the derivative of the ML wrt its parameters, the Hessian matrix is just the derivative of the score vector. It's just the second derivative wrt the parameters of the ML. It's the derivative of the derivative of the ML function.
It also builds some really simple Python code to create a simple (non-"deep", i.e. with just one hidden layer) neural network capable of recognizing human-drawn digits with good accuracy.