
Linear Regression - simonwardjones
https://www.simonwardjones.co.uk/posts/linear_regression/
======
quietbritishjim
This article had a brutal leap from being aimed at someone with barely any
understanding of maths (complete with friendly emojis) to use of cost
functions (without any explanation) and associated code. It's bit like the
"how to draw an owl" meme.

A good article on linear regression, in my opinion, would break it down into
three steps:

1\. Spend a bit of time looking at cost functions. In principle linear
regression is finding the "best" line, where by "best" we consider all
possible lines (yes, all uncountably infinite of them), compute the cost
function for each one, and pick the one where the cost comes out lowest. You
want to show a few example lines on top of some example points and label their
costs. Start with absolute deviation (i.e. l1 norm) to start with - let's face
it, that's _really_ the most obvious cost function if you don't already know
what comes next - then contrast with least squares (i.e. l2 norm). For
example, note that least squares cost function "cares" more about points that
are a particularly long way away.

2\. Admit that, OK, we do want an algorithm more sensible than "try every
possible line, labourously computing the cost of each one". Now you can talk
about gradient descent - and I mean WHY you use gradient descent, not the
computation. And now you can mention that least squares is differentiable, so
solves nicely for gradient descent, which is the real reason we tend to prefer
it over absolute deviation.

3\. Finally, after both of those you can solve the gradient descent equations
and show some associated code.

~~~
simonwardjones
haha - I love the use of the word brutal here! It is a brutal leap.

I wanted to give the ideas in the simplest way with friendly emojis and then
go onto the more complex derivation with the cost function etc. I haven't
explained the idea of the cost function enough. I think your idea for stating
with the cost function (as a concept - not a formula) and absolute error may
have been nicer to be fair. I can see a nice d3 visual where you can slide the
gradient and see the total error change!

I could always add the line of best fit approach and more intuition for the
cost function after I introduce training data (and before the brutal leap)?

~~~
rmrfstar
I'm very much _not_ a fan of this introduction.

You use gradient descent, but do not introduce the normal equations. This is
problematic for at least two reasons.

Case 1: Design matrix has full rank

Omitting the normal equations obfuscates what is really going on. You are
inverting a matrix to solve the first order condition of a strictly convex
objective, which therefore has a unique optimum.

Case 2: Design matrix does not have full rank

Omitting the normal equations hides the fact that there are _multiple_
solutions to the first order condition. Gradient descent will find one of
them, but you need a principled method for selecting among them. The Moore-
Penrose pseudo-inverse method gives you the solution with the smallest L2
norm.

Omitting these details is setting learners up for failure.

~~~
spekcular
I agree with this 100%.

You may amused to learn that "How would you program a solver for a system of
linear equations?" was an informal interview question for a top machine
learning PhD program, and applicants were not looked upon favorably if they
mindlessly gave gradient descent as an answer.

------
ojnabieoot
Two comments here, and I am sorry if they come across as mean:

1) I know this article isn't aimed at me, but I do truly hate the excessive
emojis.

2) More substantively: not once does the word "statistics" enter here. There
is not a single illustration of the _idea_ behind linear regression, which is
staggeringly simple: it is just finding the best linear fit. Anyone who looks
at a 2D scatterplot can do an approximate linear regression with their brains
by visualizing a line that sorta fits the data. While the gradient descent
stuff is useful for extending a machine learning algorithm, introducing linear
regression this way obscures the actual concept - and indeed, obscures what
machine learning actually is.

In my experience there are a huge number of data scientists who are ignorant
of statistics and don't understand their models and algorithms. Although this
is unfair, it is my impression that the author himself does not understand
linear regression, even if they are able to write formulas using vectors. This
article does a disservice to learners.

~~~
tluyben2
> I do truly hate the excessive emojis.

I am more in the camp of actually always finding them excessive. Especially in
more serious, and possibly, interesting posts.

~~~
grenoire
I would have loved to believe that they are a fad and will pass away, but I
just found myself using them in the same way that I used to use phpBB
emoticons.

I think they're never going to go away, but at least we can try to eradicate
them from professional discourse and education.

~~~
the_af
The homebanking login page from my bank greets me with "Hello :) How can we
help you today?" and it makes me cringe _every time_. This is a major bank by
the way.

I do use smileys here on HN, on facebook, and chatting, to signal I'm being
friendly and non-confrontational, to preempt and defuse situations which could
escalate into anger if someone misreads the tone.

------
stdbrouw
If you enjoy these kinds of explanations, "Data Science from Scratch" by Joel
Grus explains many machine learning algorithms and has you implement simple
versions of them in Python as you read along. It also covers linear regression
and I wonder if that book is where the author got the idea for this series of
blogposts. Kudos anyway.

Something of a nitpick, but one thing that both Simon Ward-Jones and Joel Grus
miss is that linear regression is typically not implemented using gradient
descent at all, there's an analytical solution and you can get the beta
coefficients with straightforward matrix algebra. It's much harder to explain
than gradient descent so I get why they don't bother, but on the other hand
without that background it's hard to see why everybody talks about _linear_
regression all the time when with gradient descent or any other numerical
optimizer there's really no limit to what f(x) can look like.

~~~
em500
I don't think they've missed the fact that for linear regression there's an
algebraic solution, but machine learners typically treat linear regression a
simple special case, and as soon as want to go a bit beyond you have to go
with an iterative numerical optimizer anyway so why bother with the special
case solution.

My main criticism on the article would be that the nitty gritty section only
makes sense to a reader that has already done a linear algebra / multivariate
calculus course. In which case they've likely already covered least squares in
greater depth (including the exact solutions) than this article. So I don't
really see the purpose of the math section, except maybe to signal that the
writer has a descent understanding of the algorithmic detail.

------
zwieback
The discussion here suggests that a lot of developers are jumping right into
the middle of data science from a machine learning perspective without a solid
understanding of what I would call basic math. It's a lot easier to succeed
and compete with a solid grasp of linear algebra, calculus and numerical
methods. My personal experience is that curriculum takes a good couple years
to really get your head around.

For my current project I had to really understand the closed-form solution for
linear regression and even it looks straightforward on paper it really took a
while to sink in.

------
tryptophan
Why do people write or even read posts like this. You can crack open any stats
textbook and it will be explained there in much more detail by someone likely
far more qualified to be talking about it.

Feels like some mix of SEO-farming/resume-building-blogspam.

------
mnky9800n
Why is it going from zero to gradient descent solution? I feel like going from
zero to deterministic matrix inversion solution would be more normal.

~~~
the_svd_doctor
Yes. In fact, an even better solution would be using QR factorization...

------
shoo
If you've got an interesting nonlinear cost function to minimise and can
compute the gradient, you can do a lot worse than plugging the thing into
L-BFGS-B and letting it optimise it.

e.g.

[https://en.m.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%...](https://en.m.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm)

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html)

What'll that give you compared to a simple gradient descent? It'll attempt to
accelerate convergence by estimating an approximation of the Hessian matrix -
all the second order partial derivatives, and use a line search algorithm to
figure out a good step size for each step instead of using an arbitrary
constant step size aka "learning rate". The "L-" variation of the algorithm
will use a limited amount of memory when approximating the Hessian matrix,
which might help if your cost function has a larger number of parameters. The
"-B" variations of the algorithm will also let you set upper and/or lower
bounds on each variable that will be respected during the search.

------
aazaa
> If you’re interested read on, if you’re not, see yourself out.

Huh? Doesn't this apply to literally every piece of writing ever produced? The
reader is always free to leave.

~~~
globular-toast
The author wanted to use the door emoji.

------
gbarboza
This is one of the most condescending things I've read in years

~~~
mywittyname
Yeah, I get the impression this is article is more to show off how smart the
writer is, rather than about effective instruction. If you can actually read
the article, you have no need for the content because you've probably
encountered gradient descent already in your maths training (or could figure
it out).

Also, there's no concrete examples of this algorithm in place for students to
go through and built intuition and a working understanding. Even the most
dense math text books I've read have problems for the reader to work through.

------
ellisv
> Regression is any algorithm that takes a collection of inputs and predicts
> an output.

I get that the author is writing for a certain audience, but this is a gross
over-simplification.

~~~
bonoboTP
Depends on the tradition you are coming from. In machine learning it really is
just that. A synonym for continuous-valued function approximation based on
training data.

You may say it's only regression if it fits your favorite framework, like
giving confidence values or goodness of fit etc, but that's not true in
general. There so many variants, like Bayesian probabilistic regression,
nonparametrics, neural nets, random forests etc.

~~~
rm999
It’s a nonstandard definition at best. While regression is a name given to
some algorithms (linear and logistic for example), this is an artifact of the
statistical foundations of ML. Its standard use in ML is to describe the
problem, not the solution: regression is predicting a continuous value and
classification is predicting discrete categories.

Either use is ok, but e.g. a tree model predicting if an animal is a dog or a
cat is not regression by any definition.

~~~
bonoboTP
I'd say arguing about this is a huge distraction. Some would say
classification is a special case of regression: the predicted continuous
values are interpreted as class probabilities.

You're right that regression is a type of task, not a type of solution. The
conceptual difference is important to understand if someone does not yet know
it. But it's not an all or nothing set-in-stone thing. It implies there is a
waterfall design to these ideas, as if some oracle posed these tasks to us and
then we started finding solutions to each of them separately. But actually in
many cases one and the same algorithm with small tweaks can tackle multiple
tasks. Sometimes we have the hammer (algorithm) first and then the nails
(task).

How you build your taxonomy and how you categorize one approach or another is
not the same as learning and understanding. I always had an issue with this at
university, where some lecturers would confuse learning lists like "what are
the 3 areas of field X" or what are "the 4 principles of approach Y" etc.

The world is not structured according to subjects, fields, subfields etc. It's
not a single hierarchy, but a big mess of similarities, like a graph or a
multidimensional space. The map is not the territory and so on.

Terminology is necessary of course for communication and structuring books
etc, but I like to see it merely as a utilitarian thing. The categories
provide a scaffolding so that learning can happen. Studying the vocabulary and
various outlines and nested hierarchies is a useful part of the journey but
should not be confused with actually learning the thing itself. You could in
principle learn all about regression without ever learning the word
"regression".

The name actually originates from statistics, named after the "regression
towards the mean" phenomenon, where they observed how the adult height of
people is closer to the average compared to the height of their parents. So a
tall person will have tall children but less extremely tall (in tendency, some
will of course be even taller). So confusingly, regression literally means
"going back".

------
Labo333
Now, posts full of emojis that use gradient descent for a classical problem
with an efficient analytical solution
([https://en.m.wikipedia.org/wiki/Linear_least_squares](https://en.m.wikipedia.org/wiki/Linear_least_squares))
that you learn of in any ML class make it to front page...

------
j7ake
Why go through all the steps of deriving gradient descent when there is an
analytical formula to get the estimate of the parameters ? Maybe I guess to
scale up in big data contexts? But the analytical solution in my opinion gives
more insight to the problem than the gradient descent solution

------
wolfi1
I would prefer an introduction via the Moore-Penrose Pseudo-Inverse. It's a
lot easier, imho

~~~
xthestreams
Introducing linear regression with gradient descent can also be really
confusing for a newcomer. Gradient descent solves a problem (nonlinear
optimization) which does not exist here.

------
BrandoElFollito
This is one of the rare articles which applies linear regression to something
which is linear by nature (I am assuming good faith from the authors when they
use "intuitively" to mean mean "we know our model is linear because
<something>")

Whenever I look at generic information (news, newspapers, generic articles),
someone is taking a cloud of points and, bam!, draws a straight line through
them. I actually reached to some authors to ask why they did that.

Answers (if any) were varied, I had I think one person who actually said that
the data is expected to be linear. Many could not see the point, up to "if it
looks like a line, then we put a line".

------
globuous
Great article ! But I must admit, I didn't expect it to be using gradient
descent. I like the Wikipedia derivation as well [1]. It says that the
minimization problem is convex, so you'll find a global minimum with gradient
descent. But more interestingly, at the cost of inverting a potentially huge
matrix, the article presents an elegant analytical solution to the problem.

[1]
[https://en.wikipedia.org/wiki/Ordinary_least_squares#Matrix/...](https://en.wikipedia.org/wiki/Ordinary_least_squares#Matrix/vector_formulation)

------
reedwolf
The next level:

[https://en.wikipedia.org/wiki/Symbolic_regression](https://en.wikipedia.org/wiki/Symbolic_regression)

~~~
vasili111
Do you have personal experience with Symbolic regression? If yes, please share
them briefly. What are advantages and disadvantages in comparison to other
predictive algorithms? Did it showed to you better results than other
predictive algorithms?

------
simonwardjones
Explanation, maths and code

~~~
cheschire
Well done. I look forward to digging into this more this weekend.

~~~
dunefox
You might prefer this [https://faculty.marshall.usc.edu/gareth-
james/ISL/](https://faculty.marshall.usc.edu/gareth-james/ISL/)

------
tmkadamcz
Why is he using gradient descent? There's a closed form expression for the
coefficients.

------
Lazlo182
To me it seems that the concept of "fitness functions" seems to touch on
Genetic Algorithms, and I guess I think less of regression when I hear it.

------
fartzzz
This is a very good explanation! (And a great blog btw). I especially like the
math symbols explanation. This is something I always have a hard time
understanding/remembering. (It’s very hard to lookup math symbols om Google
:P) and great that it is followed by a Python example!

------
globular-toast
The emojis add nothing and make it horrible to read, so I stopped.

------
olegious
I consider myself above average intelligence, but my math skills are sorely
lacking (didn't make it beyond pre-calculus). The first section was simple to
follow but I was completely lost as soon as he hit the "Nitty Gritty."

Any advice on what I need to learn to even begin to understand what's written
in section 2?

~~~
bonoboTP
Machine learning math is basically calculus, linear algebra and probability
theory. Pick up intro textbooks on these. Enjoying the dopamine rush of
insight from blog posts can be more harmful than useful if overdone. You have
to sit down and digest these things without diatractions, and that's not easy
for most people outside a school environment. Consciously understand this fact
and take an adult, planned approach to keep yourself on track if you truly
want to understand things.

------
redroot
Great read! I really enjoy that the article explains the problem in three
different mediums, and you walk away with some understanding even if you can
only access the first part. Reminds me of the Wired videos where a concept
would be explained at 5 different levels
[https://www.youtube.com/watch?v=eRkgK4jfi6M](https://www.youtube.com/watch?v=eRkgK4jfi6M)

