
ML From Scratch, Part 1: Linear Regression - olooney
http://www.oranlooney.com/post/ml-from-scratch-part-1-linear-regression/
======
stared
I tried to do from scratch AND hands-on in PyTorch:

[https://colab.research.google.com/github/stared/thinking-
in-...](https://colab.research.google.com/github/stared/thinking-in-tensors-
writing-in-pytorch/blob/master/3%20Linear%20regression.ipynb)

By all means, much more "from scratch".

It is a part of open-source "Thinking in Tensors, Writing in PyTorch":
[https://github.com/stared/thinking-in-tensors-writing-in-
pyt...](https://github.com/stared/thinking-in-tensors-writing-in-pytorch)

~~~
semiotagonal
I tried the popular Machine Learning course on Coursera, and found myself
spending tons of time debugging Octave code because to me it was such an
unfamiliar way of expressing things. A widely accepted alternative in a more
popular programming language would be a nice thing to exist.

~~~
unoti
It does exist! There are lots of answers to what you’re asking for. But here’s
the best answer:

[https://course.fast.ai/](https://course.fast.ai/)

This course uses python and pytorch and will have you building something that
can tell 30+ breeds of cats and dogs apart in the first lesson. While the old
course from Andrew ng you mentioned starts with theory and by the end gets to
doing something useful, the fastai course starts with best practices and moves
into the underlying theory later.

Also Andrew Ng has a newer deeplaearning.ai course on coursera that is all
python and doesn’t do Octave.

~~~
semiotagonal
Thanks.

------
Sedai05
It says it's from scratch, but it's using a lot of words and math symbols I
don't understand when describing Linear Regression.

~~~
applecrazy
If you want a gentler introduction, I recommend the first module of Andrew
Ng’s Machine Learning class on Coursera. He also explains things intuitively
before jumping into the math. Even when he uses math, he makes sure you
understand the notations and what equation means.

------
z3phyr
At first I thought it was about implementing an ML like language. Then the
realisation hit pretty fast..

~~~
uhryks
This, I wish people wrote Machine Learning in their titles, always confuses
me.

------
Sinidir
Understood everything up until this:

    
    
            ∇ΘJ=∇Θ(y−XΘ)T(y−XΘ)
    

expanded into

    
    
            ∇ΘJ=∇ΘyTy−(XΘ)Ty−yTXΘ+ΘT(XTX)Θ
    
    

Can someone explain why ∇Θ is only applied to the first term? Also i have
never seen gradient notation used like that before.

~~~
kxyvr
Well, personally I prefer a different kind of notation. That said, there are
several ways to get the same solution. In the following, apostrophe, ', means
transpose and two symbols next to each other means multiplication, Ax = A
(times) x. I can't figure out how to stop italics on this forum using a star.

1\. Expansion. Let J(x) = norm(Ax-b)^2. Then,

J(x) = norm(Ax-b)^2 = (Ax-b)'(Ax-b) = (x'A'-b')(Ax-b) = x'A'Ax - b'Ax -x'A'b +
b'b = x'A'Ax - 2b'Ax + b'b

Then,

grad J(x) = 2A'Ax-2A'b

Except, how did we know that grad(x'A'Ax) = 2A'Ax or grad(2b'Ax)=2A'b? Well,
there are at least two ways to derive it. One, go back to the limit
definition. This is tedious. Two, cheat and use a Taylor series. Now, this
assumes that we believe Taylor's theorem and it's worthwhile to go through a
proof someday. Today's not that day. Taylor series assert that under the
proper conditions that:

J(x+dx) = J(x) + J'(x)dx + (1/2)J''(x)dxdx + R(x)

where J'(x) denotes the total derivative of J and R(x) is a remainder term.
This is a linear operator where if J:X->Y, then J'(x)\in L(X,Y) where L(X,Y)
denotes the space of linear operators that have domain X and codomain Y. This
means that J''(x)\in L(X,L(X,Y)) since we can still play this same trick. This
means that J''(x)dxdx \in Y. Anyway, why do we care? If we can write something
that looks like a Taylor series, then we can read off the derivatives. For
example, for K(x) = x'Bx, we have that:

K(x+dx) = (x+dx)'B(x+dx) = x'Bx+dx'Bx+x'Bdx+dx'Bdx = x'Bx+2x'Bdx+dx'Bdx =
K(x)+K'(x)dx+1/2K''(x)dxdx

Hence, if we just match the terms:

K(x)=x'Bx K'(x)dx=2x'Bdx (1/2)K''(x)dx=dx'Bdx or K''(x)=2dx'Bdx

Great, except grad K(x)=2B'x and not 2x'B. Why the transpose? Well, it's
because K'(x) and grad K(x) are not the same thing. Recall, from above, if
K:X->R where R means the real numbers, then K'(x)\in L(X,R). To be clear, this
is a function that maps X to R. Say X was something like an m dimensional
vector space, R(m). Then, K'(x) : R(m)->R. If we want to represent this as a
matrix, the matrix would have size R(1,m). This is a row vector. Normally, the
gradient is a column vector. Why? Well, most of the time, people view the
gradient as a matrix of partial derivatives and they order them in a magical
way. To me, a better way to visualize it is as what's called the Riesz
representative of the total derivative. Basically, there's a theorem that
states that under certain conditions any linear functional can be represented
as the inner product between an element in that space and the argument.
Basically, if f is a linear function that maps to the real numbers, then we
can write any linear function, under the right conditions, as f(x)=<a,x> where
<.,.> denotes inner products. In R(m), this inner product is just the
transpose, so any linear functional in R(m) can be written as f(x)=a'x. Going
back to the above, K'(x)dx=2x'Bdx=(2B'x)'dx. This means that the linear
functional C(dx)=K'(x)dx can be written as C(dx)=K'(x)dx=(2B'x)'dx. That
element, 2B'x is the gradient, by definition. It's the Riesz representative of
the total derivative. And, again, you _can_ define the gradient differently. I
wouldn't since this is clean and generalizes to Hilbert spaces, which are
complete inner products spaces, which means we get up to some restriction of
infinite dimensions and function spaces.

OK, that was a lot, but then we can realize that we can just apply the Taylor
series trick to the whole expansion:

J(x+dx) = norm(A(x+dx)-b)^2 = norm(Ax+b+Adx)^2 = norm(Ax+b)^2 + 2(Ax+b)'Adx +
dx'A'Adx = J(x) + J'(x)dx + (1/2)J''(x)dxdx

After matching, we have that

J'(x)dx = 2(Ax+b)'Adx

or that

grad J(x) = 2A'(b+Ax) = 2A'b + 2A'Ax

2\. But, wait, there's more. We could just use the chain rule. Recall, (f o
g)'(x) = f'(g(x))g'(x) where o denotes function composition. In the above
case, f(x)=norm(x)^2 and g(x)=Ax+b. Also playing Taylor series games, we'd
eventually find out that f'(x)dx=2 dx' where I is the identity and g'(x)=A.
This means that

f'(g(x))g'(x) = 2(Ax-b)'A

which means the gradient is:

2A'(Ax-b)

3\. If you really don't like this, you could pretend to be a physicist and do
everything component wise with 1-D derivatives. That was a joke. This will
produce the correct result, but one that I find particularly painful to
derive. As such, I will punt and avoid that here.

Anyway, all teasing aside, there are lots of different ways to get the same
result. Really, just use what works well for you and potentially adapt if
you're working with an audience that views things differently. I don't claim
what I wrote above is the best notation or derivations for you, but they work
well for me and I wanted to provide some variety on a Saturday afternoon.

~~~
olooney
Yes, using chain rule for the gradient of norm(Ax-b)^2 makes that derivation a
lot shorter. Or just citing a theorem for the gradient of the L2 norm would
have been fine. I used the FOIL approach because each of the 4 terms can be
tackled using very elementary theorems in matrix calculus - a constant term,
two linear terms, and a quadratic form. But in retrospect this was probably a
mistake seeing as several people have gotten hung up on it.

------
MidgetGourde
I could balance equations in chemistry and do some algebra and trigonometry at
GCSE level. But this blows my mind. Higher tier GCSE Maths. I should have paid
more attention. I think I know of what regression is in a simpistic term. It's
like taking a rolling graph and predicting the next point on the x / y axis,
using the previous data as a training model. I hope.

~~~
applecrazy
You’re exactly right. It’s fitting a higher-dimensional line or curve to the
data.

------
j7ake
I found the notation here non-standard and confusing.

In my opinion, the wikipedia article is more concise and provides more
intuition (especially the multiple ways to derive the closed form solution).

[https://en.wikipedia.org/wiki/Ordinary_least_squares#Alterna...](https://en.wikipedia.org/wiki/Ordinary_least_squares#Alternative_derivations)

~~~
graycat
The Wikipedia article has lots of nice material, but much easier than that or
the OP here is my

[https://news.ycombinator.com/item?id=21301117](https://news.ycombinator.com/item?id=21301117)

also in this thread.

------
adipginting
For a computer science phd or master student candidate, I think the article or
the site overall is very good

~~~
graycat
For any student, middle school or later who knows matrix multiplication, but
no calculus or probability, just use my

[https://news.ycombinator.com/item?id=21301117](https://news.ycombinator.com/item?id=21301117)

in this thread. It's really simple!

------
dlphn___xyz
one of the best ‘from scratch’ articles ive read on LR

------
graycat
=== Introduction

Here I give some notes for a shorter, simpler, more general, no calculus
derivatives derivation of the basics of regression analysis.

=== The Given Data

Let R be the set of real numbers (can use the set of rational numbers instead
if wish).

We are given positive integers m and n.

In our typing, character '^' starts a superscript; character '_' starts a
subscript.

We regard R^m as the set of all m x 1 matrices (column vectors); that is, each
point in R^m is a matrix with m rows and 1 column with its components in set
R.

We are given m x 1 y in R^m.

We are given m x n matrix A with components in R.

We will need to know what matrix _transpose_ is: For an intuitive view, the
_transpose_ of A is A^T where each column of A becomes a row of A^T. Or to be
explicit, for i = 1 to m and j = 1 to n, the component in row i and column j
of A becomes the component in row j and column i of A^T.

We notice that each column of A is m x 1 and, thus, a point in R^m.

Suppose b is n x 1 with components in R.

Then the matrix product Ab is m x 1 in R^m and is a _linear combination_ (with
_coefficients_ from b) of the columns of A.

=== Real World

So we have data A and y. To get this data, maybe we got some medical record
data on m = 100,000 people.

For each person we got data on age, income, years of schooling, height,
weight, birth gender (0 for male, 1 for female), years of smoking, blood sugar
level, blood cholesterol level, race (1 to 5 from white, black, Hispanic,
Asian, other), that is n = 10 variables. We get to select these variables and
call them _independent_.

For person i = 1 to m = 100,000, we put the value of variable j = 1 to n = 10
in row i and column j of matrix A.

Or matrix A has one row for each person and one column for each variable.

Vector y has systolic blood pressure: For i = 1 to m = 100,000, row i of y,
that is, y_i, has the systolic blood pressure of person i. We regard systolic
blood pressure in y as the _dependent_ variable.

What we want to do is use our data to get a mathematical function that for any
person, in the m = 100,000 or not, takes in the values of the independent
variables, 1 x n v, and returns the corresponding value of the dependent
variable 1 x 1 z. To this end we want n x 1 b so that we have

z = vb

So for any person, we take their independent variable values v, apply
coefficients b, and get their dependent variable value z.

We seek the b we will use on all people.

=== The _Normal Equations_

So, we have our given m x n A and m x 1 y, and we seek our n x 1 b.

We let

S = { Au | n x 1 u }

That is, S is the set of all the _linear combinations_ of the columns of A, a
_hyperplane_ in R^m, a generalization of a plane in three dimensions or a line
in two dimensions, a vector _subspace_ of R^m, the vector subspace of R^m
_spanned_ by the columns of A.

We seek w in S to minimize the squared _length_

||y - w||^2

= \sum_{i = 1}^m (y_i - w_i)^2

(notation from D. Knuth's math typesetting software TeX), that is the sum
(capital letter Greek sigma) from i = 1 to m of

(y_i - w_i)^2

where y_i is component i of m x 1 y in R^m and similarly for w_i.

That is, we seek the w in the hyperplane S that is closest to our dependent
variable value y.

Well, from some simple geometry, the vector

y - w

has to be perpendicular to the hyperplane S.

Also from the geometry, w is unique, the only point in S that minimizes the
squared length

||y - w||^2

For the _geometry_ , there is one classic theorem -- in W. Rudin, _Real and
Complex Analysis_ \-- that will mostly do: In quite general situations, more
general than R^m, every non-empty, closed, convex set has a unique element of
minimum norm (length).

Now that we have the definitions and tools we need, we derive the _normal
equations_ in just two lines of matrix algebra:

Since (y - w) is perpendicular to each column of A, we have that

A^T (y - w) = 0

where the right side is an n x 1 matrix of 0s.

Then

A^T y = A^T Ab

or with the usual writing order

(A^T A)b = A^T y

the _normal equations_ where we have A and y and solve for b.

=== Results

Since w is in S, the equations do have a solution. Moreover, from the
geometry, w is unique.

If the n x n square matrix (A^T A) has an inverse, then the solution, the b is
also unique.

Note: Vector w DOES exist and is unique; b DOES exist; if the inverse of (A^T
A) exists, then b is unique; otherwise b STILL exists but is not unique. STILL
w is unique. How 'bout that!

Since w is in S and since (y - w) is perpendicular to all the vectors in S, we
have that

(y - w)^T w = 0

y^T y = (y - w + w)^T (y - w + w)

= (y - w)^T (y - w) + (y - w)^T w + w^T (y \- w) + w^T w

= (y - w)^T (y - w) + w^T w

or

[total sum of squares] = [regression sum of squares] + [error sum of squares]

or the Pythagorean theorem.

So, now that we have found b, for any v, we can get our desired z from vb.

The b we found works _best_ in the least squares sense for the m = 100,000
data we had. If the 100,000 are an appropriate _sample_ of a billion, and if
our b works well on our sample, then maybe b will work well on the billion.
Ah, maybe could prove some theorems here, theorems with meager or no more
assumptions than we have made so far!

"Look, Ma, no calculus! And no mention of the Gaussian distribution. No
mention of maximum likelihood.

Except for regarding the m = 100,000 _observations_ as a _sample_ from a
billion, we have mentioned no probability concepts. We never asked for a
matrix inverse. And we never mentioned _multicollinearity_."

And, maybe the inverse of n x n (A^T A) does exist but is "numerically
unstable". Sooooo, we begin to suspect: The b we get may be inaccurate but the
w may still be fine and our

z = vb

may also still be accurate. Might want to look into this.

For a start, sure, the matrix (A^T A) being numerically unstable is
essentially the same as the ratio of the largest and smallest (positive, they
are all positive) eigenvalues being large. That is, roughly, the problem is
that the small eigenvalues are, in the _scale_ of the problem, close to 0.
Even if they are 0, we have shown that our w is still unique.

------
vermooten
Hardly 'from scratch', I was expecting an ELI5 article.

~~~
dang
The submitted title, "Linear Regression from Scratch", was misleading. We've
replaced it with the title on the page.

From-scratchness is relative I guess.

~~~
ChrisSD
First you need to create a universe...

~~~
craftinator
Easy Biscuits from Scratch:

Step 1) Plant a crop of wheat

------
smitty1e
The LaTeX was LaToast, which is LaShame.

~~~
bradleyankrom
Are you blocking one of the JS libs that the page uses? The notation was
screwed up for me until I temporarily unblocked them.

~~~
smitty1e
Ah.

