Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, dergachev is pointing to one of the cool direct applications of linear algebra to machine learning.

Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i). Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).

One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is a linear model with $n$ parameters m_1,m_2,...,m_n:

   y_m(x_1,...x_n) =  m_1x_1 + m_2x_2 + ... + m_nx^n  =  \vec{m} · \vec{x}
If the model is good then y_m(\vec{a}_i) approximates b_i well.

But startup wisdom dictates that we should measure everything!

Enter error term:

   e_i(\vec{m}) = |y_m(\vec{a}_i)   - b_i|²,
the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.

Our goal is to make the sum $S$ of all the error terms as small as possible.

   S(\vec{m}) = \sum_{i=1}^{i=M} e_i(\vec{m}) 
Note that the "total squared error" is a function of the model parameters \vec{m}.

At this point we have reached a level of complexity that becomes difficult to follow. Linear algebra to the rescue!

We can express the "vector prediction" of the model y_m in "one shot" in terms of the following matrix equation:

  A\vec{m} = \vec{b}

  where A is an N by n matrix  (contains the a_:: part of the data)
  \vec{m} is an n by 1 vector  (model parameters---the unknown)
  \vec{b} is an N by 1 vector  (contains the b_: part of the data)

To find \vec{m}, we must solve this matrix equation, however A is not a square matrix: A is a tall skinny matrix N >> n, so there is no A^{-1}.

Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it!

     A^T A \vec{m}  =  A^T \vec{b}
     \   /
       N   (an n by n matrix) 

       N \vec{m}  = A^T \vec{b}
Now the thing to observe is that if N is invertible, then we can find \vec{m} using

        \vec{m}  = N^{-1} A^T \vec{b}

This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem

    \vec{m}   =  minimize S(\vec{m}) over all \vec{m} 
Proof: http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...

Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.

TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y}, the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.

That literally made more sense than anything I've learned this semester. :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact