Yeah, dergachev is pointing to one of the cool direct applications of linear algebra to machine learning.Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i). Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is a linear model with $n$ parameters m_1,m_2,...,m_n: y_m(x_1,...x_n) = m_1x_1 + m_2x_2 + ... + m_nx^n = \vec{m} · \vec{x}  If the model is good then y_m(\vec{a}_i) approximates b_i well.But startup wisdom dictates that we should measure everything!Enter error term: e_i(\vec{m}) = |y_m(\vec{a}_i) - b_i|²,  the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.Our goal is to make the sum $S$ of all the error terms as small as possible. S(\vec{m}) = \sum_{i=1}^{i=M} e_i(\vec{m})  Note that the "total squared error" is a function of the model parameters \vec{m}.At this point we have reached a level of complexity that becomes difficult to follow. Linear algebra to the rescue!We can express the "vector prediction" of the model y_m in "one shot" in terms of the following matrix equation: A\vec{m} = \vec{b} where A is an N by n matrix (contains the a_:: part of the data) \vec{m} is an n by 1 vector (model parameters---the unknown) \vec{b} is an N by 1 vector (contains the b_: part of the data)  To find \vec{m}, we must solve this matrix equation, however A is not a square matrix: A is a tall skinny matrix N >> n, so there is no A^{-1}.Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it! A^T A \vec{m} = A^T \vec{b} \ / N (an n by n matrix) N \vec{m} = A^T \vec{b}  Now the thing to observe is that if N is invertible, then we can find \vec{m} using \vec{m} = N^{-1} A^T \vec{b}  This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem \vec{m} = minimize S(\vec{m}) over all \vec{m}  Proof: http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y}, the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.

 That literally made more sense than anything I've learned this semester. :)

Search: