For me what makes it really interesting is the realization that all of linear regression (think y = mx + b) is based on linear algebra, specifically the notion that figuring out a best-fit line (in the case of a single-dimensional input variable) is projection of the N-dimensional vector space of observations (where you have N data points) onto the best-fitting 2 dimensional vector spaces (assuming you're fitting a slope and an intercept).
When thought of this way, a lot of linear algebra has geometric interpretations and for me this makes it a lot less abstract.
Nearly everything in linear algebra has
important geometric interpretations.
In addition, a lot that works in
the 3-space we live in (for the
non-relativistic approximation)
generalizes to linear algebra,
vector spaces more generally,
and, in particular, both Hilbert
space and Banach space.
Yeah, dergachev is pointing to one of the cool direct applications of linear algebra to machine learning.
Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i).
Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).
One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is
a linear model with $n$ parameters m_1,m_2,...,m_n:
If the model is good then y_m(\vec{a}_i) approximates b_i well.
But startup wisdom dictates that we should measure everything!
Enter error term:
e_i(\vec{m}) = |y_m(\vec{a}_i) - b_i|²,
the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.
Our goal is to make the sum $S$ of all the error terms as small as possible.
S(\vec{m}) = \sum_{i=1}^{i=M} e_i(\vec{m})
Note that the "total squared error" is a function of the model parameters \vec{m}.
At this point we have reached a level of complexity that becomes difficult to follow.
Linear algebra to the rescue!
We can express the "vector prediction" of the model y_m in "one shot" in terms
of the following matrix equation:
A\vec{m} = \vec{b}
where A is an N by n matrix (contains the a_:: part of the data)
\vec{m} is an n by 1 vector (model parameters---the unknown)
\vec{b} is an N by 1 vector (contains the b_: part of the data)
To find \vec{m}, we must solve this matrix equation,
however A is not a square matrix: A is a tall skinny matrix N >> n,
so there is no A^{-1}.
Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it!
A^T A \vec{m} = A^T \vec{b}
\ /
N (an n by n matrix)
N \vec{m} = A^T \vec{b}
Now the thing to observe is that if N is invertible,
then we can find \vec{m} using
\vec{m} = N^{-1} A^T \vec{b}
This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem
Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.
TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y},
the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.
When thought of this way, a lot of linear algebra has geometric interpretations and for me this makes it a lot less abstract.