
Matrix Calculus - sytelus
http://www.matrixcalculus.org
======
lliiffee
For those who might not realize how amazing this is, take a look at how you
can do these things manually:

[https://tminka.github.io/papers/matrix/minka-
matrix.pdf](https://tminka.github.io/papers/matrix/minka-matrix.pdf)

------
j7ake
I never learned matrix calculus so I find this tool helpful for following
technical papers that involve some matrix calculus

~~~
gajomi
I have done all kinds of work that required some kind of matrix calculus in
one form or another. There are of course all kinds of references (sibling
links to my favorite), but I have found that more often than not really the
best way to get the results you want is just to calculate them yourself. The
work involved is usually tedious but trivial. But working through it goes
along way to help make some sense of the various identities that come out. In
looking at an identity one may see a transpose here, an inner product there,
but not be able to assign any importance or distinction to any particular term
at first glance. Working through a few calculations help with this.

EDIT: I suppose I should also say that I never "learned" matrix calculus
either, in the sense that I internalized the various features unique to
matrices under derivatives and integrals. The calculations I refer to above
are crude, naive ones in the scalar notation under whatever coordinate system
seems appropriate.

~~~
jey
When I was starting out in machine learning, as a programmer with the most
rudimentary calculus background, it was easy to derive algorithms that had
terms like "gradient w.r.t X of log(det(inv(λI + A X A')))" which absolutely
stumped me when trying to derive the gradient by hand by elementwise partials.

However, thanks to Minka's notes and the Matrix Cookbook, I was able to
eventually get a handle on easy techniques for these derivations! It's
certainly no substitute for getting a handle on the theory first by studying a
textbook, but these pattern-matching shorthands are important practical
techniques.

~~~
platz
How did you start in machine learning. Did you come from a top 10 school?

~~~
jey
No, I failed out of university and am self-taught. I spent many years
deliberately underemployed working on self-study and coding up projects at the
limit of my abilities. I had decided to bet on investing in myself instead of
just carrying out some boss’s wishes, which would be more optimized to benefit
the company than my own growth and development.

Here’s a great resource if you’re starting out today:
[http://datasciencemasters.org](http://datasciencemasters.org)

~~~
asddddd
Thanks for the link! That's more or less the road I'm starting down right now,
albeit with likely a bumpier past (dropped out of HS in addition to college
and spent a few years doing almost nothing productive.)

------
Klasiaster
I can see that x' means $x^T$ (x transposed) but it's not mentioned in the
documentation — where does this notation come from?

~~~
strawcomb
Matlab/octave

~~~
ivan_ah
With caveat that A' is Hermitian transpose (transpose + complex conjugate)
when A has complex entities, and you need A.' if you want the real transpose.
For matrix A with real coefficients both operations are the same.

------
thearn4
Awesome, pretty handy to have for engineering/scientific computing code. Over
time, you do start to build the intuition just like you do for scalar
derivative calculations, but it takes time.

------
cjbillington
The example has sine of a vector in it. I haven't had coffee yet today but
I've never heard of being able to compute sine of a vector. What does this
mean? Defining it by its Taylor series doesn't work like it does for square
matrices because vectors can't be multiplied, and if you assume the direct
product is what's meant then each term in the Taylor series is a different
sized matrix and can't be added. Surely it doesn't mean elementwise sine?

~~~
leecarraher
It is element-wise sin, which is not entirely correct. You can take the sin
of, or any analytic function of a matrix, however you'd have to compute the
characteristic polynomial for the matrix and apply the function to that. It is
a result of the Cayley-Hamilton Theorem. here is a tutorial:
[http://web.mit.edu/2.151/www/Handouts/CayleyHamilton.pdf](http://web.mit.edu/2.151/www/Handouts/CayleyHamilton.pdf)

~~~
btilly
You don't need to know the characteristic polynomial to do it. Just sum as
many terms of the power series as you want.

What the characteristic polynomial lets you do is to calculate it faster and
more accurately because very large powers can be rewritten in terms of smaller
powers that you already computed.

------
goerz
Is the source code for this available?

------
mjfl
Along the same lines, does there exist an "algebra checker" that could, say,
take in two successive lines of latex, with perhaps a hint of how to get from
one to the other, and confirm that there are no algebra errors?

~~~
chillydawg
Mathematica can probably do that

~~~
Smaug123
Correct:

    
    
        Implies[x^2 + 2 x + 1 == 0, x == -1] // FullSimplify

------
tony_cannistra
I cannot even begin to explain how intensely I've been desiring this.

------
SnowingXIV
I haven't touched calculus since college and this freaks me out. I don't even
know where to begin, all I remember is using mnemonics to recall rules for
dx/dy or something. How would someone go about understanding this? ML is super
interesting but seems daunting to anything worthwhile without understanding
the maths behind it.

------
ffriend
Some time ago I implemented a library [1] similar to this tool. The tricky
part is that derivatives quickly exceed 2 dimensions, e.g. derivative of a
vector output w.r.t. input matrix is a 3D tensor (e.g. if `y = f(X)`, you need
to find derivative of each `y[i]` w.r.t. each `X[m,n]`), and we don't have a
notation for it. Also, often such tensors are very sparse (e.g. for element-
wise `log()` derivative is a matrix where only the main diagonal has non-zero
values corresponding to derivatives `dy[i]/dx[i]` where `y = log(x)`).

The way I dealt with it is to first translate vectorized expression to so-
called Einstein notation [2] - indexed expression with implicit sums over
repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:

    
    
        Z[i,j] = X[i,k] * Y[k,j]   # implicitly sum over k
     

It worked pretty well and I was able to get results in Einstein notation for
element-wise functions, matrix multiplication and even convolutions.

Unfortunately, the only way to calculate such expressions efficiently is to
convert them back to vectorized notation, and it's not always possible (e.g.
because of sparse structure) and very error-prone.

The good news is that if the result of the whole expression is a scalar, all
the derivatives will have the same number of dimensions as corresponding
inputs. E.g. in:

    
    
        y = sum(W * X + b)
    

if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a
3D tensor). This is the reason why backpropogation algorithm (and
symbolic/automatic differentiation in general) in machine learning works. So
finally I ended up with a another library [3], which can only deal with scalar
outputs, but is much more stable.

Theoretical description of the method for the first library can be found in
[4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in
[5].

[1]: [https://github.com/dfdx/XDiff.jl](https://github.com/dfdx/XDiff.jl)

[2]:
[https://en.wikipedia.org/wiki/Einstein_notation](https://en.wikipedia.org/wiki/Einstein_notation)

[3]: [https://github.com/dfdx/XGrad.jl](https://github.com/dfdx/XGrad.jl)

[4]: [http://docs.mipro-
proceedings.com/proceedings/mipro_2017_pro...](http://docs.mipro-
proceedings.com/proceedings/mipro_2017_proceedings.pdf)

[5]:
[https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl](https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl)

------
a1k0n
Would be nice if there was a "symmetric matrix" variable type, so it could
simplify Ax + A'x to 2Ax if A is symmetric.

~~~
jkam
I wonder why they don't do it. In the tool I had running, we handled this by
removing any transpose of a symmetric matrix (after propagating it before the
leaves). Together with the simplification rule x + x -> 2*x for any x, you get
the expected result. I could only guess why they didn't include it in the
online matrix calculus tool. It was published after I left the group.

~~~
SoerenL
It is not in the current online tool but we will add it again soon. It is
still in there the way you describe it (passing transpose down to the leaves
and simplification rules as well). Btw: How are you doing and where have you
been? Would be nice to also add a link to you and your current site. GENO is
also on its way.

~~~
jkam
I'm good. Looking at things from the data angle now. But unfortunately no
public page. You can link to the old one, if you want to. Have you compared
against TensorFlow XLA?

~~~
SoerenL
I did not compare to Tensorflow XLA but I compared it to Tensorflow. Of
course, it depends on the problem. For instance, for evaluating the Hessian of
x' _A_ x MC is a factor of 100 faster than TF. But MC and TF have different
objectives. TF more on scalar valued functions as needed for deep learning, MC
for the general case, especially also vector and matrix valued functions as
needed for dealing with constraints. But I will give TF XLA a try.

~~~
jkam
I think XLA is trying to reduce the overhead introduced by backprop, meaning
when you optimize the computational graph you might end up with an efficient
calculation of the gradient (closer to the calculation you get with MC).
Regarding non-scalar valued functions: Don't you reduce a constrained problem
to a series of unconstrained problems (via a penalty (or even augmented
Lagrangian) or barrier method)? Then you only need the gradient of a scalar
valued function to solve constrained problems. I imagine you can use the
gradients of the constraints for solving the KKT conditions directly but this
seems only useful if the system is not too big. But for sure it opens new
possibilities.

~~~
SoerenL
XLA is good for the GPU only. On the CPU MC is about 20-50% faster than TF on
scalar valued functions. For the GPU I don't know yet. But it is true that for
augmented Lagrangian you only need scalar valued functions. This is really
efficient on large-scale problems. But on small-scale problems (up to 5000
variables) you really need interior point methods that solve the KKT
conditions directly as you point out. This is sometimes really needed.
However, when you look at the algorithms TF and MC do not differ too much. In
the end, there is a restricted number of ways of computing derivatives. And
basically, most of them are the same (or boil down to two versions). Some of
the claims/problems made in the early autodiff literature concerning symbolic
diff is just not true. In the end, they are fairly similar. But lets see how
XLA performs.

------
Jerry64545
Is it open source.

------
jmh530
Bookmarked!

