Hacker News new | past | comments | ask | show | jobs | submit login
Matrix Calculus (matrixcalculus.org)
408 points by sytelus on Nov 18, 2017 | hide | past | favorite | 51 comments



For those who might not realize how amazing this is, take a look at how you can do these things manually:

https://tminka.github.io/papers/matrix/minka-matrix.pdf


I never learned matrix calculus so I find this tool helpful for following technical papers that involve some matrix calculus


I have done all kinds of work that required some kind of matrix calculus in one form or another. There are of course all kinds of references (sibling links to my favorite), but I have found that more often than not really the best way to get the results you want is just to calculate them yourself. The work involved is usually tedious but trivial. But working through it goes along way to help make some sense of the various identities that come out. In looking at an identity one may see a transpose here, an inner product there, but not be able to assign any importance or distinction to any particular term at first glance. Working through a few calculations help with this.

EDIT: I suppose I should also say that I never "learned" matrix calculus either, in the sense that I internalized the various features unique to matrices under derivatives and integrals. The calculations I refer to above are crude, naive ones in the scalar notation under whatever coordinate system seems appropriate.


When I was starting out in machine learning, as a programmer with the most rudimentary calculus background, it was easy to derive algorithms that had terms like "gradient w.r.t X of log(det(inv(λI + A X A')))" which absolutely stumped me when trying to derive the gradient by hand by elementwise partials.

However, thanks to Minka's notes and the Matrix Cookbook, I was able to eventually get a handle on easy techniques for these derivations! It's certainly no substitute for getting a handle on the theory first by studying a textbook, but these pattern-matching shorthands are important practical techniques.


How did you start in machine learning. Did you come from a top 10 school?


No, I failed out of university and am self-taught. I spent many years deliberately underemployed working on self-study and coding up projects at the limit of my abilities. I had decided to bet on investing in myself instead of just carrying out some boss’s wishes, which would be more optimized to benefit the company than my own growth and development.

Here’s a great resource if you’re starting out today: http://datasciencemasters.org


Thanks for the link! That's more or less the road I'm starting down right now, albeit with likely a bumpier past (dropped out of HS in addition to college and spent a few years doing almost nothing productive.)


Inspiring story!!


You might also find the Matrix Cookbook [0] to be a useful reference.

[0]: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf


The best resource I have come across is the Wikipedia page. It's comprehensive and concise. https://en.wikipedia.org/wiki/Matrix_calculus


This kind of stuff just infuriates me. I can't see any good reason for having so many competing notational standards for the same thing:

The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics.

Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a row vector. Both of these conventions are possible even when the common assumption is made that vectors should be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be somewhat standard throughout a single field that commonly uses matrix calculus (e.g. econometrics, statistics, estimation theory and machine learning). However, even within a given field different authors can be found using competing conventions. Authors of both groups often write as though their specific convention is standard.

Seriously? So if I want to read a paper that uses Matrix Calculus, it's not enough to just understand Matrix Calculus in general.. no, first I have to decipher which of a legion of possible notations the author used, and then keep that state in mind when thinking about that paper in relation to another, which might use yet another notation.

I understand that ultimately nobody is an position to mandate the adoption of a universal standard, but part of me wishes there were (this is, of course, not a problem that is limited to Matrix Calculus).


Some other examples:

sign conventions in thermodynamics

conventions in Fourier transforms

short billion vs long billion

calorie vs Calorie

English vs. metric units

SI vs cgs metric

esu vs Gaussian units in EM


I can see that x' means $x^T$ (x transposed) but it's not mentioned in the documentation — where does this notation come from?


From http://mathworld.wolfram.com/Transpose.html :

Unfortunately, several other notations are commonly used, as summarized in the following table. The notation Aᵀ is used in this work.

Aᵀ — This work; Golub and Van Loan (1996), Strang (1988)

à — Arfken (1985, p. 201), Griffiths (1987, p. 223)

A' — Ayres (1962, p. 11), Courant and Hilbert (1989, p. 9)


Matlab/octave


With caveat that A' is Hermitian transpose (transpose + complex conjugate) when A has complex entities, and you need A.' if you want the real transpose. For matrix A with real coefficients both operations are the same.


Awesome, pretty handy to have for engineering/scientific computing code. Over time, you do start to build the intuition just like you do for scalar derivative calculations, but it takes time.


The example has sine of a vector in it. I haven't had coffee yet today but I've never heard of being able to compute sine of a vector. What does this mean? Defining it by its Taylor series doesn't work like it does for square matrices because vectors can't be multiplied, and if you assume the direct product is what's meant then each term in the Taylor series is a different sized matrix and can't be added. Surely it doesn't mean elementwise sine?


It is element-wise sin, which is not entirely correct. You can take the sin of, or any analytic function of a matrix, however you'd have to compute the characteristic polynomial for the matrix and apply the function to that. It is a result of the Cayley-Hamilton Theorem. here is a tutorial: http://web.mit.edu/2.151/www/Handouts/CayleyHamilton.pdf


You don't need to know the characteristic polynomial to do it. Just sum as many terms of the power series as you want.

What the characteristic polynomial lets you do is to calculate it faster and more accurately because very large powers can be rewritten in terms of smaller powers that you already computed.


Yes it's element-wise, it's defined in the sidebar tab "Operators". In general the syntax seems closely aligned with Matlab, where all the math functions are elementwise.


Is the source code for this available?


Along the same lines, does there exist an "algebra checker" that could, say, take in two successive lines of latex, with perhaps a hint of how to get from one to the other, and confirm that there are no algebra errors?


The choice of LaTeX comes with two problems:

1) It's designed for describe graphical representation and not semantics.

2) parsing it properly is very difficult, especially considering how macros and other definitions can influence the syntax.


You could:

1) assume a standard interpretation of symbols (e.g. this site uses a-g for scalars, h-z for vectors and A-Z for matrices) and make those assumptions explicit and modifiable (e.g. WolframAlpha displays "assuming X is a Y, use as a Z instead").

2) Support only the subset that MathJax can handle, which seems to be enough for most purposes.

I definitely agree that LaTeX is not the optimal input format for that purpose, though.


See my comment above about my experiences with this: I tried, and a corner case killed me. There’s a reason why Mathematica and other CAS packages have invented their own idiosyncratic notations. LATEX and MathML are truly hopeless when it comes to communicating with algebra systems. It looks deceptively ’doable’, but really it’s a recipe for disaster.


I'm with cosmic_ape regarding the recommendation for SymPy. Write the two lines as symbolic expressions `expr1` and `expr2` then run `simplify(expr1-expr2)` to check if the answer is zero.

If you want to start with LaTeX then the parser from this library might be useful: https://github.com/Khan/KAS


SymPy does not take Latex as input, but it has symbolic equivalence checkers.

Methods are highly heuristic, and some languages (all expressions with,say, \pi, and exp function, or something similar) are in general undecidable.


To elaborate: Symbolic equivalence checking for two expressions f1 and f2 can be done by seeing if sympy.simplify(sympy.Eq(f1, f2)) is sympy.true.

  import sympy
  from sympy.abc import x, y, alpha, s
  quad = s ** 2 - alpha * s - 2
  # Let s1 and s2 be the two solutions to the quadratic equation 'quad == 0'
  s1, s2 = sympy.solve(quad, s)
  u = (x - s2) / (x - s1) * (y - s1) / (y - s2)
  f1 = (s2 - s1 * u) / (1 - u)
  f2 = (x * y - alpha * x - 2) / (y - x)
  # Claim: f1 is equal to f2
  print(sympy.simplify(sympy.Eq(f1, f2)))
  # Prints "True"


Is there something for implication or equivalence of equations? I was trying the example from Smaug123's sibling comment (https://news.ycombinator.com/item?id=15728598), but

  >>> import sympy as sp
  >>> x = sp.Symbol('x')
  >>> sp.simplify(sp.Implies(sp.Eq(x**2 + 2*x + 1, 0), sp.Eq(x, -1)))

  Eq(x, -1) | Ne(x**2 + 2*x + 1, 0)

  >>> sp.solve(sp.simplify(sp.Implies(sp.Eq(x**2 + 2*x + 1, 0), sp.Eq(x, -1))))

  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.5/site-packages/sympy/solvers/solvers.py", line 1065, in solve
    solution = _solve(f[0], *symbols, **flags)
  File "/home/user/.local/lib/python3.5/site-packages/sympy/solvers/solvers.py", line 1401, in _solve
    f_num, sol = solve_linear(f, symbols=symbols)
  File "/home/user/.local/lib/python3.5/site-packages/sympy/solvers/solvers.py", line 1971, in solve_linear
    eq = lhs - rhs
  TypeError: unsupported operand type(s) for -: 'Or' and 'int'

  >>> sp.solveset(sp.simplify(sp.Implies(sp.Eq(x**2 + 2*x + 1, 0), sp.Eq(x, -1))))

  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.5/site-packages/sympy/solvers/solveset.py", line 880, in solveset
    raise ValueError("%s is not a valid SymPy expression" % (f))
  ValueError: Eq(x, -1) | Ne(x**2 + 2*x + 1, 0) is not a valid SymPy expression
none of the obvious ways appear to work. Does Sympy not support this kind of equational reasoning?


I think there used to be a open source tool called Maxima or something that would simplify equations. Havent used it for the last 10 years and never used it much anyway.


I can recommended Maxima too. It's useful for checking for obvious errors and for simplifying expressions. It can be used for more advanced stuff, but I found it has a steep learning curve.


On a whim I programmed a symbolic integral-solving library in Python (I am not a professional programmer) and, as an ill-advised ’extension’, bolted on the ability to receive and emit LATEX expressions. Long story short, I was boasting about it on Nichols Nassim Taleb’s twitter feed, submitted an answer to a problem question he had posed that got mis-parsed by the front end, gave the wrong answer (which was entirely consistent, because the check stage compared the answer by deriving the flawed question), and got banned, flamed, and persecuted to Hell and back. True story.

From this I learnt the hard way that tokenisation and representation (in LATEX or MathML) do not belong in the same place as a CAS (Computer-assisted Algebra System).


I believe that LaTeX/MathML absolutely do belong in a CAS, just not as the primary means of interaction. (I.e. you do your work in the CAS's own language and then export to LaTeX.)

But maybe your example can convince me otherwise. Could you show the specific LaTeX code in question and describe how your program mishandled it?


I had set myself the target of implementing the Risch Algorithm (https://en.wikipedia.org/wiki/Risch_algorithm) in SymPy (3) but the event was so embarrassing I deleted my code and never published it publicly.

I’m not a programmer. I erred in trying to do a programmer’s job. Also, I discovered why Nicholas Nassim Taleb has the reputation for being rather uncharitable (but deep down I feel like I deserved it, because hey, I stated a mathematical untruth).


As one of the authors of this tool I understand that you would like to have some way of trusting the output.

Internally we check it numerically, i.e., generate some random data for the given variables and check the derivative by comparing it to an approximation via finite differences. We will ship this code (hopefully soon) with one of the next versions. You can then check it yourself. Otherwise, as far as I know there does not exist any other tool that computes matrix derivatives so I understand it is hard to convince anyone of the correctness of the results. But I hope the numerical tests will be helpful.


Mathematica can probably do that


Correct:

    Implies[x^2 + 2 x + 1 == 0, x == -1] // FullSimplify


  [In] FullSimplify[expr1 == expr2]

  [Out] True


I cannot even begin to explain how intensely I've been desiring this.


I haven't touched calculus since college and this freaks me out. I don't even know where to begin, all I remember is using mnemonics to recall rules for dx/dy or something. How would someone go about understanding this? ML is super interesting but seems daunting to anything worthwhile without understanding the maths behind it.


Some time ago I implemented a library [1] similar to this tool. The tricky part is that derivatives quickly exceed 2 dimensions, e.g. derivative of a vector output w.r.t. input matrix is a 3D tensor (e.g. if `y = f(X)`, you need to find derivative of each `y[i]` w.r.t. each `X[m,n]`), and we don't have a notation for it. Also, often such tensors are very sparse (e.g. for element-wise `log()` derivative is a matrix where only the main diagonal has non-zero values corresponding to derivatives `dy[i]/dx[i]` where `y = log(x)`).

The way I dealt with it is to first translate vectorized expression to so-called Einstein notation [2] - indexed expression with implicit sums over repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:

    Z[i,j] = X[i,k] * Y[k,j]   # implicitly sum over k
 
It worked pretty well and I was able to get results in Einstein notation for element-wise functions, matrix multiplication and even convolutions.

Unfortunately, the only way to calculate such expressions efficiently is to convert them back to vectorized notation, and it's not always possible (e.g. because of sparse structure) and very error-prone.

The good news is that if the result of the whole expression is a scalar, all the derivatives will have the same number of dimensions as corresponding inputs. E.g. in:

    y = sum(W * X + b)
if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a 3D tensor). This is the reason why backpropogation algorithm (and symbolic/automatic differentiation in general) in machine learning works. So finally I ended up with a another library [3], which can only deal with scalar outputs, but is much more stable.

Theoretical description of the method for the first library can be found in [4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in [5].

[1]: https://github.com/dfdx/XDiff.jl

[2]: https://en.wikipedia.org/wiki/Einstein_notation

[3]: https://github.com/dfdx/XGrad.jl

[4]: http://docs.mipro-proceedings.com/proceedings/mipro_2017_pro...

[5]: https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl


Would be nice if there was a "symmetric matrix" variable type, so it could simplify Ax + A'x to 2Ax if A is symmetric.


I wonder why they don't do it. In the tool I had running, we handled this by removing any transpose of a symmetric matrix (after propagating it before the leaves). Together with the simplification rule x + x -> 2*x for any x, you get the expected result. I could only guess why they didn't include it in the online matrix calculus tool. It was published after I left the group.


It is not in the current online tool but we will add it again soon. It is still in there the way you describe it (passing transpose down to the leaves and simplification rules as well). Btw: How are you doing and where have you been? Would be nice to also add a link to you and your current site. GENO is also on its way.


I'm good. Looking at things from the data angle now. But unfortunately no public page. You can link to the old one, if you want to. Have you compared against TensorFlow XLA?


I did not compare to Tensorflow XLA but I compared it to Tensorflow. Of course, it depends on the problem. For instance, for evaluating the Hessian of x'Ax MC is a factor of 100 faster than TF. But MC and TF have different objectives. TF more on scalar valued functions as needed for deep learning, MC for the general case, especially also vector and matrix valued functions as needed for dealing with constraints. But I will give TF XLA a try.


I think XLA is trying to reduce the overhead introduced by backprop, meaning when you optimize the computational graph you might end up with an efficient calculation of the gradient (closer to the calculation you get with MC). Regarding non-scalar valued functions: Don't you reduce a constrained problem to a series of unconstrained problems (via a penalty (or even augmented Lagrangian) or barrier method)? Then you only need the gradient of a scalar valued function to solve constrained problems. I imagine you can use the gradients of the constraints for solving the KKT conditions directly but this seems only useful if the system is not too big. But for sure it opens new possibilities.


XLA is good for the GPU only. On the CPU MC is about 20-50% faster than TF on scalar valued functions. For the GPU I don't know yet. But it is true that for augmented Lagrangian you only need scalar valued functions. This is really efficient on large-scale problems. But on small-scale problems (up to 5000 variables) you really need interior point methods that solve the KKT conditions directly as you point out. This is sometimes really needed. However, when you look at the algorithms TF and MC do not differ too much. In the end, there is a restricted number of ways of computing derivatives. And basically, most of them are the same (or boil down to two versions). Some of the claims/problems made in the early autodiff literature concerning symbolic diff is just not true. In the end, they are fairly similar. But lets see how XLA performs.


Is it open source.


Bookmarked!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: