Einsum Is All You Need – Einstein Summation in Deep Learning (2018) 147 points by aleyan 82 days ago | hide | past | web | favorite | 46 comments

 The problem with einsum is that you have to explicitly specify the mapping between dimensions and indices every time, without any way to enforce consistency. It would be more ergonomic if each tensor had labeled dimensions. That would prevent the kind of silly mistake where you mix up the ordering of dimensions and only notice it when you later change the shape of the tensors so the different dimensions no longer match up.
 I think, this is something that the community is now slowly realizing. I've must have read about that multiple times recently. Some packages that seem to provide tensors with labeled dimensions.(I haven't personally worked with them):xarray (Python): https://github.com/pydata/xarray NamedArrays.jl (Julia): https://github.com/davidavdav/NamedArrays.jl
 ITensor [1] has somewhat pioneered this concept in tensor networks for condensed matter physics. If I understand correctly, Uni10 [2], a similar project, is even working on a graphical interface for such networks so that you can "draw" the network and have the computer figure out the best contraction order.In my own code I’ve recently also implemented such named indices (only had "einstein summation"-like contraction specifiers before) and they make tensor contractions so much simpler to write, especially since you can simply overload operator * and have it figure out which legs need to go together.As a side effect, it also enforces your algorithms to make sense because you can't simply add two tensors living on different (but equal-dimensional) vector spaces together anymore.
 I made a thing to audit that you are consistent with your indices [1] as an alternative to explicit named-tensor objects. And found, but have not used, another package in a similar spirit [2].Although to be honest, simply writing A[n,μ,ν,c] etc. (using different letters for different spaces) makes it pretty easy to visually check that you are getting this right. This is one of the attractions of the notation, even on paper. Unfortunately np.einsum's string notation makes this harder to see, as the indices aren't adjacent to the variable name.[2] https://github.com/ofnote/tsalib (Python)
 > I made a thing to audit that you are consistent with your indices [1] as an alternative to explicit named-tensor objects. And found, but have not used, another package in a similar spirit [2].Cool, I will have to check this out!> Although to be honest, simply writing A[n,μ,ν,c] etc. (using different letters for different spaces) makes it pretty easy to visually check that you are getting this right. This is one of the attractions of the notation, even on paper. Unfortunately np.einsum's string notation makes this harder to see, as the indices aren't adjacent to the variable name.The problem is not so much the letter-space association but also handling the ordering of the spaces inside the tensor. For example in my code, to do a contraction over two indices, you could do prod<2>(a, b, "tlx,tr,p1,p2|tlx,p1,tl|tl,p2,tr") where the result would then have index order tl,p2,tr. The problem was then that changing the index order in one place (e.g. for performance reasons) meant having to re-check all other places where this is used. If you want to contract a tensor network like (d) in [1], this quickly gets complicated. With named indices, the above becomes a * b and if I change the index order in any place, it gets automatically changed there, too.
 Right, changing the order would be a pain. Although if the reason for doing so is memory layout (for speed), then a lazy permutedims(A) would decouple index order from this. Perhaps when creating a tensor for the first time there ought to be a way to specify the layout? Haven't thought much but something like A[n^4, μ,ν, c^1] := .... would not be hard to do.In your prod<2>(a, b, ... example, if p1 and p2 are in the same space, how would a*b know which one to contract? Or do they have different names from when a was created?
 > In your prod<2>(a, b, ... example, if p1 and p2 are in the same space, how would a*b know which one to contract? Or do they have different names from when a was created?ITensor introduced this concept and I mostly just followed their lead – spaces have unique names and tensor legs have a name label and a "prime level". So for example an operator O: A → A would have one leg labelled a[uuid]-prime0 and another leg a[uuid]-prime1. Similar to how one might write O: A → A has elements O_{a a’} when writing it down on paper.
 See HarvardNLP's namedtensor which is under active development!disclaimer: i'm a contributor
 Why is it that mathematicians are so comfortable with using ordering to encode semantics?Evident from this thread it seems many developer recognize the need to use proper labels to simplify reasoning. Doesn’t mathematicians have similar desires?
 Well, many different notations are convenient in different contexts. The compiler they target is a bit more flexible than developers', so they are free to just describe what they are doing on page 1.Human languages face the same trade-off: in some word order is very important, in others less so, but they need to compensate with some kind of case labels.
 I remember when I was learning matrix calculus and realized at some point that it was much simpler to convert everything to index notation, perform all operations, then convert everything back to standard notation at the end. It became almost comically simple, because you're "just" working with labeled scalars at that point. To be fair, it's convenient to memorize some of the more commonly used expressions (like ∂tr(AB)/∂B) rather than rederive them from scratch.
 Roger Penrose (the famous mathematician) has been saying the same thing for a couple of decades (going even further than that, check out Penrose diagrams).
 And Penrose diagrams (tensor networks) became a prominent example of String Diagram in the monoidal category of vector spaces over a field.
 I am interested: do you have a reference that shows this through examples ? In my grad school years, I was always dissatisfied by matrix calculus notations
 Interesting, I almost always got the exact opposite experience. How do you calculate non trivial matrix derivatives using index notation ?
 You insert kronecker deltas: ∂M_ab / ∂M_cd = δ_ac δ_bd , and a,b here remain contracted with whatever M was originally contracted with. Then you simplify, δ_ac Z_xya = Z_xyc and so on.
 Right now I work on Tensor diagram notation for deep learning (for project "thinking in tensors, writing in PyTorch").To read more about it, see: https://medium.com/@pmigdal/in-the-topic-of-diagrams-i-did-w... (obviously, I refer to the post).And if you want to create some, here is a short demo: https://jsfiddle.net/stared/8huz5gy7/In general, I want to expand that to tensor structure (e.g. n, channel, x, y) plus, translate it to the Einstein summation convention.
 I've never really understood the point of Einstein notation, as a piece of mathematical notation. Is writing something like A[i, j] * B[j, k] really that much faster than writing something like Sum[j](A[i, j] * B[j, k])? Especially when you have to check the left hand side of the equality sign just to know which indices to sum over, it seems like making things less clear for a minuscule saving on ink.
 It's not just about the sum. The upper and lower indices are semantically crucial. Eg: vectors have upper indices and gradients have lower indices. It's nonsensical to add the two, so the core step in gradient descent doesn't make sense. To add the two, you need the metric (Hessian) to raise the index of the gradient. And just like that, voila, you've discovered natural gradients!
 Gradients are vectors. You can certainly add them to other vectors.I think what you mean to say is that you can not add vectors and covectors. The differential of a function is a 1-form. Pointwise, it is a covector. This data is the natural starting point from which to compute gradients.Metrics can be used to convert covectors into vectors and forms into vector fields. The result is the thing you call the natural gradient. If you do this with the standard metric on R^n, then you get the vector field which you call the gradient.
 Absolutely yes.Doing tensor magic without einstein notation will make you shoot yourself, and even you don't someone else will if publish it.The only real problem I have with it personally is the abstraction of upper and lower indices, which I constantly forget the conventions as to which is which.
 This is interesting... I love the notation, but mainly because how upper and lower indices make it easy to distinguish vectors and forms (not that it matters in ML where everything is "euclidean").
 Upper and lower indices are great, but are still unrelated to the convention of dropping the big sigma at the front.
 > not that it matters in ML where everything is "euclidean"Non-euclidean spaces are actually quite common in ML, but many people don’t realize the spaces they’re working in are non-euclidean!
 Would be curious to hear what you have in mind here -- could you expand?
 Co -> low, primes below.
 It's more succinct, and the thing is that in most physics, repeated indices really almost always are sums. When you're doing fluid mechanics or anything like that, it's tensors all the time. The terseness is nice once you get used to it.
 Yes, it's incredibly faster. You do a lot of tensor algebra in a relativity class.
 To the point where writing one large sigma and listing indices under it at the start of each line would significantly slow you down? I found I had to do this mentally during my own relativity class, just to figure out what each expression meant.
 > To the point where writing one large sigma and listing indices under it at the start of each line would significantly slow you down?Yes (in my relativity & QFT classes). But the timesaving aspect was not that important for me. The notation enabled the intuition of “zipping” together these somewhat unwieldy mathematical objects, and that was the clincher.
 Then you're doing it wrong. The point of notation (or learning any new language) is to operate in that language, not to translate back and forth.
 It's not actually a new notation though, it's literally just deleting a part of the old notation and then getting the reader to fill that part back in. While this is fine for calculations, I would rather that when equations are actually presented, that one extra piece of information is explicitly given.I guess a similar thing that happens on the programming language side would be leaving out types when the compiler can infer them. This is great, and can make code more concise, but when reading an API I would really like the types written down explicitly, rather than having the play the part of the compiler.
 If types are omitted then there's actual missing information from the equation. If summations are omitted, there... isn't. They add nothing (at least in relativity where every sum is over the same 4 values).Anyway, index notation is more of a shorthand for symbolic tensor algebra computations than actual sums. The problem is that when dealing with 2, 3, and 4 index tensors, with co- and contra-variant indexes, it's quite cumbersome to notate which component of a tensor contracts with which component of another tensor, and anything you could come up with would end up looking like index notation. Rarely do you actually mean to sum anything.
 I don't have a problem with using indices for anything, they are fine. Distinguishing upper and lower indices is even better. The exact complaint I have is the implicit sums. For example, a matrix A would be written with one upper and one lower index, which I'll just write as A[i, j]. In einstein (implicit sum) notation, the trace of the matrix is written Tr A = A[i, i]. Is this really better than saying Tr A = Sum[i] A[i, i]?
 If you have just one term, then leaving out the summation sign saves little. But in larger expressions, the rule applies sums per term, and then the savings in clutter can be quite large. For example, here v_i is not summed, and the term with z_b has two sums:A_ia x_a + v_i + x_a y_a ( w_i + B_ib z_b )You should not need to check the LHS, x_a y_a should always mean the dot product of these two. At least this is the convention among the heavy addicts; among lighter users you will sometimes find other setups.
 I still feel that the whole expression could be clarified quite easily be writing a \Sum_{a, b} at the front?
 But that would sum everything, not just the parts which need it. You'd actually have to write this:\sum_a(A_ia x_a) + v_i + \sum_a(x_a y_a) ( w_i + \sum_b(B_ib z_b) )My example is actually still easier without indices. (Although you lose some information, like the fact that x & y are vectors in the indexed by a,b.) Here it's clear that Ax and x⋅y each involve a sum, which never includes v:Ax + v + (x⋅y)(w + Bz)
 As an analogy, why do you write A⋅B instead of Sum[i](A[i] * B[i])?
 This is a bit of a red herring - why would people using Einstein notation write A[i] B[i] instead of writing A⋅B?
 They wouldn't. My point was that A⋅B is also hiding a sum across all indices, just like Einstein notation. The sum never changes and is therefore visual/cognitive noise, once you've understood it the first time. The dot product is an abstraction that hides it and makes it implied.
 Awesome examples. However einsum fails to express convolutions. Also functions that are applied element wise such as the sigmoid and softmax. All three are crucial in deep learning.
 How do you einsum convolution? Arguably the single most important linear operation in deep learning?
 Convolution is a linear operator (with a certain special structure). So convolutions are represented just like any matrix-vector multiplication. Given the special structure of the convolution operator, it can be represented in a sparse manner, and what's more, the actual computation can also be implemented in an efficient manner, compared to a naive matrix-vector multiplication.TL;DR: It can be represented easily using Einstein notation. Einstein notation just does not capture the sparsity properties we like; it represents the transformation properties quite nicely.
 The smart-aleck answer would be "with a Fourier transformation". But yeah, I think you have a point.
 Looks like this is a re-post of the same link from this somewhat recent post: https://news.ycombinator.com/item?id=16986759

Search: