
Why GEMM is at the heart of deep learning - shepardrtc
http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
======
bmh100
Fotran can be a pain to write, but it can lead to very efficient programs,
especially with the incorporation of advanced math kernels like ATLAS or MKL.
Of course, CUDA will lead to 10x-40x boost in GEMM performance in certain
algorithms as well. If you have a very matrix heavy hot loop, or could
possibly represent an algorithm as a matrix operation [1], it might be worth
your time to write some modern Fortran [2].

[1]: [http://kukuruku.co/hub/algorithms/using-the-quick-raise-
of-m...](http://kukuruku.co/hub/algorithms/using-the-quick-raise-of-matrices-
to-power-to-write-a-very-fast-interpreter-of-a-simple-programming-
language?ModPagespeed=noscript) [2]:
[http://fortranwiki.org/fortran/show/Fortran+2008](http://fortranwiki.org/fortran/show/Fortran+2008)

~~~
jedbrown
1\. ATLAS hasn't had great performance in ages because it explores the wrong
parameter space. OpenBLAS or MKL are much better choices and don't need
autotuning.

2\. You can call these libraries from any language, so Fortran offers no
advantage there.

3\. Fortran performance is similar to C and C++ (assuming use of restrict
where needed).

4\. CUDA will _not_ lead to 10-40x boost unless your CPU implementation is
garbage. The realistic standard is about 2x when normalized by Watts; see the
Green500 for example. The 2x also holds for price when using server-grade
parts (like Tesla), but GPUs come out further ahead if you buy consumer grade.
Note that you still have to pay to buy and power a host, and that GPUs are a
lot less versatile in terms of problem size/turn-around time and workload.
They have their place, but evaluate your workload carefully before concluding
that the cost proposition favors GPUs.

~~~
bmh100
1\. I certainly did not mean to imply that ATLAS or MKL were the best CPU
linear algebra libraries out there. Thanks for pointing out OpenBLAS and
offering that comparison.

2\. If I were to run a highly iterative algorithm, say with 100,000 or more
iterations, would the memory marshaling not be significant? Or would it be
unnoticeable since I would be using these libraries anyway? I also personally
find the array notation and slicing to be quite nice.

3\. The use of "restrict" is nice. I was not aware of that feature. Looking at
this reference [1], it seems that Fortran basically does the equivalent of
automatically using "restrict".

4\. According to NVIDIA [2], cuBLAS outperforms MKL by 6x-17x (it seems MKL is
catching up since I last checked). Is this misleading by NVIDIA? What is your
expectation for a reasonable execution speed increase? The fixed-cost
efficiency and variable-cost efficiency are good points to bring into the
discussion. Your point about carefully evaluating workload is also a good one.
I don't want people to think that throwing an algorithm into CUDA will
magically speed up the overall program. Network I/O, disk I/O, and device
(GPU) I/O are all important bottlenecks to consider.

[1]: [http://www-cs-students.stanford.edu/~blynn/c/ch05.html](http://www-cs-
students.stanford.edu/~blynn/c/ch05.html)

[2]:
[https://developer.nvidia.com/cuBLAS](https://developer.nvidia.com/cuBLAS)

~~~
jedbrown
2\. Most languages have a way to store raw arrays that can be passed directly
to the numerical routines. If you have to marshal, then it depends on the
problem size and amount of work done in the numerical routine.

3\. Yes.

4\. NVIDIA has a track record of misleading comparisons. They cleaned up their
act in the these CUDA-7 benchmarks:
[http://devblogs.nvidia.com/parallelforall/cuda-7-release-
can...](http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-
feature-overview/) in response to this G+ discussion
[https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxy...](https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxyz)
. Thanks to Szilárd Páll for calling attention to this discussion and to Mark
Harris for updating the plots. Note that this should not be interpreted as
"MKL/Xeon is catching up" but rather that performance comparisons are
sensitive to the details of the experiment and it can be hard to recognize the
consequences of biased configurations. Better normalization and standards for
comparison can help.

------
im3w1l
You should use FFT's for the convolution, see
[http://arxiv.org/abs/1312.5851](http://arxiv.org/abs/1312.5851)

Also, while the original Blas was in Fortran, you probably want OpenBlas,
which is in C. Or maybe Intels MKL.

~~~
ajtulloch
FWIW, FFT-based convolutional methods aren't better for a lot of common sizes
(including a bunch of AlexNet/VGG/GoogLeNet sizes).
[http://arxiv.org/abs/1412.7580](http://arxiv.org/abs/1412.7580) has a pretty
thorough evaluation of im2col + gemm vs fused gemm (cudnn) vs fft.

~~~
im3w1l
Thanks for the article tip hadn't seen it. But I don't agree with your
conclusion. In fig 1-6, you should look at the last column. Because really the
other channel numbers are quite low. Knowing that, the space domain algo, only
really has a chance at 3x3 filters which is quite small.

And since they claim fbfft was even faster than cufft, the situation should
look even better for fft.

EDIT: And they are about tied in the 3x3, 64 case.

~~~
gcr
The current trend in convolutional neural networks seems to be moving toward
_more_ convolutions with _smaller_ kernels. This year's second-place ILSVRC
winner (VGG) is essentially a giant stack of 19 convolution layers with super
tiny kernels, sandwiched between nonlinearities and pooling. "To reduce the
number of parameters in such very deep networks, we use very small 3×3 filters
in all convolutional layers"
[http://arxiv.org/pdf/1409.1556.pdf](http://arxiv.org/pdf/1409.1556.pdf)

------
gcr
Oops! The "A x B = C" figure is exactly backwards. A matrix with shape (r, k)
times a matrix with shape (k, c) should return a matrix with shape (r,c). The
figure incorrectly claims that a matrix (k,m) times a matrix with shape (n,k)
returns a matrix with shape (n,m), which is only true if all of these numbers
are equal (eg. square matrices).

Kind of an embarrassing typo for an article that is essentially about matrix
multiplication.

Here is a corrected (transposed) version of that figure:
[http://i.imgur.com/Bo6CJCv.png](http://i.imgur.com/Bo6CJCv.png)

Edit: It looks like every figure in the article suffers from this problem. I
know FORTRAN matrices store each column contiguously in memory, but whether a
matrix is accessed like * (ptr + r * ncols + c) or * (ptr + c * nrows + r) is
(in my opinion) a low-level detail and shouldn't affect our thinking about how
multiplication works.

~~~
simonster
Either I'm missing something here or the article is correct. Generally, the
first dimension of a matrix corresponds to the rows and the second dimension
to the columns. So in the original figure, A is m x k, B is k x n, and C is m
x n, and this seems right. See
[http://en.wikipedia.org/wiki/Matrix_%28mathematics%29#Size](http://en.wikipedia.org/wiki/Matrix_%28mathematics%29#Size)

~~~
dplarson
I think the author revised the figure(s) between the time of the parent
comment (by gcr) and your comment. At least, the A * B = C figure's filename
seems to imply a revision [1].

EDIT: yep, the figures were revised. Compare the corrected version [1] vs the
original [2].

[1]
[https://petewarden.files.wordpress.com/2015/04/gemm_correcte...](https://petewarden.files.wordpress.com/2015/04/gemm_corrected.png)

[2]
[https://petewarden.files.wordpress.com/2015/04/gemm.png](https://petewarden.files.wordpress.com/2015/04/gemm.png)

EDIT 2: I completely missed that the author put a notice (about having revised
the figures) at the bottom of the post.

~~~
petewarden
Yes, that was my screwup, sorry for any confusion! Despite multiple linear
algebra courses, working in 3D graphics for a decade, I haven't internalized
the basics of matrix notation. It doesn't help that my usual sandbox (Eigen)
is column major by default, which is the wrong in-memory order for my raster-
image trained brain to visualize. Funnily enough, I find tensor notation a bit
easier, despite being less familiar.

~~~
dplarson
No worries. Also, I found the article very interesting and informative.
Thanks!

------
ntenenz
Being computationally limited by GEMM is both a benefit and a pitfall. The
operation itself is highly parallelizable and typically compute-bound (as
opposed to memory-bound). As a result, advances in GPU performance will
translate into improvements in GEMM. However, it also means that
algorithmically you're limited by the hardware. GEMM is an _extremely_ mature
so algorithmic improvements are unlikely. Instead, any algorithmic
improvements would likely stem from replacing GEMM with a more tailored,
specialized operation.

~~~
gcr
Different parts of the computation are bound in different ways. The "rule of
thumb" for AlexNet-derived networks (with conv layers followed by fully
connected layers) is that 95% of the computation time is spent in the conv
layers, but 95% of the parameters are stored as the weights of the FC layers.
I imagine the later stage could be memory-bound and the former stage is CPU-
bound.

Not sure if this holds true for other NN structures like GoogLeNet or the
weirder recent fully convolutional networks.

------
bra-ket
great writeup!

in general, to make deep learning faster avoid using backpropagation

~~~
iskander
What else do you use to train the network?

~~~
bra-ket
random matrices + svd:
[http://www.ntu.edu.sg/home/egbhuang/](http://www.ntu.edu.sg/home/egbhuang/)

~~~
gcr
I would be very interested to see if this approach outperforms more
conventional deep networks trained with backpropagation.

It's important to note that (from what I understand) this paper seems to focus
on building _single-layer_ NNs, which are quite different from the behemoth
AlexNet that this article focuses on.

~~~
Fede_V
Very early on, people experimented heavily with non-gradient based
optimization (genetic algorithms and the like) to optimize neural networks,
but the results were rather poor.

For very large parameter spaces, gradient based optimization is really the
only game in town. There's also lots of clever tricks (gradient clipping,
momentum, adagrad, etc) that are involved in optimizing the convergence.

