
Benchmarking State-Of-the-Art Deep Learning Software Tools - cerisier
http://arxiv.org/abs/1608.07249
======
dgax
It's not surprising that TF is the slowest in many cases. It has been widely,
sometimes harshly, criticized in the past for that reason. On the other hand,
despite its speed TF appears to be the only tool that doesn't have to sit out
any of the tests due to incompatibilities or lack of features.

Other tools like MXNet deserve a shoutout as well, and it would be interesting
to see how a wider group compares. MXNet also integrates seamlessly into R,
something of a rarity in deep learning tools (excepting the also excellent h2o
package).

~~~
taliesinb
Yes, unfortunate that MXNet wasn't covered. It's in the happy Venn place of
(fully cross-platform) ∩ (easy to embed) ∩ (flexible) ∩ (hackable).

* Cross platform: Windows, MacOS, Linux; CPU and CUDA. Though their CMake needs work.

* Easy to embed: straightforward C FFI, JSON for metadata and parameter serialization, no weird runtime.

* Flexible: not too specialized to vision. Static unrolling of RNNs possible now (with mirroring this can still be very memory efficient [0]), basic support for the fast new cuDNN 5 RNN layers [1] (contributed by colleague of mine). Dynamic unrolling is on the horizon I hear.

* Hackable: once you're familiar with the codebase, custom elementwise unary or binary ops = few minutes, custom layers = 1+ hours (depending on complexity). And if you can leverage mshadow primitives for your layer implementation, you don't even have to touch CUDA. Also fairly active on github, responsive to PRs etc.

[0]
[https://arxiv.org/pdf/1606.03401.pdf](https://arxiv.org/pdf/1606.03401.pdf)

[1] [https://devblogs.nvidia.com/parallelforall/optimizing-
recurr...](https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-
neural-networks-cudnn-5/)

------
gcr
When properly configured, most of these libraries use NVidia's CuDNN package
under the hood. The only thing you're really measuring here is overhead, not
the actual computation.

------
mbeissinger
No Theano comparison?

------
dave168
CNTK is great at scaling out beyond a simple machine. The paper didn't
benchmark that but only tested one single box performance.

~~~
Eridrus
Realistically, most people barely get to multiple GPUs, let alone multiple
machines. You're more likely to do hyperparameter tuning across machines
before you do distributed training.

------
breezest
I wonder why Torch is so slow. But, the authors did not provide the
configuration of each tool in the paper or on the web.

~~~
geezerjay
> I wonder why Torch is so slow.

What do you mean "so slow"? It's by far the fastest framework covered by the
paper in scenarios where threads don't outnumber CPU cores.

Taken from the article itself:

"However, Torch still achieves the best performance in our experiments in
which Torch has nearly 12x speed up compared with TensorFlow under 4-thread
setting."

~~~
breezest
Why can't Torch utilize more threads in CPU cores? Taken from the article
itself: "both of them cannot run normally when threads usage is set to be
bigger than the number of CPU cores on desktop CPU." Do the authors set up the
system correctly?

You're right that Torch is faster than TensorFlow in RNN. But Torch is slower
than TesnorFlow in AlexNet and ResNet. There is a set of benchmarks for many
DL approaches as found in [https://github.com/soumith/convnet-
benchmarks](https://github.com/soumith/convnet-benchmarks)

~~~
T-A
Context-switching is expensive. You have to swap out the data being worked on
by thread #1 and swap in the data for thread #2. So you end up being
bottlenecked by memory bandwidth and latency rather than by raw compute.

