
Evaluation of Deep Learning Toolkits - marcelsalathe
https://github.com/zer0n/deepframeworks/
======
Smerity
Given how many deep learning toolkits have popped up and the complexities
involved, these evaluations are quite useful. Code comparisons would be even
better but I feel that's asking too much given how quickly things move :)

One important note, if you're looking to actually do standard tasks, use a
higher level library. My favourite is fchollet's Keras[1] given that it
supports using both Theano and TensorFlow as backends. It will likely support
more in the future, giving you better performance (i.e. Neon is likely the
next contender) and helping prevent legacy toolkit specific code.

I'm confident many of the issues with TensorFlow will be cleared up sooner
than later, especially given it was only open sourced a month and a half ago.
As an example, bidirectional RNNs are trivial to implement by yourself (~5
lines of code) but additionally TensorFlow has working code for that already
in TensorFlow 0.6[2], the API is just not publicly listed.

Most important for this, single device performance is already at cuDNNv2 Torch
levels for TensorFlow given the 0.6 update[3]. Both soumith and this
evaluation haven't updated the numbers but Google replicated the benchmarks
and presented them at NIPS. They're working on adding cuDNNv3 support, which
should be another speed jump.

[1]: [http://keras.io/](http://keras.io/)

[2]:
[https://github.com/tensorflow/tensorflow/blob/master/tensorf...](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py#L227)

[3]:
[https://twitter.com/deliprao/status/673888452736245760](https://twitter.com/deliprao/status/673888452736245760)

~~~
zer0nes
I'm a fan of Keras too. Including comments on Keras and other higher-level
libraries would make my original review too long, hence I dropped it.

Note that the performance review is very much incomplete. As mentioned in the
blog: Deep Learning is not just about feed-forward convnets, not just about
ImageNet, and certainly not just about a few passes over the network. However,
Soumith’s benchmark is the only notable one as of today. So we will base the
Single-GPU performance rating based on his benchmark.

For TF, I think a bigger single-node perf issue is memory allocation. At NIPS,
Jeff Dean didn't have a straight forward answer for why TF's memory perf is so
poor.

------
gcr
In my opinion, the author doesn't try very hard to be constructively critical
of all the systems.

Example: Caffe is given 3/5 in the "Architecture" section because its
layerwise design means you have to implement the forward/backward passes
yourself for both CPU and GPU. Torch uses exactly the same design and inherits
all of its downsides, but is given 5/5 for no reason.

I would also strongly mark Torch down because Lua is difficult for users who
are familiar with Python/C/C++. At least, it's difficult for me. I'm still
routinely bit by off-by-one bugs due to Lua's lack of 0-indexed arrays. Hardly
the kind of thing you should be worried about in a scientific computing
framework.

This article is a good start, but you should familiarize yourself with each
framework and ignore the author's subjective ratings.

------
emcq
I would love to hear what people think of mxnet. I haven't had time to do
anything real with it but at a high level view seems super fast and flexible.

------
mjw
On the whole this is useful, although I think it's a little unfair to Theano
in places.

* Performance

I feel they should score separately here for compilation/startup time vs
runtime. Theano's compilation step can be slow the first time around. (In my
personal experience not enough to add significant friction at development
time, but YMMV -- I hear it can struggle with some more complex architectures
like deep stacked RNNs.)

Its compilation process gives some unique advantages though -- for example it
can generate and compile custom kernels for fused elementwise operations,
which can give speed advantages at runtime that aren't achievable via a simple
stacking of layers with pre-canned kernels. Some of its graph optimisations
are pretty useful too. In short smarter compilation can save you from having
to implement your own kernels to achieve good performance on non-standard
architectures. If you're doing research that can matter.

* Architecture

The architecture of Theano's main public API is clean and elegant IMO, which
is what matters most.

When it comes to extensibility, firstly you don't need to go implement custom
Ops very often, certainly not as often as you might implement a custom Layer
in Torch. That's because Theano ships with lots of fundamental tensor
operations that you can compose, _and_ a compiler that can optimise the
resulting graph well.

About the idea that it's hacky that "the whole code base is Python where
C/CUDA code is packaged as Python string": if you want to generate new CUDA
kernels programatically then you're going to want to use some high-level
language to do it. As stated Theano gets some unique advantages from being
able to do this. At some conceptual cost I'm sure it'd be possible to handle
this code generation in a slightly cleaner way, but I don't really see anyone
else in this area doing it significantly better, so I think given the
constraints it's a bit subjective and slightly unfair to call it "hacky".

I also think it's something that matters more for framework developers than
users. In my experience, on the relatively rare situations where you do need
to implement a custom Op, it's usually as a performance optimisation and you
can get away with something relatively simple and problem-specific,
essentially a thin python wrapper around some fixed kernel code.

The CGT project (which seems to be aiming for a better Theano) has some valid
and more detailed criticism of the architecture of the compiler, which I think
is fairer:
[http://rll.berkeley.edu/cgt/#whynottheano](http://rll.berkeley.edu/cgt/#whynottheano)

I'm also hoping in due course that Tensorflow will come closer to parity with
some of Theano's compiler smarts, at which point I'll be eager to switch as
Tensorflow has some other advantages, multi-GPU for one.

------
elliott34
Anyone using dl4j?

[http://deeplearning4j.org/](http://deeplearning4j.org/)

------
therobot24
These are great, i hope they'll update as frameworks do.

------
blazespin
Wow, what a Torch puff piece. It's good, but who uses lua..

~~~
gjm11
The comparison rates TensorFlow higher than Torch on two criteria and lower
than Torch on two, and explicitly agrees with you about Lua ("However, let's
face it, Lua is not yet a mainstream language.").

What, in your view, should it have looked like in order not to be a "Torch
puff piece"?

[EDITED to fix a trivial typo.]

