
TensorFlow Benchmarks - sherjilozair
https://github.com/soumith/convnet-benchmarks/issues/66
======
emcq
Is this surprising? I think it's expected, when most published numbers from
Google suggest that this is only 2x over DistBelief, which Adam Coates
improved upon 6x
([http://www.cs.stanford.edu/~acoates/papers/CoatesHuvalWangWu...](http://www.cs.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf))
in 2013. Coming from a HPC background, DistBelief (and sometimes MR in
general) has always felt like a square being forced into a round hole. Sure it
"scales" but it's not how you would build a system from scratch to meet the
same computational needs. Moving data around is much slower than doing extra
math.

DNNs provide some interesting opportunities for domain specific optimization
with lossy math because empirically it seems that the solutions dont require
significant precision. Something like a general Tensor library holds you back
from using some of the tricks like synchronization-free hogwild, where it
wouldn't be appropriate in other scenarios where more exact computation is
necessary.

One interesting step towards this is their implementation of lossy compression
with a custom 16-bit half float representation. However a lot of people are
dialing down the precision even further, including one of the authors of
TensorFlow ([http://petewarden.com/2015/05/23/why-are-eight-bits-
enough-f...](http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-
deep-neural-networks/)).

Scott Gray, one of the principle engineers at Nervana Systems, has a very wise
reply in the github thread - they aren't doing much to optimize for memory
locality, and wont have amazing speedups until that happens.

------
mark_l_watson
From Google's perspective it is probably more about how TensorFlow scales out
horizontally. If a researcher fires off a Borg run (or whatever they use now)
and the job takes a few thousand CPUs, no problem, at least for research.

They must have better optimization a for running in production, such as in
place operations.

~~~
dave_sullivan
I'm going to say something unpopular, but horizontally-scaled deep learning is
overkill for most applications.

Can anyone here present a use case where they have personally needed
horizontal scaling because a Titan X couldn't fit what they were trying to do?
It's the thing I hear talked about the most and used the least.

The biggest misunderstanding I've heard is "I have petabytes of data so I need
multi-GPU", but NNs are trained on batches of data, so it doesn't matter. It's
the model that must fit, but 12gb can fit a big model.

The whole "who can build the biggest net" is like building the fastest car:
it's for show, the slower ones are way more cost effective and get you there
just as well.

Keras is my present deep learning library recommendation--it's like torch 7
but in python and theano. Multi-GPU is roadmap because there's a lot more
going on in deep learning that acts as a bigger differentiator than horizontal
scaling.

For the record, when I say scaling, I'm not talking about having multiple
machines serve multiple requests (that's a good reason to purchase more
machines), I'm talking about spreading your gradient computations and/or model
weights across machines or GPUs.

~~~
freerobby
Would image training for video recognition be a good example of this?

~~~
yjxiong
Our experiments might support this. With 4 or 8 titan X, training time is
shortened from more than one week to one or two days. code available at
[https://github.com/yjxiong/caffe](https://github.com/yjxiong/caffe)

~~~
dave_sullivan
Note: data parallel vs model parallel.

Data parallel is easy to implement and leads to linear training time speed
ups, as per 1 week to 2 days with 4x hardware. But not bigger models.

Model parallelism leads to bigger models and is what I've been referring to.
It is overkill and doesn't work well anyway. Frameworks should not be expected
to support it because there are more interesting topics in deep learning they
could support instead.

Better research/methods will come out at some point, at which point this
calculus will change, but not yet! Today it is the very definition of
premature optimization in nearly all cases.

------
donthateyo
Until now, I've seen two responses to Google's TensorFlow from Facebook
employees. Yann Le Cunn seemed to really challenge Jeff Dean about
TensorFlow's scalability [1] and this benchmark puts TensorFlow down there in
all the measures it tested for. I can't ignore the possibility that this
criticism of TensorFlow from Facebook employees (while factually correct and
constructive) might be driven by some competition and jealousy.

[1]
[https://www.youtube.com/watch?v=90-S1M7Ny_o&t=39m](https://www.youtube.com/watch?v=90-S1M7Ny_o&t=39m)

~~~
kastnerkyle
This benchmark basically shows that releasing TensorFlow with cudnn v2 backend
support hurts - v2 is quite a bit slower than v3 (current) and v4 (upcoming).
TF has announced that they will update to v4 support, which should help quite
a bit - but when many hobbyists and researchers are developing on one or two
GPUs performance on that scale is more important (for them) than infinite
scalability.

It is not surprising that a tool developed and focused on `Google scale` work
has some imperfections in a wildly different setting. The question is - will
they (or some dedicated contributor) speed up a use case the business itself
may not have a use for? My gut feeling is that they will, but these things
usually don`t happen overnight. Torch, Theano, and Caffe have years of work
put into them and have largely been focused on the one to two GPU case.

~~~
smhx
I've done an apples to apples, TensorFlow + CuDNN R2 vs Torch + CuDNN R2

~~~
kastnerkyle
My point is more that _even_ if they were on the same footing from benchmark
timings, v2 is still far behind what is supported in Torch, Caffe, and Theano
right now (v3 in all IIRC). Your comparison is very fair, and it is good
insight!

------
iraphael
I don't know how much this matters, but an issue similar to this had been
raised here:
[https://github.com/tensorflow/tensorflow/issues/120](https://github.com/tensorflow/tensorflow/issues/120)

The responses seem to show that the way you implement things can make a big
difference in runtime. Perhaps the scripts used for benchmarking can be
further optimized?

That said, the lack of in-place operations might be surprising (although it
has been said that they are coming)

------
IanCal
Interesting benchmarks. One hopefully constructive critique, if you say things
go out of memory, it'd be really useful to know what your setup is. Maybe
you've got a big array of massive GPUs or you're running it on a more normal
consumer GPU+box.

~~~
brianchu
According to: [https://github.com/soumith/convnet-
benchmarks](https://github.com/soumith/convnet-benchmarks), it's an NVIDIA
TitanX (12GB GPU memory), which is pretty much the top of the line GPU for
training neural nets. From my limited experience, most deep learning in
research - including much of the state of the art - is done on single GPUs.
(My guess is that if your model doesn't fit in 12GB, your model has way too
many parameters to practically train anyway).

~~~
IanCal
Ah thank you for that, I hadn't clocked on that this was opened by the person
who ran the repo. I should have checked the main readme.

------
vonnik
It takes 4x as long as CuDNN on AlexNet due to lack of in-place ops. What is
up with that?

------
programnature
In the announcement they called this the "reference implementation".

------
mtgx
It's almost like Google wanted everyone to use slow obsolete software and keep
the _really good stuff_ for itself, while still making it look like they're
doing a great thing for the community.

~~~
tfgg
I think it makes more sense that:

a) It's Google, they can throw hardware at a problem such as out of memory

b) TensorFlow makes methods development so much easier that it's worth the
loss of performance

c) It's early days and the compute graph scheduler has lots of opportunities,
and is designed, for optimization, and in a more flexible fashion than other
frameworks.

When I work on developing methods for scientific code, I worry more about
whether the code is bug free/easy to understand and that it's giving the right
answer. I don't usually worry about performance unless I'm actually not able
to run things. Especially since when developing stuff you waste way more time
on runs with bugs - I dread to think what my (published / unpublished) CPU
hour ratio is. If the new approach allows less buggy implementations then
that's a resource win.

~~~
IanCal
> b) TensorFlow makes methods development so much easier that it's worth the
> loss of performance

Indeed, if TensorFlow means I can try out an idea with 1 day of coding and 2
days of training rather than 3 days of coding and 1 day of training then I can
spend a day drinking cocktails and reading books and still be finished sooner.

Based on the tutorials it seems like I'd be able to pretty quickly build a
translation pipeline, and in fact there's an implementation of that I think
I'll try. If it takes a week or two to train, that's fine by me, I've got
other things to be getting on with.

