
Announcing TensorFlow 0.8 – now with distributed computing support - mrry
http://googleresearch.blogspot.com/2016/04/announcing-tensorflow-08-now-with.html
======
TrickedOut
Is OpenCL anywhere on the roadmap? I now make laptop and desktop purchasing
decisions almost entirely on Nvidia card presence. One reason I didn't get the
latest MacBook Pro.

~~~
dave_sullivan
AMD is so woefully behind the curve in gpgpu and especially deep learning that
their management should be replaced. They think they're still competing with
Intel (they're not). Nvidia has a wide open field for the foreseeable future
and this will end up being a bad thing for consumers.

~~~
dharma1
Couldn't agree more. They need to diversify and really push gpu compute and
openCL. Where's the equivalent of CuDNN for openCL? How many AMD engineers
would it take it build that and what would the impact be?

------
pvnick
Can anybody offer a TLDR of how this works (or point me to one)? It seems
particularly well-suited for convolutional nets with many layers if I
understand correctly, but I am curious as to whether e.g. recurrent nets may
receive the same speed-ups from parallelization.

~~~
dgacmu
People at Google use multiple replicas to train RNNs and LSTMs to very good
effect.

At heart, the most common distributed training mechanism creates multiple
"replias" of the model -- each replica has a full copy. It splits the training
data among the replias, and then at the end of every batch, synchronizes the
updates to the model weights between the replicas. (A simple way would be to
think of taking the average of the gradients produced at each replica, and
having all replicas apply the average gradient. Equivalently, just reduce the
per-replica learning rate and apply all of the gradients, then propagate back
the new state to everyone.)

~~~
pvnick
Ah, great, thanks for the explanation.

------
therobot24
Since TensorFlow was dubbed as slower than most
([http://arxiv.org/abs/1511.06435](http://arxiv.org/abs/1511.06435)) it'll be
nice to see how this affects perceived performance

~~~
dgacmu
That study's way out of date - it benchmarked the CuDNNv2 version. Soumith's
convnet-benchmarks is much more up-to-date:
[https://github.com/soumith/convnet-
benchmarks](https://github.com/soumith/convnet-benchmarks)

but it hasn't yet been updated to reflect the latest performance improvements
in 0.8. We've continued to push on both single-machine and distributed
performance, and the next update to soumith's benchmarks should continue to
show that improvement.

~~~
therobot24
>> That study's way out of date

I don't know about 'way' out of date, it was first published just a few months
ago (November) and the authors pushed a revised version just a few weeks ago
(March 30th), but i definitely agree that it's not using the most current
implementations

>> Soumith's convnet-benchmarks is much more up-to-date

I'll definitely check these out, thanks for the link

~~~
vrv
And even those numbers on the front page are out of date :) (we're even faster
now: [https://github.com/soumith/convnet-
benchmarks/pull/96](https://github.com/soumith/convnet-benchmarks/pull/96),
which is from a few weeks ago.)

The field is moving quickly enough that many published benchmarks are stale
within 3 months, and it's a lot of hard work to maintain up to date
benchmarks, given how many frameworks there are. Also there are
performance/memory/scalability/flexibility tradeoffs everywhere, so it's hard
to capture everything in one number without a tremendous number of caveats.

------
fudged71
I wonder how effective this would be on a fleet of raspberry pis. With things
like Resin.io, Weave, and Kubernetes, I wonder if it would be possible to
create something like Seti@home for crowdsourced machine learning for all
kinds of different applications. Many of us have spare raspberry pis laying
around that could be utilized in a global network.

~~~
wyldfire
You'd probably have to scale to hundreds or thousands of pis to achieve the
performance you could see from a single $100-200 GPU.

------
taliesinb
I'm getting 404s for some of the tutorial sections when selecting r0.8 (e.g.
[https://www.tensorflow.org/versions/r0.8/tutorials/mnist/tf/...](https://www.tensorflow.org/versions/r0.8/tutorials/mnist/tf/index.html#tensorflow-
mechanics-101)). master works. Seems like some of the documentation is only
built for master and for r0.7, not for r0.8.

~~~
vrv
(Do you have an example link that doesn't work? I clicked a bunch of links
there and they were all working. Feel free to file a bug at
github.com/tensorflow/tensorflow)

~~~
taliesinb
The link I gave, and others, repeatedly didn't work when I tried them, but now
they seem to work!

------
modeless
Very cool! Any progress on Windows support?

~~~
hebdo
I doubt it is a priority. But I can certainly recommend Amazon GPU-enabled
instances (~0.6$/hour/GPU, not that much actually).

~~~
babo
g2 instances has a GPU which is not compatible with the stock tensorflow, you
must rebuild it from source. Do you have a workaround for that?

~~~
vrv
I believe our published wheels now include the code for cuda compute 3.0, so
it should work out of the box now.

(as long as the images have cudnn v4 and cuda 7.5 installed, I think :)

~~~
babo
Great news, I'll try today!

------
elcct
I can predict in 10 years we will see a rise of computer psychotherapists.

------
hiddencost
Nice; it only took them 7 months to catch up to amazon:

[http://www.nikkostrom.com/publications/interspeech2015/strom...](http://www.nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf)

~~~
dgacmu
For others who may be interested in the details despite the uninformative tone
of this comment: The Amazon paper is about a specific tweak to learning rates
for better scalability when doing distributed training. The core principles of
distributed DNN training are much older - for example, Dean et al. 2012:
[https://papers.nips.cc/paper/4687-large-scale-distributed-
de...](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-
networks.pdf) trained a model for Imagenet on 2000 cores, using the DistBelief
framework that is the predecessor to TensorFlow.

The question of how to improve the multiple-replica scaling of distributed DNN
training is very important, as is the question of creating usable, flexible,
and high-performance abstractions in which to implement one. They're also
fairly orthogonal. TensorFlow as an architecture focuses on the latter. One
could imagine implementing the Amazon tweak within either TF or any other
framework.

