
Horovod: Distributed Training Framework for TensorFlow, Keras, and PyTorch - axiomdata316
https://github.com/uber/horovod
======
metral
Uber engineers gave a great talk on distributed deep learning at the
Mesosphere conference back in Oct 2017 where they discuss the problem space
and Horovod in detail. Highly recommend it:
[https://www.youtube.com/watch?v=Ktc3GjshHcc](https://www.youtube.com/watch?v=Ktc3GjshHcc)

------
gaius
I'm a little confused by the architecture here. Keras already abstracts
TensorFlow and PyTorch - and CNTK. Don't at least TF and CNTK already handle
the distributed training use cases natively? (I am not familiar with PyTorch).
It's certainly one of CNTK's main selling points. What are the use cases for
adding yet another layer to the stack? In what way is this simpler than
tweaking to use the distributed features of the underlying packages natively?
Which is what they are designed for.

~~~
mbeex
> Keras already abstracts TensorFlow and PyTorch - and CNTK.

PyTorch? You mean Theano, right?

~~~
gaius
Yes my mistake! I mainly use Keras and CNTK in R, I made a conscious decision
to ignore most of the others to avoid getting bogged down.

------
stephensonsco
Anyone able to find any data on time 'til accuracy? I don't see it (even in
the video linked in another comment).

Sure it's nice to "achieve 90% scaling efficiency" in images/sec, but
images/sec alone doesn't get you where you want to go. Increased accuracy per
unit wallclock time is what you want.

~~~
vinn124
> Increased accuracy per unit wallclock time is what you want.

yah, especially for a framework for distributed learning!

------
Eridrus
I feel like before you need distributed training, you need distributed
hyperparameter tuning, so I'm disappointed I don't see anything about that
here.

