Horovod: Distributed Training Framework for TensorFlow, Keras, and PyTorch

metral · on June 3, 2018

Uber engineers gave a great talk on distributed deep learning at the Mesosphere conference back in Oct 2017 where they discuss the problem space and Horovod in detail. Highly recommend it: https://www.youtube.com/watch?v=Ktc3GjshHcc

gaius · on June 3, 2018

I'm a little confused by the architecture here. Keras already abstracts TensorFlow and PyTorch - and CNTK. Don't at least TF and CNTK already handle the distributed training use cases natively? (I am not familiar with PyTorch). It's certainly one of CNTK's main selling points. What are the use cases for adding yet another layer to the stack? In what way is this simpler than tweaking to use the distributed features of the underlying packages natively? Which is what they are designed for.

cromulen · on June 3, 2018

I don't know about CNTK, but Tensorflow and I think PyTorch don't have _good_ distributed training.

They use a distributed training model that utilizes parameter servers, which scales nowhere near Horovod's mpi solution.

Even for single-machine-multi-gpu solutions, only now in Tensorflow 1.8 is pure tensorflow as fast as Horovod with it's estimator MirroredStrategy. If you watch Tensorflow dev days 2018, the devs say they're working on bringing something like Horovod to pure Tensorflow

vinn124 · on June 3, 2018

> What are the use cases for adding yet another layer to the stack?

in my limited experience with horovod, horovod is most useful when youre running large clusters of workers/ps. in those situations, you typically have to manually find the appropriate balance of workers/ps. (otherwise youd run into blocking or network saturation issues.) horovod addresses this issue with their ring allreduce implementation.

having said all of that, im sticking with distributed tf for now.

mbeex · on June 3, 2018

> Keras already abstracts TensorFlow and PyTorch - and CNTK.

PyTorch? You mean Theano, right?

gaius · on June 3, 2018

Yes my mistake! I mainly use Keras and CNTK in R, I made a conscious decision to ignore most of the others to avoid getting bogged down.

stephensonsco · on June 3, 2018

Anyone able to find any data on time 'til accuracy? I don't see it (even in the video linked in another comment).

Sure it's nice to "achieve 90% scaling efficiency" in images/sec, but images/sec alone doesn't get you where you want to go. Increased accuracy per unit wallclock time is what you want.

vinn124 · on June 3, 2018

> Increased accuracy per unit wallclock time is what you want.

yah, especially for a framework for distributed learning!

Eridrus · on June 3, 2018

I feel like before you need distributed training, you need distributed hyperparameter tuning, so I'm disappointed I don't see anything about that here.