Uber engineers gave a great talk on distributed deep learning at the Mesosphere conference back in Oct 2017 where they discuss the problem space and Horovod in detail. Highly recommend it: https://www.youtube.com/watch?v=Ktc3GjshHcc
I'm a little confused by the architecture here. Keras already abstracts TensorFlow and PyTorch - and CNTK. Don't at least TF and CNTK already handle the distributed training use cases natively? (I am not familiar with PyTorch). It's certainly one of CNTK's main selling points. What are the use cases for adding yet another layer to the stack? In what way is this simpler than tweaking to use the distributed features of the underlying packages natively? Which is what they are designed for.
I don't know about CNTK, but Tensorflow and I think PyTorch don't have _good_ distributed training.
They use a distributed training model that utilizes parameter servers, which scales nowhere near Horovod's mpi solution.
Even for single-machine-multi-gpu solutions, only now in Tensorflow 1.8 is pure tensorflow as fast as Horovod with it's estimator MirroredStrategy. If you watch Tensorflow dev days 2018, the devs say they're working on bringing something like Horovod to pure Tensorflow
> What are the use cases for adding yet another layer to the stack?
in my limited experience with horovod, horovod is most useful when youre running large clusters of workers/ps. in those situations, you typically have to manually find the appropriate balance of workers/ps. (otherwise youd run into blocking or network saturation issues.) horovod addresses this issue with their ring allreduce implementation.
having said all of that, im sticking with distributed tf for now.
Anyone able to find any data on time 'til accuracy? I don't see it (even in the video linked in another comment).
Sure it's nice to "achieve 90% scaling efficiency" in images/sec, but images/sec alone doesn't get you where you want to go. Increased accuracy per unit wallclock time is what you want.
I feel like before you need distributed training, you need distributed hyperparameter tuning, so I'm disappointed I don't see anything about that here.