Sorry I just saw this post. We're actually intending on implementing allreduce as well. Right now, the initial focus was on implementing something more robust. There's a lot of elements of our distributed training that isn't talked about in the post(this was meant to be more of a high level overview).
A few additional things:
1. First and foremost fault tolerance and making spark run well was a bigger priority for us. Spark and its ilk don't do well with gpu clusters. Our initial focus was more taking what we had and making it work well and running everywhere with no code changes.
A common workflow that "just works" is being able to run model import on spark and run things as is. My colleague max is behind elephas (which we've since adopted for our python interface for dl4j on spark).
2. What we've seen many people don't have is MPI. We have this hard constraint of running things in strange on prem environments. So instead, we focus more on things like multi cast udp and compression/quantization to speed things up the networking as much as we can.
3. When we go to implement all reduce we'll be focusing on reusing as much of this as we can. We'll also try to figure out how to reuse our existing parts tha twork well like our cyclic buffer re use called workspaces: https://deeplearning4j.org/docs/latest/deeplearning4j-config...
Any other feedback folks have would be appreciated.