As far as I can see there seems to be a lot of noise ("How do I run Caffe on OpenCL") and not huge amounts of progress.
The truth is that you are better off waiting for external NVidia GPUs to become widely available than waiting for a decent OpenCL implementation.
At heart, the most common distributed training mechanism creates multiple "replias" of the model -- each replica has a full copy. It splits the training data among the replias, and then at the end of every batch, synchronizes the updates to the model weights between the replicas. (A simple way would be to think of taking the average of the gradients produced at each replica, and having all replicas apply the average gradient. Equivalently, just reduce the per-replica learning rate and apply all of the gradients, then propagate back the new state to everyone.)
The cool thing about distributed TensorFlow is that it supports efficient synchronous optimizers, so you can scale up the effective batch size by using multiple GPUs, to get increased throughput without losing accuracy.
but it hasn't yet been updated to reflect the latest performance improvements in 0.8. We've continued to push on both single-machine and distributed performance, and the next update to soumith's benchmarks should continue to show that improvement.
I don't know about 'way' out of date, it was first published just a few months ago (November) and the authors pushed a revised version just a few weeks ago (March 30th), but i definitely agree that it's not using the most current implementations
>> Soumith's convnet-benchmarks is much more up-to-date
I'll definitely check these out, thanks for the link
The field is moving quickly enough that many published benchmarks are stale within 3 months, and it's a lot of hard work to maintain up to date benchmarks, given how many frameworks there are. Also there are performance/memory/scalability/flexibility tradeoffs everywhere, so it's hard to capture everything in one number without a tremendous number of caveats.
My conclusion from this is that Soumith's approach to having a living repository is the way to go. It's harder to call it a "publication", but it's providing something of more lasting value than a static performance snapshot in a field where the engineering is moving so quickly.
Maybe at some point it will be viable, but not with the hardware and software as it is at the moment.
I doubt your idea would prove efficient.
It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.
High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.
This could be interesting if ported to an FPGA though. That could give you that power/performance tradeoff.
If you checkout apache mahout you can get an idea of what is possible and what is not.
(as long as the images have cudnn v4 and cuda 7.5 installed, I think :)
The question of how to improve the multiple-replica scaling of distributed DNN training is very important, as is the question of creating usable, flexible, and high-performance abstractions in which to implement one. They're also fairly orthogonal. TensorFlow as an architecture focuses on the latter. One could imagine implementing the Amazon tweak within either TF or any other framework.