Maybe at some point it will be viable, but not with the hardware and software as it is at the moment.
I doubt your idea would prove efficient.
It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.
High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.
This could be interesting if ported to an FPGA though. That could give you that power/performance tradeoff.
If you checkout apache mahout you can get an idea of what is possible and what is not.