

Distributing a Fully Connected Neural Network Across a Cluster - iamtrask
http://iamtrask.github.io/2014/11/24/distributing-network/

======
ajtulloch
How is this on the front page? This is a completely incoherent.

For anyone actually interested in some interesting techniques for multi-GPU
DNN training,
[http://arxiv.org/pdf/1404.5997v2.pdf](http://arxiv.org/pdf/1404.5997v2.pdf)
and references therein are probably a good start.

~~~
iamtrask
This also might help... here are some slides graphically showing how the
distribution works.
[http://prezi.com/hdctecihctdr/?utm_campaign=share&utm_medium...](http://prezi.com/hdctecihctdr/?utm_campaign=share&utm_medium=copy&rc=ex0share)

------
dhaivatpandya
The exposition is not very clear. What exactly do you mean when you say "No
edges will be communicated over the network, only half of the nodes."? I'm
puzzled, because a few sentences later, you claim "The only network IO that
would be required would be sending each edge value to its respective node in
Q."; so the edge values are actually communicated?

From what I've understood, what you're suggesting is that for every node in a
layer, you colocate the edge on the same machine?

~~~
iamtrask
Precisely! I highly encourage checking out the slide-deck for a graphical
representation.

For every node in every other layer, I colocate the edge on the same machine.
In this way, when a group of, say, 10 nodes in layer 1 are each sending a
weighted message to a single node in layer 2... they can pre-combine their
messages (weighted sum) and send only that value over the network. This
happens for every node in the second layer, reducing network i/o (this is the
first optimization).

