Hacker News new | comments | ask | show | jobs | submit login
Introduction to Distributed Training of Neural Networks (skymind.ai)
52 points by sytelus 44 days ago | hide | past | web | favorite | 4 comments

The asynchronous parameter server based approach to distributed training just does not make any sense in a cluster of GPUs. Communication overhead would be massive. Synchronous training with the ring-allreduce method of Horovod etc. is the way to go.

I agree. Here's Horovod for anyone who hasn't heard of it: https://github.com/uber/horovod

I wished HN had a budget where you could upvote some people/posts 10 times.

Hey folks:

Sorry I just saw this post. We're actually intending on implementing allreduce as well. Right now, the initial focus was on implementing something more robust. There's a lot of elements of our distributed training that isn't talked about in the post(this was meant to be more of a high level overview).

A few additional things:

1. First and foremost fault tolerance and making spark run well was a bigger priority for us. Spark and its ilk don't do well with gpu clusters. Our initial focus was more taking what we had and making it work well and running everywhere with no code changes.

A common workflow that "just works" is being able to run model import on spark and run things as is. My colleague max is behind elephas (which we've since adopted for our python interface for dl4j on spark).

2. What we've seen many people don't have is MPI. We have this hard constraint of running things in strange on prem environments. So instead, we focus more on things like multi cast udp and compression/quantization to speed things up the networking as much as we can.

3. When we go to implement all reduce we'll be focusing on reusing as much of this as we can. We'll also try to figure out how to reuse our existing parts tha twork well like our cyclic buffer re use called workspaces: https://deeplearning4j.org/docs/latest/deeplearning4j-config...

Any other feedback folks have would be appreciated.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact