
Infrastructure for Deep Learning - yigitdemirag
https://openai.com/blog/infrastructure-for-deep-learning/
======
programnature
While its useful to have this kind of info, IMHO its still far from
'infrastructure for deep learning'. What about model versioning? What about
deployment environments? We need to address the whole lifecycle, not just the
'training' bit. This is a huge and underserved part of the problem bc people
tend to be satisfied with having 1 model thats good enough to publish.

~~~
tlb
Indeed, deployment is a whole set of interesting issues. We haven't deployed
any learned models in production yet at OpenAI, so it's not at the top of our
list.

If the data and models were small and training was quick (on the order of
compilation time), I'd just keep the training data in git and train the model
from scratch every time I run make. But the data is huge, training requires
clusters of machines and can take days, so you need a pipeline.

An industrial strength system looks like this:
[https://code.facebook.com/posts/1072626246134461/introducing...](https://code.facebook.com/posts/1072626246134461/introducing-
fblearner-flow-facebook-s-ai-backbone/)

~~~
platypii
CTO of Algorithmia here. We've spent a lot of time thinking about the issues
of deploying deep learning models. There are a whole set of challenges that
crop up when trying to scale these kinds of deployments (not least of which is
trying to manage GPU memory).

It would be interesting to compare notes since we have deployed a number of
models in production, and seem to focus on a related but different set of
challenges. kenny at company dot com.

------
thr0waway1239
I don't know much about deep learning. Just noticed that there are 40+ upvotes
and 0 comments. I propose the HN Bikeshedding effect theory. Take the number
of comments and divide it by the number of upvotes.

<0.1 = Too technical for even HN audience 0.1-1.0 = At the right level for the
HN audience >1 = The topic is similar to painting the bike shed.

~~~
minimaxir
The high amount of upvotes and low ratio of votes/comments on deep
learning/big data posts is unfortunately accurate.

It's not a problem that HN has topics which are frequently upvoted; topics
such as employment and Rust are popular memes.

It _is_ a problem, however, if the upvote-for-the-title crowd upvote articles
which are _bad_ and would not get upvotes if they were about another topic.
That's a legit hard problem to solve (what makes a good submission?),
unfortunately, but one I've been looking into.

(For clarity, this submission is a good submission, but I've seen quite a few
top-ranking HN submissions that are just a bar chart on a controversial topic
that is poorly sourced. And linkbait about deep leaning tends to get upvotes,
but flagged too.)

~~~
mamon
There's a problem with upvoting on HN: upvote is also the only saving
mechanism available: If I run across some click-baity title, but I don't have
time to read it right now I will click upvote, but what I really mean by that
is "save for later".

Maybe HN just needs to separate bookmarking and upvoting?

~~~
detaro
... HN has added favorites recently ("favorite" link, on the
submission/individual comments page) and thus has both. And apparently really
needs a more public changelog.

------
ymt123
It's great to see people talking about the infrastructure they use to manage
their deep learning workloads.

One area where we've had trouble with other orchestration tools (e.g. Docker
Swarm) was in managing resources at anything beyond whole boxes. They are all
good at managing CPU/RAM/Disk but we've had trouble with give this task GPU2.
We had planned to try Mesos (given that we already run it for other things)
but it sounds like maybe we should take a harder look at Kubernetes first.

------
freyr
> _Like much of the deep learning community, we use Python 2.7_

It's unfortunate that so much effort has been spent on bringing tools up to
speed with Python 3, but some groups still insist on dragging their feet. I
understand the motivation when we're talking about an established company with
a huge legacy code base, but within the research community it's kind of
embarrassing.

~~~
daveguy
Python 2.7 is the present and future of scientific computing with Python. If
there is one field in which a print statement is a critical feature it is
quick interactive analysis prototyping. That won't change no matter how much
Guido and co want it to change.

~~~
nahumfarchi
Curious here, what difference does it make if it's a statement or a function?

~~~
daveguy
4 extra keystrokes. If you could add a statement I would add "p" and change my
muscle memory to use that. For me it doesn't even need the ">>" syntax, just
quick display. Rapid prototyping needs rapid feedback.

Edit: it's not just the quickness of a single statement. It's that the print
statements are about 30-50% of the code when you are working this way.

~~~
patcallier
To be fair we've ingested a fair amount of python 2.7 research code for our
Python 3 codebase and the print statements are the quickest of fixes. There
are rarer actual gotchas, but 2to3 catches a fair number of them. We only
switched for the machine learning project I'm working on right now as an
experiment, but it's surprising how smoothly it's gone.

~~~
daveguy
I'm not talking about it being a bug that needs a fix. That is easy enough in
existing code. Im talking about when you are using Python like Matlab or
Mathematica. Analyzing data and quickly viewing the results or subsets of the
results.

~~~
ebalit
You should probably use Jupyter notebook if you don't use it already. It's
great for exploratory coding like data analysis. And as the last evaluated
expression of a code block is automatically printed, no need for a print
statement.

------
vonnik
Tensorflow is actually pretty slow and problematic on large clusters outside
the Google Cloud. Probably because that's not what it was designed for.

For Java/Scala people, Deeplearning4j has a pretty sophisticated Spark + GPUs
setup:

[http://deeplearning4j.org/gpu](http://deeplearning4j.org/gpu)

[http://deeplearning4j.org/spark](http://deeplearning4j.org/spark)

[http://deeplearning4j.org/spark-gpus](http://deeplearning4j.org/spark-gpus)

[Disclosure: I help create DL4J, and it's supported by my startup, Skymind.]

~~~
tlb
How does DL4J training scale across >8 GPUs?

~~~
dragandj
I don't know, but I am curious about how many people, in percentages, outside
Google and Facebook and the likes, need to scale _their_ models to more than 8
GPUS?

~~~
tlb
Many of the models people are building here, such as generative image models,
take a few days to train (say, 100 hours) on our 4 GPU boxes. Research would
be faster if we could train on 400 GPUS in one hour, but the communication
bandwidth required makes it hard to scale.

~~~
emcq
What is being shared between GPUs?

Training data is easy to duplicate and share nothing.

Large models with shared weights get tricky but less frequent asynchronous
updates with schemes like hogwild seem to work with SGD. I believe TF has
support for this too. It won't scale linearly but might be good enough.

There's some excitement about synthetic gradients to allow less communication
and further parallelism.

The hpc community has certainly leveraged 400 GPU clusters.

It seems like a fun problem, and if you want to focus the resources, there
isn't anything insurmountable about utilizing 400 GPUs :)

~~~
dharma1
The DeepMind paper on synthetic gradients is super interesting for training in
clusters. I hope it gets built into TF soon.

Also really hope there will be more options for GPU hardware on public clouds.

Eventually the pieces will come together and it will be trivial to deploy
cloud 400gpus for an hour to run the load your local 4gpu workstation would
spend 100 hours on, but we are definitely not there yet.

We are working on a packaging format called snappy
([http://www.snapcraft.io](http://www.snapcraft.io)) and starting to talk to
TF guys about packaging TF with it (kubernetes is already packaged as a snap)
- hopefully this will take some pain away once it's working

------
josh_carterPDX
"Top performance thus requires top-of-the-line GPUs."

Would be curious to see the data around the economics of the different
options.

~~~
komali2
Especially metrics comparing cost of getting your own vs offloading to AWS or
whatever. When is the "break even" point for buying your own?

~~~
visarga
So, a single GTX 1080 GPU deep learning box would come around 1500$, if you
pay 0.7$/hr for your cloud server, you should buy if you use more than
1500/0.7 = 2142h. So, if you need more than 90 days of GPU time, you should
probably buy your box. Of course, if the cloud server is slower than GTX 1080,
then the benefit is multiplied.

But ... your own box would not be scalable. You'd still need AWS to speed up
training.

~~~
coredog64
AFAIK, AWS is stuck several architecture revisions behind Pascal (most things
I see say it's still a Kepler GK104). At a best guess, that 1080 is probably
2x faster than any single GPU AWS instance.

------
cs702
On a related note, I'm running a poll on deep learning frameworks:
[https://news.ycombinator.com/item?id=12391744](https://news.ycombinator.com/item?id=12391744)

------
mitbal
Very interesting article but I guess the scale is not for everyone. 1600 AWS
GPU? I'll be lucky if my infra request for g2.8xlarge is approved.

