
Building your own deep learning computer is 10x cheaper than AWS - walterbell
https://medium.com/the-mission/why-building-your-own-deep-learning-computer-is-10x-cheaper-than-aws-b1c91b55ce8c
======
dagw
You're forgetting the cost of fighting IT in a bureaucratic corporation to get
them to let you buy/run non-standard hardware

Much easier to spend huge amounts of money of Azure/AWS and politely tell them
it's their own fucking fault when they complain about the costs. (what me? no
I'm not bitter, why do you ask?)

~~~
VHRanger
They seriously can't buy a graphics card and slap it in the PCIe slot?

~~~
drewg123
When I worked at Google, I really missed the dual monitor setup I had had at
my previous job. I asked my manager how to get a 2nd monitor. Apparently,
since I had the larger monitor, I was not allowed to get a 2nd one without all
kinds of hassle. I asked if I was allowed to just buy one from Amazon and plug
it in, and I was told no. I finally just grabbed an older one that had been
sitting in the hallway of the cube-farm next to us for a few days, waiting for
to be picked up and re-used. I'm sure somebody's inventory sheet finally made
sense 2 years later when I quit, and they collected that monitor along with
the rest of my stuff.

~~~
grogenaut
When I started at Amazon they gave us dual 22" monitors. I bought dual 27"
dells and a gfx card to back them and plugged it in. I also explained how much
I was swapping with 8gb and a virt (45 minutes of lost productivity a day),
and ram was $86. Manager happily expensed ram for me and the rest of the team
. 2 years later that was the standard setup. Now I have 7 monitors and 3 PC's
from multiple projects and oses and hardware reups and interns. None funded by
me. The dells are happily on my quad tree at home.

Amazon also allows you to byoc and image it. Images are easily available. Bias
for action gets u far.

~~~
xfitm3
TIL Bias for action.

~~~
gebeeson
As did I and the excellent rabbit hole that it led to.

------
cameldrv
Great post. From someone who's done a few of these, I'll make a few
observations:

1\. Even if your DL model is running on GPUs, you'll run into things that are
CPU bound. You'll even run into things that are not multithreaded and are CPU
bound. It's valuable to get a CPU that has good single-core performance.

2\. For DL applications, NVMe is overkill. Your models are not going to be
able to saturate a SATA SSD, and with the money you save, you can get a bigger
one, and/or a spinning drive to go with it. You'll quickly find yourself
running out of space with a 1TB drive.

3\. 64GB of RAM is overkill for a single GPU server. RAM has gone up a lot in
price, and you can get by with 32 without issue, especially if you have less
than 4 GPUs.

4\. The case, power supply, and motherboard, and RAM are all a lot more
expensive for a properly configured 4 GPU system. It makes no sense to buy all
of this supporting hardware and then only buy one GPU. Buy a smaller PSU, less
RAM, a smaller case, and buy two GPUs from the outset.

5\. Get a fast internet connection. You'll be downloading big datasets, and it
is frustrating to wait half a day while something downloads and you can get
started.

6\. Don't underestimate the time it will take to get all of this working.
Between the physical assembly, getting Linux installed, and the numerous
inevitable problems you'll run into, budget several days to a week to be
training a model.

~~~
sabalaba
In terms of number 6. The Linux Deep Learning framework installation can
happen in one line with Lambda Stack:

[https://lambdalabs.com/lambda-stack-deep-learning-
software](https://lambdalabs.com/lambda-stack-deep-learning-software)

------
leon_sbt
I ran into this exact issue about 2 years ago about build vs rent. Ultimately
I chose build.

Here's my thoughts/background:

Background: Doing small scale training/fine tuning on datasets. Small time
commercial applications. I find renting top shelf VM/GPU combos on the cloud
to be psychologically draining. Did I forget to shut off my $5 dollar an hour
VM during my weekend camping trip? I hate it when I ask myself questions like
that.

I would rather spend the $2k upfront and ride the depreciation curve, than
have the "constant" VM stress. Keep in mind, this is for a single instance,
personal/commercial use rig.

I feel that DL compute decisions aren't black/white and should be approached
in stages.

Stage 0: If you do full time computer work at a constant location, you should
try to own a fast computing rig. DL or not. Having a brutally quick computer
makes doing work much less fatiguing.Plus it opens up the window to
experimenting with CAD/CAE/VFX/Photogrammetry/video editing. (4.5ghz i7 8700k
+32gb ram +SSD)

Stage 1: Get a single 11/12 gb GPU. 1080TI or TitanX (Some models straight up
won't fit on smaller cards). Now you can go on Github and play with random
models and not feel guilty about spending money on a VM for it.

Stage 2: Get a 2nd GPU. Makes for writing/debugging multi-gpu code much
easier/smoother.

Stage 3: If you need more than 2 GPU's for compute, write/debug the code
locally on your 2 GPU rig. Then beam it up to the cloud for 2+ gpu training.
Use preemptible instances if possible for cost reasons.

Stage 4: You notice your cloud bill is getting pretty high($1k+ month) and you
never need more than 8x GPUs for anything that your doing. Start the build for
your DL runbox #2. SSH/Container workloads only. No GUI, no local dev.
Basically server grade hardware with 8x GPUS.

Stage 5: I'm not sure, don't listen to me :)

~~~
kwillets
I'm thinking of starting a prepaid cloud service -- once your $20 is gone it
shuts everything off.

------
anaxag0ras
You can also get GPU instances at much cheaper rates from Hetzner and OVH.

[https://www.hetzner.de/dedicated-rootserver/ex51-ssd-
gpu](https://www.hetzner.de/dedicated-rootserver/ex51-ssd-gpu)

[https://www.ovh.com/world/public-
cloud/instances/prices/](https://www.ovh.com/world/public-
cloud/instances/prices/)

~~~
ovi256
Only in non-North American datacenters. In NA, Nvidia can enforce their driver
license, which prohibits use of consumer GPUs in datacenters.

A nice advantage of non-consumer GPUs is their bigger RAM size. Consumer GPUs,
even the newest 2080 Ti, has only 11 GB. Datacenter GPUs have 16GB or 32 GB
(V100). This is important for very big models. Even if the model itself fits,
small memory size forces you to reduce batch size. Small batch size forces you
to use a smaller learning rate and acts as a regularizer.

~~~
tyingq
_" Only in non-North American datacenters"_

OVH offers these in their Canadian data center.

------
psergeant
> Nvidia contractually prohibits the use of GeForce and Titan cards in
> datacenters. So Amazon and other providers have to use the $8,500 datacenter
> version of the GPUs, and they have to charge a lot for renting it.

~~~
tfolbrecht
Curious of the MTBF (mean time between failure) rate of a GefForce/Titan
series GPU under continuous utilization in datacenter conditions vs a desktop
computer with intermittent usage. I don't want to believe Nvidia is just out
to stiff cloud providers. Maybe it's to protect themselves from warranty
abuse?

~~~
mastax
It's market segmentation. It doesn't need to be more than that

------
dkobran
NVIDIA is attempting to separate enterprise/datacenter and consumer chips to
justify the cost disparity. Specifically, they're introducing memory,
precision etc. limits which have major performance implications to GeForce and
there's also the EULA which was been mentioned here. That said, everything AWS
comes at a premium as they're making the case that on-demand scale outweighs
the pain of management/CapEx. This premium is especially noticeable with more
expensive gear like GPUs. At Paperspace
([https://paperspace.com](https://paperspace.com)), we're doing everything we
can to bring the cost down of cloud and in particular, the cost of delivering
a GPU. Not all cloud providers are the same :)

Disclosure: I work on Paperspace

~~~
Zero2Ten
Paperspace is great for the price(especially for storage) and easy for
deployment. However, the customer support is horrible. Turnaround time is
around a week, and sometimes they seem to have billing issues. One time I got
charged around $100 for trying out a GPU machine for 2 hours(which is the
number of hours showed up on the invoice), and it took them three weeks to
eventually issue a refund. That made me decide to switch. Hope you guys can
hire more people for customer support or have better solution for issues like
this.

------
julvo
For me, one of the main reasons for building a personal deep learning box was
to have fixed upfront cost instead of variable cost for each experiment. Not
an entirely rational argument, but I find having fixed cost instead of
variable cost promotes experimentation.

~~~
syphilis2
This is how I feel as well, especially in a work context. It's liberating to
have hardware freely available for use. I'm dreading the day when our
computing grid is made more efficient (smaller and tolled).

------
oneshot908
It's been this way since day 1. NVLINK remains the only real Tesla
differentiator (although mini NVLINK is available on the new Turing consumer
GPUs so WTFever). But because none of the DL frameworks support intra-layer
model parallelism, all of the networks we see tend to run efficiently in data
parallel because doing anything else makes them communication-limited, which
they aren't because data scientists end up building networks that aren't,
chicken and the egg style.

I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it
to TensorFlow or anywhere else:

[https://arxiv.org/abs/1404.5997](https://arxiv.org/abs/1404.5997)

I also suspect that's why so many thought leaders consider ImageNet to be
solved, when what's really _solved_ is ImageNet-1K. That leaves ~21K more
outputs on the softmax of the output layer for ImageNet-22K, which to my
knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide
embedding is 90K+ parameters (which is almost 4x as many parameters in the
entire ResNet-50 network).

All that said, while it will always be cheaper to buy your ML peeps $10K quad-
GPU workstations and upgrade their consumer GPUs whenever a brand new shiny
becomes available, be aware NVIDIA is very passive aggressive about this
following some strange magical thinking that this is OK for academics, but not
OK for business. My own biased take is it's the right solution for anyone
doing research, and the cloud is the right solution for scaling it up for
production. Silly me.

~~~
cameldrv
I think that the reason no one implements Krizhevsky's OWT (at least in normal
training scripts, there's nothing stopping you from doing this in TensorFlow)
is that the model parallelism in OWT is only useful where you have more
weights than inputs/outputs to a layer. This was true for the FC layers in
AlexNet, but hardly anyone uses large FC layers anymore.

~~~
p1esk
Model parallelism is also useful in situation where your model (and/or your
inputs) is so large that even with batch_size=1 it does not fit in GPU memory
(especially if you're still using 1080Ti). However other techniques might help
here (e.g. gradient checkpointing, or dropping parts of your graph to INT8).

------
LeonM
Own hardware is always cheaper to _buy_ than using a cloud service, but
keeping it running 24/7 involves substantial costs. Sure, if you run a solo
operation, you can just get up during the night to nurse your server, but at
some point that no longer makes sense to do.

Somewhere along the way we forgot about this and it's now perfectly normal to
run a blog on a GKE 3 VM kubernetes cluster, costing 140 EUR/month.

~~~
vidarh
I used to manage hardware in several datacentres, and I'd usually visit the
data centres a couple of times _a year_. Other than that we used a couple of
hours of "remote hands" services from the datacentre operator. Overall our
hosting costs were about 30% of what the same capacity would have cost on AWS.
Once a year I'd get a "why aren't we using AWS" e-mail from my boss, update
our costing spreadsheets and tell him I'd happily move if he was ok with the
costs, as it would have been more _convenient_ for me personally, and every
year the gap came out just too ridiculously huge to justify.

In the end we started migrating to Hetzner, as they finally got servers that
got close enough to be worth offloading that work to someone else. Notably
Hetzner reached _cost parity_ for us; AWS was still just as ridiculously
overpriced.

There are certainly scenarios where using AWS is worth it for the convenience
or functionality. I use AWS for my current work for the convenience, for
example. And AWS can often be cheaper than buying hardware. But I've never
seen a case where AWS was the cheapest option, or even one of the cheapest,
even when factoring in everything, unless you can use the free tier.

AWS is great if you can justify the cost, though.

~~~
mlrtime
When you create the spreadsheet, do you price in running servers 24x7 or using
elastic capacity?

~~~
vidarh
We had too low variations in load over the day to make elastic usage cost
effective for the most part, so it made very little difference. Indeed, one of
the most cost effective ways of using AWS is to not use it, but be ready to
use it to handle traffic spikes. Do that and you can load your dedicated
servers much closer to capacity, while almost never having to spin any
instances up.

------
syntaxing
There's also a great series on STH where they build a couple different
configurations from budget versions (~$800) to more expensive version
(~$16000)

[1] [https://www.servethehome.com/deeplearning02-sub-800-ai-
machi...](https://www.servethehome.com/deeplearning02-sub-800-ai-machine-
learning-cuda-starter-build/) [2]
[https://www.servethehome.com/deeplearning11-10x-nvidia-
gtx-1...](https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1080-ti-
single-root-deep-learning-server-part-1/)

------
scosman
Missing 1 important point: ML workflows are super chunky. Some days we want to
train 10 models in parallel, each on a server with 8 or 16 GPUs. Most days
we're building data sets or evaluating work, and need zero.

When it comes to inference, sometime you wanna ramp up thousands of boxes for
a backfill, sometimes you need a few to keep up with streaming load.

Trying to do either of these on in-house hardware would require buying way too
much hardware which would sit idle most of the time, or seriously hamper our
workflow/productivity.

~~~
Jack000
on the other hand, this comparison accounts for the full cost of the rig,
while a realistic comparison should consider the marginal costs. Most of us
need a pc anyways, and if you're a gamer the marginal cost is pretty close to
zero.

------
bogomipz
>"Nvidia contractually prohibits the use of GeForce and Titan cards in
datacenters. So Amazon and other providers have to use the $8,500 datacenter
version of the GPUs, and they have to charge a lot for renting it."

I wonder if someone might provide some clarification on this. Is this to say
only if a reseller buys directly from Nvidia they are compelled by some
agreement they signed with Nvidia? How else would this legal for Nvidia to
dictate how and where someone is allowed to use their product? Thanks.

~~~
uryga
Another comment in this thread said that it's due to the license on Nvidia's
drivers. So technically you _can_ use the hardware in a datacenter, just not
with the official drivers. Unfortunately it seems that the open-source drivers
aren't usable for most datacenter purposes, so this effectively limits how you
can use the hardware (at least in North America, where they can enforce it).

~~~
bogomipz
Wow so the drivers serve as a form of license key. I wonder how long before
AWS develops its own GPUs.

------
maxehmookau
While, in sheer dollar amount this post is probably correct, it doesn't really
scale.

At scale, you need more than just hardware. It's maintenance, racks, cooling,
security, fire suppression etc. Oh, and the cost of replacing the GPUs when
they die.

At full price, yes, cloud GPUs on AWS aren't cheap, but at potentially a 90%
saving in some regions/AZs, the price of spot instances by bidding on unused
capacity for ML tasks that can be split over multiple machines make using
cloud servers a much more attractive prospect.

I think this post is conflating one physical machine to a fleet of virtualised
ones, and that's not really a fair comparison.

Also, the post refers to cloud storage at $0.10/GB/month which is incorrect.
AWS HDD storage is $0.025/GB/month and S3 storage is $0.023 which is arguably
more suited to storing large data sets.

~~~
kakwa_
And the same can be said about can in fact pretty much be said by any AWS
service.

The equivalent of an i3 metal is probably around 30000 to 40000$ with Dell or
HP, and probably half cheaper if self assembled (like a supermicro server).
AWS i3.metal will cost 43000 annually, so even more than the acquisition cost
of the server, server which will last probably around 5 year.

But if you start taking into account all the logistic, additional skills,
people and processes needed to maintain a rack in a DC, plus the additional
equipment (network gears, KVMs, etc). The cost win is far less evident and it
also generally adds delays when product requirements changes.

Fronting the capital can be an issue for many companies, specially the smaller
ones, and for the bigger ones, repurposing the hardware bought for a failed
project/experiment is not always evident.

~~~
mmt
> But if you start taking into account all the logistic, additional skills,
> people and processes needed to maintain a rack in a DC,

You've mostly described what one pays a datacenter provider, plus hiring
someone who has experience working with one (and other own-hardware vendors,
such as ISPs and VARs), which doesn't cost any more (and maybe less) than
hiring someone with equivalent cloud vendor expertise.

> plus the additional equipment (network gears, KVMs, etc).

Although these are non-zero, they're a few hundred dollars (if that) per
server, at scale, negligible compared to $20k.

> The cost win is far less evident

It still is, since the extra costs usually brought up are rarely quantified,
and, when they are, turn out to be minor (nowhere near even doubling the cost
of hardware plus electricity). AWS could multiply it by 10 (as in the very
rough pricing example you provided).

> generally adds delays when product requirements changes.

This is cloud's biggest advantage, but it's not directly related to cost. This
advantage can easily be mitigated by merely having spare hardware sitting
idle, which is, essentially part of what one is paying for at a cloud
provider.

------
montenegrohugo
I really want to believe this. Of course the numbers given depend on _very_
frequent use of your machine, but still. One would imagine that building a
datacenter at scale and only renting when you actually are training a model to
be _much_ cheaper, but the reality appears to be not so.

So where does the money go?

Three places:

\- AWS/Google/Whoever-you're-renting-from obviously get a cut

\- Inefficiencies in the process (there's lots of engineers and DB
administrators and technicians and and and people who have to get paid in the
middle.)

\- Thirdly, and this is what most surprised me, NVIDIA takes a big cut.
Apparently the 1080Ti and similar cards are consumer only, whilst datacenters
& cloud providers _have_ to buy their Tesla line of cards, with corresponding
B-to-B support and price tag (3k-8k per card). [1]

So, given these three money-gobbling middlemen, it does seem to kinda make
sense to shell out 3.000$ for your own machine, if you are serious about ML.

Some small additional upsides are that you get a blazing fast personal PC and
can probably run Crysis 3 on it.

[1][https://www.cnbc.com/2017/12/27/nvidia-limits-data-center-
us...](https://www.cnbc.com/2017/12/27/nvidia-limits-data-center-uses-for-
geforce-titan-gpus.html)

~~~
aranw
I'm guessing the costs around NVIDIA cards is why Google are pushing there own
TPU's and building custom chips and hardware?

------
gnufx
By coincidence I just posted
[https://news.ycombinator.com/item?id=18066472](https://news.ycombinator.com/item?id=18066472)
about the expense of AWS et al for NASA's HPC work. (Deep learning, "big data"
et al are, or should be, basically using HPC and general research computing
techniques, although the NASA load seems mostly engineering simulation.)

------
gnur
> Even when you shut your machine down, you still have to pay storage for the
> machine at $0.10 per GB per month, so I got charged a couple hundred dollars
> / month just to keep my data around.

Curious how it relates to sticking only a single terabyte SSD in the machine.
As a couple hundred dollars per month should relate to a couple terabytes.

~~~
rcarmo
Yeah, that didn't add up for me too. Storage for a single machine shouldn't
add up to hundreds of dollars a month unlesa you're not really trying to
manage your data (by culling datasets, using compression, etc.)

------
w8rbt
I have seen sys admins stand-up a bunch of EC2's in AWS and install PostGres
and Docker on them (because the dev's said they need a DB and a docker
server). They don't get the services model (use RDS and ECS). Sys admins have
to change. Orgs can't afford this cost nor be slowed down by this 1990s
mindset.

Standing up a bunch of EC2's in AWS is just a horrible idea and an expensive
one as well. It also moves all of the on-prem problems (patching, backups,
access, sys admins as gatekeepers, etc.) to the cloud. It's the absolutely
wrong way to use AWS.

So stop sys admins from doing that as soon as you notice. Teach them about the
services and how, when used properly, the services are a real multiplier that
frees everyone up to do other, more important things rather than baby sitting
hundreds of servers.

~~~
malux85
Yeah that's nice and all, but then you have total and complete vendor lock-in.

I decided to do the EC2 thing when I built one of my products, knowing that I
couldn't have vendor lock in - and that decision was _critical_ to the
survival of my company when:

1) A customer wanted to run on azure. We would have lost a 2.5 million pound
contract if we couldn't do that

2) Another customer wanted on-prem solution, we would have lost a 55 million
USD contract if we were vendor locked to AWS

so sometimes it makes sense

~~~
w8rbt
The DB schema in RDS is identical to the DB schema on a Postgres server in
your data center and the RDS data dumps can be loaded into your very own
personal DB wherever you like.

The dockerfile used to build the containers in ECS can be pushed to any
registry and the resulting containers can run on any docker service.

What vendor lock in? I just don't buy that argument.

Use the services Luke ;) it will cost much less and you can focus on other
things!

~~~
malux85
I thought you meant use all of the services, like lambda, SQS instead of Kafka
or Rabbit, S3 for blobs instead of HDFS or filesystem.

Sure RDS can be swapped out easily, but what if I'm using 10 or 20 services?

------
brownbat
Reminds me of this post about how much of a gap there is between crypto mining
and renting a box out to researchers:

[https://news.ycombinator.com/item?id=16663689](https://news.ycombinator.com/item?id=16663689)

We're way overdue for a p2p marketplace for cycles.

------
maaark
> There’s one 1080 Ti GPU to start (you can just as easily use the new 2080 Ti
> for Machine Learning at $500 more — just be careful to get one with a blower
> fan design)

I don't believe there are any blower-style 20-series cards. The reference
cards use a dual-fan design.

~~~
_Wintermute
The ASUS TURBO models are blower style, no idea how hot they'll run though.

------
mullen
Try spot instances, you'll save a ton of money.

------
skywhopper
So this is definitely an interesting article with some good ideas. But the
main thrust isn't particularly interesting. It's always going to be true that
building and operating your own is cheaper than using "the cloud"... so long
as you are making use of it much of the time; and if you have the resources
and facilities to build, operate, troubleshoot, and replace the hardware; and
if you are sure about your long-term needs. But the decision isn't always that
straightforward for most potential users.

Also, look into using S3 for long-term storage, instead of leaving your stuff
cold in EBS volumes. It's quite a bit cheaper.

~~~
candiodari
Except in this case, if you use it once a week, you're clearly coming out
ahead.

That's getting pretty extreme.

------
code4tee
Purely on hardware yes it’s no secret that AWS is more, quite a bit more, than
just buying/building the machine and plugging it in. That’s for any servers no
just “deep learning” servers.

Of course you’re also paying for everything else AWS brings and the ability to
spin up/down on demand with nearly unlimited scalability which is hardly
“free.” AWS is also a very profitable business for Amazon so they’re making
good margin too on most of their pricing.

~~~
Const-me
> AWS is more, quite a bit more, than just buying/building the machine and
> plugging it in.

Last time I looked, for mid-range AWS instances, purchase price was about 6-12
months of the rent. That’s assuming you buy comparable servers, i.e. Xeon, ECC
RAM, etc…

For GPGPU servers however, purchase price is only 1-2 months of Amazon rent.
Huge difference, despite 1080Ti is very comparable to P100, 1080Ti is slightly
faster (10.6 TFlops versus 9.3), P100 has slightly more VRAM (12/16GB versus
11).

------
pimmen
You know, all the negatives of building your own machine assumes you run it
24/7, and we sure as hell don't. We run these models maybe a couple of times a
week, but we expect fast results when we do, and that's unchanged.

I will bring this up with the rest of my data engineering team, this might be
a good idea.

~~~
flukus
> Assuming your 1 GPU machine depreciates to $0 in 3 years (very
> conservative), the chart below shows that if you use it for up to 1 year,
> it’ll be 10x cheaper, including costs for electricity

You can invert that to 3 years of use at 33% utilization which would come out
cheaper (or is my maths broken?). Still doesn't sound like it'd be a good
match for your usage though.

That 3 years is extremely conservative though, in reality it would probably be
much longer and there are some potential upgrade paths to factor in as well.
Not to mention the potential to use it for other purposes.

------
Walkman
Every hardware can be cheaper than AWS if you build it yourself, that's not
the point. The ongoing cost of maintaining and improving it, and the necessary
knowledge might not. Also instead of building your own hardware, you can build
the app instead. I thought these are obvious for anyone.

------
wimbledon
Hardware is the easy part. Setting up and running the software stacks.
Uptimes, etc are tricky.

~~~
michaelanckaert
Exactly. I recently had a discussion on the costs of an RDS setup on AWS. Yes,
renting a dedicated server or buying a server and setting up replicated MySQL
is much cheaper than the montly cost of two RDS instances. Until you add the
labour cost of setting up, maintaining and monitoring.

------
paradite
This works if your machine learning is just for data analytics purposes.

The article somehow completely ignores integration with other services to
query or update the model, which requires some API to be hosted on a static ip
or domain, as well as the devops process.

~~~
thisisjeffchen
Yeah the main use case here is training, which is much more of an offline
process. For inference, you will probably want a cloud provider. Though the
cost difference still holds, so maybe a startup idea.

~~~
mr_toad
Inference can usually be done on much cheaper hardware, or using lambda.

------
flatb
Can anybody fill me in on why people these days are spending almost the same
money for a last-gen, medium-high end, 12 core, 24 thread CPU and a
motherboard? What features would an almost $400 motherboard have that could
justify that price?

~~~
paulmd
Threadripper is actually not even that great here, it's NUMA (half of the
lanes are attached to each die) and if you just want NUMA you can do a dual-
socket E5-2650 system for like half the price. Same logic applies to TR here:

[https://www.servethehome.com/how-intel-skylake-sp-changes-
im...](https://www.servethehome.com/how-intel-skylake-sp-changes-impacted-
single-root-pcie-deep-learning-servers/)

You're not using the CPU for anything, there is no reason to have a beefy
12-core CPU there. If you really want Threadripper, the 8-core version is
fine.

Again, personally I'd look at X79 boards, since they are pretty cheap and you
can do up to 4 GPUs off a single root, depending on the board. There are new-
production boards available from China on eBay, see ex: "Runing X79Z". Figure
about $150 for the mobo, about $100 for the processor, and then you can stack
in up to 128 GB of RAM, including ECC RDIMMs, which runs about $50 per 16 GB.

There are some Z170/Z270 boards like the "Supercarrier" from Asrock that
include PLX chips to multiplex your x16 lanes up to 4 16x slots (does not
increase bandwidth, just allows more GPUs to share it at full speed!). They
also will not support ECC (you need a C-series chipset for that, which run
>$200). So far most OEMs have been avoiding putting out high-end boards for
Z370 because they know Z390 is right around the corner so there is currently
no SuperCarrier available for the 8000 series.

------
peter303
Did the Stanford grad include that his and other grad students time is worth
at least $100 an hour on the open market, i.e. what a Stanford CS grad would
make in Silicon Valley (cest moi)?

~~~
mmt
This is only applicable if they can _actually_ sell very small increments of
their time for that amount of money, which I strongly suspect they can't.
There have been numerous threads here on HN about the difficulties of finding
full-price work that isn't full time.

Even then, the $2k difference between the cheapest pre-built the article
references and the DIY version would be 20 hours at $100/hr. Half a workweek
to build one PC seems excessive.

------
justifier
seems logical.. i would argue the inflection point in the ubiquity of pooled
compute services came as bandwidth needs increased due to more businesses
interacting directly, remotely and in real time, with customers, be they
consumers or other businesses

if compute needs were all internal, as they are with desktop apps.. namely,
compilation.., and with the majority of computationally expensive machine
learning demands.. namely, training.. then i'd argue the pooled compute model
would have remained niche

------
jl2718
Does anybody know how Cloud ML on TPUs compares to BYO 2080?

------
TeeWEE
You only need such a machine for short time periods most of the time. So
renting it is much more easy and cheaper.

~~~
thisisjeffchen
Hi! Thank you for the comment, I'm the writer on the article. When we did our
training we actually needed the computer for months at a time. It takes 1-2
months to tune a model and we were running exps almost 24/7\. So one project
put us in the break even point to build.

~~~
tlear
Was it possible to paralelize the process? Seems like training on few dozen
cloud GPUs and finishing much faster would save a lot of money in terms of
engineer time

------
payne92
This analysis assumes 100% utilization: keeping the machine busy 24/7/265\. It
also ignores AWS price drops and the value of being able to switch to new AWS
GPU instance types as they become available.

TL;DR: Proposed machine costs $3k plus ~$100-200?/month for electricity,
comparable AWS is currently $3/hr.

My conclusion: If you're going to do more than 1-2,000hrs of GPU computing in
the next few years, start thinking about building a machine.

~~~
Nimitz14
If you just run something for 3 days every 2 weeks using your numbers that
comes out to $432 per month. Just one month! Still not including storage costs
etc. So after 6 months you're even. And you can still sell those GPUs in 1/2
years for a decent chunk of money when you want to buy the next generation.

The cloud only really makes sense for the on-demand burst capability IMO.

------
treve
I feel this probably also assumes that your time is worthless.

------
sytelus
TLDR; Two month rental cost of GPU machine in AWS (and elsewhere) is same as
purchase price of similarly powered hardware. Why? In fiersly competitive
market no cloud provider wants to capture share here. One thing is that author
is comparing high end GPU machine with V100s that cost way too much for the
privilege of putting them in data center compared to consumer grade Ti1080.

------
maltalex
I don't see what's surprising here. This is the ski rental problem [0].

Would it surprise anyone to learn that renting a car is more expensive in the
long run compared to buying one? This is the same thing, only the time scales
are different.

[0]:
[https://en.wikipedia.org/wiki/Ski_rental_problem](https://en.wikipedia.org/wiki/Ski_rental_problem)

~~~
yazr
Because the cloud has economies of scale and benefits from specialization
(cheaper power, bulk buying prices, remote land prices, etc)

Hence, some of these saving should be passed on to consumers and it should be
cheaper.

The counter argument is that the cloud is over-engineered (redundancy,
backups, better software) for enterprise customers. You u are paying for all
these benefits whether u want to or not.

~~~
swebs
I've seen just the opposite. A VPS from a cloud provider is often 2-3x the
cost of an equivalent Linode.

~~~
desdiv
I think your pricing information might be slightly out of date. AWS
Lightsail's $5 plan is equivalent or better than Linode's $5 plan. Plus AWS
Lightsail has an even more barebones plan at $3.5 per month.

Two years ago you would be perfectly correct. The cheapest AWS option back
then would be a micro EC2 instance at around $20 per month IIRC.

~~~
_hyn3
Lightsail is basically a toy to compete with Vultr/DO/OVH/etc. You can't even
rename a server or change its metadata. You can't resize a server (except for
taking a snapshot and building a new server). Essentially, you can't make any
changes to a server and there are zero tools for organizing, tagging, or even
adding a description field. Of course, you cannot manage Lightsail instances
using the AWS console (or even see them). All of the powerful AWS tools are
invisible to you as well, and your entire Lightsail infrastructure is
invisible from within AWS itself (except for the bill).

Lightsail is really great for personal sandbox servers, or for moving to AWS
on the cheap, but it's not even as capable as its worst competitor (and
certainly cuts out a _lot_ of features compared to real AWS.)

The worst part is probably the surprising and seemingly minor factor of being
able to simply rename or resize a server. This makes using it in a team or
production environment surprisingly difficult.

With that all said, I still find Lightsail very useful for small prototyping
and other fast jobs (or a Lambda replacement -- it's _way_ cheaper than
running any reasonably busy Lambda function), but I can't see using it for
real production usage. Your best bet for real production usage, or something
that will someday move into production, would be probably vultr/DO/linode.
(OVH really wants to be great, and it should be because the hardware is really
great, but the dashboard and billing are so bizarrely bad..)

AWS is good when you have a virtually unlimited budget or don't mind running
up the bill while you're getting rolling, but getting locked into AWS will
mean that you are stuck with it forever. It's very hard to migrate away if you
start using a vendor-specific thing like DynamoDB (to be fair, Google Firebase
has exactly the same issue; Instances are basically fungible, but proprietary
data stores are not)

~~~
snaky
> getting locked into AWS will mean that you are stuck with it forever

I wonder why that's almost never discussed, like that obvious downside is non-
existant.

------
d--b
Of course building is cheaper than AWS...

It also means you need to build the computer, maintain it, upgrade it, store
it somewhere, connect it to the internet, and so on.

And you can't just scale it if you need to.

I mean, the cloud's here for a reason...

~~~
amelius
> Of course building is cheaper than AWS...

Why "of course"? Last time I checked baking my own cookies is more expensive
than just buying a bag at Walmart.

~~~
raverbashing
> Last time I checked baking my own cookies is more expensive than just buying
> a bag at Walmart.

Yes, but most likely they aren't of comparable quality

~~~
TeMPOraL
Quite likely, the Walmart ones are better, at least if it comes to taste and
looks. Unless you're an expert baker. And then your cookies probably still
aren't better, you will just claim they are, and it's polite to agree with
that assessment. In fact, it's _polite_ in our society to agree that any half-
good DIY attempt has better quality than commercial equivalent, even though it
doesn't stand up to scrutiny.

~~~
snaky
Only with taste and looks. Maybe.

> Even Walmart’s “Great Value 100% Whole Wheat Bread” contains seven
> ingredients that Whole Foods considers “unacceptable”: high fructose corn
> syrup, sodium stearoyl lactylate, ethoxylated diglycerides, DATEM,
> azodicarbonamide, ammonium chloride, and calcium propionate.

[http://www.slate.com/articles/life/culturebox/2014/02/whole_...](http://www.slate.com/articles/life/culturebox/2014/02/whole_foods_and_walmart_how_many_groceries_sold_at_walmart_would_be_banned.html)

------
mishurov
GPU accelerates training time considerably. For development it's very
convenient to use a Linux laptop with nVidia Optimus. I was able to train a
quite complex model on laptops's nVidia (optirun process) while continue
working using Intel SOC's GPU. However GPU memory is crucial it's better to
choose a laptop with 4GB or more video RAM.

~~~
sgt101
I've found that many models require 8GB or more; the T1080 cards are "just
about" good enough

------
quantumhobbit
Does this factor in the cost of lost productivity from having an awesome
gaming pc always available?

------
jcwayne
I have a hard time taking seriously any analysis that states x is n-times
smaller than y, with neither being a negative number. That's not how math
works.

~~~
quickthrower2
It is if you define cheapness is 1/price. So $10 is 0.1 cheap, and $100 is
0.01 cheap. So $10 is 10x cheaper than $100.

------
known
But not commercially viable due to
[https://en.wikipedia.org/wiki/Economies_of_scale](https://en.wikipedia.org/wiki/Economies_of_scale)

