
Mega Benchmark: Cost of AWS vs. GCE vs. IBM vs. Hetzner for Machine Learning - Radim
https://rare-technologies.com/machine-learning-hardware-benchmarks/?from=hn
======
OkGoDoIt
Here’s a cache:
[http://webcache.googleusercontent.com/search?q=cache:pxXMUXY...](http://webcache.googleusercontent.com/search?q=cache:pxXMUXY7iKcJ:https://rare-
technologies.com/machine-learning-hardware-
benchmarks/&num=1&client=safari&hl=en&gl=us&prmd=ivn&strip=1&vwsrc=0)

Looks like the instances they use are not GPU instances, so I’m not sure what
the point of this is. Also they do not take into account EC2 spot pricing,
which saves a ton of money and are the main reason I use AWS over alternate
clouds.

~~~
thesandlord
Have you tried GCP's Preemptible VMs? They are like Spot but without the
bidding. If you have, what do you like better about Spot?

[https://cloud.google.com/preemptible-
vms/](https://cloud.google.com/preemptible-vms/)

(I work for GCP)

~~~
OkGoDoIt
Thanks for the head’s up, I had not heard of that. Then again GCP is not an
option for me until they support Windows and C#.

~~~
5ersi
Google Compute Engine are virtual machines so you can put any OS on them.

In fact they have preconfigured Windows instances:
[https://cloud.google.com/compute/docs/quickstart-
windows](https://cloud.google.com/compute/docs/quickstart-windows)

~~~
OkGoDoIt
Thanks, this is news to me. But then again I haven’t looked at Google cloud
platform in years. I guess I should check it out again and see how it’s
evolved.

------
Radim
Hi guys, sorry for the unresponsive site. I had to add crap to the end of the
URL so I could resubmit here. But now without caching, the HN traffic is
melting our server :( EDIT: fixed.

__________

Answering the comments here:

GPUs: makes no sense for word2vec (and many other ML algos).

Azure: no support from Microsoft for our program. The Docker image and all
code is 100% public though, so please run yourself & report the results. We'll
be happy to update.

AWS spot: might still try that.

Thanks for the feedback!

~~~
JulianRaphael
Hi Radim, I understand that it is hard to stay on top of things in this fast-
evolving market, but you omitted a few important things.

1\. "Softlayer is also the only platform that allows the provisioning of bare
metal[...]": so does AWS: [https://aws.amazon.com/about-aws/whats-
new/2017/11/announcin...](https://aws.amazon.com/about-aws/whats-
new/2017/11/announcing-amazon-ec2-bare-metal-instances-preview/)

2\. "Softlayer and AWS charge in hourly increments." AWS offers per-second
billing: [https://aws.amazon.com/blogs/aws/new-per-second-billing-
for-...](https://aws.amazon.com/blogs/aws/new-per-second-billing-for-
ec2-instances-and-ebs-volumes/)

3\. Echoing @tedivm's comment, for a real comparison you'd really have to
include GPUs. Do you plan to write a follow-up? (Disclaimer: I work for AWS)

~~~
Radim
Hi Julian, happy to get feedback!

1\. AWS launched this last week, _after_ the benchmarks were finished &
published -- can't take the blame for that :) Exciting news though.

2\. Launched _during_ the benchmarks -- sorry we missed this!

3\. I disagree that CPUs are "not real", but, yes, we already ran some GPUs
benchmarks (alluded to in this blog post, too). Stay tuned.

~~~
JulianRaphael
Not blaming you at all, Radim - it's not easy to stay on top of the latest
developments in this fast-moving space. This was great work and I hope you
will continue running these benchmark test. Also, feel free to ping me via
sameting@amazon.com if you need any help with anything AWS.

------
rdtsc
> We found IBM’s Softlayer to be highly “new-user” friendly, which greatly
> expedited the process of configuring and ordering an instance

That's surprising. Usually the complaint is that IBM's stuff if complicated to
figure out and Google and others are "user-friendly".

> Softlayer is also the only platform that allows the provisioning of bare
> metal servers amongst the 3 cloud service providers,

Wonder if there is an opportunity there to deploy Kubernetes on it and manage
it on your own. Before cloud orchestration was the super-secret-money-making
sauce, but lately with some solutions being open source maybe that's changing
and bare metal servers would become popular again. Wonder in general if Google
talking about Borg and sponsoring Kubernetes was aimed at weakening Amazon's
AWS's position.

~~~
ktta
>> Softlayer is also the only platform that allows the provisioning of bare
metal servers amongst the 3 cloud service providers

Amazon recently announced bare-metal servers

[https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-
metal-i...](https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-
instances-with-direct-access-to-hardware/)

------
code4tee
They seems to have ignored the spot market on AWS which is actually a huge
reason people do sizeable ML work on AWS.

------
isoprophlex
(about GCE) "An unsatisfactory customer support experience only compounds
their problems"

... sounds about right

------
likelynew
> With regards to stability, I had no issues with either AWS, Softlayer or
> Hetzner. However, when running a long task (the benchmark with Tensorflow-
> GPU) on GCE, the job abruptly stopped after a few days without showing any
> errors, and I couldn’t figure out why. A repeated attempt also resulted in
> the same failure.

I encountered same error with google cloud ML engine. No errors on warning
when shutting down. Any idea why?

------
cva_1
This direct link works for me -- [https://rare-technologies.com/machine-
learning-hardware-benc...](https://rare-technologies.com/machine-learning-
hardware-benchmarks/)

------
raverbashing
Funny how Word2Vec on Tensorflow performs much slower than most
implementations

I am skeptical about using it, I've had better performance with Keras/Theano
on CPU only loads

------
hodgesrm
As a matter of curiosity, is anyone out there using dedicated hosts in a big
way for ML? The price/performance is pretty good based on these stats.

(I've used Hetzner in the past--in my experience their hosts are not very
reliable but are super cheap.)

------
dkobran
People are always talking about spot pricing when GPU compute comes up which
is valid but not for the ML use-case (which this article is discussing). Spot
is 100% incompatible with ML training. You need to be guaranteed that your
many hour/multiple day job is going to complete which is antithetical to spot.
At Paperspace, our goal is to provide a steady state low price that you can
depend on because our primary audience is ML. Spot would be more ideal from an
efficiency perspective but it’s just not viable in this context.

~~~
rayuela
Now that is patently false. Maybe you should start with your disclosure of
working for a competitor before spreading such misinformation. Hourly
checkpointing is a thing and built into frameworks such as Tensorflow. Making
it very easy to save your work just before interruption, that is IF you are
interrupted given your spot price is too low, and then resuming whenever you'd
like, i.e. when prices fall back down below your max threshold. I've compared
cost with Paperspace and it's not even close. The cost saving on AWS are huge,
especially on the most expensive gpu instances which is what we machine
learners use.

Look, as a consumer of gpu compute, I want there to be as much competition in
this space as possible, so I root for you guys, but as it stands you're
nowhere even close.

------
outside1234
No Azure???

------
mrep
anyone got a cache? I'm getting 502's

~~~
ploggingdev
[https://archive.fo/B98Ft](https://archive.fo/B98Ft)

~~~
aliljet
Definitely the best mirror.

------
penteston
Also if you searching servers near asia and europe use dedicer.com they have
good oltiobs for dedicated servers.

~~~
jensv
???

dedicer.com has been connecting our visitors with providers of Craps, Dice,
Dice Games and many other related services for nearly 10 years. Join thousands
of satisfied visitors who found Discount Toys, Fun Games, Game Magazines,
Games, and Games And Puzzles.

------
Thaxll
Where is the C5 in here, it's cheaper and faster than the C4 instance.

~~~
devonkim
The C5 instance only went GA a couple weeks ago, it takes a lot of lead time
to get results like this up and there’s been a few slight snags users have
reported with the C5 instances that may impact getting fair benchmarks
written.

