Hacker News new | comments | show | ask | jobs | submit login
Mega Benchmark: Cost of AWS vs. GCE vs. IBM vs. Hetzner for Machine Learning (rare-technologies.com)
95 points by Radim 11 months ago | hide | past | web | favorite | 40 comments

Here’s a cache: http://webcache.googleusercontent.com/search?q=cache:pxXMUXY...

Looks like the instances they use are not GPU instances, so I’m not sure what the point of this is. Also they do not take into account EC2 spot pricing, which saves a ton of money and are the main reason I use AWS over alternate clouds.

Have you tried GCP's Preemptible VMs? They are like Spot but without the bidding. If you have, what do you like better about Spot?


(I work for GCP)

FYI you no longer have to bid for AWS spot.


Oh good to know! Looks like a similar model to GCE.

Thanks for the head’s up, I had not heard of that. Then again GCP is not an option for me until they support Windows and C#.

Google Compute Engine are virtual machines so you can put any OS on them.

In fact they have preconfigured Windows instances: https://cloud.google.com/compute/docs/quickstart-windows

Thanks, this is news to me. But then again I haven’t looked at Google cloud platform in years. I guess I should check it out again and see how it’s evolved.

Here is the C# support: https://cloud.google.com/dotnet/

Since you work for GCP can you give us any insight into whether google will be offering any of the Volta line of GPUs?

I misread your first line as "here's the catch." :). I feel like your second paragraph delivered on my misreading.

Do you have any numbers on that take? Or just a general feel that it works out?

The specific projects I’m working on basically process data from an SQS queue. I have a bunch of spot instances set to launch at the minimum prices. If the prices jump (which they very rarely do) I simply have less throughput until the prices come back down. I also have a t2 instance as the web frontend. I’m significantly more budget constrained than I am time constrained, these are basically side projects.

For specific numbers, the AllThePeople.net web scraping and profile compilation runs on Windows m1.small instances. The normal price is 8 cents per hour each while the spot instance costs me only a couple cents per hour each. I set the majority of my spot requests to the minimum possible price. I leave a couple at a higher price point of 8 cents per hour, so at least I have some work getting done if the prices go up. Those numbers are mostly for memory, it’s been a while since I fiddled with it much.

Here are your numbers for spot instance pricing[1].

Machine learning stuff will typically happen in batches, openai even went through the effort to publish a k8s library for the task[2] of autoscaling clusters to minimize cost and manage it.

[1]https://aws.amazon.com/ec2/spot/instance-advisor/ [2]https://github.com/openai/kubernetes-ec2-autoscaler

Thanks! I meant more if you had numbers on how you managed your workflow to be able to take advantage of this pricing. I'm assuming there is no magic "this is the cheaper option" for most folks. Instead, you have to have a pipeline built up that can take advantage of this style of pricing.

It is ironic that sometimes those pipelines are themselves cheaper to build up than building on fixed cost alternatives. But I don't think it is a given. (I would be surprised, but not shocked, to find I am wrong on that.)

Hi guys, sorry for the unresponsive site. I had to add crap to the end of the URL so I could resubmit here. But now without caching, the HN traffic is melting our server :( EDIT: fixed.


Answering the comments here:

GPUs: makes no sense for word2vec (and many other ML algos).

Azure: no support from Microsoft for our program. The Docker image and all code is 100% public though, so please run yourself & report the results. We'll be happy to update.

AWS spot: might still try that.

Thanks for the feedback!

Hi Radim, I understand that it is hard to stay on top of things in this fast-evolving market, but you omitted a few important things.

1. "Softlayer is also the only platform that allows the provisioning of bare metal[...]": so does AWS: https://aws.amazon.com/about-aws/whats-new/2017/11/announcin...

2. "Softlayer and AWS charge in hourly increments." AWS offers per-second billing: https://aws.amazon.com/blogs/aws/new-per-second-billing-for-...

3. Echoing @tedivm's comment, for a real comparison you'd really have to include GPUs. Do you plan to write a follow-up? (Disclaimer: I work for AWS)

Hi Julian, happy to get feedback!

1. AWS launched this last week, after the benchmarks were finished & published -- can't take the blame for that :) Exciting news though.

2. Launched during the benchmarks -- sorry we missed this!

3. I disagree that CPUs are "not real", but, yes, we already ran some GPUs benchmarks (alluded to in this blog post, too). Stay tuned.

Not blaming you at all, Radim - it's not easy to stay on top of the latest developments in this fast-moving space. This was great work and I hope you will continue running these benchmark test. Also, feel free to ping me via sameting@amazon.com if you need any help with anything AWS.

It may not make sense for the model you choose, but realistically speaking if you want a real comparison of machine learning platforms you need to do so on GPUs. The major competitive advantage that AWS has over GCE is the Volta GPUs, which are significantly faster than what any other provider is offering.

> We found IBM’s Softlayer to be highly “new-user” friendly, which greatly expedited the process of configuring and ordering an instance

That's surprising. Usually the complaint is that IBM's stuff if complicated to figure out and Google and others are "user-friendly".

> Softlayer is also the only platform that allows the provisioning of bare metal servers amongst the 3 cloud service providers,

Wonder if there is an opportunity there to deploy Kubernetes on it and manage it on your own. Before cloud orchestration was the super-secret-money-making sauce, but lately with some solutions being open source maybe that's changing and bare metal servers would become popular again. Wonder in general if Google talking about Borg and sponsoring Kubernetes was aimed at weakening Amazon's AWS's position.

>> Softlayer is also the only platform that allows the provisioning of bare metal servers amongst the 3 cloud service providers

Amazon recently announced bare-metal servers


> ... deploy Kubernetes on it and manage it on your own.

Yep. IBM Cloud Private, a self-managed K8s based orchestration platform can be deployed on bare metal.

From the article: "We gratefully acknowledge support by AI Grant and IBM Global Entrepreneurship program, which made these (sometimes costly) experiments possible."

They seems to have ignored the spot market on AWS which is actually a huge reason people do sizeable ML work on AWS.

(about GCE) "An unsatisfactory customer support experience only compounds their problems"

... sounds about right

> With regards to stability, I had no issues with either AWS, Softlayer or Hetzner. However, when running a long task (the benchmark with Tensorflow-GPU) on GCE, the job abruptly stopped after a few days without showing any errors, and I couldn’t figure out why. A repeated attempt also resulted in the same failure.

I encountered same error with google cloud ML engine. No errors on warning when shutting down. Any idea why?

Funny how Word2Vec on Tensorflow performs much slower than most implementations

I am skeptical about using it, I've had better performance with Keras/Theano on CPU only loads

As a matter of curiosity, is anyone out there using dedicated hosts in a big way for ML? The price/performance is pretty good based on these stats.

(I've used Hetzner in the past--in my experience their hosts are not very reliable but are super cheap.)

People are always talking about spot pricing when GPU compute comes up which is valid but not for the ML use-case (which this article is discussing). Spot is 100% incompatible with ML training. You need to be guaranteed that your many hour/multiple day job is going to complete which is antithetical to spot. At Paperspace, our goal is to provide a steady state low price that you can depend on because our primary audience is ML. Spot would be more ideal from an efficiency perspective but it’s just not viable in this context.

Now that is patently false. Maybe you should start with your disclosure of working for a competitor before spreading such misinformation. Hourly checkpointing is a thing and built into frameworks such as Tensorflow. Making it very easy to save your work just before interruption, that is IF you are interrupted given your spot price is too low, and then resuming whenever you'd like, i.e. when prices fall back down below your max threshold. I've compared cost with Paperspace and it's not even close. The cost saving on AWS are huge, especially on the most expensive gpu instances which is what we machine learners use.

Look, as a consumer of gpu compute, I want there to be as much competition in this space as possible, so I root for you guys, but as it stands you're nowhere even close.

No Azure???

anyone got a cache? I'm getting 502's

Definitely the best mirror.

Also if you searching servers near asia and europe use dedicer.com they have good oltiobs for dedicated servers.


dedicer.com has been connecting our visitors with providers of Craps, Dice, Dice Games and many other related services for nearly 10 years. Join thousands of satisfied visitors who found Discount Toys, Fun Games, Game Magazines, Games, and Games And Puzzles.

Where is the C5 in here, it's cheaper and faster than the C4 instance.

The C5 instance only went GA a couple weeks ago, it takes a lot of lead time to get results like this up and there’s been a few slight snags users have reported with the C5 instances that may impact getting fair benchmarks written.

I wish these were available in all regions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact