Hacker News new | comments | show | ask | jobs | submit login
Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning Than Cloud GPUs (minimaxir.com)
258 points by myth_drannon 8 months ago | hide | past | web | favorite | 104 comments

Disclosure: I work on Google Cloud (and launched Preemptible VMs).

Thanks for the write-up, Max! I want to clarify something though: how do you handle and account for preemption? As we document online we've oscillated between 5 and 15% preemption rates (on average, varying from zone to zone and day to day) but those are also going to be higher for the largest instances (like highcpu-64). But if you need training longer than our 24-hour limit, or you're getting preempted too much, that's a real drawback (Note: I'm all for using preemptible for development and/or all batch-ey things but only if you're ready for the trade-off).

While we don't support preemptible with GPUs yet, it's mostly because the team wanted to see some usage history. We didn't launch Preemptible until about 18 months after GCE itself went GA, and even then it involved a lot of handwringing over cannibalization and economics. We've looked at it on and off, but the first priority for the team is to get K80s to General Availability.

Again, Disclosure: I work on Google Cloud (and love when people love preemptible).

> how do you handle and account for preemption?

I do most of my experiments with Jupyter Notebooks and Keras on top of TensorFlow. Keras has a ModelCheckpoint callback (https://keras.io/callbacks/#modelcheckpoint) which saves a model to disk after each epoch and is super easy to implement (1 LOC), and a good idea even if I wasn't training on a preemptable instance. In the event of an unexpected preemption, I can just retransform the data (easy with a Jupyter-organized workflow), load the last-saved model (1 LOC) and resume training.

The drawback there is if the epochs are long, which could risk in losing more-than-wanted progress due to a preemption.

That's really odd that that Keras API's interval is measured in epochs (which is a different wallclock interval for every different model/dataset/hardware configuration). It's much more common to checkpoint based on a time interval.

Oh interesting, I've never seen checkpointing on a time interval. Most Torch examples just dump the model to disk after the epoch finishes.

One reason to use epoch checkpointing is because that ensures that all samples of the training data have been seen the same number of times. If your data is large and diverse, with heavy enough augmentation it might not matter very much

Shoutout for Hetzner's 99 euro/month server with a GTX 1080, much better than the pseudo-K80s that Google Cloud provides for $520/month. The Google K80s are half or quarter the speed of a real K80, part of the reason they show so badly in the comparison.


Just to reiterate barrus's (the Product Manager for K80s) point, K80s come with two dies per board, so we're giving you the granularity. We struggled with wording, but as both NVIDIA and AMD ping pong between GPUs with two dies per board as the best part versus one we didn't want to make the minimum granularity "a part sold by a vendor". So there's no conspiracy or half or quarter speed nonsense, just that it's probably not as clear as it should be that this is half a K80 board.

Disclosure: I work on Google Cloud.

You could survey your customers to see how widely this is understood. Likewise "cores"/hyperthreads.

Meanwhile the 10x price performance difference is the main point. Really eager to see the TPUs rolled out broadly, please do price them to take market share from NVIDIA

GTX 1080 GPU, i7-6700 Skylake, 64 GB DDR4 RAM, 2x500 GB 6 Gb/s SSDs for 99EUR/month with a one-time 99EUR setup fee.

My lord. HPC resources are incredibly affordable. Hetnzer and some of the other dedicated server companies in Europe/Canada have some amazing deals (we've used OVH in the past with great success, and right now we use Paperspace for CPU intensive stuff we want to share expensive licensing on, like Visual3D).

Wow, to go off topic; I've been using Versaweb for the past 3-4 years, but became very unhappy after they forced a "server management" fee down our throats.

We've been looking at setting up a small cluster of servers at work (budget of about $500), and I was still going to go with Versaweb. After seeing Hetzner, I'm going to reassess, and likely move everything there.

I'm paying €150 for what it seems I could pay €100 for. There was something that made me decide against Hetzner a few years ago, but I'll research and see if their TOS are now different.

Thanks again!

EDIT: My numbers are wrong, I'm going to pay less for 4x the RAM (256GB)

Why are you comparing Hetzner and Versaweb? They exist in completely different markets.

How so? I understand that there are some differences between the two, but the fundamentals are the same; which is that I want a physical server that I can manage.

Their pricing structures are slightly different, Versaweb gives me a bit more flexibility when configuring, a wider IP subnet bundled (instead of 1 usable IP), and a few other things which I'm investigating.

I also have to consider laws and network latency as these are in different regions.

In the end, I am paying $180 for a Haswell Xeon with lots of disk space and IO. I could pay the same amount for more RAM on the same CPU, albeit with slightly less space.

If I keep the same setup at a fraction of the cost, I could end up getting the 1080 GPU on the same datacenter. It somehow feels like the same or similar market to my needs ...

>How so? I understand that there are some differences between the two, but the fundamentals are the same; which is that I want a physical server that I can manage.

They're on different continents, which is a pretty fundamental difference.

"incredibly affordable"

It could go a lot lower. Hetzner's profit margin on renting a server like this for 99€/month is formidable.

I'm sure they are, but there's turnover, customer service, attrition, obsolescence, etc that is all baked in, though clearly GTX 1080s will be valuable for some time, as will the i7 Skylake architecture.

Relative to the market, that price is very, very good.

How to you figure?

If I were to buy such as system it would be over €2000, that's not including cooling or a case for it either. Granted I live in Sweden so taxes are a bit on the high side.

Regardless, that will be at least 20 months before they make a dime (assuming they can rent it 100% of the time). And in that time it will collect rackspace along with electricity, bandwitdh (2 gbits and 50 TB per month) and a dedicated IP.

And after all that time that computer is not that hot anymore, but still draws just as much electricity regardless.

Just signed up and ported my model + data: - it's indeed noticeably faster than the Google VMs. As usual, I compiled tensorflow for this GPU vs K80 (feature 6.1 vs 3.7). - ubuntu 16 minimal is indeed "minimal" ! but it worked... - GTX 1080 (7.92GB) has less GPU RAM than the K80 (11.17GiB) - this required me to reduce the model design slightly.

For my model/data, Hetzner runs 1 training epoch in 1 hr vs 1.75 hr for Google. I'm moving the rest of my work over tomorrow. When Google has TPUs available, I'll look at it again.

thanks!! for the tip.

This is presumably just the full board versus half nomenclature noted above. But yes, consumer GPUs are way more cost competitive than Tesla class parts. Being able to train bigger models is valuable to some folks, but not everyone, so I don't begrudge using the GTX line.

Disclosure: I work on Google Cloud.

Unless I am mistaking, Tesla have ECC memory and consumers cards don't.

Should do some marketing on the disastrous effects it can have on the training.

Or.. just call it random dropout and market it as a feature.

(Also, I can no longer edit, but a colleague pointed out that I should have read more carefully. A GTX 1080 is a Pascal part, which compared to the poor old Kepler in K80s, it'll really shine. Volta all the more so in the next year).

Yeah, also have a 30€/month hetzner dedicated server with 2x TB HDD and 32 GB RAM. At the same time at my company we pay sometimes up to a 1000$/month for a really weak AWS machine because of the costs for traffic and storage. Ridiculous, but.. Yeah...its not my money.

Looked into this a bit more. The GTX 1080 is based on the Pascal architecture and so will be faster than any Kepler-based K80 on any cloud - even faster than a K80 card with 2 GPUs. The GTX is a consumer board and is less expensive than the datacenter equivalent P100 PCIe card. The P100 has 16 GB ram and HBM2 memory (twice the memory and more than twice the memory bandwidth) and supports ECC if you care about detecting memory corruption. The P100 will be faster than the GTX 1080 once it is available. As I said before, GCP offers K80 GPUs in passthrough mode and you can use a single K80 die ($0.70 / hour billed by the minute) or you can attach up to 8 K80 GPUs to a single VM. Disclaimer: I am a product manager for GPUs in Google Cloud.

The P100 is about 10x the price of the 1080 ($6000-9000 vs $500 for the 1080 and $700 for the 1080TI).

I've talked with several second-tier cloud providers, and the GTX 1080TI is what their large-deployment customers use. At the NVIDIA conference they were all promoting the P100 (NVIDIA insisted), but all admitted that nobody asked them to deploy P100s at scale.

The Hetzner box is about 0.15 an hour. That means more GPUs per developer.

> Google K80s are half or quarter the speed of a real K80

do you have any evidence for this?

Each Google K80s is one GPU or 1/2 of a K80 board, so technically you are correct that a Google K80 GPU is half of a K80 board. However, they are offered in passthrough mode and achieve full performance. If you want a whole K80 board, attach 2 K80 GPUs to a single VM. You can have 1, 2, 4 or 8 K80 GPUs attached to each VM in GCP. (I'm one of the GPU product managers at Google Cloud).

It would be polite to indicate that on the price list (understatement).

While you're here: the other reason we switched to Hetzner is reliability. Sure we can continue training from the last checkpoint but we still lost half a day on average for the many surprise reboots. We suspect that you've overbooked the GPUs and someone has to lose when too many connect.

The price list indicates that 1 instance is 1/2 a device, and so on: https://cloud.google.com/compute/pricing#gpus

Although I agree it is somewhat confusing in terms of performance.

thanks for the info. On a side note, do you know if GCP will ever support preemptible GPU instances?

According to this[1] comment it's on their mind but doesn't seem like a priority.

[1] https://news.ycombinator.com/item?id=14728476

They clearly state this, the K80 contains two "GPUs", when you get one K80 instance you only get one of those two GPUs, so you get half of a K80.

Wish HN had a 'save' feature so I can remember this comment when I need a GPU box.

Your browser has a dedicated feature for that. It's called a bookmark. /s

click on the timestamp, click "favorite"

New one for me also. Thanks!

Add it as a favourite? Click the timestamp.

Awesome, thanks!

If you upvote it, you can find it again through your profile. I'm on mobile but I think you can favourite comments too via the time stamp link, also shows on your profile

I do this too, but does anyone have a good way of searching through your own upvoted comments/stories?

A lot of times I couldn't find what a comment/story I know I saved because I've upvoted pages upon pages more stuff since.

What I do is for really important comments, the timestamp link -> "favourite" functionality gets used. Much smaller list :)

You can hit the favorite link at the top of the thread so it's in your favorites.

individual comments have "favorite" as well -> just click on the timestamp to get to the single-comment view.


You can favorite the post. You'll just have to find the specific comment later.

Though I never get to use Hetzner sop can't comment how good they are, I got into issues with them because of there convoluted process.

I was trying to calculate the total cost as their list price excludes VAT. It turns out they just booked the server for me and started sending invoice. Of course they allow to cancel within 14 days but I was handling a personal issue so didn't check my emails for almost a month. It turned messy.

If Hetzner support are listening, please improve the process and if possible take to credit card/payment details upfront so that person is aware that you are spinning the server for them.

That's not how you use cloud, you pay 99e/month no matter of the usage with bare metal.

Being easily burstable doesn't matter nearly as much when the price is 10x higher.

If you need heavy job on the spot bare metal company won't be able to deliver what you need. Only those cloud services have enough servers.

10x more seems a lot but it really depends on how you use it, it's no secret that cloud is more expensive.

That's surprisingly affordable. Commenting to save

One of the interesting variables in calculating ML training costs is developer time. The cost of a Data Scientist (or similar role) on an hourly basis will far outweigh the most expensive compute resource by several orders of magnitude. When you factor in time, the GPU immediately becomes more attractive. Other industries with heavy/time consuming computational workloads like CGI rendering have understood this for decades. It's difficult to attach a dollar sign to the value of speeding something up because it's not only about simply saving time itself but also about the way we work: Waiting around for results limits our ability to work iteratively, scheduling jobs becomes a project of its own, the process becomes less predictable etc.

Disclaimer: Paperspace team.

For training, that's likely to be true. For large scale inference it's not possible to beat CPUs right now if cost is a factor. You might be able to beat them once you can buy TPU access in cloud, depending on how steep a premium Google attaches to it.

While the authors article is relevant if you are stuck on GCP, on AWS you will not have the same conclusion. This is because AWS has GPU spot instances (P2) which can be found for ~80% cheaper depending your region [1]. Hopefully one day soon GCP will support preemptible GPU instances.

[1] https://aws.amazon.com/ec2/spot/pricing/

When I started it was even cheaper(~10% of reserved cost), but even now, it's pretty cheap.

I would love to see these results put up against Google's new TPUs[1]. While TPUs are still in Alpha, my guess is that customized hardware that understands TensorFlow's APIs would be a lot more cost effective.

[1] - https://cloud.google.com/tpu/

I've been amazed that more people don't make use of googles preemtibles. Not only are they great for background batch compute. You can also use them for cutting your stateless webserver compute costs down. I've seen some people use k8s with a cluster of preemtibles and non preemtibles.

something I've always been curious about (and if a Google Cloud Engineer could clear up - that would be great), is why we should not (as in, why does everyone not) use preemptible nodes (apart from maybe the 3 / 5 master nodes).

My question specifically being: if I configure a k8s cluster to have all my slaves as preemptible nodes...would GCP automatically add new nodes as my old nodes are deleted (from what I understand preemptible nodes are assigned to you for a max of 24 hrs)?

Considering the pricing of preemptible nodes + the discounts that GCP assigns to you for sustained use, it makes cloud insanely cheap for an early stage startup.

Google Cloud Developer Advocate here.

Go for it as long as you understand the downside. It's possible that all instances get preempted at once (especially at the 24hr mark), that there isn't capacity to spin up new preemptible nodes in the selected zone once the old instance is deleted, etc. New VMs also take time to boot and join the cluster.

If you are just doing dev/test stuff, I'd recommend using a namespace in your production cluster or spinning up and down test clusters on demand (which can be preemptible).

If you have long running tasks (like a database) or are serving production traffic, using 100% preemptible nodes is not a good idea.

Preemptible can be great for burst traffic and batch jobs, or you can do a mix of preemptible and standard to get the right mix of stability and cost.

If you don't mind me asking, what exactly is the role of a developer advocate?


What about spreading your K8s load across multiple instance types (given it is unlikely google runs out of all types at the same time). That plus historical modeling was the trick of a startup that Amazon acquired that promised to dramatically reduce compute cost, by using mostly spot instances.

Would those types of mitigations work similarly with Google's premetable VM's?

Compute Engine doesn't really have "Instance Types" or "Instance Families" per say, just Core/Memory combinations. Larger machines have a higher chance of preemption though (according to the PM of PVMs who stated that on this thread).

There are a few interesting projects out there that do the kind of automation you are speaking of like these:

https://github.com/binary-com/gce-manager https://github.com/skelterjohn/prevmtable

Spreading multiple smaller machines over a multi-zone k8s deployment might help mitigate, but it will never solve all the issues.

preemtibles are not available for GPU.

For research and experimentation what you need is your own DL box. It will pay for itself in a few months. You will feel better having your own reliable hardware that you don't share or have to pay by the minute, and that will impact the kind of ideas you are going to try.

Then you scale up to the cloud to do hyperparameter search.

Do you have any advice on getting your own box set up in a data center - constantly traveling...

Excellent write-up, kudos on going through all of that Max. Too bad Google will deprecate the preemptable instances as a result :P.

There is a notable CPU-specific TensorFlow behavior; if you install from pip (as the official instructions and tutorials recommend) and begin training a model in TensorFlow, you’ll see these warnings in the console:

FWIW I get the console warnings with the Tensorflow-GPU installation from pip, and I verified that it was actually using the GPU.

A question for those who've used TensorFlow on NVIDIA GPUs:

What range of GPU performance do you see? As in, if the card does 10 TFLOPS peak, does TensorFlow manage to reach that peak, or is it at 5% or 20% or some other percent of peak typically?

And are there expectations for Googles new generation TPU? What range of peak performance do people expect to get?

Thank you for benchmarks! It will be interesting include in your research Inception-v4, Inception-ResNet. And try to compute with Nvidia 1080 / 1080TI cards.

Our benchmarks for processing 1000000 images ResNet-50:

- 8x Tesla K80: 43m 3 sec.

- 8x Nvidia 1080: 17m 32 sec ( 0.09 euro / minute ).

We can provide you resources for free for research.

Disclosure: I'm founder of LeaderGPU.

Paperspace has dedicated GPU instances for $0.40/hr, I'll have to compare with Hetzner...

Been very impressed with their customer service and their bandwidth availability. I'm getting 500/500 on speed tests to the west coast servers of theirs. I don't use their GPU instances but I do use their high power CPU instances for video rendering and V3D work.

Neat article. I think it's worth pointing out that this guy is an active commenter in the Hackathon Hackers facebook group, if you want to see more of his content. He can be pretty pretentious sometimes, but good content nonetheless.

You don't need to make a throwaway to call me pretentious. :P

I've seen him link some funny/ridiculous conversations that take place in that group (eg- can people who develop wix websites be considered web developers?), but unfortunately I can't more content since I'm not on FB. If there's an archive of all the funny conversations, let me know.

Fascinating. Wish he could have shown benchmarks on a larger image database (Imagenet or CFAIR-100), as mnist is extremely easy to train on. Great to know, especially the LSTM benchmarking.

As far as the 64 vCPU finding, that's quite possibly because it's crossing NUMA modes. GCE's virtualization hides NUMA information unfortunately (at least as far as I've ever seen), so there's no way to handle this in software even.

Would be interesting to see these benchmarks on Haswell/Broadwell vs Skylake.

Quick question to those with a deep understanding of these things... I have not been able to get GPU tensor flow (on AWS) to run faster for the networks I'm using.

This is with a small(ish) network of perhaps a few hundred nodes... should I see a speedup for this case, or are GPUs only relevant for large CNNs, etc.?

In theory, there's no reason that a GPU shouldn't be faster.

In practice, there's a multitude of reasons why CPUs are more efficient (or at least faster) for smaller networks.

It really depends on the type of things that you do - if your network is deep and has a lot of matrix multiplication, GPUs definitely do speed things up. Libraries like cuDNN have built in optimized convolution ops that will also make convolutions a lot faster.

In my experience (not tf related, I mainly work on my own library now: https://github.com/chewxy/gorgonia) even with a cgo penalty, deep networks do improve with GPU training. Never dabbled much in CNNs (convolutions tend to do my head in) so can't say much.

Correct. GPUs are not efficient with very small networks.

Any useful rules of thumb on when to use GPU?

> [slower on 32 and 64 core systems]

The library doesn't handle NUMA hardware?

would be interesting to see benefit of MKL optimizations on the same examples


No spot instances?

I kept it the analysis to GCE only for simiplicity. (both because the costs of spot instances are highly variable, and costs are not prorated on Amazon meaning you have to pay for the full hour; an additional concern if you just want to run a small ad-hoc training)

If a spot instance terminates in the first hour you're not charged for it. You can grab spot blocks as well for specific duration workloads.

This entire article was about spot instances. On GCE they are called preemptible.

Didn't catch this thanks, spot instances have a bit better rate, but as he mentioned they are billed by hour :)

Yes premptiable are a much better deal with minute billing.

For shits and giggles I recently compared a spot instance CPU miners of monero vs spot instance GPU miners. I don't have the exact numbers on hand but IIRC the CPU miner was ~50% the cost in terms of $/hash.

FYI, y'all: cloud "cores" are actually hyperthreads. Cloud GPUs are single dies on multi-die card. If you use GPUs 24x7, just buy a few 1080 Ti cards and forego the cloud entirely. If you must use TF in cloud with CPU, compile it yourself with AVX2 and FMA support. Stock TF is compiled for the lowest common denominator.

This is very important if you're running any cpu intensive workload at scale. We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.

You'd be surprised the difference it makes. It was one of the reasons I liked Gentoo, emerge would always build from source for your target CPU flags, instead of using the package managed "one size fits all" build. Those 5-10%s really compound when you add them up along all dependencies.

Kudos to tutorials and guides that instruct how to build from source.

The same is every bit as true today for your containers, assuming you have a homogeneous target to run them (yes I know, containers are supposed to be supremely portable, but private ones can be purpose built)

>We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.

Can you tell me more about this? I wanted to switch to Ryzen architecture with my video transcoding project that handles large volume, but because we lean heavily on x264/ffmpeg, it didn't seem like a good idea given the AVX issues, keeping me on i7-based architecture. (Previous comments of mine will show the history of this particular thread.)

Would love to hear it here or via my throwaway: mike.anon@hotmail.com. Thank you so much.

This is especially important if most of your workload is matrix multiplication. Those workloads heavily benefit from vectorization. It might also help to enable Intel MKL, because Eigen, which TF uses by default is not the fastest thing out there, just the most convenient to work with cross platform.

Would hyperthreading be helpful or harmful?

Hyper threading is not harmful per se. It lets your CPU make forward progress when it would otherwise be stalled waiting for something. My issue is that they call hyperthreads "vCPU" which makes it seem like you're getting a full core, while in reality you're getting 60% of a core at most.

Hyper threading often is harmful when you use it, because while it does let your CPU make forward progress, it does that at the expense of e.g. cache that is evicted.

Obviously depends on your workload, but on my highly parallel "standard" workloads, my experience is that you can get at most 15% more with hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4 threads), whereas on the cache intensive loads, I get 20-30% LESS with hyperthreading on.

I have never encountered such an abnormal workload. This is also less likely to happen in Broadwell Xeon and up, where last level cache can be partitioned. And this is also less likely to happen on Google Cloud in particular, because Google uses high end CPUs with tons of cache.

If both core threads are memory (and cache) intensive, then you get effectively half the cache size and half the memory bandwidth. Partitioning may make eviction less random, but the cache size is still halved, regardless of how much "tons of cache" you start with.

Increasing cache has the net effect of increasing hit ratio, sometimes substantially. With 20MB per die this may change the calculation of where things drop off. I have found that I can't reliably predict how a chip will perform, so I just wrote a bunch of benchmarks and it takes me about half an hour to see if the chip performs better or worse than I thought it would. Google's Broadwell VMs perform very well.

vCPU is a different concept than hyperthreading logical cores, though. They're decoupled. (vCPU comes from virtualization software like Xen.)

They are, but what you are buying is a HT cpu core on aws.

>FYI, y'all: cloud "cores" are actually hyperthreads

Depends on the provider. Azure, for instance, has hyperthreading disabled on most of their configurations. They're starting to offer new configurations with hyperthreading though.

Yep. But they compensate for that by charging a lot more and using lower end CPU SKUs with less cache. And GPUs are still per die.

Also to add to the article: I have also discovered that for our deep learning workloads 8 core VMs are the sweet spot in terms of cost/perf. This is on Google Cloud, which in the particular zone I tested uses $5k apiece high end Broadwell Xeons with tons of cache. Our stuff is quite a bit faster than general purpose frameworks like TF though. 8 cores is not as fast per core as the smaller number of cores, but latency is lower, and the penalty per core is not that bad. After 8 cores perf per core drops off pretty steeply due to memory bandwidth constraints. I imagine PPCle would be pretty awesome with its 250GB/s of memory bandwidth. I wish I had a machine to try out.

Hadn't heard about compiling yourself improving performance for cloud CPU usage - thanks!

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact