Hacker News new | comments | show | ask | jobs | submit login

FYI, y'all: cloud "cores" are actually hyperthreads. Cloud GPUs are single dies on multi-die card. If you use GPUs 24x7, just buy a few 1080 Ti cards and forego the cloud entirely. If you must use TF in cloud with CPU, compile it yourself with AVX2 and FMA support. Stock TF is compiled for the lowest common denominator.

This is very important if you're running any cpu intensive workload at scale. We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.

You'd be surprised the difference it makes. It was one of the reasons I liked Gentoo, emerge would always build from source for your target CPU flags, instead of using the package managed "one size fits all" build. Those 5-10%s really compound when you add them up along all dependencies.

Kudos to tutorials and guides that instruct how to build from source.

The same is every bit as true today for your containers, assuming you have a homogeneous target to run them (yes I know, containers are supposed to be supremely portable, but private ones can be purpose built)

>We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.

Can you tell me more about this? I wanted to switch to Ryzen architecture with my video transcoding project that handles large volume, but because we lean heavily on x264/ffmpeg, it didn't seem like a good idea given the AVX issues, keeping me on i7-based architecture. (Previous comments of mine will show the history of this particular thread.)

Would love to hear it here or via my throwaway: mike.anon@hotmail.com. Thank you so much.

This is especially important if most of your workload is matrix multiplication. Those workloads heavily benefit from vectorization. It might also help to enable Intel MKL, because Eigen, which TF uses by default is not the fastest thing out there, just the most convenient to work with cross platform.

Would hyperthreading be helpful or harmful?

Hyper threading is not harmful per se. It lets your CPU make forward progress when it would otherwise be stalled waiting for something. My issue is that they call hyperthreads "vCPU" which makes it seem like you're getting a full core, while in reality you're getting 60% of a core at most.

Hyper threading often is harmful when you use it, because while it does let your CPU make forward progress, it does that at the expense of e.g. cache that is evicted.

Obviously depends on your workload, but on my highly parallel "standard" workloads, my experience is that you can get at most 15% more with hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4 threads), whereas on the cache intensive loads, I get 20-30% LESS with hyperthreading on.

I have never encountered such an abnormal workload. This is also less likely to happen in Broadwell Xeon and up, where last level cache can be partitioned. And this is also less likely to happen on Google Cloud in particular, because Google uses high end CPUs with tons of cache.

If both core threads are memory (and cache) intensive, then you get effectively half the cache size and half the memory bandwidth. Partitioning may make eviction less random, but the cache size is still halved, regardless of how much "tons of cache" you start with.

Increasing cache has the net effect of increasing hit ratio, sometimes substantially. With 20MB per die this may change the calculation of where things drop off. I have found that I can't reliably predict how a chip will perform, so I just wrote a bunch of benchmarks and it takes me about half an hour to see if the chip performs better or worse than I thought it would. Google's Broadwell VMs perform very well.

vCPU is a different concept than hyperthreading logical cores, though. They're decoupled. (vCPU comes from virtualization software like Xen.)

They are, but what you are buying is a HT cpu core on aws.

>FYI, y'all: cloud "cores" are actually hyperthreads

Depends on the provider. Azure, for instance, has hyperthreading disabled on most of their configurations. They're starting to offer new configurations with hyperthreading though.

Yep. But they compensate for that by charging a lot more and using lower end CPU SKUs with less cache. And GPUs are still per die.

Also to add to the article: I have also discovered that for our deep learning workloads 8 core VMs are the sweet spot in terms of cost/perf. This is on Google Cloud, which in the particular zone I tested uses $5k apiece high end Broadwell Xeons with tons of cache. Our stuff is quite a bit faster than general purpose frameworks like TF though. 8 cores is not as fast per core as the smaller number of cores, but latency is lower, and the penalty per core is not that bad. After 8 cores perf per core drops off pretty steeply due to memory bandwidth constraints. I imagine PPCle would be pretty awesome with its 250GB/s of memory bandwidth. I wish I had a machine to try out.

Hadn't heard about compiling yourself improving performance for cloud CPU usage - thanks!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact