This is very important if you're running any cpu intensive workload at scale. We...

icelancer · on July 9, 2017

>We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.

Can you tell me more about this? I wanted to switch to Ryzen architecture with my video transcoding project that handles large volume, but because we lean heavily on x264/ffmpeg, it didn't seem like a good idea given the AVX issues, keeping me on i7-based architecture. (Previous comments of mine will show the history of this particular thread.)

Would love to hear it here or via my throwaway: mike.anon@hotmail.com. Thank you so much.

0xbear · on July 9, 2017

This is especially important if most of your workload is matrix multiplication. Those workloads heavily benefit from vectorization. It might also help to enable Intel MKL, because Eigen, which TF uses by default is not the fastest thing out there, just the most convenient to work with cross platform.

edwinksl · on July 9, 2017

Would hyperthreading be helpful or harmful?

0xbear · on July 9, 2017

Hyper threading is not harmful per se. It lets your CPU make forward progress when it would otherwise be stalled waiting for something. My issue is that they call hyperthreads "vCPU" which makes it seem like you're getting a full core, while in reality you're getting 60% of a core at most.

beagle3 · on July 9, 2017

Hyper threading often is harmful when you use it, because while it does let your CPU make forward progress, it does that at the expense of e.g. cache that is evicted.

Obviously depends on your workload, but on my highly parallel "standard" workloads, my experience is that you can get at most 15% more with hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4 threads), whereas on the cache intensive loads, I get 20-30% LESS with hyperthreading on.

0xbear · on July 9, 2017

I have never encountered such an abnormal workload. This is also less likely to happen in Broadwell Xeon and up, where last level cache can be partitioned. And this is also less likely to happen on Google Cloud in particular, because Google uses high end CPUs with tons of cache.

beagle3 · on July 9, 2017

If both core threads are memory (and cache) intensive, then you get effectively half the cache size and half the memory bandwidth. Partitioning may make eviction less random, but the cache size is still halved, regardless of how much "tons of cache" you start with.

0xbear · on July 9, 2017

Increasing cache has the net effect of increasing hit ratio, sometimes substantially. With 20MB per die this may change the calculation of where things drop off. I have found that I can't reliably predict how a chip will perform, so I just wrote a bunch of benchmarks and it takes me about half an hour to see if the chip performs better or worse than I thought it would. Google's Broadwell VMs perform very well.

jsmthrowaway · on July 9, 2017

vCPU is a different concept than hyperthreading logical cores, though. They're decoupled. (vCPU comes from virtualization software like Xen.)

shaklee3 · on July 9, 2017

They are, but what you are buying is a HT cpu core on aws.