This is very important if you're running any cpu intensive workload at scale. We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.
You'd be surprised the difference it makes. It was one of the reasons I liked Gentoo, emerge would always build from source for your target CPU flags, instead of using the package managed "one size fits all" build. Those 5-10%s really compound when you add them up along all dependencies.
Kudos to tutorials and guides that instruct how to build from source.
The same is every bit as true today for your containers, assuming you have a homogeneous target to run them (yes I know, containers are supposed to be supremely portable, but private ones can be purpose built)
>We had custom compiled x264 then custom compiled that into ffmpeg to get everything out of our CPUs for an encoding cluster. AMD cpus seem to really shine here.
Can you tell me more about this? I wanted to switch to Ryzen architecture with my video transcoding project that handles large volume, but because we lean heavily on x264/ffmpeg, it didn't seem like a good idea given the AVX issues, keeping me on i7-based architecture. (Previous comments of mine will show the history of this particular thread.)
Would love to hear it here or via my throwaway: mike.anon@hotmail.com. Thank you so much.
This is especially important if most of your workload is matrix multiplication. Those workloads heavily benefit from vectorization. It might also help to enable Intel MKL, because Eigen, which TF uses by default is not the fastest thing out there, just the most convenient to work with cross platform.
Hyper threading is not harmful per se. It lets your CPU make forward progress when it would otherwise be stalled waiting for something. My issue is that they call hyperthreads "vCPU" which makes it seem like you're getting a full core, while in reality you're getting 60% of a core at most.
Hyper threading often is harmful when you use it, because while it does let your CPU make forward progress, it does that at the expense of e.g. cache that is evicted.
Obviously depends on your workload, but on my highly parallel "standard" workloads, my experience is that you can get at most 15% more with hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4 threads), whereas on the cache intensive loads, I get 20-30% LESS with hyperthreading on.
I have never encountered such an abnormal workload. This is also less likely to happen in Broadwell Xeon and up, where last level cache can be partitioned. And this is also less likely to happen on Google Cloud in particular, because Google uses high end CPUs with tons of cache.
If both core threads are memory (and cache) intensive, then you get effectively half the cache size and half the memory bandwidth. Partitioning may make eviction less random, but the cache size is still halved, regardless of how much "tons of cache" you start with.
Increasing cache has the net effect of increasing hit ratio, sometimes substantially. With 20MB per die this may change the calculation of where things drop off. I have found that I can't reliably predict how a chip will perform, so I just wrote a bunch of benchmarks and it takes me about half an hour to see if the chip performs better or worse than I thought it would. Google's Broadwell VMs perform very well.
You'd be surprised the difference it makes. It was one of the reasons I liked Gentoo, emerge would always build from source for your target CPU flags, instead of using the package managed "one size fits all" build. Those 5-10%s really compound when you add them up along all dependencies.
Kudos to tutorials and guides that instruct how to build from source.
The same is every bit as true today for your containers, assuming you have a homogeneous target to run them (yes I know, containers are supposed to be supremely portable, but private ones can be purpose built)