Qualcomm Centriq 2434 [https://www.nextplatform.com/2017/11/08/qualcomms-amberwing-...]:
- 40 cores (no SMT)
- 2.5 GHz peak
- 4 uops/instructions per cycle [https://www.qualcomm.com/media/documents/files/qualcomm-cent...]
- 110 W TDP
- 10 Guops/s/core
- 0.011 Guops/s/core/$
- 400 Guops/s
- 0.45 Guops/s/$
- 3.63 Guops/s/W
AMD Epyc 7401P [https://en.wikipedia.org/wiki/Epyc]:
- 24 cores (2x SMT)
- 2.8 GHz all-core boost
- 6 uops per cycle [http://www.agner.org/optimize/microarchitecture.pdf]
- 170 W TDP
- 16.8 Guops/s/core
- 0.016 Guops/s/core/$
- 403 Guops/s
- 0.37 Guops/s/$
- 2.37 Guops/s/W
So based on this the AMD processor has 170% the Qualcomm's per-core performance, equal on total throughput, 83% of Qualcomm's total thoughput per $ and 65% of Qualcomm's total throughput per W.
Note that the AMD CPU has SMT while the Qualcomm doesn't which improves utilization, and its components are probably faster (due to higher TDP and more experience making CPUs), so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.
Also, with AMD/Intel, albeit at much lower performance/$, you can have 32-core instead of 24-core CPUs and there is up to 4/8-way SMP support that Qualcomm doesn't mention.
In reality things don't work like that. First of all some of the uops can only be loads, others stores and other branches. Second, you have factors like branch prediction, cache latency, branch mis-prediction penalty that plays a huge role on performance.
I am yet to see a workload that can saturate 6 execution ports, even with SMT.
In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.
Is there one such thing as "in the real world"? I mean isn't it use-case dependent and if you want a build machine, a web server, or a database server you'll get different results out of your benchmarks?
That's also debatable. I've read HPC papers that show opterons outperforming xeons on heavy fp workloads due to the higher throughput and larger cache. Baseless claims regarding "real world performance" are only good for marketeers.
Will I ever do it again? I have no idea. At the time, I got a very nice bottle of wine for my bet.
It's not an edge case when we're talking about basic BLAS kernels.
> I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.
You seem to be oblivious to the fact that for a long time cache access is the main bottleneck in HPC applications, and although the focus on parallel programming gets all the attention the bulk of the research in the field is placed on figuring ways to minimize cache misses while pumping data to the ever growing number of registers. Opterons outperformed xeons because researchers figured how to harness opteron's larger cache and throughput to avoid performance penalties imposed by cache misses and thrashing and it showed. That's also one of the reasons why the old bulldozer architecture showed linear per-core performance even when each pair of cores shared a floating point unit.
Cloud scale providers don't care about raw performance (within reason).
TCO wins the day, so if Qualcomm CPUs offer higher performance per watt than Intel or AMD, I can definitely see them buying these like hotcakes.
It is too bad someone has't made a graph of these two trajectories, Intel/AMD performance and ARM performance, over time. I bet it would let us see if there is going to be an intercept between the two and when it would happen. We have like 7 years of data now in this race, a graph should be possible.