Hacker News new | comments | show | ask | jobs | submit login

> In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.

Is there one such thing as "in the real world"? I mean isn't it use-case dependent and if you want a build machine, a web server, or a database server you'll get different results out of your benchmarks?






I am sorry, you are 100% correct. There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512. I meant in the real world of (most) web servers in this case.

> There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512.

That's also debatable. I've read HPC papers that show opterons outperforming xeons on heavy fp workloads due to the higher throughput and larger cache. Baseless claims regarding "real world performance" are only good for marketeers.


Edge cases are edge cases. I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.

Will I ever do it again? I have no idea. At the time, I got a very nice bottle of wine for my bet.


> Edge cases are edge cases.

It's not an edge case when we're talking about basic BLAS kernels.

> I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.

You seem to be oblivious to the fact that for a long time cache access is the main bottleneck in HPC applications, and although the focus on parallel programming gets all the attention the bulk of the research in the field is placed on figuring ways to minimize cache misses while pumping data to the ever growing number of registers. Opterons outperformed xeons because researchers figured how to harness opteron's larger cache and throughput to avoid performance penalties imposed by cache misses and thrashing and it showed. That's also one of the reasons why the old bulldozer architecture showed linear per-core performance even when each pair of cores shared a floating point unit.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: