Hacker News new | comments | show | ask | jobs | submit login

There have been many previous attempts at "throughput computing", meaning many slow cores or threads for server workloads. 3 generations of Niagara (UltraSPARC T1 and descendants), SeaMicro's box full of Atoms, etc. It doesn't mean this is attempt is doomed, but is there compelling answer to "What's different this time?"



The difference here is that the individual cores on this are very fast. Unfortunately I cannot publish benchmarks because of NDAs, so you'll have to believe me on this until the hardware is more widely available.

https://rwmj.wordpress.com/2017/11/20/make-j46-kernel-builds...


Cloudflare did a quick review:

https://blog.cloudflare.com/arm-takes-wing/


They are decent, but don't seem quite as good as the ones on AMD's (or even more Intel's) processors.

Qualcomm Centriq 2434 [https://www.nextplatform.com/2017/11/08/qualcomms-amberwing-...]:

- 40 cores (no SMT)

- 2.5 GHz peak

- 4 uops/instructions per cycle [https://www.qualcomm.com/media/documents/files/qualcomm-cent...]

- 110 W TDP

- $888

- 10 Guops/s/core

- 0.011 Guops/s/core/$

- 400 Guops/s

- 0.45 Guops/s/$

- 3.63 Guops/s/W

AMD Epyc 7401P [https://en.wikipedia.org/wiki/Epyc]:

- 24 cores (2x SMT)

- 2.8 GHz all-core boost

- 6 uops per cycle [http://www.agner.org/optimize/microarchitecture.pdf]

- 170 W TDP

- $1075

- 16.8 Guops/s/core

- 0.016 Guops/s/core/$

- 403 Guops/s

- 0.37 Guops/s/$

- 2.37 Guops/s/W

So based on this the AMD processor has 170% the Qualcomm's per-core performance, equal on total throughput, 83% of Qualcomm's total thoughput per $ and 65% of Qualcomm's total throughput per W.

Note that the AMD CPU has SMT while the Qualcomm doesn't which improves utilization, and its components are probably faster (due to higher TDP and more experience making CPUs), so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.

Also, with AMD/Intel, albeit at much lower performance/$, you can have 32-core instead of 24-core CPUs and there is up to 4/8-way SMP support that Qualcomm doesn't mention.


This is the most meaningless comparison you could possibly make. By this logic Pentium 4 is also better, because it had 3.8 peak GHz, SMT and could do 4 uops/cycle.

In reality things don't work like that. First of all some of the uops can only be loads, others stores and other branches. Second, you have factors like branch prediction, cache latency, branch mis-prediction penalty that plays a huge role on performance.

I am yet to see a workload that can saturate 6 execution ports, even with SMT.

In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.


> In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.

Is there one such thing as "in the real world"? I mean isn't it use-case dependent and if you want a build machine, a web server, or a database server you'll get different results out of your benchmarks?


I am sorry, you are 100% correct. There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512. I meant in the real world of (most) web servers in this case.


> There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512.

That's also debatable. I've read HPC papers that show opterons outperforming xeons on heavy fp workloads due to the higher throughput and larger cache. Baseless claims regarding "real world performance" are only good for marketeers.


Edge cases are edge cases. I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.

Will I ever do it again? I have no idea. At the time, I got a very nice bottle of wine for my bet.


> Edge cases are edge cases.

It's not an edge case when we're talking about basic BLAS kernels.

> I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.

You seem to be oblivious to the fact that for a long time cache access is the main bottleneck in HPC applications, and although the focus on parallel programming gets all the attention the bulk of the research in the field is placed on figuring ways to minimize cache misses while pumping data to the ever growing number of registers. Opterons outperformed xeons because researchers figured how to harness opteron's larger cache and throughput to avoid performance penalties imposed by cache misses and thrashing and it showed. That's also one of the reasons why the old bulldozer architecture showed linear per-core performance even when each pair of cores shared a floating point unit.


> so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.

Cloud scale providers don't care about raw performance (within reason).

TCO wins the day, so if Qualcomm CPUs offer higher performance per watt than Intel or AMD, I can definitely see them buying these like hotcakes.


They do seem surprisingly decent. Not yet clearly better, but they are surprisingly decent.

It is too bad someone has't made a graph of these two trajectories, Intel/AMD performance and ARM performance, over time. I bet it would let us see if there is going to be an intercept between the two and when it would happen. We have like 7 years of data now in this race, a graph should be possible.


Hold up. Comparing uops/sec between CPU architectures is not meaningful.


Yes. Furthermore, the assumption that a core being capable of a maximum of x uops/cycle means that the core actually executes x uops/cycle when running a workload seems really far-fetched.


Some public benchmark data: https://blog.cloudflare.com/arm-takes-wing/


Re: "What's different this time?"

1. Tier-1 server suppliers like HPE plan to make a Centriq server. For customers to stick their neck out and start porting critical software to a new architecture, they have to believe that there's going to be a refresh next year, and the year after that. Some adventurous customers will be ready to explore this territory but the critical mass won't move until they see momentum shift that way.

HPE and Qualcomm each have a popular brand and big budgets that can sustain a slow ramp of a couple generations of these products before they start to see major adoption.

2. What else is different this time is the ever-increasing popularity of: open source software, linux, containerization, python, golang, node/JS, Java, C#, etc. Redhat announced at SC this year that they will offer a supported ARM release. That means that all of the above will Just Work.


We can roughly estimate generalized performance per core via their performance per dollar comparisons with Intel.

Their 48 core Centriq 2400 is listed at $2000. The 28 core Skylake they compare it to is $10000. They claim 4x better performance per dollar, which puts it at the chip level as 0.8x as fast as the Intel. That would peg every core as 0.5x as fast as the Intel.

It won't win a single-thread contest, but that puts it well within the "very fast" category.


These go way back, the Thinking Machines Corporation was doing stuff like this back in the early 80s.

https://en.wikipedia.org/wiki/Connection_Machine

It didn't pan out that time either, not even on machines with 64k processors.


Intel's Knights Landing / Xeon Phi architecture is a bit like that.


Eventually we'll face the fact cores won't get any faster and that even phones will have a dozen of them.

Porting our code to run well on such machines is a bet people should start making.


... or a bet that vendors should start making.

Intel could sell bottom bin Xeon Phi chips as development systems. I don't care if half the cores have failed testing and are masked off or it won't run at full speed.


I would love if they did that.

Developers always have the computers of the future on their desks. If Intel wants Phi to be part of that future, they'd better put them on developers' desks.


Considering that even my far-from-flagship phone has an eight-core processor, I wouldn't be surprised to see a 12-core in the near future.


Sure, but that's more like dual-quad (since no one uses Asymmetrical MultiProcessing).


Which is also another bet people should start making.

Most of the time, my laptop is doing workloads that would leave an Atom bored to death, but when I need it, I sure love those i7 cores.

I would gladly sacrifice one i7 core for 4 Atom ones, provided the OS knew what to do with them.


Yeah, it's a latency vs power vs parallelism issue


This is one reason why WinRT initially only had async APIs.


Wasn't Phi more of an HPC/machine learning solution, not a server solution?

I don't think they target the same kind of applications anyway.


pushed by a bigger company using a better established architecture and a more reasonable core count.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: