
ARM details its new high-end CPU core, Cortex A72 - aroch
http://arstechnica.com/gadgets/2015/04/arm-details-its-new-high-end-cpu-core-cortex-a72/
======
ChuckMcM
Ok, this was a fun read. Two things stood out for me, one was that their
partially out of order 8 issue pipeline and the other was their watts per
instruction.

For a long time Intel has been top dog in the "instructions per clock" or IPC
space. So if you wanted performance you used their chips, except you paid for
that by consuming a lot of power. ARM on the other hand has always tried to be
the 'low power' chip which you could embed and run on batteries, and
ultimately slower than the Intel architecture.

But into this a couple of interesting market realities intruded, the most
obvious was that at some point computers were "fast enough" for enough people
to make a durable market for lower power machines. That really took off course
in the smart phone and tablet market, but is inching into the "low end" server
market.

If ARM can get better at performance faster than Intel can get better at "low
power" they can really put a dent into Intels market dominance in the larger
computer space. And as a competitor with a completely different ISA they limit
Intels ability to compete with lawsuits and/or changes to the "standard."

~~~
stephencanon
"In more detailed terms, the Cortex A72 CPU pairs a three-wide, in-order front
end with a five-wide, out-of-order back end (i.e. 8-issue)."

... That's not how it works. You don't add front-end and back-end width to get
issue width. I know that Ars' architectural chops took a hit when Stokes left,
but this is ridiculous.

You take the minimum of the two[1], except that (per AnandTech's far better
article [2]) the decoder is actually three _fused macro-ops_ wide, not
instructions, so the truth is actually somewhere between three and five
instructions wide, depending on workload.

[1] or maybe just the back-end width, since most modern CPUs can short-circuit
decode in tight loops.

[2] [http://www.anandtech.com/show/9184/arm-reveals-
cortex-a72-ar...](http://www.anandtech.com/show/9184/arm-reveals-
cortex-a72-architecture-details)

~~~
DannyBee
I don't think you understand how this works at all.

It's like when you add a few cortex m0's to your phone, and now you have a
quad-core phone.

(It's 8 issue. You can issue 8 instructions to it, it just doesn't process
them all in a single cycle :P)

~~~
sliverstorm
8-issue means you can issue 8 things all together, all at once, all in a
single cycle.

The front end feeds the back end.

3-issue FE means you can give the BE 3 things per cycle.

5-issue BE means you can crunch 5 things from the FE per cycle.

FE = branch prediction, decode, instruction cache

BE = schedule, execute, load/store, data cache

~~~
ChuckMcM
The question for me is whether you can retire the FE issue with a BE dispatch,
and if so pull in another FE issue. So can you get the EU to the point where
the all 5 of the BE units are subscribed and the three instructions in the
front end have been pre-cracked waiting for a BE slot to free up?

So ...

    
    
       Ins 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ...
    
       FE: [ ] [ ] [Ins0]
       BE: [ ]  [ ]  [ ]  [ ]  [ ]
    
       FE: [ ] [ ] [Ins1]
       BE: [ ] [ ] [ ] [ ] [Ins0]
    
       FE: [ ] [Ins1] [Ins2]
       BE: [ ] [ ] [ ] [ ] [Ins0]
    
       FE: [Ins1] [Ins5] [Ins6]
       BE: [ ] [Ins3] [Ins4] [Ins2] [Ins0]
    
       FE: [Ins1] [Ins5] [Ins7]
       BE: [Ins6] [Ins3] [Ins4] [Ins2] [Ins0]
    

And then maybe zero retires, that lets Ins1 proceed and it retires 1, 2, 3, 4,
which lets 5 proceed, that lets 6 retire and then 7 proceeds.

But until I get a closer look at the TRM this is all speculation.

~~~
sliverstorm
_The question for me is whether you can retire the FE issue with a BE
dispatch, and if so pull in another FE issue._

Almost certainly yes. It's supposed to be fully out-of-order, which means it
should have a fully functional scheduler in between the FE & BE.

Not to mention, given modern memory latencies (vs. clock speed), letting the
FE run ahead is important for performance.

------
M8
How soon will we have 16/32 core mobile processors? It would be nice if
something will force the industry to investigate alternative parallel
programming paradigms.

~~~
Scaevolus
Companies aren't going to sacrifice millions of dollars to encourage
development of new parallel programming techniques.

Developers barely know what to do with 4 cores, especially on mobile. The
usual outcome is that foreground apps use a core, maybe two if they have async
rendering, and background apps can run as well.

Mobile OpenCL is the closest we'll come to massively parallel programming in
people's pockets for the foreseeable future.

~~~
madez
It seems like you are ignorant about Servo. "We are still evaluating plans to
ship Servo as a standalone product, and are focusing on the mobile and
embedded spaces rather than a full desktop browser experience in the next two
years."
[https://github.com/servo/servo/wiki/Roadmap](https://github.com/servo/servo/wiki/Roadmap)

~~~
wmf
Imagine how much it cost to develop Servo. Now imagine rewriting every mobile
app from scratch using the lessons learned from Servo. The cost would be
astronomical, and for what? Incrementally better battery life?

~~~
gue5t
Most mobile apps are native GUI wrappers around ffmpeg, some http library, and
maybe WebKit or Blink. The high-level coördination will remain relatively
unchanged as the underlying GUI, media, network, and web libraries push up
against Amdahl's law.

~~~
pcwalton
And don't forget the king of multicore-scalable APIs: OpenGL. The reason why
GPU manufacturers have been able to scale so well by adding more cores is that
GLSL provides a programming model that scales broadly to (more or less) any
number of cores, and applications are written to that model. This allows
applications to run unmodified on new hardware with more cores and see
speedups.

------
StillBored
A72 is nice, but to really compete in the server market I think they need
another product. One that drops the performance/watt metric for the "as fast
as it goes" metric. That is because while a lot of people are going to want
efficient chips, if ARM really takes off there will definitely be people who
want to share a binary with a CPU that is screaming fast.

~~~
tw04
They will literally NEVER beat Intel at the speed game in servers and they
know it. If you want blazing fast, you've already got a market leader that's
held the crown almost their entire existence.

~~~
StillBored
? I'm not sure that is accurate. Before the PPRO, there were the RISC vendors,
which overwhelmingly were significantly faster. In the early 2000's there was
AMD, for about 5 years was faster in nearly every regard. Still today, you
have things like the POWER 8, which while not necessarily faster for every
workload definitely can beat Intel's offerings for certain workloads. So at
the moment it seems to be similar to the early 80's when it wasn't clear cut
who had the fastest CPU because it depended on workload. Hence my expectation
that ARM will get there, if they put in the effort.

------
faragon
Issuing up to 8 instructions per clock (3 in order, 5 OoO) is single-thread
serious performance. It looks like is going to be much better than the NVidia
Denver CPU (also 8 instruction/cycle, but all of them in-order). It will be
scary comparing 20-70$ ARM CPUs performing similar to Intel's in the 100-200$
price range.

------
chisleu
"There's also a reworked 3-way L1 cache that's "almost as powerful as direct-
mapped cache," and a much smaller (~10 percent) and reorganized dispatch
unit."

What? I'm guessing it is a typo for a 2/4 way set associative cache, or
something I know not what.

