
Apple's Cyclone Microarchitecture Detailed - pinaceae
http://www.anandtech.com/show/7910/apples-cyclone-microarchitecture-detailed
======
fidotron
Apple almost certainly have a MacBook Air based on one of these chips at
prototype stage.

There are many reasons in favour, including keeping negotiations with Intel
over price interesting, but probably the main thing preventing them running
with it would be an inability to manufacture the chips fast enough, which is a
major concern in mobile land. This is why lots of Android devices exist in
variations based on different chips as hedges by the device manufacturer on
long term availability.

~~~
mx12
I don't think that Apple will completely get away from x86 for a long time.
Attempting to emulate the x86 on an Arm would be terribly slow.

I do however, think that they will eventually include both arm and x86
processors in the Macbook Air. That way, backwards compatibility is preserved
and low power apps can run on the arm. In the current Macbook Pros they have
dynamic switching of GPUs, there's no reason they couldn't use the arm as a
coprocessor or even run the full OS.

Here a few technical points:

* LLVM - You can compile once for both both architecture and then the JIT will take over compiling for the specific architecture (Take a look at llvm for the OpenGL pipe line in OSX)

 _Full screen app- When an app is full screen (if it 's a low power app) then
the x86 could sleep and the arm could switch to running the app.

_App nap - If all x86 apps are a sleep, switch over the running the arm
processor exclusively

*Saving State - It's possible to save the state of apps in OSX, a similar mechanism could be used to seamlessly migrated between processors.

This is pure speculation, but it is feasible. There would be many technical
challenges that Apple would have to solve but the a capable. The advantage
Apple has is that they have absolute control over both platforms.

~~~
noahl
I'm skeptical about moving existing apps seamlessly between x86 and ARM
processors, because you'd need guarantees about process memory layout that I
don't think any current compiler makes. Imagine the memory image of a process
running on the ARM chip. It has some instructions and some data:

    
    
        |        Data          |         ARM insructions        |
    

You could certainly remove the ARM instructions and replace them with x86
instructions. However, the ARM instructions will have hard-coded certain
offsets in the data buffer, like where to look for global variables. You would
have to be sure that the x86 instructions had exactly the same offsets. For
another issue, if the data buffer contains any function pointers, then the x86
and ARM functions had better start at exactly the same offsets. And if there
are any alignment requirements that differ between x86 and ARM (I don't know
if there are), then the data had better be aligned to the less permissive
standard on both chips.

None of these problems are impossible to solve. They could be solved easily by
adding a layer of indirection, at the cost of some speed, and then Apple could
go back and do the real difficult-but-fast implementation later.

However, why would it? When its ARM cores are essentially desktop-class,
there's no need to have an x86 chip other than compatibility with legacy code.
Looking at Apple's history, it seems pretty clear that it likes to have full
control of its own destiny, and designing its own chips is a logical part of
that, so having its own architecture could be considered a strategic move too.

So given the difficulty of implementing it well, and assuming that Apple
eventually wants to have exclusively Apple-designed ARM chips in all of its
products, if I were in their shoes, I wouldn't bother to make switching work.
I might have a product with both kinds of chips, but I would just have the x86
chip turn on for x86 apps, and off when there were no x86 apps running, and
know that eventually those apps would go away. (And because I'm Apple, I have
no problem pushing vendors to switch to ARM faster than they want to, so this
won't be a long transition.)

However, an even cooler move would be to make LLVM IR the official binary
representation of OS X, and compile it as part of the install step of a new
program. That gives Apple several neat capabilities:

1) They can optimize code for the specific microarchitecture of your computer.
Maybe not a huge deal, but nice

2) They can iterate on their microarchitecture without having to care about
the ISA, because the ISA is an implementation detail. This is the technically
correct thing that everyone should have done years ago (yes, I'm annoyed).

3) They can keep more secrets about their chips. It's obnoxious, but Apple
would probably care about that.

So, there's my transition plan for Apple to move to its own chips. It probably
has many holes, but the biggest one is still the question of what Apple gains
from this. Intel still has the best fabs, and as long as that's true, there
will be some advantage in sticking with them. Whether the advantage is big
enough, I don't know. (And when it ends in a few years, then who knows?)

~~~
jws
Older enough programmers will remember the DEC Vax to Alpha binary
translators. When DEC produced the Alpha you could take existing Vax binaries,
run them through a tool, and have a shiny new Alpha binary ready to go.¹

Given such a tool, which existed in 1992, it seems simple enough to do the
recompile once on the first launch and cache it. Executable code is a
vanishingly small bit of the disk use of an OS X machine.

Going forward, Apple has a long experience with fat binaries for architecture
changes. 68k→PPC, PPC→IA32, IA32→x86-64. I don't think x86-64→ARM8 is anything
more than a small bump in the road.

As far as shipping LLVM and letting the machines do the last step, that should
make software developers uncomfortable. Recall that one of the reasons OpenBSD
needs so much money² for their build farm is because they keep a lot of
architectures going because bugs show up in the different backends. I know I
want to have tested the exact stream of opcodes my customer is going to get.

␄

¹ I think there was also a MIPS to Alpha tool for people coming from that
side.

² In the sense that some people think $20k/yr for electricity is a lot.

~~~
jey
Yep, and Apple has already done dynamic binary translation once before, during
the PPC to x86 switch.

~~~
BruceM
And 68k to PPC.

------
glasshead969
Here is the LLVM Commit the article refers to.

[https://github.com/llvm-
mirror/llvm/blob/7b837d8c75f78fe55c9...](https://github.com/llvm-
mirror/llvm/blob/7b837d8c75f78fe55c9b348b9ec2281169a48d2a/lib/Target/ARM64/ARM64SchedCyclone.td)

~~~
doe88
And for those interested the whole commit and corresponding discussion thread
on the LLVM-dev mailing list:

[http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/thread...](http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/thread.html#71574)

[https://github.com/llvm-
mirror/llvm/commit/7b837d8c75f78fe55...](https://github.com/llvm-
mirror/llvm/commit/7b837d8c75f78fe55c9b348b9ec2281169a48d2a)

------
sanxiyn
"Cyclone is a wide machine. It can decode, issue, execute and retire up to 6
instructions/micro-ops per clock."

This is truly amazing. In comparison, Cortex-A15 can issue 3 instructions per
cycle, for example.

~~~
higherpurpose
Nvidia's Denver is said to have 7. It's also rumored to have 2x the
performance of Cortex A15. If it's true it could reach Sandy Bridge
performance level either at first or 2nd gen (on FinFET 16nm) and
Haswell/Broadwell level by 3rd gen (most people don't realize the performance
between Haswell/Broadwell compared to Sandy Bridge is only like 15-20 percent,
since Intel stopped focusing on performance). I also think in 2nd gen Denver
SoC (2 process nodes from Tegra K1), Nvidia will have at least a 1 Tflops,
possibly 1.2 Tflops GPU (Xbox One level).

~~~
ChuckMcM
It is interesting that in the 'chip wars' everyone went to Intel for their
chips (or Motorola) and that allowed a lot of R&D to be re-couped. But as
Apple is baking their own chips, what is the market for Nvidia's Denver? It
has to compete with Broadcom, Samsung, and others for a fraction of the
Android device pool? Challenging to justify the levels of R&D that Apple is
applying without a predictable return.

That said, I'm going to be watching for the Denver chip, it will be an
interesting counter point to the A7.

~~~
glasshead969
By the time Denver based devices are on market, Apple will be releasing A8
based devices, so it will probably be competing with A8 rather than A7.

------
watersb
Anand also taught us that an iPad draws about as much power as an 11-inch
MacBook Air.

Configure an ARM platform with 8 GB of DDR3 DRAM and a PCIe-based SSD, and you
may well blow the power budget versus the current Intel platform design.

Intel has turned its attention to total platform power: moving the VRMs on-
die, and also identifying third-party motherboard components which draw silly
amounts of power.

I think that Intel is on track here. Apple will of course push their design
talent in order to deliver the best mobile devices they possibly can. But this
does not necessarily mean a move away from Intel for a laptop or desktop.

------
tdicola
Very cool analysis of the new chip. If I were Intel I would be more than a
little concerned about what Apple is up to here. Would be very interesting to
see what Cyclone can do in a proper laptop/desktop with more RAM.

~~~
gatehouse
I'm impressed with Apple's work and it reaches all the way down to the user
experience (I can play GTA San Andreas on my 5s), but the pace of Intel's
research over the last 10 years has been absolutely demonic. I think for Apple
to compete on that front would require an extreme investment.

For example see the second chart here: [http://preshing.com/20120208/a-look-
back-at-single-threaded-...](http://preshing.com/20120208/a-look-back-at-
single-threaded-cpu-performance/)

~~~
sanxiyn
I guess Apple is plucking low hanging fruits, but speedup from Apple A6 to
Apple A7 (spaced almost exactly a year) is about 50% in average, which
corresponds to _more than 2 years_ of speedup in recent Intel cores. I also
note that this speedup was achieved _without increasing clock_ , which is not
the case for Intel.

~~~
gatehouse
Intel released a P4 @ 3.73GHz in 2004, and the i7-4771 is "only" 3.9GHz in
single core mode (i.e. turbo), so that's only a slight increase in clock in
the last decade.

Sooner or later mobile processors are going to hit the power wall. I haven't
looked into this lately but one thing that could provide a fixed and permanent
benefit to mobile processors is the more modern instruction set.

~~~
sanxiyn
Well, P4's clock was achieved with 31-stage pipeline, which was a bad idea. If
you compare with what came after P4, there was more than slight increase in
clock.

~~~
zokier
Willamette and Northwood (ie the original P4s) cores had 20 stage pipeline,
compared to Nehalem with 24 stages and Sandy, Ivy and Haswell having 19
stages.

And those Northwood cores reached 3.4GHz, compared to 3.8GHz of the infamous
Prescott with it's 31 stages. That is fairly meager clockspeed increase, and
based on that I'd argue that the extreme pipeline depth was not the major
contributing factor in pushing the clockspeed of P4 higher.

------
jobu
It really seems like Apple is ahead of the curve on mobile R&D. I wonder if
they'll ever consider reselling some of these chips. It's unlikely Steve Jobs
would've ever done it, but I'm not so sure about Tim Cook.

~~~
rsynnott
People mightn't want to buy them if they did. These chips are _big_, which
means that they're expensive to make. Apple can get away with it because their
only costs are manufacture and licensing, and because they have high margins
which can absorb a bit of a hit. If they were selling them, though, they'd
presumably want to make a profit, and that profit would make the end product
almost certainly the most expensive mobile chip on the market.

From a marketing point of view, too, it'd be a hard sell in Android-land.
Apple has been very careful to steer clear of spec-oriented marketing, but can
you imagine the less-sophisticated enthusiast market's response to, say, the
Galaxy S6 using a dual-core 1.3GHz chip instead of a quad-core 2GHz?

~~~
jobu
What about something for the server market? I know some companies have been
switching to ARM chips as a low-power alternative in large server farms. It
seems a 64-bit chip might be a good fit in that market.

------
chucknelson
Pretty impressive that Apple is capable of such CPU disruption with just a few
small acquisitions. Is PA Semi the main reason for this, or could Apple have
been building up a CPU design team for years before that acquisition?

~~~
bryanlarsen
Is a $270 million acquisition a small acquisition these days? Not to mention
the fact that they acquired a team responsible for two large CPU disruptions:
DEC Alpha and StrongARM. A third disruption is perhaps not so surprising.

~~~
chucknelson
Just seems like _billion_ is the new million when acquisitions are concerned,
but yes, I guess it wasn't that small overall. Did not realize PA Semi was
founded by a lead designer at DEC. Thanks for the info, and Wikipedia filled
in the rest for me :)

------
stonemetal
_There 's little benefit in going substantially wider than Cyclone, but
there's still a ton of room to improve performance._

Didn't AMD, and Intel more or less say the same thing about 3 wide? No real
benefit from going wider. Is that because of the differences in micro
architecture or is it more about getting a little more performance without
having to ramp the clock speed? What makes 6 wide good for ARM but not x86 or
x64?

~~~
fournm
This is probably completely and utterly wrong as it's just a guess, but
potentially the Thumb [1] instructions (small, limited subset of shorter
instructions in ARM) might allow for a wider setup. Thumb instructions make a
bunch of simplifying assumptions that might remove some of the issues with
going wider. Not that I have any idea if Cyclone even supports them in the
first place.

[1]
[http://en.wikipedia.org/wiki/ARM_architecture#Thumb](http://en.wikipedia.org/wiki/ARM_architecture#Thumb)

~~~
wtallis
The biggest wart on the ARM instruction set was the predicate attached to
basically every instruction. That's gone in the 64-bit ARM ISA, so there's no
need for a 64-bit Thumb (and there isn't one).

------
leejoramo
On balance, I think that Apple would use most of the energy savings from
reducing to 20nm for other things: longer life and lower weight for the
battery, improvements in the camera, wireless, increased RAM.

~~~
antimagic
In the iPhone definitely. For the iPad, it already has more than adequate
battery life (most people get several days use out of one before needing to
recharge), so the extra clock speed could be used to handle multiple apps on-
screen simultaneously (for example).

------
JTenerife
Good on Apple! That's the best that can happen to us customers. Even for an
Android / Windows guy like me :-).

------
szatkus
I wonder if Nvidia Denver could match with this chip.

------
NextUserName
Congrats Apple for all your achievements. It is amazing what can be
accomplished when you employ Chinese Engineers and manufacturers who's tech
espionage (they steal trade secrets) is one of their best known traits/assets.

~~~
axman6
Proof? Even informed speculation?

Apple have made several aquisitions in the last few years to give them the
technology they need to make these sorts of developments. Even if some of the
tech had been stolen, it would require a lot of work to put it into practice.

I know I'm feeding the troll, but I couldn't help it; this is just too
ridiculous.

------
towski
What is a tock?

~~~
stephencanon
LMGTFY: [http://en.wikipedia.org/wiki/Intel_Tick-
Tock](http://en.wikipedia.org/wiki/Intel_Tick-Tock) =)

~~~
CodeWithCoffee
Apple also uses a similar strategy with the iPhone. The iPhone 4 and 5 both
made major form changes (retina display and screen shape) whereas the 4S and
5S were focussed on refinement (faster CPU, 64-bit CPU respectively). It has
also been noted (possibly on the Accidental Tech Podcast, or somewhere
similar) that a 'tick' year for the phone tends to be a 'tock' year for iOS
and vice versa; iOS 5 and iOS 7 (released with the 4S and 5S respectively)
were bigger changes than iOS 4 and 6.

------
raverbashing
I am thinking that this is looking up to be what the G5 aimed to be.

Unfortunately the ARM architecture, even with those optimizations is probably
slower "clock by clock" compared with x86/PPC

But yes, I think this is something Apple is probably testing (Desktop Mac OSX
on ARM)

