
ARM64 and You - zdw
http://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html
======
Tuna-Fish
This is a reasonable, short overview of the programmer-visible side of changes
in A64.

For those interested, there's an another side. A64 drops all the features of
the ISA (inline variable shifts, conditional execution, variable-width
instructions) that are hard to implement in a fast, high-power CPU. If a cpu
is to not have any 32-bit ARM compatibility, there's no reason one couldn't
make a 4GHz 4-wide superscalar one based on A64.

~~~
ajross
This is true, but that last sentence is a doozy. The AArch64 _ISA_ drops that
stuff. The A7 CPU, which much remain compatible with legacy code, _must not_.
So yes, a theoretical CPU would have a much easier design task, but in the
real world we'll never see one.

And in any case: there's another reasonably well-known company out there with
an even cruftier ISA making 4+GHz 6-wide superscalar cores with backwards 32
(and 16!) bit compatibility modes. Instruction decode is the sort of thing
programmers understand, so we tend to get hung up on it when talking about
CPUs. It's really not a meaningful design limitation to a CPU core implemented
with hundreds of millions of transistors.

~~~
plorkyeran
I would not be surprised if an iPhone without support for 32-bit apps came out
within a few years. Apple has been pretty willing to break existing apps with
iOS updates, so there's no real expectation that something which works today
will continue to work indefinitely with no changes.

~~~
christoph
This.

Apple's track record proves that they are more than happy to drop _anything_
more than a couple of years old if it means general forward progression.

My real bugbear is that to some extent they've created the modern day IE6 by
not updating Safari on the old iPad.

Also, if you look at some legacy iOS apps on the Appstore (designed pre 6.x),
they are pretty much 100% broken on 6.x+

------
mullr
It never occurred to me that they'd be using tagged pointers for Objective-C
runtime stuff. Of course it's obviously a good idea, but only after hearing it
does it become so. Objective-C is always more dynamic than you think it is, so
taking implementation cues from other dynamic language runtimes makes perfect
sense.

It appears that they've been using tagged pointers on the desktop since 10.7,
which I never realized:
[http://objectivistc.tumblr.com/post/7872364181/tagged-
pointe...](http://objectivistc.tumblr.com/post/7872364181/tagged-pointers-and-
fast-pathed-cfnumber-integers-in)

~~~
roskilli
This was one of the most interesting things the post detailed for myself,
makes so much sense - we do a substantial amount of allocs and reads for
NSDecimalNumbers, etc in our payment based app and I can imagine the heap
savings and mem write/read savings we will get as a result would of some
significance. Pretty interesting innovation.

~~~
mritun
Tagged pointers have been in use since late 60s in most garbage collected
languages of the day.

------
simscitizen
One of the biggest impacts of moving to 64-bit is increased memory pressure.
While all of Apple's apps and daemons are running 64-bit, most users will be
actively using third party apps that are 32-bit-only for a while. This means
that on average there is less memory available in the system, because the
amount of RAM is unchanged in the 5s, and there will now be code from both
64-bit and 32-bit binaries resident, rather than just 32-bit binaries.

Apple has done some work to alleviate this extra memory pressure at the kernel
level. grep for WKdm in the xnu sources if you're interested.

------
StephenFalken
It is interesting to watch ARM finally adopting many of the great
architectural solutions that MIPS used 22 years ago, back in 1991, when it
launched the MIPS R4000 family of 64 bit processors. [1]

[1]
[http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_boo...](http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf)

~~~
bodyfour
The original ARM ISA felt very VAX-inspired to me, such as the elegant (but
ultimately inefficient) use of a general-purpose register for the program
counter.

I've only just started looking at AArch64 but I agree that it feels a lot more
like MIPS though. I think that's a good thing.

------
mistercow
>This allows compiling if statements and similar without requiring branching.
Intended to increase performance, it must have been causing more trouble than
it was worth, as ARM64 eliminates conditional execution.

Probably because so many projects use Thumb (the default for iOS projects in
XCode, for example) which doesn't include most instructions for conditional
execution. From what I can tell, it also sounds like compilers weren't making
very effective use of those instructions anyway.

Also, these were originally meant to compensate for a lack of branch
prediction, which as I understand it, has changed drastically in recent years.

------
w-m
> With ARM64, there are 32 integer registers, with a dedicated zero register,
> link register, and frame pointer register. One further register is reserved
> for the platform, leaving 28 general purpose integer registers.

but
[http://www.arm.com/files/downloads/ARMv8_Architecture.pdf](http://www.arm.com/files/downloads/ARMv8_Architecture.pdf)
says:

31 general purpose registers accessible at all times * Improved performance
and energy

* General purpose registers are 64-bits wide

* No banking of general purpose registers

* Stack pointer is not a general purpose register

* PC is not a general purpose register

* Additional dedicated zero register available for most instructions

Which one is it?

By the way, the ARMv8 resources are quite interesting overall and a bit more
in-depth than the article.
[http://www.arm.com/products/processors/armv8-architecture.ph...](http://www.arm.com/products/processors/armv8-architecture.php?tab=ARMv8+Resources)

~~~
mikeash
I'm not seeing the conflict between what I wrote and your quote from the
architecture docs. Is it just confusion because the dedicated link register,
frame pointer, and platform-reserved register are part of the ABI rather than
the ISA?

~~~
Scaevolus
There is no dedicated frame pointer-- that would be part of the ABI. The stack
pointer and zero register are both encoded as register 31, and the meaning
depends on context.

~~~
mikeash
My statement encompasses both the ISA and the ABI, which I hope was implied by
my previous comment.... From the perspective of a userland software writer,
there's not much point in trying to distinguish between the two.

~~~
saurik
Is the frame pointer register really "dedicated" by the compiler? (I haven't
had the time to upgrade everything I need to upgrade yet to install the new
Xcode and check this directly.) With 32-bit ARM, there is a register denoted
"fp", but with iOS 2.0 Apple started compiling without a dedicated frame
pointer (instead using that register temporarily for some kind of thread-local
storage variable, before moving that elsewhere and freeing up the register
entirely). I was under the impression that dedicated frame pointers are only
used when you have less-than-awesome compilers that are unable to keep track
of the moving stack pointer target as it performs optimizations.

~~~
brigade
It's "required" by Apple's ABI that the frame pointer always point to a valid
stack frame; just the same as on armv7. It's for debugging and ensuring
backtraces are always valid. Not modifying the current frame pointer complies
with this, which is what -fomit-leaf-frame-pointer does. But it's an ABI
violation to use it as a general-purpose register.

It sounds like you're partially confusing this with r9, which is a completely
different story - r9 was globally reserved by the system until iOS 3.0, then
allowed. This equivalent in arm64 is x18, which again is globally reserved by
the system.

~~~
saurik
Yeah, I got the backstory mixed up with r9, I think because it coincided with
fp being largely renamed to r11 due to iOS using r7 as the frame pointer
instead (which I now remember the patch I had to merge for). Sorry :(.

------
skylan_q
Thank you for the breakdown of how performance is affected with the new
architecture.

I've had a few quibbles about where performance gains would be, and all too
often I was told that the performance increases would be solely realized in
the larger memory addressing space. That just didn't seem right to me.

I really like the use of the otherwise unused space in the 64-bit pointers.

------
thepumpkin1979
"On ARM64, 19 bits of the isa field go to holding the object's reference count
inline." That's really awesome.

------
Scaevolus
I hope by "Perform an atomic store of the new isa value." he means "Perform an
atomic compare-and-set of the new isa value."

A64 doesn't eliminate conditional execution completely. It just pares it down
to the basics: branch (obviously), add/sub, select, compare (for flattening
conditionals like `a && b && c`).

Another thing removed from A32 was the optional shift on operand 2-- which was
taking up 7/32 bits for most instructions.

This has a few more that were missed:
[http://nominolo.blogspot.com/2012/07/arms-new-64-bit-
instruc...](http://nominolo.blogspot.com/2012/07/arms-new-64-bit-instruction-
set.html)

~~~
mikeash
It's not a compare-and-set. Rather, it uses ARM's atomic instructions where
the load creates a reservation on the memory address, and the store succeeds
only if the reservation is still present, with any other stores to that
address (or nearby addresses) breaking the reservation.

You can use this pattern to implement compare-and-set, but you don't need
compare-and-set to use that pattern directly.

Edit: I wasn't sure how to encode this into the steps in the article, so it's
a bit vague on that part. Suggestions welcome.

~~~
scott_s
Also commonly called _load-link, store-conditional_ :
[http://en.wikipedia.org/wiki/Load-link/store-
conditional](http://en.wikipedia.org/wiki/Load-link/store-conditional)

------
matthewmacleod
Great write-up, thanks!

I expect we'll see ARMv8 architectures in the next round of flagship phones.
Apple's a little ahead of the curve, but it won't be long till competitors
catch up.

In the context of Apple, it's interesting to think about how they're going to
take this next. ARM process and architecture improvements are likely to lead
to chips with high-enough performance to be used in mainstream desktop
applications – Is it possible we're going to see something like an ARM/x86
dual-processor Macbook platform that allows ARM's low power consumption
supplement Intel's performance?

~~~
glasshead969
Macs with ARM processors don't seem like a possibility. Intel Haswell
processors are shown to have comparable Performance per watt which is expected
to get better with Broadwell.

ARM64 Apple chips are play for iPads. Current iPads are lagging on performance
when we compare it to something like a Baytrail Intel tablet or a Haswell
equipped surface tablet. There is going to be convergence point for Intel
where a tablet with Haswell level performance with a Fanless chasis and 500$
price. Apple need to converge there to compete.

------
Pxtl
The bit about memory-mapped files, considering the fact that these devices
aren't using magnetic discs, is something interesting. The conventional file
API of seeking and streams suddenly feels a bit anachronistic. Of course,
flash memory is often optimized for sequential reads, but still - it's far
more amenable to the memory-mapped model than magnetic media ever was.

~~~
Peaker
Memory mapped files have a problem (that may be less relevant for iOS) with
error reporting.

Explicit APIs can have explicit error codes. Memory accesses don't have much
opportunity to report errors, so have to resort to awful signals and such
(that nobody handles properly).

~~~
mikeash
Apple's NSData API even has a flag just for this, NSDataReadingMappedIfSafe.
Basically, it uses memory mapping if the file is on the root filesystem, and
otherwise just reads it all in conventionally. This is because if you end up
memory mapping a file on a USB stick and the user yanks it, you'll segfault,
and nobody likes a crashing app.

On the subject of memory mapping and magnetic disks, one amusing bit of
history is that GNU's Hurd kernel originally implemented filesystems by memory
mapping _the entire hard drive_ and working from there. This worked fine at
first, but started to cause major trouble when HDs grew beyond 4GB and Hurd
was still running on 32-bit CPUs. I believe they ended up redoing it all
without memory mapping so they could grow beyond that limit.

------
devx
Why didn't ARM call it ARM64? It's hard to believe it didn't cross their minds
and decided AArch64 is the better name, so it could be another reason.

~~~
duskwuff
The "ARM123" naming scheme was used to refer to specific ARM cores prior to
the "Cortex" naming scheme. While "ARM64" isn't ambiguous in and of itself,
it's troublingly close to ARM60, the first ARM CPU with a 32-bit address
space.

------
denim_chicken
I still wonder why in the world Apple went with just 1GB of RAM on the 5s.
Even the Nexus 4 that I bought contract-free for $200 comes has 2GB of RAM.

~~~
runjake
1) Because the iPhone doesn't need 2 GB RAM.

2) They took the money they saved and devoted it elsewhere (perhaps the
Sapphire home button? :)

When it comes down to the bill of materials, every cent really does count when
it scales across several million units sold.

~~~
denim_chicken
The iPhone doesn't need a 64-bit desktop-class processor, either.

I personally think Apple strategically held off the RAM upgrade 'til next
year's iPhone, so that they could have a "killer feature" to lean on if they
don't manage to figure out a more novel^Winnovative one in time.

~~~
runjake
_> The iPhone doesn't need a 64-bit desktop-class processor, either._

It doesn't need it, but it certainly helps, as benchmarks and my own
development of an app that does live video effects has shown.

 _so that they could have a "killer feature" to lean on if they don't manage
to figure out a more novel_

Apple has _never_ marketed, nor revealed (to my knowledge) what amount of RAM
is in an iOS device model. This is always determined later by a 3rd party.

~~~
denim_chicken
The extra RAM also helps. Any good OS will not fail to find a use for extra
RAM.

------
jswanson
From the article:

    
    
      The biggest change is an inline retain count, which eliminates the need to perform a costly hash table lookup for retain and release operations in the common case. Since those operations are so common in most Objective-C code, this is a big win.

------
bnolsen
Only using 33 bits for memory addressing is troublesome. 33 bits is 8GB ram
which is small potatos for a desktop. Why couldn't they have left it at 38 or
even 40 bits? Or is this limitation only part of the objective-c runtime?

~~~
mikeash
It comes down to the OS. Basically, when a new process is created, the OS sets
up its address space and decides where it will allow new memory to be mapped.
For whatever reason (I'm not entirely clear just yet), iOS 7 in 64-bit mode
goes for an 8GB address space.

As far as I know, there's nothing preventing that from being increased on
future hardware or even on the 5S with future OS updates. I believe the CPU
itself supports a 48-bit virtual address space.

------
revelation
CPython has reference counts as a part of the object in memory. The claims of
"large memory consumption" are nonsense, especially since small integer
objects and strings are aggressively interned.

And increasing just one aligned integer is certainly cheaper than the bit
masking the solution here entails (all of which is neatly hidden away in the
'increment of the _correct_ portion' part).

~~~
mikeash
Remember that this decision was made back when the entire system might have
32MB of RAM. Does CPython even fit in that, as a single process, let alone a
full multitasking UNIX?

Additional RAM consumption has costs of its own, in terms of cache usage.
Adding an extra 8 bytes for every object in the system is not insignificant.
Masking and shifting is extremely cheap.

If you've run the benchmarks and can show your approach is better, by all
means, please share.

------
nickhalfasleep
Anybody know when 64-bit arm processors might be released as a blade or mini-
server to work with?

~~~
Moto7451
Not sure what the pricing will be like, but AMD is aiming to release their
Cortex A57 based CPUs in 2014[1].

[1][http://techreport.com/news/25338/new-amd-embedded-roadmap-
sh...](http://techreport.com/news/25338/new-amd-embedded-roadmap-shows-64-bit-
arm-cortex-a57-chip)

------
corresation
_First, a note on the name: the official name from ARM is "AArch64", but this
is a silly name that pains me to type. Apple calls it ARM64, and that's what I
will call it too._

What ARM calls ARM related periphery is canonical, whether you think it's
silly or not.

However the overarching entity is called ARMv8, with the 64-bit state called
AArch64 (which can be contrasted with the AArch32 state, which is also a part
of ARMv8) and the instruction set is actually called A64.

~~~
Someone
Not all official names survive a confrontation with reality, where 'easier to
remember' and 'easier to pronounce' have value, too.

Do you use the terms IA-32e and EM64T, too (both are/were Intel's official
names for what people now typically call x64 or x86-64)?

~~~
maaku
EM64T & x86_64 only exist because Intel has too much pride to call it what it
is: AMD64.

~~~
scott_karana
The name "x86_64" existed before AMD fully released their CPUs, and was
latched onto by some of the open-source communities, so that specific one
isn't exactly Intel's. But you're right about EM64T. :)

~~~
ChuckMcM
I do not believe this is true. Do you have a reference that used that name
x86_64 before Intel shipped its first AMD64 compatible CPU?

~~~
scott_karana
Here's a source to back up maggit's Wikipedia citations:

[http://www.amd.com/us/press-
releases/Pages/Press_Release_715...](http://www.amd.com/us/press-
releases/Pages/Press_Release_715.aspx)

Dated 8/10/2000.

~~~
ChuckMcM
Awesome thanks. I stand corrected.

