Hacker News new | comments | ask | show | jobs | submit login
ARM Goes 64-bit (realworldtech.com)
123 points by enos_feedler on Aug 14, 2012 | hide | past | web | favorite | 60 comments

It's interesting to see a very x86_64-like attempt to shake off the weirdness of the ancestral architecture here. The PC is no longer an addressable register. Thumb has been dropped. The predication bits are no more. The weird register aliasing thing done by NEON is gone too. The register banking (and it seems most of the interrupt architecture) is entirely different.

And, just like in the Intel world, market pressures have introduced all new CISC quirks: AES and SHA256 instructions, for example.

But of course an architecture document does not a circuit make. All the weirdness (old and new) needs to be supported for compatibility (OK, maybe they can drop Jazelle), so the fact that they no longer talk about some things doesn't really save them any transistors in practice.

Honestly, this is sounding more like an Intel or AMD part, not less.

Is having hardware acceleration for AES and SHA256 really a "CISC quirk", or just a really specialized set of arithmetic instructions? The classic RISC idea of making the core simple and fast doesn't really apply here; internally, it's all simple micro-operations driving special-purpose hardware. It seems similar to having a fused multiply-accumulate operation: they've figured out how to accelerate the core of a common task, and this is the API they've decided to give it.

That's actually an almost reasonable definition of CISC for laypeople. Take a look at the definition on wikipedia.

  A complex instruction set computer (CISC, play /ˈsɪsk/)
  is a computer where single instructions can execute several 
  low-level operations (such as a load from memory, an arithmetic 
  operation, and a memory store) and/or are capable of multi-step
  operations or addressing modes within single instructions. 
Now the reality is that on the whole things are not quite as cut and dry. In this case they're doing it to give access to dedicated hardware for power gains most likely, which is why something that's typically close to RISC would add something like that. As time has gone on, both CISC and RISC systems have moved more toward a blend of both in-order to get the best of both worlds, from what i've heard interally most x86 chips actually work like a risc chip they just translate between things in the instruction decoder.

> A complex instruction set computer (CISC, play /ˈsɪsk/) > is a computer where single instructions ... and/or are > capable of multi-step operations ... within single > instructions.

What's a "multi-step operation"?

I ask because I worked on the microarchitecture (read "implementation") of a microprocessor that had what was generally regarded as a very RISC instruction set.

Yet, almost every instruction had multiple steps. Yes, including integer add.

Were we doing something wrong?

And no, "one cycle fundamental operations" doesn't change things. Dividing things into cycles is a design choice. For example, one might reasonably do integer adds in two steps.

It's a fuzzy distinction, but it becomes more clear if you look at the x86 instruction set and its extensions.

A very RISC chip usually just has ADD, OR, AND, LOAD, STORE, etc. But in x86 (CISC) we have things like these:

UNPCKLPS: (sse1) Unpack and Interleave Low Packed Single-FP Values

MOVSHDUP: (sse3) Move Packed Single-FP High and Duplicate

AAM: ASCII Adjust AX After Multiply

If those ops are register-register, how are they necessarily not-RISC?

Yes, division is inherently more complex than bitwise NAND, but it's not obvious to me where the line is that you find so clear.

FWIW, I've seen a very serious architecture proposal that used two instructions for memory-reads. (It had one instruction for memory writes.) Along those lines, register-value fetch can be moved into a separate instruction....

The sse1 instruction provide the option of register-register, but also support register-memory. I didn't realize it supported register-register mode, so now I see why it would be less obvious to you.

Why is copying a value from register to memory (or memory to register) "RISC" while performing some logical operation to the value to the value as moves "not risc"?

I'd agree that memory to memory is "not risc", but given the amount of work necessary to do a register access, it's unclear why doing work on a value is "not risc".

Datapaths are NOT the complex part of a microprocessor.

Mmm, I think RISC has to be a relative term. (It does, after all, have "reduced" in its name, which implies a comparison with a less-reduced alternative.) So every time processor A has an instruction that can only be done with a sequence of several instructions on processor B, that is evidence that B is more RISCy than A.

One definition of RISC is that every instruction should take one cycle and thus any instruction that takes longer is CISC. This led to MIPS not having multiply, for example.

A different definition is that RISC should not have any instructions that could be just as efficiently broken into multiple simpler general-purpose instructions. For example, a memory-register architecture can do a load-and-add in one instruction but RISC prefers separate load and add instructions that take the same time. In this view AES instructions are justified as RISC because implementing an AES round with multiple simple instructions is much slower (6x in Intel's case).

> One definition of RISC is that every instruction should take one cycle and thus any instruction that takes longer is CISC. This led to MIPS not having multiply, for example.

Err, if that was ever really a "RISC" thing, it got dropped quickly. I'm not even sure it's possible to create a sane architecture that runs one cycle per instruction: you need two clock edges just to load and store data from registers, let alone operating on the data. However, optimizing the pipeline so instructions are effectively one cycle makes sense; only one memory cycle per instruction makes sense.

Making AES and SHA instructions doesn't really cohere to any definition of RISC I've ever seen: mostly, use as few instructions as possible because you don't have many opcodes to work with in fixed-size instructions. However, I'm also not opposed to these instructions through some dogmatic belief: I think encryption is important enough these days to be optimized to the greatest possible extent without sacrificing general purpose functionality.

The RISC vs CISC debate has been dead for years. Doubly so ever since we found the limits of scaling clock frequencies ever higher. After all, the RISC movement started as a reaction to the difficulties of scaling the architectures of the day to faster clock frequencies. Now (decades, really) CPU designers are concentrating on doing more work per clock cycle, which is rather anti-RISC. So the only questions that matter are "can we implement this feature efficiently?" and "does this feature provide enough performance or power gain for the implementation cost?"

I don't think it's quite dead yet; the performance/power hit for decoding x86-64 instructions is significant, just to decode to a RISC-like microcode anyway. However, that may be more of a statement about x86-64 than it is about CISC in general. Certainly, the days when CISC made any sense at all, mainly to ease assembly programming, is long gone; remember the 8080's string instructions? Yea, neither does anyone else.

However - I think that x86 is so deeply entrenched, and x86 processors are so refined these days, that the value of the architecture is in the software and the investment in the chip design, not in the architecture itself. I think if the PC industry were to start over again, it would go with some kind of POWER variant.

Regardless of CISC vs RISC, I do agree - SIMD and many-core/stream multiprocessing will make far more difference than the instruction and register flavor used on each core.

Well, the fact that x86 encoding is suboptimal is also a dead debate. If AMD had had the resources of Intel, or if Intel hadn't botched IA-64 so badly and actually licensed it to AMD, x86-64 would have better instruction encoding, no question. (seriously, like all of the unused/slow instructions have 1 byte opcodes)

Anyway, my point is that pure CISC designs (as much as that means anything) obviously lost ages ago. Pure RISC also lost as frequencies plateaued, or perhaps more accurately never really won; CPU designers care about what makes CPUs more performant, not abstract ideology. So we get stuff that runs counter to RISC ideals: SIMD, VLIW, out-of-order execution, and highly specialized instructions like AES and conditionals.

Yes, I agree wholeheartedly. I still think RISC and CISC have value as terms, however vague, because they succinctly summarize trade offs well. I fully realize that today's processors are hybrids of many techniques, and that's a good thing.

>the performance/power hit for decoding x86-64 instructions is significant, just to decode to a RISC-like microcode anyway. However, that may be more of a statement about x86-64 than it is about CISC in general. Certainly, the days when CISC made any sense at all, mainly to ease assembly programming, is long gone

CISC still has an advantage in that it effectively compresses your instruction stream, meaning you can fit more in cache

Small question: did the 8080 have string instructions? I know the Z-80 did, not sure about the 8080.

You just described a CISC op.

most likely for energy-efficiency reasons.

Yes, its too bad cleaning up the architecture doesn't necessarily cleanup the physical design. As GPUs have been more recent entrants to the general purpose space it is clear they are trying to avoid the same mistakes. The only place you will find a a true GPU binary is buried deep in the memory of the runtime stack (for NVIDIA at least, not sure about AMD).

Right. My understanding is that NVIDIA has mucked around with their low level instructions at every iteration. I remember reading somewhere that with Kepler the hardware doesn't even have dependency interlocks -- the compiler is responsible for scheduling instructions such that they don't use results that aren't ready yet.

But at the same time the lack of a clear specification and backwards compatibility means that the software stack needs to deal with all new bugs (both hardware and software) at every iteration. That puts a IMHO pretty firm cap on the "asymptotic quality" of the stack -- you're constantly chasing bugs until the new version comes out. So you'll never see a GPU toolchain of the quality we expect from gcc (or LLVM, though that isn't quite as mature).

> The one surprise in ARMv8, is the omission of any explicit support for multi-threading. Nearly every other major architecture, x86, MIPS, SPARC, and Power has support for multi-threading and at least one or two multi-threaded implementations.

What does this even mean? Are they talking about atomic operations? Hyperthreading?

They're talking about stuff like the MIPS MT-ASE for instance.


I don't see how it could be about hyperthreading, since that's a CPU implementation detail and mostly unrelated to the instruction set. Maybe it's referring to specifying memory consistency behavior and support.

I think they are talking in the hardware sense: hyperthreading/SMT

How much does that actually help? In my extremely fuzzy memory, it only worked out to around a 30% increase in ideal situations. I'd rather see them work on features that can be exploited with less voodoo.... like hardware 64-bit support, or SIMD support, or HTM, or hell, clock rate.

Intel HT [1] originally was like that (if your code runs in 1.0s single-threaded, ideally it will run in ~0.77s multi-threaded).

The main problem with hyperthreading is that each CPU generation has been so different and software's only decision is in binding to unique cores and hoping the performance is better. AMD's Bulldozer hasn't helped either.

On the other hand, most of Intel's big markets all tend to use pretty inefficient code (very low IPC), and that's where HT makes a lot of sense. ARM cores are typically running a pretty tight ship. So it makes me laugh when I see Atom includes HT.

Intel, clearly, would dispute my claims.

[1] http://en.wikipedia.org/wiki/Hyper-threading

I figured Atom had hyperthreading because it was Intel's first in-order x86 core in over a decade, so compilers had forgotten how to schedule x86 code, so there were lots of stalls in the ALUs that a second thread could make good use of. Plus scheduling for Atom is pretty hard in part due to the lack of registers in x86.

Additionally, Ars argues [1] that from a performance per watt perspective, hyperthreading makes more sense with x86 and two cores makes more sense with ARM

[1] http://arstechnica.com/gadgets/2008/05/risc-vs-cisc-mobile-e...

> Intel, clearly, would dispute my claims.

Remember that it's peoples' perception of products, not reality, which makes money.

This is all very interesting. I'm going to have to break out my Hennessy & Patterson and get back into hardware.

You want to read Agner Fog's article, How good is hyperthreading?, http://www.agner.org/optimize/blog/read.php?i=6. As an aside, Agner is one of those rare people who only writes when he has extremely valuable things to say. His entire website is worth a read.

Does he actually answer that question? I read the main post, which seemed to conclude if it's good, it's good, and if it's bad, it's bad. Then there's some replies, and finally a single (negative) number presented for the Rybka chess engine. What about programs that aren't chess engines?

But it's 30% you get for basically free. I kind of thought HT was mostly a gimmick (look, now with 256 virtual CPUs), but changed my mind since it doesn't cost anything (in terms of die space) to add it to a chip. 30% more performance for 1% more cost is a better deal than 100% more performance for 100% more cost, assuming you can live with only 30% more performance.

I should add I think what AMD is doing with Bulldozer (claiming two virtual cores are actually full cores) is bullshit.

> I should add I think what AMD is doing with Bulldozer (claiming two virtual cores are actually full cores) is bullshit.

I think AMD is doing whatever it can to get people to buy its CPUs. If it weren't for their ATI purchase, I think they'd be basically dead by now. It still amazes me how far they've fallen: I built my first computer with an AMD X2 when I was 15 (6 years ago now) - they looked like they were going to upset Intel as deciding the future of x86 chips. They did for a while - we got a sane 64-bit architecture out of it. I'm not sure where they went wrong: was it marketing, was it manufacturing tech, was it profit margins, was it Apple? I don't even know if their current processors are competitive or not in the performance market - things like "Bulldozer" make me think not.

Anyway, could SMT be implemented on top of ARM v8? My knowledge of hardware doesn't include multithreading. However, from my limited understanding of it, I don't see SMT making much difference in tight RISC code, which is designed to have a high instruction throughput per cycle, leaving little for instruction reordering to optimize.

One way to think of SMT is context switches for free, and lots of them. What happens when you run two processes on one core? Every 10ms the kernel copies out all the registers from one process to memory, copies in the regs for the other, and switches. What happens when you use SMT? Every "2" instructions the CPU switches from one process to the other, transparently, without hitting memory. After 20ms, the same amount of work is done, possibly a little more, and if process two only had 1ms of work to do, it doesn't have to wait the full 10ms timeslice of process one.

SMT is not about instruction reordering at all (within one process). Just like the OS switches between processes whenever you wait for disk, now the CPU switches processes whenever you wait for memory. It just happens that virtual cores are the way the OS programs the CPU scheduler.

SMT generally doesn't require any ISA support, which is why it's confusing that Kanter would mention it.

Also, a lot of code (think pointer-chasing) can never be made "tight".

It's true that AMD is being disingenuous with Bulldozer, but on the other hand their SMT threads share less resources than other implementations (they have separate integer execution units for example, which makes them much closer to "full cores").

The thing is I don't think it adds 30%, but more like 10%.

The tests I've seen only showed a 10% improvement on average with hyperthreading. I don't think it was worth it for ARM.

A question about how HN works. I'd submitted the same article to HN at a much less opertune time so it fell off the "new" page before it got its first upvotes. [1]

Normally when somebody then resubmits the same article at a better time I thought they had to add a '~' at the end of the URL or something, but I don't see anything like that in this case. So how'd they do it?

(And I should say I'm glad that you all get to see this article, so thank you enos_feedler).


Note that the article you submitted has a trailing /, this submission doesn't and so was not detected by HN as a duplicate story - or so it seems.

I wonder if it is possible to make the CPU be 64 bit only, and how much die space/power it would save by not including the 32 but cruft.

I'd also be curious if it was possible to software translate 32 bit arm binaries into 64 bit while retaining comparable performance.

Honest question since I'm confused about terminology: Why is ARMv8 described in places as "backwards compatibility for existing 32-bit software" when some existing instructions will be removed in AArch64?

Because the ARM front end is so simple, it isn't hard to have multiple ones. One of these will be able to run existing 32-bit software.

It looks like an interesting article, so it is a shame that it was split into 5 pages with no way to view everything on one page. I have no recourse but to not read the article at all.

> I have no recourse but to not read the article at all

You could click the next button 4 times and read the full article. There's lots of content on each page. It would've taken 100% less typing than this complaint, and you would've spent that time learning instead of grumbling.

It's a real shame you can't read books either. Whole libraries of documents split into pages with no "view all" button.

There's a big difference in turning a page the size of your hand when you are already holding the book and trying to click a micro button the size of a word when you use the arrow keys to move the browser window.

I'll just wait until the exact same information appears on a single page. I was expressing sincere regret because I liked the first page, but I absolutely will not read paginated articles.

Also: you're obviously irritated by my grumbling, but grumbling about it is just a massive load of hypocrisy, so please realize I'm not going to be taking any of your comments all that seriously.

I agree with your sentiment. Hence:

  ~ $ curl -s 'http://www.realworldtech.com/arm64/'{1..5}'/' --compressed > a.html; open a.html
("open" is OS X-specific; "nautilus-open" might have a similar function on Linux or something.) Interestingly, that website seems to deliver gzip-compressed output no matter what you request.

I'm ashamed I didn't think of this :) Thanks, it worked exactly as expected.

xdg-open for any FreeDesktop-compliant system (basically any Linux since about 2000).

That is clever. What is the {1..5} syntax called? I am trying to figure out what the zsh equivalent is.

Relevant terms are "brace expansion" and "range". And, um, at least for me, the command I wrote works verbatim in zsh. (I think zsh is supposed to be bash-compatible like that.) Brace expansion works like this (in zsh and bash):

  % echo {1..5}
  1 2 3 4 5
  % echo meh{1..5}
  meh1 meh2 meh3 meh4 meh5
  % echo {1..5}{1..5}
  11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55
  % echo {1,2,4}{1,3,9}
  11 13 19 21 23 29 41 43 49
I have observed one difference in brace expansion: {a..f} -> "a b c d e f" in bash, but "{a..f}" in zsh. Curious. Oh well.

In Chrome: typing Ctrl+F <space>2<space> <Esc> <Enter> will get you to the next page, no need to use the mouse if you're concerned about moving your hands.

Although I ended up reading the 'curl'ed version, I wanted to say this is also a decent solution. Cheers

For the book there a a physical need to seperate it into pages, it would be worse to read one long paper scroll.

But webpages is easiest to read if its one long article, they just split it up to inflate their page view.

The article is hyperthreaded so you can read all five pages at once if your brain supports that kind of instruction set.


too bad there's a 'fence' missing between each page

Or you'll just reread every page until everything makes sense since you can read things very fast for marketing purposes

If only :/ I don't even have an FPU.


It scrapes down and combines multi-page articles like this with a click for on or offline reading. Great interface and mobile apps too, I use it all the time.

I didn't know it did multiple pages. That's a great feature.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact