The ARM, the PPC, the x86, and the iPad

pascal_cuoq · on April 3, 2010

Article:

    In fact, it’s now a universally accepted truth that RISC
    is better than CISC! Actually, because of how much more
    efficient RISC machines are than their CISC counterparts,
    most CISC CPUs convert their CISC instructions into RISC
    instructions internally, then run them!

Most RISC CPUs convert their RISC instructions into RISC internally, then run them! Look at the G5 (IBM PowerPC 970): it does register renaming, and divides and assembles its instructions for the purpose of out-of-order execution, just like any out-of-order CISC processor.

The truth is that instruction sets are always outdated compared to the number of transistors that Moore's law allow to put on a chip. At some time it seemed that 32 registers would make renaming unnecessary. At another it seemed VLIW would make possible a level of performance not allowed by older instruction sets. Ask Intel and HP how that transition worked out for them.

RISC principles are only superior in the sense that 70's crazy haircuts are superior to 60's crazy haircuts. They are dated too, just a little less so. And in these days of memory-bound computations, higher density in CISC instruction sets seems to give them a slight advantage, if anything.

derefr · on April 3, 2010

> Most RISC CPUs convert their RISC instructions into RISC internally, then run them!

A RISC processor is a processor with no microcode virtual machine level. These processors aren't RISC in anyone's view but marketing's. Likely, there are very few true RISC processors still being designed (for speed; processors for embedding are a different story.)

1amzave · on April 4, 2010

I think one thing that ought to be dealt with first before any RISC-CISC debate is what exactly defines either one. Some attributes commonly identified (in my experience) with RISC machines include:

- Fixed-width instructions (generally 32 bits)

- Explicit load and store instructions with arithmetic being performed only between registers (hence RISC architectures sometimes being called load-store)

- Large register files with few or no special-purpose registers (MIPS HI/LO registers being an exception to this one, for example)

Other common (though perhaps less "defining") traits: three-operand instructions, relatively few/simple addressing modes, procedure calls often done via "branch-and-link" instructions, etc.

But...

> A RISC processor is a processor with no microcode virtual machine level.

Huh? I've certainly never heard that before. In fact, I'd say the RISC/CISC dichotomy is primarily (if not entirely) an attribute of the ISA, not the microarchitecture implementing it. It's generally pretty easy to look at e.g. a PPC or x86 instruction set and classify it one way or another; in terms of internal implementations though, the lines have gotten so blurred (in both directions) in modern machines that I don't think it's real meaningful to talk about RISC vs CISC anymore at the microarchitectural level, frankly.

derefr · on April 4, 2010

> I've certainly never heard that before.

It's because we've adopted (or rather, co-opted) the terms to refer to things that shared phenotypical traits with their progenitors, but no longer held true to the original definitions. In reality, RISC originally just meant "exposes its microarchitecture as its instruction-set architecture." All the other well-known properties of RISC machines were effects of this decision. But these days,

> the lines have gotten so blurred (in both directions) in modern machines

...that, like I said, there are very few RISC processors under the original, theoretical definition of the term (and it's probably alright to just use "RISC" under the new definition, since only the embedded programmers will complain.)

jomohke · on April 4, 2010

The article a little out of date. The RISC vs CISC distinction is very blurry these days. Most RISC architectures have been made more CISCy, and vice versa.

The article mentions CISC CPUs using RISC instructions internally, which isn't the whole story anymore.

Ars Technica did a classic article on this debate (in 1999!): http://arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html and a follow up here: http://arstechnica.com/hardware/news/2009/09/retrospect-and-...

A particularly interesting part of the follow up: "But a funny thing happened with the Pentium M: processor designers discovered that processors of all types are actually more power-efficient if their internal instruction format is more complex, compound, and varied than it is with simple, atomic RISC operations ... ... The end result is that even RISC processors needed to get more CISC-y on the inside if they wanted to juggle the largest number of in-flight instructions using the least amount of power."

ComputerGuru · on April 4, 2010

Thanks for the link, it was an interesting read!

kjhghjmkedfcv · on April 3, 2010

The RISC/CISC things is a little simplified. One reason RISC has never caught on in the desktop is memory speed hasn't kept up with CPU speed (and can't - with the laws of physics). So if a RISC cpu takes 10 instructions to do what a CISC can do in 1, it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

The principle reason people use ARM is low power, part of it's low power comes from the RISC design but it's not as simple as that. To reach the same overall performance as an x86 the RISC may have to use more power, simply because power increases faster than clock frequency.

jws · on April 3, 2010

The difference in RISC/CISC instruction count is closer to 2:1 than 10:1. (Unless you are using a VAX polynomial evaluation opcode, but that is an extreme.)

ARM ameliorates this by having multiple instruction sets. The Thumb instructions are a denser encoding, if somewhat slower. 90/10 rules apply.

ryanpetrich · on April 4, 2010

Thumb instructions are the same speed, but can only perform a subset of what the ARM instruction set can; each instruction takes half the space. Thumb can be faster if it eliminates cache overflows, but can also be a lot slower if faster ARM instructions have to be emulated with Thumb equivalents.

derefr · on April 3, 2010

> it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

I'm just thinking out loud... but what if instructions in memory were simply compressed, and the CU's decode step were a decompression algorithm, rather than lots of opcode-specific lookups? It would still be a RISC processor, basically, just with a decompression coprocessor.

ramchip · on April 4, 2010

Compression wouldn't be much use if it were applied one opcode at a time, so I suppose you'd have to either read the code one block at a time, which could make jumps very slow, or the compiler and instruction decoder would have to do somewhat crazy stuff to turn code paths into compressed blocks.

derefr · on April 4, 2010

Main memory is already read a block at a time anyway, to get the gains we all expect for space locality. I'm imagining the blocks (probably equivalent to memory pages, in practice) would be kept uncompressed in L1/2 cache memory, with an additional layer of cache added on top for compressed blocks. Then, a near jump would be a read on a low-cache hit, and a decode on a high-cache hit, while a long jump would be a page-fault+decode as usual.

ComputerGuru · on April 3, 2010

Kind of, but not exactly.

The amount you'd need to increase the clock of a RISC CPU to get similar performance to desktop CISC CPUs is a lot less than the 2.4GHz your currently using.

The shorter pipelines and simpler cycles means that you don't need to make the CPU crazy-fast to boost performance as much as you would with CISC. Intel has improved that starting with the Core CPUs, but it's still not as good as RISC design. The really deep pipeline in the P4 series was a killer - the cores were churning 3.6GHz and still not getting much work done. The "density of work" in a RISC cycle is (was) much higher than in CISC, and that really does help keep the power consumption down to a minimum.

philwelch · on April 3, 2010

Hence the advantages of a CISC frontend and a RISC backend, as x86 has evolved to. An x86 might pull one instruction from memory. translate it into 10 backend operations, and get the best of both worlds.

jws · on April 3, 2010

… at the expense of extra computation and power consumption. And I think the current CISC backends would better by called VLIW.

There's more than one way to skin a cat, and people can talk all day about what to call each one.

ComputerGuru · on April 3, 2010

Absolutely. It's a real shame AMD beat Intel to the x64 punch, or we'd all be running Itanium today. VLIW nevermore....

nominolo · on April 3, 2010

No, the problem with Itanium is/was that it's no-one knew how to write good compilers for them. Also, VLIW is incredibly close to the hardware and thus bound to be outdated very quickly. It didn't help that IA64 has about 40bits / instruction and thus the least instruction / byte of all the mainstream architectures.

Transmeta had the right idea, they did the x86-to-VLIW dynamically at runtime and in software, coupled with proper code caches. But it seems they were to early -- the market wasn't ready for them.

wmf · on April 3, 2010

So if a RISC cpu takes 10 instructions to do what a CISC can do in 1, it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

99% of the time instructions come from the instruction cache, not from memory. And as jws said, it's not 10:1.

kjhghjmkedfcv · on April 3, 2010

Yes and in practice ARM isn't really RISC and x86 isn't really CISC, with VLIW and pipelines and caches it's more complex.

But the original RISC research was in a time when neither CPU clocks nor memory bandwidth was anywhere near physical limits.

It's not the RISC(Apple) is clever and CISC(intel) is a dumb dinosaur - message the article is aiming at.

artsrc · on April 3, 2010

One point of view is that we are pretty close to optimal. Here is another perspective from:

    http://queue.acm.org/detail.cfm?id=1039523

Kay says:

    Just as an aside, to give you an interesting benchmark—on roughly the same system, 
    roughly optimized the same way, 
    a benchmark from 1979 at Xerox PARC runs only 50 times faster today. 
    Moore’s law has given us somewhere between 
    40,000 and 60,000 times improvement in that time. 
    So there’s approximately a factor of 1,000 in efficiency 
    that has been lost by bad CPU architectures.

    The myth that it doesn’t matter what your processor architecture is — 
    that Moore’s law will take care of you—is totally false.

From my point of view garbage collection, JIT compilation, and late binding are valuable and the hardware is leaving to much to the VM's.

_juof · on April 4, 2010

Alan kay might not be correct in this case . The correct factor between 1979 and today's computers is not 1000x , is something between 10x-50x. and with the right compilers you get a factor of 3x-10x[1].

[1]http://lists.canonical.org/pipermail/kragen-tol/2007-March/0...

Hoff · on April 3, 2010

It's building the business case and the revenues that makes this whole processor design discussion interesting.

Not the microprocessor technology itself.

Competitive microprocessor designs (in terms of speed, power, cost, volume and particularly the applications the users care about) are feasible but are Not Cheap, and to get the costs down (and the revenues up means you need to build the production scale. And building scale means leapfrogging the existing players in one or more dimensions sufficiently to draw over applications that the end-users really care about.

You need to be significantly better here or sufficiently profligate with the application vendors and the resellers, or sufficiently compatible to overcome the inherent application inertia; you need to have enough "pull" (speed, power, cost, particularly applications) to build up a base. Without this, you're running a furnace with the contents of the corporate coffers.

You can't be a just little better with your processor designs, either. That won't draw enough folks over. Both the Alpha and MIPS microprocessors had Microsoft Windows NT, and that (and even with porting tools such as FX!32) wasn't enough to build up an installed base against the x86 designs.

Apple is playing the long game here.

pdwoolcock · on April 3, 2010

CPU's were/are one of the last pieces of computer-related hardware that I still kind of considered "magic." This article definitely helped to alleviate that...

pedalpete · on April 3, 2010

This is one of the most interesting articles I've read here in a long time.

However, with the average use of mobile devices, does the CPU architecture really matter that much?

As the focus of mobile devices is more the consumption of media (games, video, etc), isn't the GPU where the large differentiator in this space?

Does either RISC or CISC have a benefit in the GPU space? Or am I completely wrong with that previous statement?

z303 · on April 4, 2010

A GPU may help with media. GPUs tend to be SIMD / Stream processors. So some problems map well to that architecture, some less so.

Lots of GPUs tend to work on four pixel quads of RGBA data with very specialised instruction sets, which are not really CISC or RISC but are maybe more RISC like.

This is good for graphics rendering both 2D and 3D, video and audio encoders and decoders are trickier, some parts like motion estimation can be implemented on a GPUs SIMD architecture but the bitstream process is very serial by design so fits badly to a SIMD architecture and is not great on CPU either, having a custom piece of hardware or FPGA maybe a better solution.

z303 · on April 4, 2010

http://www.youtube.com/watch?v=g2j3AGeeLg4 A Google tech talk on 'Graphics Processors, Graphics APIs, and Computation on GPUs' gives a bit more detail on the design of a modern(ish) GPU

wallflower · on April 3, 2010

> Today, with the optimizations and internal RISC conversions that take place, CISC vs RISC isn’t really about the performance any more. It’s about the business, the politics… and the power consumption.

elblanco · on April 4, 2010

Anybody remember Transmeta?

TimMontague · on April 3, 2010

In fact, it’s now a universally accepted truth that RISC is better than CISC!

Is that actually true? I'd like to see some actual numbers comparing RISC vs. CISC performance/power consumption.

ComputerGuru · on April 3, 2010

Author here.

You'd be comparing apples and oranges. CPUs are optimized for specific use cases. There's always a tradeoff between the performance of each component, and current x86 CPUs are designed/built-for the desktop while current ARM/MIPS/etc. are made for embedded and mobile devices.

Atom comes close to being comparable to certain embedded devices, but not really. Because it's actually a desktop architecture (resembles older in-order execution x86 desktop PCs) scaled down and simplified to reduce power.

Within the field though, it's taken for fact. To me though, nothing says it clearer than that Intel internally converts CISC instructions to a series of RISC instructions, then enters them into the pipeline. If it weren't for compatibility issues, Intel would be RISC today.

anamax · on April 3, 2010

> To me though, nothing says it clearer than that Intel internally converts CISC instructions to a series of RISC instructions, then enters them into the pipeline.

One problem with that argument is that RISC machines do the same thing.

Another problem is that there are different amounts of available/potential bandwidth at different points. For example, L2 accesses are far more expensive than what a decoder can produce.

There's always some benefit to reducing the number of control bits, but it is especially pronounced the further that you get from functional units.

BTW - "the pipeline" is misleading. Many implementations use multiple pipelines even for a single instruction stream.

> Within the field though, it's taken for fact.

There may be a theoretical advantage, but if so, it is small relative to many other factors.

I've actually helped define a commercial ISA (RISC FWIW).

sloughly · on April 3, 2010

Microcode isn't the same a RISC. With an actual instruction set you are bound to an architecture, microcode changes with the microarchitecture so it can contain optimizations that would break compatibility if used in a general way.

Intel tried to move away from CISC with EPIC (Itanium), which could be considered a kind of RISC, but it obviously didn't work out.

ComputerGuru · on April 3, 2010

Didn't it? The Itanium is a great performer. Like I mention below in another comment, if only for AMD which beat Intel to the x64 punch with the hybrid x86_64 architecture, we'd be all on Intel's vision of true 64-bit computing: the Itanium.

All computer architecture PhDs I've spoken to have referred to the Itanium with an air of awe. It was built from the ground-up with all the optimizations and bottlenecks in mind, and can be statically optimized to do magic... except no one is ever going to use it since it requires all applications to be recompiled, and with x86_64 offering an easy way out, that's not going to happen.

Nelson69 · on April 3, 2010

If AMD didn't introduce x86-64, we'd all be using PowerPC and 32bit x86 still. IA64 suffers from the same problems as all the other RISCs out there that died, it has potential to do magic but there just aren't that many magic compilers and the output tends to look sort of mundane.

The whole RISC vs. CISC is off the point, that was decided a long while ago. x86 "CISC" which is quite a bit more RISCy than say VAX or 370 happens to be a very compelling blend.

ARM does have an interesting position in the ultra lower power field though. I suspect that has more to do with ARM being designed for that from the start than the instruction set though.

sloughly · on April 3, 2010

I'm also a fan of Itanium. Instruction encoding is a bit weird (41 bits, really?), but many cool ideas to give the compiler more control.

My point was that if microcode were the same as simply expanding CISC instructions into multiple RISC instructions, Intel could have gotten away with designing a RISC chip that was performant on x86 by simply expanding each x86 instruction into the new architecture.

The performance of x86 on Itanium wasn't too great (and was one of the reasons it didn't do so well).

1amzave · on April 4, 2010

> All computer architecture PhDs I've spoken to have referred to the Itanium with an air of awe.

Really? I've had pretty much exactly the opposite experience. In computer architecture circles I find it's actually more often referred to as "Itanic" than "Itanium", if that's any indicator. I think what's probably most sad about the Itanium debacle is that it brought about the end of Alpha, PA-RISC, and (non-embedded) MIPS, and all apparently for naught.

A slight aside: it doesn't really say much about it's technical merits one way or another, but Wikipedia has a pretty amazing chart of predicted vs. actual sales figures for Itani{c,um} over time: http://en.wikipedia.org/wiki/File:Itanium_Sales_Forecasts_ed... (you can just see it falling on its face over the course of a decade).

1amzave · on April 4, 2010

> In fact, it’s now a universally accepted truth that RISC is better than CISC!

> Is that actually true? I'd like to see some actual numbers comparing RISC vs. CISC performance/power consumption.

While I realize you intended your question about the latter part ("that RISC is better than CISC"), I'd say the prior part ("universally accepted truth") is absolutely, unequivocally not true. Just last weekend [1] I heard Yale Patt (a relatively Big Name in computer architecture) refer to RISC as something like "a hiccup that lasted 20 years" -- a slightly less than flattering description. Not saying I necessarily agree with him, but agreement is far from universal. There are plenty of other examples as well; see http://portal.acm.org/citation.cfm?id=1506661.1506667 for one I came across recently.

[1] Incidentally, at the same event I also met an Intrinsity chip-design engineer who said his company was in the process of being acquired, though at the time he didn't mention it was by Apple...

kjhghjmkedfcv · on April 4, 2010

Sparc vs Intel? Which is your desktop running?

It might be a slightly different answer at your web host!

woadwarrior01 · on April 4, 2010

> z = x + y

AFAIK, x86 doesn't have 3 operand instructions yet.

1amzave · on April 4, 2010

Generally true, though I think some of the proposed (future) SSE5 instructions may be three-operand -- or even four in the case of FMA ops (perhaps this is what you were subtly referring to by saying "yet").

Perhaps more importantly though, (again, AFAIK) x86 does not have any instructions that operate memory-to-memory in the way the article indicates. There are plenty of memory-to-register and register-to-memory (and register-to-register) ops, but the ALU ops generally don't combine a load and a store into a single instruction (some instructions like cmpxchg do for the purpose of providing atomicity, but those are a special case).