They both borrowed so many ideas from each other that the architectures are nearly identical these days.
Neither modern ARM or modern x86 architectures deserve to be called RISC or CISC.
Still, there are a few areas where the A64 architecture is theoretically "better" than x64, due to the lack of legacy.
The first is instruction decoding, x86 had to deal with a whole bunch of weird instruction length modifiers, prefix bytes, weird MOD RM encodings and years of extensions. To decode the 4 or 5 instructions per cycle that modern out of order microarchitectures demand, Intel CPUs have to attempt decoding instructions at every single byte offset in a 16 byte buffer, throwing away unwanted decodings. That's got to waste transistor and power budgets.
In comparison, when an ARM CPU is in A64 mode, all instructions are the same length, making decoding multiple instructions per cycle trivial.
The second area is memory consistency guarantees. x64 has relatively strong guarantees, allowing simpler assembly code but at the cost of more complex memory subsystem between cores.
A64 has much weaker ordering guarantees, which saves on hardware complexity, but required the programmer (and/or compiler) to insert memory fences whenever stricter ordering is required.
This is all theoretical, I have no idea what difference this makes at the practical level.
Every extant A64 processor needs to deal with ARMv8 instructions, thumb instructions, thumb2 instructions, shift encoding, microcoded multiple load/store instructions... It's true that the subset of the ISA that makes up the overwhelming bulk of instructions actually executed is very simple. But frankly that's true of x86 as well.
> when an ARM CPU is in A64 mode
That's... not the way hardware works. Those transistors are still there, it's not like you can make them go away by switching "modes". They still are ready to switch every cycle, and in any case having transistors that "aren't needed" doesn't make anything faster, because they all execute in parallel (or at worst in an extra pipeline stage or two) anyway.
Either you have a simple architecture or you don't. A64 is a simple ISA. Real CPUs have complicated legacy architectures.
A typical ARMv8-A processor needs to decode the A64, A32 and T32 instruction sets, but that doesn't strike me as a significant burden. A32 and T32 are essentially just different encodings of the same instruction set - there are very few A32 or T32 instructions that don't have an equal in the other instruction set. A64 has more and wider registers, but is otherwise broadly similar in capabilities. I would expect that most ARMv8-A implementations unify the three instruction sets very early in the decoding process.
Retaining the system level aspects of AArch32 strikes me as more expensive, especially support for short page tables, the subtly different system register layout, the more complex relationship between PSTATE.M and the security state, and the banked system registers between the secure and non-secure states. I'm surprised ARM-designed cores haven't pushed harder to eliminate AArch32 at the higher exception levels (although I'm aware of some cores designed by ARM Architecture licensees that do so). Perhaps that ARM has retained AArch32 all the way up to EL3 is evidence that they believe doing so isn't very expensive.
It seems like they are very slowly phasing it out and heading to 64-bit only. A32 ISA is only supported in userspace mode (EL0) on the Cortex-A76.
First, even in T32 mode, there are only two instruction lengths, 2 and 4 bytes. Requires much less transistors than modern Intel and AMD CPUs which supports single cycle decoding of instruction lengths from 1 to 15 bytes.
Second, due to prefix bytes, the bits which control the x64 instruction length a spread all throughout the instruction, and they interact with each other. There are instructions where every single byte modifies the length. Decoding variable length T32 is easy, there are only a few bits in the first byte that select between 16 and 32bit.
Third, T32 support is technically optional. Apple has required all apps to be recompiled in A64 mode for years, I'm not even sure if their latest CPUs still support switching to T32 mode.
Forth, mode switching does use less transistors than a design that has to chose on a per-instruction basis. It takes a lot of extra transistors to dynamically detect the instruction type, which have to be replicated 4 or 8 times are unnecessary. All you need is a single bit register to store the current state and a special instruction to switch between modes.
Wouldn't surprise me if there are other A64-only chips as well.
Regarding transistor / area budget, sure.
Regarding power budget, that absolutely could be the way hardware works. Modern chips can and do power off pieces of silicon when they are not needed. Whether that would be worth it for instruction decoding, I don't know, but it could be.
That is only possible when you deal with isolated parts. You cannot, for example, power down an instruction decoders ability to understand different syntaxes, but only power down the entire instruction decoder. Trying to design it so that sub-features of a block like that can be powered down would not be productive.
A realistic clock-gating would be something like powering down the actual execution units ("We don't need AVX-512, so lets not waste power on the execution units"), but that doesn't help in saving power wasted on legacy.
You can design anything. The question is whether the added design complexity (which for silicon directly translates to increased power consumption) outweighs the benefits.
>> Whether that would be worth it for instruction decoding, I don't know
There would be significant overhead to design a decoder such that it could switch between legacy and aarch64 only, but it could conceivably be done.
fyi clock gating isn't the same as power gating.
What you'd do then is to split the decoder into several blocks, so that there's a fan-out from a main decoder into the different sub-decoders, and then power down the sub-decoders. It's still entire blocks you power down.
Plus, I think the increased power consumption from this design (especially considering that the decoder now needs to stall on powered down sub-decoders) will outweigh the savings of powering down any sub-decoders.
> fyi clock gating isn't the same as power gating.
Of course not. Both clock gating and power gating are power saving designs. Clock gating and power gating both eliminate switching current entirely, while power gating also removes leakage current at the cost of larger architectural changes than those required by clock gating.
I'm out on a limb here, but I don't think power gating makes much sense outside extreme low-power devices.
It seems like if it's doable, that Apple would do it - especially since they killed off 32-bit apps in iOS 11 already, and therefore were able to remove the 32-bit code from the iOS codebase too.
Fair Enough.  States it is AArch 64 only. I spent some time trying to find the official answer to AArch32 being optional, unfortunately nothing concrete has come up.
On the other hand, that wastes cache memory and fetch bandwidth. Instruction density is very important especially since caches are big and consume a lot of power too. I believe that if it weren't for the brief period in the 80s where memory speeds were higher than core speeds, what we know as "RISC" today would've never appeared.
A significant amount of space is wasted on old rarely used instructions, and the x86-64 encoding was chosen mostly based on similarly with the earlier 32-bit ISA, which uses decades old instruction frequencies. The addition of many ISA extensions has become progressively more difficult: the newest AVX-512 instructions which use the EVEX encoding have 4 (!!) prefix bytes before the instruction even starts. So just the prefix is as long as any AArch64 instruction.
The net result is that AAarch64 binary sizes are largely comparable with x86, and there is still a bit of juice to squeeze out as well as the ARM compiler backends haven't had the same decades of heavy optimization compared to x86.
This. One of the reasons why modern x86 processors have such strong performance is their external facing CISC interface effectively acts as memory compression while they are RISC on the inside. In many ways its a best of both worlds that was achieved through incremental evolution.
That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers; the original designers (at AMD, not Intel) decided on the prefices (and defaulting to 32-bit for most ops) instead of defaulting all operations to 64-bit in 64-bit mode precisely because it would be better for size and performance --- using their carefully optimised compilers. Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.
I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen. I even recall a remark in that study about how it was generating mostly RISC-like instructions, so in other words they were comparing binaries generated for a RISC CPU using a RISC-oriented compiler with ones generated for a CISC CPU using a RISC-oriented compiler, failing to exploit the full capabilities of a CISC.
I've written x86 Asm for several decades (started with 16-bit --- dating myself here...), and done some occasional MIPS and ARM, and it's very difficult for me to believe that the RISCs have any intrinsic advantage in code density other than the fact that compilers for x86 aren't that great at it; you can write a Fibonacci calculator for the latter in 5 bytes and pushes and pops are single-byte instructions, while on the former even a register-register move is 4 bytes.
No, that's true for basically all code. 6 or 7 registers isn't enough for basically anything interesting, so you end up pretty much always hitting the high registers.
> Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.
The only real difference is that you have memory load addressing modes in x86, while for load-store architectures like AArch64 you don't. But:
* On x86-64 you have two-address instructions, not three-address instructions. This means that AArch64 "sub x9,x10,x11", or "49 01 0b cb", becomes x86-64 "mov r9,r10; sub r9,r11", or "4d 89 d1 4d 29 d9": 4 bytes vs. 6, thanks to the doubled REX prefix.
* On x86-64 immediates are very inefficiently encoded, while they tend to be compressed on RISCs to fit in the 32-bit instruction word. The end result is that AArch64 "sub x9,x10,#1234", or "49 49 13 d1" in 64-bit mode becomes x86-64 "lea r9,[r10-1234]", which is "4d 8d 8a 2e fb ff ff": 4 bytes vs. 7.
> I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen.
LLVM is doing pretty well at x86-64 size optimization: for example, it prefers to select lower registers to reduce size. As I recall, Dan Gohman told me the code size win was something on the order of 2%. It really doesn't make a big difference: AArch64 and x86-64 have about the same code size.
> you can write a Fibonacci calculator for the latter in 5 bytes
But real code, again, hits the high registers.
> pushes and pops are single-byte instructions
Pushes and pops aren't used by most compilers, except in function prologs and epilogs. This is actually an example of inefficiency in the design of x86-64. The opcode space shouldn't go to functions that are only used to set up and tear down functions.
> on the former even a register-register move is 4 bytes
"mov r11,r12" is 3 bytes on x86-64. Not a big difference…
There is a very good reason for GCC's x86 backend to do this. Intel/AMD optimisation manuals provide a subset of x86 instructions that are worth using. Instructions that are actually fast in modern designs, that don't fall back to legacy microcode.
This subset looks very RISC-like.
Sure, x86 has CISC instructions that sometimes allow very dense code, but if you want your code to actually run fast you need to do it the RISC way.
Was basically Thumb2 but with more flexibility, and it still felt like RISC.
Not only man different features contribute to performance but also that put performance depends so much on use-case. Two CPUs implementing the same arch might each win a benchmark that has a different instruction mix or memory access pattern, etc.
Which hurts x86 a lot, because x86-64 is very space inefficient for a variable length ISA. The REX prefixes add up to make x86-64 just as space-inefficient as AArch64.
That said ForwardCom blur the bondary between CISC and RISC..
Notably, ARM64 is one RISC ISA that doesn't have that sort of extension.
Has anyone made a R64GC out of order core yet?
Open source, and taped out.
I'd guess the taped out versions were prior to this improvement.
Section 2.2 from here goes into some details on the challenges of using compressed instructions and how they deal with them.
There is no magic - down at the back-end, x86's and ARMs are doing the same thing and getting the same performance will cost about the same chip area and power. Where they differ, however, is on the front-end, the pieces that decode instructions, issue micro operations and schedule them through to the execution units. In that space ARM64 seems more promising.
Unless, of course, Intel decides to throw away a lot of the backwards compatibility and goes very low-transistor-budget for legacy instructions (maybe software traps), freeing a ton of chip area to implement the ones they care about in a fast and power efficient way. That x86 would be unable to run DOS, but I don't think many of us would care.
I think it's safe.to say that RISC architecture has already changed everything.
But honestly, the way forward is quite obvious. Cores with self managed caches and ideally without coherence.
I'm not a fan at all of weak memory models. Especially when it results (is it caused by that? probably?) in atomics being crazy slow compared to those of state of the art x86. Especially since multicore is crucially important and will continue to become even more, and there is no SW solution to a HW providing slow atomics.
This nonsense keeps coming up. No, it's not irrelevant. It matters. A lot.
A CISC design is complex, but it doesn't stop there. This complexity spreads down the chain. Implementations get complex, bugs happen. Making formal proofs of an implementation's correctness becomes impossible. Writing a compiler back end will be complex. Debugging it will be complex. Writing a proof that the machine code meets both the ISA specification and implements the same thing the higher level language does is also complex.
Now, where's the advantage of CISC to justify this complexity? Yeah, right.
We don't need to reduce the number of instructions to fit the whole processor on a single piece of silicon any more. With the decoupling of the processor and memory clocks and the introduction of caches load-store is a less pressing matter than it was. And the complexity of a processor is dominated by the fiendish complexity of executing operations out of order while still appearing in order to all outward appearances, even in the face of interrupts.
The legacy of these styles is still with us in the ISAs that were defined back then and the complexity of decoding an ISA can still make a small but noticeable difference. It can even have security implications when a sequence of bytes could be read as one of two valid instruction streams depending on where you start reading. But most of the architectural complexity of a modern processor has very little to do with the ISA and whether the architecture it was originally written for was RISC or CISC.
I'd guess load-store with sufficient architectural registers is still an advantage if you're doing an in-order design, as that allows the compiler to schedule the load as early as possible? Sure, for an OoO design which splits a load-op into separate micro-ops this doesn't matter.
AESE is in particular a very "CISC" instruction, because it is usually macro-op fused with an AESMC instruction. The ARM decoder will look for AESE + AESMC instruction pairs and execute them as a single macro-op (kind of like how x86 joins "cmp" and "jnz" instructions together into a singular op).
ARM's "CISC" roots go deeper than that. For a long time, ARM machines had a "Jazelle" mode which directly executed Java bytecode. When Java-for-phones became less popular, Jazelle support was dropped.
But in any case, the "CISC" advantage is that you get instructions designed for the applications that run on your system. And lets be frank here: AES Acceleration just makes sense these days. Everyone uses a web browser or a web server.
CISC vs RISC is stupid. The purported advantages of RISC have been ported to CISC, and vice versa... now as ARM and Power9 support CISC-like instructions like AESE (ARM) and vcipher (Power9). All processors will have CISC-instructions to accelerate their most common tasks these days: there's a lot of extra die space (especially because large portions of the die have to be kept 'off' to help distribute the heat. So rarely used specialized instructions are very useful for heat-distribution purposes).
The biggest advantage to "RISC-V" is the ability to add application-specific instructions to the core. That's innately a CISC-design: custom instructions to support everyone's favorite optimizations.
The original design of "RISC", that is REDUCED instruction set, is incompatible with today's cheap transistors. You can fit billions of transistors on modern systems, so there's almost no reason to reduce your instruction sets.
> You might want to look up the ARM instructions "FJCVTZS" and "AESE", and "SHA256H".
The AESE isntruction is a perfect example of that.
The AESE instruction is a complex instruction that does not execute in the RISC pipeline, but it doesn't add major complexity to the decoder. That makes it very similar to the multiply and divide instructions on the original MIPS CPU: those operated outside of the canonical pipeline as well, with results stored in the HI and LO registers instead of the general purpose register file.
Yet I don't think anyone will argue that the 1985 MIPS was not RISC. ;-)
AESE + AESMC is seen as a singular 8-byte instruction from the decoder. The two instructions are decoded as one operation to maximize the throughput of AES. Yeah, web browsers + web server workloads demand an incredibly fast AES, to the point that ARM is willing to complicate the decoder just to make this one operation faster.
> Yet I don't think anyone will argue that the 1985 MIPS was not RISC. ;-)
1985 MIPS wasn't macro-op fusing together instructions for performance gains. The AESE + AESMC instruction pair is straight-up a CISC design (borrowed from Intel's cmp + jnz fusion), complicating the decoder severely but adding huge performance increases in practice.
> That makes it very similar to the multiply and divide instructions on the original MIPS CPU: those operated outside of the canonical pipeline as well, with results stored in the HI and LO registers instead of the general purpose register file.
AESE and AESMC operate on NEON registers. They coincide with the NEON Pipeline and NEON Registers. They're not "outside" the core by any stretch of the imagination.
Maybe you're surprised that ARM has a HUGE series of 128-bit vector instructions. But yeah... ARM's instruction set is very complicated these days. Its basically a CISC processor. https://community.arm.com/android-community/b/android/posts/...
At least, the above is true on A75: https://static.docs.arm.com/101398/0200/arm_cortex_a75_softw...
As you can see in the diagram, the FP0 and FP1 pipelines are all that exist in A75. The AESE / AESMC instruction pair is decoded as one instruction. Its execution is in the FP0 pipeline, just like any other NEON instruction. These are all very CISC-like design decisions.
Anyway, I would argue that ARM stopped being "RISC" as soon as it implemented the "Branch to Java" Jazelle instruction: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....
ARM was always a "pragmatic" architecture. It did what it had to to achieve high performance. Be it a RISC-design choice or a CISC-design choice.
But that's an implementation decision and not inherent to the ISA itself: AESE and AESMC can be implemented as individual instructions without fusing them.
The fact that RISC-V advocates fusing certain pure RISC opcode patterns for improved performance doesn't make it any more CISC either.
> AESE and AESMC operate on NEON registers. They coincide with the NEON Pipeline and NEON Registers. They're not "outside" the core by any stretch of the imagination.
The key part here is "NEON Pipeline and NEON Register". If you can make the split between the traditional pipeline and the additional pipelines, the complexity penalty is contained.
The issue with x86 CISCness has always been with having to deal with byte-aligned variable width instructions.
On the contrary: the compiler and programmers have performance expectations. Just as modern compilers/programmers expect macro-op fusion between x86 cmp/jmp pairs, modern ARM compilers/programmers expect fusion between AESE and AESMC.
Especially when we're talking about articles like "ARM processors like A12X are nearing performan parity...", we have to understand the architectural decisions ARM has made to get to this point.
Some CISC-techniques are really, really good for performance. As such, ARM takes those CISC techniques. AESE + AESMC is an excellent example.
You can't write a compiler optimizer unless you have an idea of what the CPU core is actually doing. The expectation for any cell-phone ARM chip is to have fusion between AESE and AESMC. That's just how it works these days. Modern compilers are going to work very hard to put AESE + AESMC instructions next to each other to maximize the potential for fusion.
> The key part here is "NEON Pipeline and NEON Register". If you can make the split between the traditional pipeline and the additional pipelines, the complexity penalty is contained.
Ehhh? AMD Zen has a FP Pipeline and FP Register pools to implement x86. Basically the 64-bit pipeline+registers for normal instructions, and then 128-bit pipeline+registers for SSE / AVX / etc. etc. Does that make AMD Zen's implementation of AMD64 a RISC machine?
> The issue with x86 CISCness has always been with having to deal with byte-aligned variable width instructions.
ARM kind of has this instruction set called "Thumb2" you probably should get familiar with. Which by the way, all ARM8 chips still support if you put them into AArch32 bit mode.
So yeah, all modern ARM chips have a variable-length decoder implemented. Its part of their design. It turns out that variable-length decoder is really, really good for code density. You can compress more data to fit into the limited L1 cache when its variable length.
EDIT: Its really, really difficult for me to consider ARM a real RISC machine. I'm sorry. It supports all of the PDP-11 addressing modes for goodness sake. (post-decrement, pre-increment, etc. etc.).
Or is "LDR R0, [R1, r2, LSL #2]" really a RISC-style instruction?
RISC vs CISC debate is dead, and has been for decades. ARM just takes good designs, just as all the other CPU designers do. If its a good design, it goes into the chip. ARM even copies the CISC-style "shadow registers" approach to data-dependencies. Intel Skylake has over 200-"shadow registers" internally that it assigns to RAX / RBX / etc. etc., and ARM cores do the same. The ACTUAL hardware registers no longer match the ISA on modern ARM designs.
I understand that compilers and programmers have expectations. I just don't see how that determines the CISC-ness of an ISA specification. (Again: see my RISC-V example.) That said, it's a bit of a side argument.
> Does that make AMD Zen's implementation of AMD64 a RISC machine?
No, it does not. The fact that you have different pipelines for different parts of the ISA enables maintaining a RISC-like architecture. That doesn't mean that it is RISC. For me, the complexity of the decoder is the determining factor.
> ARM kind of has this instruction set called "Thumb2" you probably should get familiar with. Which by the way, all ARM8 chips still support if you put them into AArch32 bit mode.
I thought Thumb2 instructions can either be 16-bit or 32-bit wide, like RISC-V compressed instructions.
x86 allows a free mix of 8 bit, 16 bit, 24 bit, 32 bit and 40 bits.
As a hardware designer, the former is minor decoder implementation nuisance. The latter is a nightmare.
I was not aware that THUMB2 allowed byte-aligned instructions of variable length.
> EDIT: Its really, really difficult for me to consider ARM a real RISC machine. I'm sorry. It supports all of the PDP-11 addressing modes for goodness sake. (post-decrement, pre-increment, etc. etc.).
Nothing is ever black and white. If somebody considers mixing 16-bit and 32-bit instructions CISC, then CISC it is. My bar is just at a different level.
> Or is "LDR R0, [R1, r2, LSL #2]" really a RISC-style instruction?
Why would it not be?
A more useful distinction is perhaps "death to microcode". If you have an AES instruction, that is implemented with microcode running on the "normal" ALU's; Yes, CISCy. OTOH, if your chip has hardware for doing AES, adding AES instructions for using that hw doesn't sound particularly "un-RISCy" to me. Then again, many ARM chips apparently microcode some instructions, so meh..
ARM-SVE seems very non-RISC to me.
The only thing that seems common to all RISC designs is the load/store architecture, which x86 implements under-the-hood with microops now.
And? That's an implementation detail rather than a feature of the ISA. Then again, to some extent so is microcode, so I'm contradicting myself. Ugh. Well anyway, although these days there seems little common ground in what makes a design RISC, CISC, or whatnot, I think you'll be hard pressed to have much support for the idea the reg renaming or micro-ops would be such a defining feature.
> In fact, the SVE extension to ARM provides variable-length vectors, so that the "inner loop" of SIMD compute is stored entirely in micro-code space and independent of the ISA.
Huh? I thought the point was just that the vector width is not a compile-time constant, but rather there's some instructions like "vlmax foo", which calculates max(foo, implementation vector length), and then you use that as the loop increment rather than a compile-time constant.
Not that the "inner loop" is stored in micro-code space (what does that even mean?).
> ARM-SVE seems very non-RISC to me.
To an extent I agree, but I'd say the most un-RISCy thing of SVE is not the variable length but rather the presence of scatter/gather instructions. I mean, in many RISC definitions there's the limit of one memory op per load/store instr., which makes sense as it makes e.g. exception handling much easier. But here with scatter/gather we have memory instructions which not only load/store multiple consecutive values, but potentially load/store a bunch of values all from different pages! If that isn't non-RISC, then what is.
But then again, scatter/gather is awesome for some problems such as sparse matrix calculations. Practicality trumps ideological purity.
If you're talking about embedded CPUs or whatnot, such as picorv32, the frontend is a relatively bigger piece of the "pie" so having a simpler ISA is nice. But really, verifying any CPU design is complex as hell and the decoders/frontends are not the biggest problem (even in RISC-V a source of bugs I've seen float around and even fell into myself, for my own emulator, is the semantics of JALR clearing low bits, for instance -- but that's not a decoding problem, it's a semantics one, flat out. Also, let's be real here, the whole memory-model approach in RISC-V where there's a spectrum of extensions adding TSO on top of weak memory ordering etc is nice and cool but I'd hardly call it fitting in the form of what people think of as a "simple" RISC CPU. That's not even getting into the array of other extensions that will crop up -- bitmanip, a crypto extension will almost certainly pop up, the vector extension, etc etc...)
Programmers love speculating and talking about the ISA and attaching words like "RISC" and "CISC" to everything. Everyone says x86 is a RISC not a CISC because "microcode" or whatever, but honestly who cares? Maybe it is, maybe it isn't, but ultimately it's just a superficial moniker for what you're fundamentally interacting with, and nobody who designs CPUs thinks this way anymore. It's an descriptive moniker from a bygone era, when today most systems have converged very closely in many of their core design decisions, and most of the differentiating features are things like various security extensions, interconnect support (because the interconnect is king), and software support. For things like designing CPUs or formal verification it's a small part of the overall job and you have about 9,000,000 other problems on your plate.
What Intel (and IBM's Z9) has demonstrated is that with enough enginering and silicon, you can still make complexity go fast. What RISC (like RISC-V and Aarch64) enables is the same kind of microarchitecture but with far far lower complexity (= time to market, design team, etc). Intel is still doing really well because the physical design (custom cells) is hard and takes teams that are hard to come by.
Things are very interesting right now and for the next few years.
You can't just extrapolate and assume your 2W (or whatever) phone CPU will be like a 95W desktop if you just stick on a heatsink & fan and feed higher voltage & clocks to it.
If it were that simple, the CPU manufacturers could fire a whole lot of engineers.
It's like, you know, my Honda Accord is reaching performance parity with your Bugatti. (If I extrapolate based on how much I think I'd get performance by sticking in a big turbo and new exhaust, intercooler, and higher RPM redline. It's that simple, right?! No, actually it's not..)
We can equally say that you can't extrapolate and assume that your 45W x86, with lowered clock and voltage, will fit in the power budget of a mobile phone and still give good performance. But all these companies are throwing a lot of engineering effort at it and making solid progress. There's not some reason why x86, the architecture, should be faster than ARM. Intel has been enjoying a process advantage for many years now, and an enviable R&D budget, but that R&D budget is spent fighting against diminishing returns and in the meantime, competitors are catching up. As long as everyone can buy roughly the same node (a big "if", but looks like we're close enough), diminishing returns will bring everyone closer to parity.
But, benchmarks show this a bit better. Geekbench (I'm looking at Single Core):
iPad Pro 11-inch (iPad8,1) has the A12X at 2.5 GHz, score ~5000.
iMac 27-inch retina (iMac18,3) has the i7-7700K at 4.2 GHz, score ~5700.
This is not a cherry-picked comparison... this is just a comparison of whatever happens to be top of the line in both categories. Note that multi-core benchmarks will paint a slightly different picture, but that's very natural, since you can get a Mac with 18 core Xeons. Presumably, adding more cores to a mobile processor when you switch to desktop and can handle proportionally higher TDP, while not trivial, is not especially difficult either.
I assume that different benchmarks give different results as well. This is just one benchmark I know has results for both platforms.
Multi-core scaling past a relatively low number is extremely difficult. AMD & Intel both have complex interconnect fabrics to handle this. You can see an example of this with Intel's shift from a ring bus to a router mesh design: https://www.anandtech.com/show/11550/the-intel-skylakex-revi...
Or AMD's new chiplet design with Epyc 2 of the IO die + core dies connected via InfinityFabric.
> iMac 27-inch retina (iMac18,3) has the i7-7700K at 4.2 GHz, score ~5700.
Be careful with those clock speeds because the Intel Core i7-8700B @ 3.2 GHz shows a nearly identical score.
And just below that supposedly the Intel Core i9-8950HK @ 2.9 GHz comes in just a bit lower at 5348
And the Intel Xeon W-2170B @ 2.5 GHz scored 5100. Higher than the A12X at the same clocks. That'd suggest Intel has an IPC advantage.
Of course the reality here is that the clock listed on that page are complete nonsense. It's not the clocks the CPU was running at while running the benchmark, which is a hugely important data point to have here given the variety in boost frequencies at different thermal/power situations. The i7-7700k wasn't running at 4.2ghz, it was probably running at 4.5ghz. And the i7-8700B wasn't running at 3.2ghz, it was probably running at 4.6ghz.
So depending on the actual clocks during the run the resulting scores may be more or less impressive. Similarly we don't know the actual power draw during those runs, which since we're talking about power efficiency here matters quite a bit.
Based on that (and yes, both devices are now "old"), I'd trade my Mac Mini performance for the performance of my iPad without hesitation. Bring it, ARM !
Personally, if someone put an ARM laptop in front of me with a workable, non-spyware OS, I would take it in a heartbeat.
Meanwhile your Mac Mini was using some form of Intel's integrated, which is notoriously trash tier. The very newest ones are OK sort of, but when push came to shove even Intel opted to ship AMD's Vega graphics in their own NUC instead.
Honest question, just wondering.
Sorry: answered it myself. See the instructions:
Five days a week you drive fifteen miles to work over roads constrained by traffic, getting up to 60MPH for two minutes and spending the rest of the trip at 0-30. On weekends you sometimes drive two hours to visit friends or go to a special event.
The fact is that the Honda is much more appropriate for your life than a Bugatti; 5 days a week you don't do anything that the Accord can't do, and on weekends you could take the money you saved by not buying the Bugatti and rent a Corvette or a Ferrari, and still come out ahead.
Most desktops and laptops spend 5 days a week being idle and waiting on RAM, SSDs, spinning disks, or network data... or worse, waiting for human input. During that time, you might as well have a cheap ARM. When you ask for peak power, mostly you could rely on an outboard GPU.
A simple example: You have designed an ATM using a lower powered ARM CPU. 99.9% of the use-cases never tax the CPU beyond 30%. However, in specific cases where the ATM must simultaneously access the bill dispenser, printer, and note acceptor the hardware interrupts overwhelm the CPU and leads to 2-3 secs of "CPU lock". After those 3 secs, everything returns to normal.
The problem is that the bill dispenser protocol only allows for 1000ms in delay in ack to messages, thus when this situation occurs, the lag is longer than the protocol allows, so it errors out and FUBARs the entire transaction.
In this case, even though this only occurs .1% of the time, the existence of this edge-case will mandate that you not use this specific CPU.
Spoken another way: The ongoing issue of certain edge-cases will cost much more in the long term, than the extra $50 for a different CPU.
Network (TLS encryption), bill stacker over MDB (which requires continuous polling otherwise the devices shut down), credit card swipe/NFC, keyboard, display, receipt printer (USB) and dispenser.
Plenty of horse power left in both ARM Cortex-A8 and MIPS 24Kc configurations running Linux.
However, to further the requirements. These days many people throw around the term "ATM" sort of indiscriminately. So, they call everything from a simple 3rd party cash dispensing kiosk to full-on bank automation centers "ATM's".
The problem is, that for a simple cask dispensing kiosk, you may be totally correct. Linux + ARM may work fine. However, the more functionality this device is suppose to have (ie. do 90% of the functions of a real bank teller: deposit checks, cash checks, etc), then the more these edge-cases become an issue.
Because of this tight integration with backend banking system, and regulatory requirements, these edge cases will become magnified (i.e. who says you can use linux?). Additionally, since all other players in this market also have to deal with all these issues, the $50 difference in CPU cost is totally absorbed into the rather high-dollar price tag associated with this automated teller.
The Bugatti has higher performance but for 90% of people isn't the better car.
How long before Intel paint themselves into a corner by going for performance to justify their prices, when everyone just needs a bog standard ARM?
The challenges that ARM face while competing with x86 are software maturity, and moving away from low-cost low-power designs towards larger and more performant design (which will consume more power and cost more than traditional ARM designs)
x86 biggest weakness is its dependence on wintel that precludes them from the very needed major isa revisions.
If you look at atom dies, the overcomplicated decoder and other x86 vestiges take more area than the rest of the core.
A12x is remarkable in that it gets close to 15 watt Intel CPUs with LESS die area and lower power consumption. And if you remove the useless things like NPU, DRM stuff, security coprocessor, and other useless peripherals from calculation, the comparison will really begin to look dire for Intel.
Adding to that, even if we take into consideration that Intel is still on 14nm and A12x is a 7nm part, A12x still wins even if Intel will make a die shrink on 7nm. And you also have to consider that Intel has squeezed all and everything in terms of power efficiency from 14nm node after 5 years of active development on it, while Apple really went for the very first baseline revision of TSMC 7nm.
Moreover, what I hear from the scene here in Shenzhen is that in A12x Apple did not really put much into power saving: A12x has nothing comparable to Intel's complex runtime power management, power and clock gating, separate power domains, and on-package smart dc-dc converters. If Apple will commit itself to squeezing more power efficiency from their chips with equal zeal, I believe they add additional 25-35% to their power advantage.
The Atom is a bit of an edge-case since it has barely any cache (and the performance is exactly what you'd expect from that), yet it still takes up a significant amount of the die; in all other CPUs, the caches are far bigger.
The second biggest difference is that, in a big OoO core, the cost of decoding the instruction stream is trivial on ARM as opposed to merely small on x86. Something like 5% of the cores power budget on a modern x86, I think? One upshot of that is that on Intel chips the decode stage is balanced with the overall execution width so that they are only seldom decode limited. Whereas the attitude on high end ARM chips is that you might as well just throw in more decoders than you think you need so you can stop worrying about it.
But overall ARM has traditionally been the CISCiest of the RISC architectures and x86 has been the RISCiest of the CISC architectures, making me think they might have both ended up successful by being at a sort of happy medium.
I agree completely if you mean "RISCiest" as in the complexity of decoding instructions. Look at the opcode map of a VAX, for example --- instructions were simply added where they fit, so there's no easily discernable pattern in the bits of the first opcode byte. x86 has an octal structure to its instructions and the first quarter of the opcode map (000 through 077 octal) contains nearly all the commonly used ALU operations.
The 8086 along with the Z80 and its predecessors had to be implemented on a single chip, which put constraints on how complex its instruction encoding could be, which could explain why a regular structure (but not fixed length) was adopted instead of a more "true CISC" way of making every instruction opcode arbitrarily increasing and microcoding everything.
1. L1, L2, L3 cache sizes. Since these are implemented as S-RAM, they can drastically increase chip area, and hence cost.
2. ISA Acceleration of certain types of algorithms - like encryption, codecs and AI.
There's absolutely nothing inherently lacking in the ARM ISA or architecture to impact #1, and ARM has consistently been adding instructions for #2, including SIMD support.
Apple chips perform so well as Apple can afford more cache for the same $ cost, as they make their own chips. Android phone manufacturers need to pay more for Qualcomm to make their profits
A12x advantage comes not only from gigantic caches, but in how efficient is the core in using them. Anybody well versed in microarchitecture design will tell that at some point increasing cache size will actually begin to slow you down.
Both cache topology, address lookup logic, prefetch, and the dark magic like smart cache invalidation circuits matter a lot. The goal is to flush as few bits on cache miss as possible, find stored values in cache faster, and do prefetch efficiently. Only once you can do that, the cache size/core complexity tradeoff begins to work.
Any normal CPU benchmark will be measuring a non accelerated workload. So, the measurements of A12x having a true lead thanks to wider pipeline, bigger caches, and smarter cache management are something very well expected.
Performance commoditization is truly upon us. ARM v.s. x86 is quite literally a battle of existing software binary support. Things like WASM and JIT will further obviate the need to worry about ISAs when making product design choices. As of 2019, you can't go wrong picking an ARM based design, given you picked a SOC with sufficient cache, etc. that your use case demands.
On your position, I don't agree. Purpose built cores are there to stay. There is certainly no definite optimal cache configuration. And even inside ARM space, approaches vary dramatically (in part because cache algos is a patent minefield second only to wireless:) Samsung went on largely to non-determenistic algos for oop machinery, cache, and branch prediction; Qualcomm more or less went the Intel way by making fever, but faster execution units, and adapting cache for that; ARM always had size and cost in mind; and Apple did as aforementioned.
Even very minimal changes to performance profile of execution side may turn things upside down for people engineering caches.
Keep in mind that the cost of non-performancen is irrelevance in the market, and that's a non-option for the likes of Qualcomm, Samsung, etc.
If it is competitor's, but much likely of a patent troll, whom are as many in this area as I said in wireless. Cache algos today are said to be designed very specifically to work around known traps, even if it meant going for suboptimal solutions.
Are you stationed in China by any chance?
Why do you ask? Where are you stationed?
I'm asking because I am very curious about the industry.
The larger the cache the higher the lookup latency becomes. That's why L1 and L2 rarely increase in size.
Athlon XP: 64+64 kB L1I/D (ok, this one had bigger caches than the competition, 16/32kB was more usual in that time frame); 256 kB L2
Zen: 64+32 kB L1I/D; 512 kB L2
Pentium III Tualatin: 16+16 LkB 1I/D; 256-512 kB L2
Skylake: 32+32 kB L1I/D; 256 kB L2
The performance gains certainly plateau, but mobile caches weren't even trying to approach this limit, till now - explaining a big part of the performance gap between desktop and mobile
A cache is a latency-hiding device over a working set of unknown size. The latency of the cache depends on its size. So picking a cache size embodies a huge set of unknowns (the programs that will be run, their working set sizes, latency and size of main and swap memories etc.). In any case, for a given workload you could draw up a graph of cache size (and implied cache latency) vs throughput / latency and see that there is a sweet spot (larger cache does not substantially increase performance, but rather degrades it, due to increased cache latency).
Furthermore, when you do a shrink, you could put the implied performance gains into making a bigger cache of the same (absolute) latency, or making a cache of the same size with lower latency; the latter might be implicitly required since you now probably have a faster core that wants a cache with lower absolute latency.
Still no real benchmarks from the Qualcomm ARM boards Cloudflare is using, or availability to anyone else except Cloudflare. (Unlikely to ever happen now with the CPU being discontinued, seems pretty damn unlikely it was ever really very competitive. )
Maybe you don't consider those "real" benchmarks, but they look like benchmarks to me.
However if your data centers are capacity constrained by total power supply, then you can double the capacity of your data center by going for a CPU that is twice as efficient (2x is the approximate difference that they measured between AMD and Intel).
“Every request that comes in to Cloudflare is independent of every other request, so what we really need is as many cores per Watt as we can possibly get,” Prince explained. “The only metric we spend time thinking about is cores per Watt and requests per Watt.” The ARM-based Qualcomm Centriq processors perform very well by that measure. “They've got very high core counts at very lower power utilization in Gen 1, and in Gen 2 they're just going to widen their lead.”
If the race is for request por watt alone, then you're probably right, but there's always real-world grittiness that needs to be addresses. that's how Google succeeded right? leveraging common platforms and hardware to their highly specialized software combo.
Using average German electricity price (which are some of the most expensive you can have) of 0.33 EUR / kWh, even if you are playing six hours a day with an average power consumption of 400 W (which you can only reach using a high-end PC and a big, bright monitor), you would still pay less than one Euro per day on electricity for that. The cost of the hardware makes that operating cost pretty much irrelevant.
However, I'm not sure how it would perform compared to my somewhat obsolete AMD Athlon II X2 260, and other posts have convinced me the power savings wouldn't necessarily be significant enough to pay for it.
> ARM processors like the A12X Bionic are nearing performance parity with high-end desktop processors
the reality is
> ARM processors from Apple like the A12X Bionic are nearing performance parity with high-end desktop processors
There are no other ARM CPUs that are this fast. Not even close. Ye, geekbench is not the best but still, if you look at https://browser.geekbench.com/android-benchmarks vs https://browser.geekbench.com/ios-benchmarks the difference is staggering, in multicore the difference at the top is close to 100%. And no, I am not an Apple fanboy, couldn't be further from the truth, see http://drupal4hu.com/future/freedom my post from almost a decade ago.
So, it is getting closer than ever before, but we're still quite far to closing the gap.
The T2 security coprocessor in the Mac -- a minimalist ARM implementation to manage a few very specific parts of the Mac -- does HEVC encoding (e.g. dramatically more complex than decoding) thirty times faster than the Intel chip it sits beside. I can't say it's a 30x faster chip, however.
High TDP Intel chips definitely are much more powerful than Apple's A-series chips right now. It will be interesting to see what Apple can do with a larger power profile and active cooling, however. That's theoretical, but they should have an enormous amount of headroom to exploit.
It happens to have hardware acclerated HEVC encoding.
I did once throw a photo of a mac motherboard into an image editor and estimated the T2 package dimensions. They were a very close match for the A10 Fusion, within the margin of error.
Clearly Apple copy-pasted their core, but the T2 serves such a novel purpose in the Mac, and has some unique performance requirements, that it seems very unlikely that they sourced it from the A10 reject bin. Much more likely they simply used of their ARMv8 cores with some custom IP particular for the Mac as they slowly moved the line to ARM. This is all just speculation though.
Tapping out a new SoC design costs Apple hundreds of millions of dollars and years of engineering time. Reusing those pieces of A10 silicon that had defects effecting only part of the chip is essentially free.
Every other SoC/CPU manufacture does this, sells the same piece of silicon at different clock-speeds or with half the cores disabled.
The fact that apple typically doesn't do this binning actually puts Apple at a disadvantage cost wise.
> but the T2 serves such a novel purpose in the Mac, and has some unique performance requirements
Sure, but the performance requirements are all a subset of what the A10 can already do. There is no need for a large GPU like the A10 has, the T2 only drives a tiny, low animation display. Yet the GPU takes up ~35% of the A10 die. There is also no need for such powerful CPU cores.
If apple were designing a custom SoC for inclusion in Macs to meet those requirements, it should logically be much smaller than the A10.
Yet we have this T2 chip which is basically the same size as the A10.
When Apple moved to the T-chips they also moved several system management chips to it as well (e.g. copy-pasting the design alongside the A10).
Software on a general purpose CPU is great, but it simply isn't nearly fast enough for many system management functions. For >4GB/second DMA to go through it (for security oversight, encryption/decryption, etc), for instance -- that simply eliminates it from being a binned A10 at the outset, which was never designed around such a high performance need. The specialized display controller for the touchbar is absolutely nothing like the very purpose-developed display controller in the A10, either. These are all differences that would make it a terrible hack for them to use an A10.
Until we have imaging of the T2 innards we can't say, but I'd say with 99.999%+ certainty it is simply not possible for it to be a binned A10. An A10 single core integrated on some new IP (with purpose-suited blocks that perfectly fulfill their roles concurrent with the general processor), sure, but not an A10.
I have wondered whether they should be using the GPU in the A10/T2 too, instead of the Intel integrated graphics. Would probably perform rather well!
My only point is that assuming Apple makes the best low power CPUs should not necessarily imply they can make good high power CPUs. They may have to build up the competency over time just as they did in the initial iterations of the A series.
I disagree. SIMD can be used in numerous cases, and not only in games/video. And to the end user, it does not matter which part of the CPU/GPU is used.
SIMD instructions aren't the difficult part of matching Intel's CPU architecture. Fast cache hierarchies, prefetching, branch prediction and out of order instructions etc. are the parts that have to be matched.
Plus a sane ISA extension strategy that actually gets people to use your SIMD instructions helps a lot (Intel hasn't done well here, but so far neither has ARM e.g., with almost non-existent support for SVE).
SIMD performance is one place that Intel is still fairly far ahead of both AMD and the ARM competition (with Apple far out ahead in that group).
I happen to agree with the GP that comparing SIMD accelerated to not can give a misleading picture, and so deserves to be noted - especially when one party has relevant SIMD instructions but they weren't used for some reason.
I remember people whining in the 90s x86/Alpha/POWER comparisons that some compilers were aware of fused Multiply-Add operations, too, but everyone who was trying to make a purchasing decision just tuned them out since they wanted to know realistic FLOPS/$ ratios.
However it is still impressive that the A12X could hit 80% of a fairly aggressive Intel design at 60% of the same block.
Certainly active cooling would help ARM chips sustain the performance they can get for short periods of time without active cooling.
So Arm is closing the gap and the A12x is a pretty impressive chip. Certainly plenty for many use cases met today by Intel Desktops/Laptops.
But to hit the same clock speeds Intel's using might well require Apple to add an extra stage in the pipeline and/or running cache at a lower fraction of the CPU clock. Not to mention increasing clock speeds without decreasing memory latency will also hurt IPC. Any of these changes would hurt IPC and make it that much harder to reach 100% parity with Intel.
He is saying that despite using a small fraction of the power, and running at a significantly lower frequency, the Apple chips have only a small deficit on Spec2006 (and, not noted, but for several benchmarks, no deficit at all - a lot depends on whether SIM plays a big role).
Under that scenario, it is reasonable to assume that if you say tripled the power budget and adjusted the frequency to match the larger power budget and "full size" cooling solution, there would be a jump in performance. They are not claiming it would be 3 (power) x 1.5 = 4.5x times faster performance, i.e., A-series chips would be many times faster than Intel chips.
I think it is entirely reasonable to assume it would be the in the range of 20% or more, however. Certainly Intel chips scale up and down based on exactly those factors.
In my opinion, in the case where apples-to-oranges comparisons are possible (low TDPs), Apple's newest chips are already faster than Intel chips. In a high TDP scenario, the same would be true with basic re-targeting of voltages, frequencies, etc (no uarch changes). We don't have the latter yet, and by the time we do we will probably see Ice Lake and Sunny Cove from Intel, so the pendulum may swing back the other way.
s/ARM/Apple/ really. They have a great chip development team, but they are focusing, obviously, on mobile and do not seem to be interested yet in desktop, and even less in servers, except possible as a byproduct of their mobile development.
Isn't it pretty much taken as a given that Apple is going to produce an ARM powered laptop in the next couple of years?
Running x86 code faster in an emulator than on a real chip might not ever happen.
And in not too long the x86-64 patents will have expired all the way through SSE4... I think Apple making their own x86 chips is just as likely as switching to ARM.
I kind of hate the way you throw this out there as if it invalidates the whole point of the article. You sort of walk it back with the rest of your comments, but the damage is done within the first line, overemphasizing nitpicks at the expense of the greater picture. I guess such things should be expected on a forum filled with pedantic engineers... I'm guilty of this myself.
I hope people still take the time to read what is an otherwise interesting, fair minded view that proves the central point it sets out -- that ARM is not an inherently inferior architecture and that recent designs from Apple prove this handily.
I think the logic is that with better cooling, you could have the same ARM chip running at higher voltage and clock speeds, under sustained load, and the result would compare favorably for the ARM chip both on 'burst performance' and 'sustained performance' metrics.
What makes you think this is not true? Is there anything in fast ARM chip designs that makes them only optimized for burst loads and hence inherently unusable for sustained loads? In a sense, you could make the same argument for x86 desktop CPU's, seeing they are also not able to maintain boost clocks very long for sustained loads either.
The article specifically addresses this point: current ARM chips are mostly held back by the passive cooling of phones and tablets, which is a property of the device itself and not of the possible performance you could theoretically get from the CPU.
so it's netburst all over again, see how that turned out.
higher voltage could, cooling being a consequence. at best cooling could stave off throttling.
but increasing voltage at 7nm gets you massive leakage currents real fast so you hit a wall in scaling that's not just about cooling it better.
so yes, basically all the misconception that netburst had about power, density and scaling, reproposing themselves as "but no! it's about the cooling"
basically everyone defending this is assuming power, cooling, clock speed and transistor density are independent variables, which aren't.
> "the fact"
In the context of building desktop-grade ARM CPU's we're not talking about 5Ghz overclocks, but about going from passive cooling in cramped space without any airflow, to something more like a laptop or workstation with active cooling. You also don't need to go all the way up to 5 Ghz, the A12X for example runs at a maximum clock speed that is 40% lower than the base clock of the i7 the benchmarks the article is referring to, yet it manages to be already quite close in terms of performance.
they have less cache on board, less cores, different integrated gpu; they don't support higher frequency ddr and they have limits on the memory bandwidth.
are you sure you don't want to check in with any of those facts of yours before pursuing further conversation?
modern cpu architecture are built to fit their own constraints maximally. once you start changing voltage and make some part of the cpu hotter than the original spec you'll might very well find out whole part of the chip need to be shifted around or redesigned to spread the load differently.
of course a slower chip is better performing watt for watt, the whole point is that the relationship is not linear! that doesn't mean you can just upclock the chip adding cooling, neither that a upclocking a chip won't require a significant redesign.
those parts are already pushing their envelope, or are you implying Apple is specifically wasting money on their chips?
this seems a good time to remind how clock speed and chip features are intertwined with yeld and impurities https://en.wikipedia.org/wiki/Overclocking#Factors_allowing_...
it's not like "just add cooling to the cpu and it'll tolerate a higher voltage" - not at all.
I would say the envelope Apple is pushing with their designs is currently almost exclusively bound by the working environment their SoC's run in: limited cooling and limited battery. They probably spend more time optimizing their software to make more efficient use of their chips than optimizing their cooling solution, because there simply is no room for fans and airflow.
That does not say anything about how suitable these chips could be with better cooling though. The fact that they are optimized for low power does not mean they cannot, or be redesigned minimally, to run at higher clock speeds. In fact, that's exactly what Apple is already doing, by using virtually identical variations of their SoC's across iPhone, iPad and AppleTV, running at different clock speeds.
>> are you sure you don't want to check in with any of those facts of yours before pursuing further conversation?
>> this seems a good time to remind how clock speed and chip features are intertwined with yeld and impurities
I don't know why you need to be so dismissive and agressive in your comments, especially since so far you have not brought up anything countering any of the arguments made by anyone else in this thread.
Maybe you can address the observation already made by dotaro about Intel making literally 20 different variations of the same CPU's, scaling from the ULV end with low clock speeds and limited cooling options, all the way up to the HPC end where clock speeds, TDP, etc. are large multiples of what goes into the ULV parts? What makes you think this is only possible with x86 chips and not with the ARM-based designs Apple uses? Do you think each variation of an Intel x86 chip from the same generation is a completely different design that was built from the ground up to fit that particular use case?
You seem to be stuck on equating having the option of a better cooling solution so you can push the design of e.g. an A12 chip to higher clock speeds and close the already small gap with x86 chips, with going full-scale Netburst architecture with low IPC compensated by crazy clock speeds and ultra-deep pipelines, by means of nothing more than pushing an imaginary turbo button and call it a day. Nobody suggested that but yourself.
Just because you can scale down an HEDT- or HPC-focused design to the point of making it run as a ULV chip under very challenging thermals, doesn't mean that the resulting chip will perform very well. We've seen this time and time again with x86 vendors trying and failing to enter the lucrative "mobile" segment. And it's not clear why we should expect a different outcome when mobile-focused vendors try the reverse play, by attempting to "scale up" their existing designs. One size very much doesn't fit all in the semiconductor industry.
And while I bet you could overclock an Apple core somewhat if you used liquid nitrogen or whatever it will still have more logic between clock latches than an Intel processor does. That deeper pipelining means that Intel will be able to clock higher than you can for any given process/voltage/temperature combination. Apple has some very talented CPU architects and I'm sure they could design a high performance chip. But it won't be the same one that runs in iPhones.
it's addressed. it's literally on the first line of the comment you're replying to.
> limited cooling options
and again, you can't just 'cool away' leakage currents, that's not how it works, that's not how any of this works.
> you can't just 'cool away' leakage currents
You certainly can, the leakage current is going to be proportional to (1-e^(-qV/KT)) so better leakage will tend to go down exponentially at lower temperatures. Yes, I know what you actually meant and you're still wrong there because if you have better cooling you can accept more leakage, meaning you can use a process with a lower threshold voltage and accept the larger amount of leakage current that results.
again, no, better cooling stave off the thermal runway effect. you can think of it as generating less leakage per voltage unit, which is quite the opposite as accepting more leakage.
higher operating voltage allows for accepting more leakage.
and there's still no indication that cooling a specific chip design (because we're not talking about chips in general here) would prevent thermal issue within the chip, because you can only work on reducing the average surface temperature opposite to the pins - or, you can't just pick a single part of the argument and disprove it in isolation, because the whole thermal issue is not a premise, but ties in the whole claim of "we could just make this chip run with cooling at it'll beat down intels" - which they cannot, because you can't just cool away leakage currents, the chip has to be meant to be cooled and to be run with higher voltage at design phase.
One thing he would constantly avoid though, was the fact that for short 10 minute sessions, that may hold true, but for extended play sessions the CPU and GPU would likely be throttled without any form of active cooling.
Otherwise that's a lot of heat applied directly to the back of a very expensive display and none of the components will last very long.
Correct, without active cooling basically every CPU or GPU out there will throttle after extended play sessions. The point isn't to compare an iPad to a workstation but rather an (Apple) ARM SoC to an (Intel) x86 CPU. The chip itself and the implementation in a device are different things. The intrinsic performance of the chip is what you see in those short bursts. The implementation performance is what you see in the long runs. Saying a chip is no good because it doesn't get enough cooling makes no sense.
Anecdata: I have a laptop and a desktop both with Intel i7 3770 CPUs. The laptop CPU always throttles even without extended use while the desktop never does it. What's the conclusion, that the i7 3770 is much better than the i7 3770 because the i7 3770 won't throttle after long play sessions while the i7 3770 will?
It's not as easy as it sounds, both technically and business wise (does Apple have enough economy of scale just on the high-end desktop compared to Intel? Doubtful.), but it's entirely feasible.
I would like to understand this statement better. In 2012, I read this article saying ARM chips were matching x86 chips and that we'd be seeing ARM desktops, ARM servers, and ARM laptops within 2-3 years. https://liliputing.com/2012/02/fastest-arm-chips-are-compara...
But it is now, 2019. Aside from my phone, access point, and tablet which are ARM based, everything else is still x86-64. Am I an outlier? What data leads you to believe that "entirely likely that their fabled next Mac Pro is entirely ARM-based"?
> does Apple have enough economy of scale just on the high-end desktop compared to Intel? Doubtful
I don't understand this statement. Is the volume of chips "manufactured" by Apple significantly lower than the volume of chips manufactured by Intel?
Apple is affected the least by this Intel lock-in, since they are the biggest seller of high-performance hardware to consumers and are capable of doing their own support infrastructure. Moreover, they have been heavily investing and building up a very successful own processor unit themselves. Finally, Apple has famously transitioned their entire hardware platform multiple times already when it felt like their current hardware plaform didn't suit them strategically. They've shown to be capable of supporting their own oddball hardware platforms before Intel, when they were a lot smaller still. Given Apple's level of control over their own technology and Intel's recent stagnation, I think it's very likely Apple wants to move in this direction if they are capable of doing it from a business perspective.
The Mac Pro might be a bold product to begin with since it's so focused on professional users and Apple is really peculiar about having a bold vision an the high-end desktop. It would also be clear signal of their strategy to transition the Mac Pro to this new architecutre. The reason I threw in that sentence about economy of scale is that the high-end desktop is a small market for Apple, and it might not be worth for them to make large workstation class chips for such a market. Especially considering that Intel's high-end Xeon W-line is based on the exact same silicon as Intel's high end servers, and therefore Intel just makes a lot more high end chips than Apple would.
Of course, Apple would be able to work around this with a chiplet architecture like AMD is doing, but I feel that we're already too far in speculation territory.
The even bigger problem is ARMs closed nature with no support for the open off the shelf culture that has made the PC industry what it is today. ARM is about closed SOCs, closed drivers and closed vendors and this in effect closes up the drivers and software ecosystem.
Something like Linux and the open source movement would not have happened with this hardware model, and it a paradox of our times that it is Linux developed because of the open culture of x86 that is used to support this closed model. There is something ironic even parasitic about this.
There are large forces of centralization and control currently in play and getting excited about a closed ecosystem becoming mainstream seems shortsighted for the tech ecosystem and consumers who have benefited from widespread choice, competition and the open source movement in x86.
I don’t know if Apple’s ARM is fully-custom. It wouldn’t surprise me if the fast-path parts are. Standard cell designs can be pretty fast due to constant advances in synthesis. They’ll always be behind full-custom on same process node just because the latter puts more optimization effort in. Most choose standard cell since it’s faster to develop (time to market) and cheaper. Those wanting max performance or lowest energy will be using full-custom if they can afford it. Also worth noting that the Apple A12 is 7nm vs Core i7’s 14nm per Intel’s site. Apple to apples would compare that design on 14nm or node with similar performance to it.
Btw, there’s detailed analysis below of the A12 with specs, parts breakdown, and die shot.
BTW, the article uses the acronym IPC without explaining it. It stands for Instructions Per Cycle. CPUs can and do execute multiple instructions per clock cycle so this is just a measure of how many.
I would buy such a laptop with zero qualms about them bungling the migration. If it had a decent keyboard and got rid of the touch bar. I'm much more concerned about having my experience ruined by those.
I'll readily admit the thing is more cool than genuinely useful. It's just not such a gimmick as to actually "ruin" an experience. And the gain from TouchID offsets my pain from not having a physical escape key.
In this respect, the Macbook Air gets things right.
WRT the keyboard... your mileage will vary. I hated it at first. Pretty used to it now.
Regarding the touch bar: Indeed the Escape key is what breaks the deal. Honestly if it began after ESC I'd tolerate it - expensive, close to useless, but tolerable.
Apple’s position would then be: great battery life unless you need to run old intel software.
The i6700K is a 3 billion transistor 14 nm 95W chip.
If you assume linear scaling on all three metrics (bad assumption, but rough rule of thumb) you get 10/3 * 14/7 * 12/95 -> 85%, roughly in line with benchmark results.
I believe the GPUs are very different as well and that could play a large role given how GPUs love to eat upp mm2
The community never received explanation from the authors on these discrepancies, so everybody should be cautious of these promises that these biases are eliminated in GB4.
So what benchmark would be fair to you? Even Cinebench doesn't closely approach a daily workload for the typical Cinema 4D user, it's just a different way to tax the system at full.
Also, throttling is less of (or even completely not?) an issue on iPads than on iPhones. You see that on benchmarks that typically tend to reach the throttling limits on iPhones like AnTuTu. The performance gap is much wider while the SoC isn't that much more powerful.
A desktop system with an ARM processor should closely match the synthetic results of GeekBench since GeekBench measures maximum performance not persistent performance.
But i don't mind more competition on the market as long as it doesn't cost me headaches like 'this package doesn't compile on arm' or whatever.
Integer math is on the other side is what the most of programs you use every day are made of. And their complex, branching rich code is near impossible to feed to any specialised DSP.
Modern CPUs must be compared on integer math and logical operations performance, followed by their IO performance.
What an average user understands as performance today is really (integer perf + logic op perf) * effective I/O throughput.
Is this correct? Is the final output of the silicon from each architecture fundamentally different?
Can you put 32+ G of dual channel ram in there?
It's not only the CPU performance that matters...
This made me chuckle.
Make it happen Apple !
You really expect apple to release something more open?
from the article:
The next issue I want to address are fallacies
I've seen permeate discussions around ARM performance:
that a benchmark is not at all useful because it is
flawed or not objective in some way. I understand the
reaction to a degree (I once wrote an article
criticising camera benchmarks that reduce complex data
into single scalar numbers), but I also believe that
it’s possible to understand the shortcomings of a
benchmark, not to approach it objectively, and to
properly consider what it might indicate, rather than
dismissing it out of hand.
Where Intel still has a big lead is in multi-threaded performance and heavy SIMD use.