Hacker News new | comments | ask | show | jobs | submit login
Modern Microprocessors – A 90-Minute Guide (2001-2016) (lighterra.com)
443 points by projectileboy 4 months ago | hide | past | web | favorite | 87 comments



This is no doubt obvious to hardware folks, but one enlightening moment is when I came to understand register renaming.

Previously I had the (wrong) idea that rdi, rsi, etc corresponded to physical bits. Register renaming involved some exotic notion where these registers might be saved and restored.

Now I understand that rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph. Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards.

Alas we have finite bits in our ISA, so we must re-use registers. Register renaming is the technique of reconstructing an approximate infinite-register interference graph from its compressed form, and then re-mapping that onto a finite (but larger) set of physical registers.

Mostly "register renaming" is a bad name. "Dynamic physical register allocation" is better.


"Now I understand that rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph. Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards"

I'm under the impression that we don't see the causality flowing in the same direction.

To me we first had a limited set of registers (imposed by the ISA), then to get better perf through out of order execution, cpus had to infer a deps graph and use register renaming.

Ironically all this silicon is spent to recover information that was known to the compiler (eg through SSA) and lost during codegen (register allocation).


The space in an instruction is very limited. (If the representation of an instruction needs more bits you need more bandwidth, more cache space etc.) So it can be beneficial to only address 8 registers but then have a detection for spilling of spilling to RAM etc. (It can even be beneficial to specify only 2 registers (a = a + b instead of a = b + c) and replace a copy of registers with a rename.


Exactly. In principle even memory can be renamed, although I'm not sure any current CPU actually does it (there are rumors). It would be great if the actual SSA graph could be directly passed from the compiler to the CPU, but what's saved by getting rid of renaming would probably be used to handle the much harder decoding. It would probably have implications for context switching overhead.


And essentially that's the idea that the VLIWs with no interlocks have - one I worked on made the output registers of the execution stages architecturally visible


You’re really reconstructing a data depence graph not an interference graph. In an interference graph, there is an edge between nodes that are alive at the same time. That information remains implicit in the OOO machine (because a physical register stays allocated to a value through retirement). In a data dependence graph there are edges from operations to their operands. You need to reconstruct the data dependence links explicitly—during renaming you need to map the input operands to the correct renamed physical registers. This is where “renaming” comes from. You “rename” the architectural register operands in each instruction with the corresponding physical register or tag: https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml. After renaming, the data dependence graph will be explicit in the reorder buffer.


A couple of more points. The architectural registers are real places. Say you write to RDX, don’t do anything with it for a million instructions, then read from it. Where does the data come from? There is an architectural register file and RDX is a specific register in it. The renaming is only temporary. An instruction doesn’t hang on to a ROB entry and rename register until every instruction that may use it has executed. When the instruction is retired in order, the value is written back to the architectural register file, and the rename register and ROB entry is freed for some other instruction. So the renaming is ephemeral, only lasting until retirement.


> A couple of more points. The architectural registers are real places. Say you write to RDX, don’t do anything with it for a million instructions, then read from it. Where does the data come from? There is an architectural register file and RDX is a specific register in it.

This was true on all the P6-derived cores, but is no longer true on any modern CPU that uses a PRF.

On PRF cpus (SNB and everything after it), the final backing store of register data is the PRF, and the architectural registers are just pointers to this file. Every instruction gets renamed in the frontend, and there simply is no architectural register file.

Also, ROB does a very different task on those CPUs, to the point that still calling it ROB is misleading. The current architectural register names are held in the frontend, as part of the register rename hardware, not in the ROB. ROB holds on to instruction status, and the PRF register names for all the sources. So an used PRF entry can be garbage collected when it no longer exists in either the architectural name file or the ROB.

For a more detailed explanation, read: https://www.realworldtech.com/sandy-bridge/5/


That's correct, but a bit in the weeds. I was addressing OP's statement: "rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph." In a modern CPU, this is true only for speculative state. If I have "ADD RAX, 1" and RAX was last written by an in-flight operation, I don't care about RAX qua RAX. It's just a name that connects an operation that produces a value to an operation that consumes it. However, if the instruction that last wrote RAX was committed long ago, what we care about is the architectural state of RAX. That distinguishes modern OOO CPUs from pure data flow machines.

This fact is easiest to see in a ROB design: https://courses.cs.washington.edu/courses/cse378/10au/lectur... (page 7). The front-end rename table may point either to a ROB entry, or the architectural register file. Note, however, that the basic distinction exists in a PRF design too--some physical registers will hold architectural values (pointed to by the RRAT or similar structure) while others will hold speculative values. We can quibble about terminology, but there will be a subset of the PRF, that is pointed to by the RRAT, that contains the machine's non-speculative, architectural state.


> However, if the instruction that last wrote RAX was committed long ago, what we care about is the architectural state of RAX.

What you care about is which PRF entry will contain the valid value of RAX for the instruction going through rename. Unlike in earlier designs, post-SNB ones always do exactly the same thing during rename, that is, allocate a clean PRF register as destination and read the PRF number currently stored under the name RAX in the RAT and store that as source. Whether that entry was written to by the previous instruction, or has been stale for a thousand instructions is entirely irrelevant.

The ROB design document you posted is explicitly of an older system, that is different from modern ones. On post-SNB designs, the rename table entry and ROB entries can only point to the PRF (the ROB no longer ever contains register state!), and there is no architectural register file.

> Note, however, that the basic distinction exists in a PRF design too--some physical registers will hold architectural values

While technically true, this is extremely misleading. The architectural values reside in whatever rename register was last assigned to them. For the entire pipeline, there is no distinction between architectural registers and the rest of them.

My original complaint is of this sentence in your OP:

> When the instruction is retired in order, the value is written back to the architectural register file

Which is explicitly false of every Intel CPU released since 2011 (except maybe some atoms?). PRF machines never move values from PRF. It is the final source of truth for all registers. When "ADD RAX, 1" is executed, a new PRF entry is allocated for the result, and once that is written, it remains where RAX lives until another instruction that stores to RAX is executed, no matter how far away that is.


> What you care about is which PRF entry will contain the valid value of RAX for the instruction going through rename... The ROB design document you posted is explicitly of an older system, that is different from modern ones.

You're getting caught up in where the data is stored while I'm talking about the nature of the data conceptually, vis-a-vis OP's analogy between register renaming and graphs. I didn't post the ROB example to disagree with you about how post-SNB CPUs work re: PRF (your description is correct). I posted it because ROBs are a good illustration of the difference between speculative values and committed values, since they store speculative values in one place (the ROB) and committed values in the other (the ARF). The same distinction exists with PRFs (see below), but it's harder to see.

> The architectural values reside in whatever rename register was last assigned to them. For the entire pipeline, there is no distinction between architectural registers and the rest of them.

Architectural registers are different because they will have an RRAT entry pointing to them. And there certainly is a distinction between PRF entries that hold architectural state and PRF entries that hold speculative state, e.g. when there is an exception. That's why the RRAT exists in the first place--so the CPU can identify the subset of the PRF that holds the architectural state. If there was "no distinction" you wouldn't have an RRAT.

Concrete example:

MOV EAX, 15

-------------- < committed

MOV ECX, [EBX]

IMUL EAX, 5

ADD EAX, 1

-------------- < renamed

Say the first instruction has committed, the second has issued, and the others are waiting to issue. Say EAX in the first instruction is allocated to PR0, in the third it's PR1, and in the fourth its PR2. The RRAT has an entry for EAX that points to PR0. The front-end RAT has an entry for EAX that points to PR2. PR0 holds committed state. PR1 and PR2 (will) hold speculative state. The reservation stations and ROB encode a data-flow dependency between the fourth and third instructions. (PR1 is an operand to the fourth instruction, which produces PR2). There is no data dependence graph for anything that's committed, however. It's just values in architectural registers.

If the load throws a page fault, what happens? The machine grabs the architectural state from the RRAT. Execution restarts with PR0 in the front-end RAT entry for EAX, and the page-fault handler sees EAX == 15. If the third and fourth instructions issued in the meantime, their results are thrown away. That is the difference between architectural and speculative state.

> Whether that entry was written to by the previous instruction, or has been stale for a thousand instructions is entirely irrelevant.

It's relevant to the analogy that OP was making between register renaming and graphs. In a data dependence graph, all data flow can be represented with edges from an operations that produce operands to the operations that consume those operands. In an out-of-order machine, that graph exists (after renaming) only for instructions that are still in flight. For instructions that were committed long ago, the graph is collapsed to the result, denoted by an architectural register.

> Which is explicitly false of every Intel CPU released since 2011

The point (for purposes of responding to the OP's analogy) is that you need to update the architectural state when an instruction retires. The fact that "post-2011 Intel CPUs" do this by flipping a pointer rather than copying the data is an implementation detail that is educational, but not really relevant.


Interesting. I always figured there was just a lookup table mapping ISA registers to physical registers that was used by instruction decode, a bit vector (one bit per physical register) indicating which physical registers were in use, and each micro-op would carry with it which physical register(s) would be freed up upon instruction retirement.

If a register might or might not be renamed, what's the advantage of usually storing a given register in a given location? Is the main reason power savings, so checking a single bit to check if a register has been renamed, before paying the power and time to perform a lookup in the renaming table? Is access to a renamed register then slower than access to a non-renamed register?


You need to be able to recover architectural register state (e.g. if there is a branch misprediction, or an instruction throws an exception). Many implementations store speculative results in a reorder buffer, and write to an architectural register file on commit. The architectural register file always reflects the committed state of the processor. In such an implementation, your renaming table (the "lookup table mapping ISA registers to physical registers") might point to either a ROB entry, if the most recent write to an architectural register was an instruction that is still in flight, or the architectural register file.

Another way to do it is to use a physical register file to store data, as Tuna-Fish explains above. In that case, your front-end table always points into the PRF, but it may be a speculative value (written by an instruction that's still in the ROB and not yet committed) or a committed value. But that doesn't let you recover the architectural state on an exception. So you keep a second table (in the Pentium 4, it's called the RRAT), and whenever an instruction commits, you update that table to indicate that the most recent committed version of a particular architectural register can be found in a particular physical register.

There is at least a third way to do it, which doesn't use an RRAT. MIPS R10k rolls back the ROB to recover the last committed architectural state.


You are more or less correct, he was wrong. Architectural registers are just pointers to the PRF these days.


Are you sure about that? The HDL I've seen doesn't appear to do it that way. If I'm reading BOOM's source correctly for instance, there's not a separate architectural file.


As was pointed out in another thread once which I don't recall, almost all aspects of assembly programming in x86 aren't really low level anymore. It's just another abstracted high-level language, although one that still mirrors/is tied to the fundamental idea of an x86 machine.

As detailed in the article, x86 opcodes are translated/compiled on the fly to micro-ops.

The registers are renamed as needed to squeeze data.

Memory locations are actually fulfilled by any of three to five levels of cache, RAM, or swap.

Cores are variable in their number of issues, and can share the same instruction pipelines in hyperthreading.


While not quite infinite, what you want sounds awfully close to a belt machine;

Each instruction reads data from positions on the belt and then puts a few values at the start of the belt. This causes values on the end to "fall off".

The MILL CPU is an example of this architecture.


The belt approach is sort of hybrid between registers and stack architectures. A disadvantage is that values on the belt may be at unknown locations after branches:

    var x = something();
    if (somethingElse() { stuff(); }
    print(x); // where is x??
Here x has to be spilled and reloaded. This is worse than a register machine where x can be assigned a register for its entire lifetime.

I think this style of architecture is at its best for straight-line code, but suffers with branchy code, loops, etc. which will require a lot of spilling.


The main reason to adopt a belt is that you don't need to encode the destination (or destinations on an instruction that returns multiple values). Instruction cache pressure is typically a critical design constraint on VLIW machines which is presumably why they went this route.

Also, that simplifies the compilers job and ties into the spiller design, etc.


> Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards.

Crazy thought but would it be possible to build a decentralized and distributed CPU on top of the blockchain?

Edit: to the downvotes, care to explain? I am genuinely curious to know the answer, and curiosity should be a driving factor in this community


If the blockchain in question is Turing-complete, then it would be possible (by trivial consequence of definition) to build an emulator for any other Turing-complete (or even non-Turing-complete) deterministic processor on the blockchain. However, the latency per operation would be prohibitively high for most purposes, unless each "CPU Instruction" is complex enough to greatly stretch definitions and lose meaningful distinctions.

So, the answer to the most strait forward interpretation of your question is basically "Yes, it's obviously possible and likely trivial, but probably not practical"


Latency.



There are some missing I/O things involving DMA.

In the old days, DMA from (say) a PCI device would go directly to and from DRAM. This incurs a high latency if the CPU needs to access this data.

Network processors found a simple solution: DMA goes to the cache, not the DRAM. This reduces the I/O latency to the processor and simplifies I/O coherency. I know Cavium's NPs rely on this.

Intel picked this up for server and desktop processors once both the memory controller and PCIe were integrated on the same die. They called it DDIO:

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

You can support 100 G Ethernet with Intel Xeon processors these days due to this.

Another story is how DMA in the x86 world is cache coherent (no need to use uncached memory or flush before starting an I/O operation- which I have to do it in ARM). This is awesome from a device driver writer's point of view and is the result of having to support old operating systems from the pre-cache days.

I think the future will involve better control of how cache is shared. For example, if you know a program is going to access a lot of memory, but does not need to keep it around for a long time, it will as a side effect, eject useful data from the cache. Better would be to declare that a thread should only be able to use some fraction of the cache so that it does not interfere with other threads so much.


> Another story is how DMA in the x86 world is cache coherent (no need to use uncached memory or flush before starting an I/O operation- which I have to do it in ARM). This is awesome from a device driver writer's point of view and is the result of having to support old operating systems from the pre-cache days.

Nitpick: You mean "Sequentially consistent".

ARM is cache coherent, but NOT sequentially consistent. x86 is almost sequentially consistent (only a few obscure instructions here and there violate it).


x86 is not sequentially consistent. Its consistency model is Total Store Order (same as most SPARCs). The store buffer is architecturally visible and newer loads can be reordered above older stores. More formally, all CPUs agree on the order of remote stores but might see their own stores in a different order.

For example Dekker algorithm fails on x86 without explicit fences or explicitly sequentially consistent stores (all atomic RMW operations are sequentially consistent on x86).

edit: I think the OP really meant cache coherency; while all ARM CPUs in a system are in the same coherency domain, the IO space might be outside of it.


> edit: I think the OP really meant cache coherency; while all ARM CPUs in a system are in the same coherency domain, the IO space might be outside of it.

If that's the case, then I stand corrected. My understanding was that ARM was fully cache coherent, but it makes sense that I/O would be a different case all together.


to be clear: I do not know whether IO on ARM is cache coherent or not, I'm just pointing out that just because all CPUs are cache coherent it doesn't imply that peripherals on external buses must be as well.


I read coherent i/o begins with cortex-A9, interesting.


I wasn't thinking of the I/O case when I typed up my post earlier. I was thinking CPU-to-CPU coherency. In which case, you are probably right on this front.


> I think the future will involve better control of how cache is shared.

More fine grained cache control is definitely in the near future. x86 CPUs already allow partitioning L3 cache regions to different cores as desired. AMD CPUs provide the CLZERO instruction to quickly acquire a cacheline in exclusive mode and drop its content without waiting for any remotely modified data, which is great to implement message queues.


Small nit: Intel desktop and workstation processors (Core i5/i7, Xeon E3, Xeon W) do not have DDIO.


indeed. dpdk uses this (and other techniques) to achieve 10gbps line-rate (at least) packet forwarding on minimally sized packets per (x86) core.


So, in @rygorous's excellent Twitch streams about CPU architecture (first one here: https://www.youtube.com/watch?v=oDrorJar0kM), he said that it was basically a myth that x86 architectures dynamically decoded into internal RISC instructions. I am thus a little skeptical of the article in general, since I don't know enough myself to verify each thing.


x86 really does decode CISC into RISC-like instructions. They're called micro-ops. Some of the instruction cache stores these translated instructions. People research the details of this. See https://www.agner.org/optimize/blog/read.php?i=142&v=t

The article looked about right to me.

I didn't watch the (3 hour!) video you linked to. Can you give the time offset where the myth you refer to is explained?


Intel uops aren't really RISCy at all, at least since after P4: if you look at Agner's tables, you'll see that even complex load-op operations still map to 1 (fused) micro-op in the fused domain and they are only broken down when dispatched to execution units (instruction breakdown was performed even in early CPUs, before the CISC/RISC nomenclature was a thing). IIRC decoded uops are not even fixed size in the post-decode cache: large constants take an additional slot.

Separate load and op instructions and fixed size instructions are pretty much the only things left differentiating RISC and CISC architectures (there is nothing reduced about modern RISCs), so I do not think the claim that x86 CPUs are RISC inside does hold.

I think that Agner, which knows what he is talking about, it is just being loose with terminology.

In the grand scheme of thing it just doesn't matter, it is simply a name. I just dislike it when the x86-is-a-RISC meme get repeated, as if being a RISC somehow is a virtue in itself.


I bow in deference to your superior knowledge.

Back in the late 80s, reducing your instruction set was a good idea because it meant you could spend the transistor budget on other things, like pipelines and caches. RISC came to be seen as a virtue in itself.

When x86 was the 80286 was CISC and MIPS and ARM was RISC, then x86 was just bad and wrong. Nowadays x86 is fast and good.

As you kinda said, almost everything about the 1980s definition of RISC has ceased to be true. The only thing left of Patterson and Hennessy's RISC ideas is that they encouraged proper analysis as of how real programs use the instruction set (and cache etc), rather than just adding a bunch more instructions to please some assembly writing customers and aiming for a better Dhrystone score. If we define RISC to mean "doing proper analysis", then x86-is-a-RISC-machine is true :-)


> As you kinda said, almost everything about the 1980s definition of RISC has ceased to be true.

A central difference that still exists that RISC processors are typically load/store architectures. That means that before an operand that exists in memory can be used, it has to be transfered to a register.

This means that an instruction like

add eax, [ecx]

does not work, say, under ARM. Under ARM, you have to use

  ldr r1, [r1]
  add r0, r0, r1
Intel found out that using memory addresses both as source and target turned out to be a bad idea

  add [ecx], eax
(since it needs 3 phases: load value from memory, do instruction, store back). No such instructions thus exist in MMX, SSE..., AVX..., ... On the other hand, Intel still believes that using a memory operand as source only is quite a good idea on x86 (look at the encoding of SSE..., AVX..., AVX-512). Nevertheless: having the capability to do such a complicated instruction atomically is very useful for multithreading; consider for example

  lock add [ecx], eax
which adds eax to the memory address in ecx atomically.

Also, a very typical distinction (that Intel only dropped with AVX on) is that CISC CPUs typically use 2 operands per instruction (of which one may be memory) and RISC CPUs have 3-operand instructions. So

  add r0, r1, r2
works on ARM, but under x86, only instructions that were introduced from AVX on (i.e. use a VEX (VEX2 or VEX3) or EVEX prefix (AVX-512); I have to look up whether something like that is also possible with a XOP prefix) have this capability.

Also very often, CISC instruction sets offer complicated addressing modes, such as in x86

mov edx, [ecx+4*eax]

It is not completely clear whether this is worth the complexity or not. On one hand, such instructions are hard to use for a compiler (which is the central reason why they were abolished in RISC architectures). On the other hand, skilled programmers can use them to write quite elegant and fast code.

TLDR: A central difference that still exists is that

- RISC architectures are load-store architectures

- on CISC architectures 2 operands (1 can be memory address) are typically used and "feel more natural"

- on RISC architectures, instructions typically have 3 operands.

- CISC architectures often support much more different and complicated addressing modes than CISC.


The main point of RISC architectures is that they are trivially pipelineable to the extent that making non-pipelined implementation does not make much sense. All the architecture visible differences from CISC are motivated by that. Load-store gets you well defined subset of instructions that access memory and have to be handled specially, 3-operand arithmetics and zero register simplifies hazard detection and result forwarding logic and so on.


> The main point of RISC architectures is that they are trivially pipelineable

This was the idea behind the original MIPS (the textbook example of a RISC processor - both literally and metaphorically). Unluckily this lead to the problem that implementation details of the internal implementation leaked into the instruction set. Just google for 'MIPS "delay slot"'. When in later implementations of MIPS, this delay slot was not necessary anymore, you still had to pay attention to this obsolete detail when writing assembly code.

The lesson that was learned is that implementation details should not leak into the instruction set.

Next: About what kind of pipeline are we even talking about? It is often very convenient to offer multiple kinds of pipelines dependent on the intended usage of the processor. For example for low-power or realtime applications, an in-order pipeline is better suited. On the other hand, for high-performance applications, an out-of-order pipeline is better suited. For example ARM offers multiple different IP cores for the same instruction set with different pipelines.

Finally, pay attention to the fact that more regular and more easy to decode instruction set of typical RISC CPUs (ARM is explicitly not a typical one in this sense, in particular considering T32) often leads to bigger code than, say, x86. This turned into a problem when CPUs became much faster than the memory (indeed some people say, this was an important reason why people today think much more critical about RISC). This is also the reason why RISC-V additionally provides the optional "“C” Standard Extension for Compressed Instructions" (RVC). Take a look at

> https://riscv.org/specifications/

The authors claim in the beginning of chapter 12 of "User-Level ISA Specification": "Typically, 50%–60% of the RISC-V instructions in a program can be replaced with RVC instructions, resulting in a 25%–30% code-size reduction.".

> 3-operand arithmetics and zero register simplifies hazard detection

Despite the 3-operand format of ARM, at least the A32 and T32 instruction sets offer 2 additional parts for many instructions:

1. conditional execution: for example ADDNE is only executed when the Z(ero) flag is not set. There are 15 variants for conditional execution, including "always").

2. "S" suffix for many instruction: causes the instruction to update the flags. For example SUBS causes the processor to update the flags while SUB does not.

The conditional execution was to my knowledge dropped in ARM64 because branch predictors got good enough.

So: ARM has other things in the instruction set to avoid pipeline stalling. 3-operand instructions are not among of them. The reason for 3-operand instructions rather is that this instruction format allows the compiler to generate efficient code much more easily.


The stall detection logic remark was meant in the context of traditional MIPS-style in-order single-issue pipeline executing regularly encoded instruction set where the mentioned features lead to both smaller implementation of the detection logic itself (which for the traditional MIPS is the bulk of the control logic) and simpler routing of the signals involved.

On the other hand I completely agree that MIPS-style delay slots are simply bad idea. But for me ARM's conditional execution and singular flags register is similarly bad idea that stems from essentially same underlying thought.


Damn, you are right. I thought the load-store architecture was no more in ARM thumb 2. I was wrong. Thanks for the info.


It's only a superficial analogy. Micro-ops were RISC-like in the sense that they used to do one / few things. But their implementation is unlike RISC, micro-ops typically being very large (100+ bits wide) and not even necessarily of fixed length; you may imagine a specific bit in a micro-op more or less directly controlling a certain control line somewhere in an execution unit. Conversely micro-ops also can do a whole bunch of things at the same time.


You have to distinguish between micro instructions in the meaning of "line in microcode store" which for the horizontally microcoded CPUs contain bits that more or less directly map onto datapath control signals and micro operations in the superscalar x86 sense, which typically are more or less reformulation of x86 instructions into something that is both more easy to execute in parallel (which involves breaking instructions into their constituent suboperations, which are RISC-like in the load-store sense, not in the other RISC characteristics) and maps better to the actual execution units (which may involve combining instructions).


This is a point of much contention, micro ops are NOT RISC like (some are even variable length). However, the one argument that does have some merit is that risc is supposed to be load/store and internally x86 cpus are load/store m.


I'd say a good thing to add will be that the lion share of progress in last 5 years was done around cache architectures.

All what is described in the article like superscalarity and ooe has been squeezed to the practical maximum at around early core 2 duo era, with all later advances mostly coming without qualitative architectural improvements.

In that regard, Apple's recent chips got quite far. They got to near desktop level performance without super complex predictors, on chip ops reordering, or gigantic pipelines.

Yes, their latest chip has quite a sizeable pipeline, and total on chip cache comparable to low end server CPUs, but their distinction is that they managed to improve cache usage efficiency immensely. A big cache would't do much to performance if you have to flush it frequently. In fact, the average cache flush frequency is what determines where diminishing returns start in regards to cache size.


Apple CPUs are quite sophisticated wide and deep OoO braniacs designs with state of the art branch predictors.

There is nothing simple about them. The only reason they are not desktop level performance is because the architecture has been optimized for a lower frequency target for power consumption.

A desktop optimized design would probably be slightly narrower (so that decoding is feasible with a smaller time budget) and and possibly deeper to accommodate the higher memory latency. Having said that, the last generation is not very far from reasonable desktop frequencies and might work as-is.


Compare die shots of the two. Even after you correct for the density provided by 7nm process, A12 predictor is few times smaller than that of recent intel i cores


5 minutes of Googling didn't return any image of either skylake or 12 die shots with labelled predictors. Do you have any pointers?

Also I know nothing about the details, but I expect that most of the predictor consists of CAM memory used to store the historical information. I doubt that without internal knoledge, is it possible to distinguish it reliably from other internal memories.


CAM is expensive and requires some kind of replacement scheduling logic. I believe that branch predictors are still implemented as straight one-way associative RAM, often even without any kind of tagging and only true CAM in the CPU core is TLB.


Interesting. Is the improvement in performance mostly due to improvement in size, number, and speed, did finetuning the cache parameters (number of caches, cache size, cache line size) help, or are there more fundamental architectural improvements? Do you have links to more information?


Yes, fine tuned and fast caches is what Apple was going for last few generations. AMD Ruzen also got much faster caches than their previous gen chips. Most importantly, fined tuned caches don't come as a power/performance trade-off - they are simply better.

Moreover, when litho generations progress to 10nm, the difference in power consumption in between working and non-working transistors gets so small, that the traditional convention that "a slow chip is also a low power chip" does not hold true. You are better off in regards to power consumption if you get your IPC up and IO-wait down.

The best analysis of A12 cache performance I know of that is written in popular language is this piece: https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-re...


IBM was doing essentially all that stuff before 1990 or so except possibly for multiple threads per processor core. So, there was pipelining, branch prediction, speculative execution, vector instructions, etc.

Then I was in an AI group at the Watson Lab, and two guys down the hall had some special hardware attached to the processors and were collecting and analyzing performance data based on those design features.


IBM was doing all sorts of amazing stuff before the 1990's. They had VMs, containers, etc.

Personally, I'd say that I don't care. They didn't want to make that technology available to the masses, we barely even got the PC architecture because they made several strategic blunders.

If the tech exists but it's not reachable by common folks, in my eyes it's as bad if not worse that it not existing at all.


They tried with https://en.wikipedia.org/wiki/PC-based_IBM-compatible_mainfr...:

”The XT/370 was an IBM Personal Computer XT (System Unit 5160) with three custom 8-bit cards.

The processor card (370PC-P) contained two modified Motorola 68000 chips (which could emulate most S/370 fixed-point instructions and non-floating-point instructions), and an Intel 8087 coprocessor modified to emulate the S/370 floating point instructions.

The second card (370PC-M), which connected to the first with a unique card back connector contained 512 KiB of memory.

The third card (PC3277-EM), was a 3270 terminal emulator required to download system software from the host mainframe.”

I don’t know whether that indicates that this was a monstrosity (not only did they combine a pair of 68000’s with a 8087, but both were modified. How did they ever license that?) or that it just was very hard to ship something that could run System/360 code at somewhat decent speed at the time.

Edit: more info at http://www.cpushack.com/2013/03/22/cpu-of-the-day-ibm-micro-.... Apparently, this was a step towards having a ‘real’ 360 on a chip.


"a pair of 68000’s with a 8087"

Reminds me of the TRS-80 model 16 which had a 68000 + a Z80.


> they didn't want to make that technology available to the masses

Do you have a source for this? It makes far more sense that it simply was not feasible to make a System/360 available to the masses than it was intentionally kept back. I would assume IBM would have been thrilled to sell one to everyone on earth, but it was expensive and physically large.

> if the tech exists but it's not reachable by common folks, in my eyes it's as bad if not worse than it not existing at all

I disagree strongly. If developing technology that isn't available to the public is worse than not developing technology at all, there goes research. No new technology makes it to the public in its original form.


Largely unrelated anecdote:

About 1992-3, I worked at IBM Austin on the IBM Microkernel and Workplace OS[1]. The development machines we were using were "Sandalfeet"[2][3] which ran AIX and (IIRC) a weird 64-bit port of Windows NT (and the in-development Workplace OS). I once expressed the opinion that these were pretty neat little machines, and a co-worker (a real IBMer who had been there for a while, not a contractor like me) told me that they had warehouse full of them somewhere since they were never going to be released.

[1] https://en.wikipedia.org/wiki/Workplace_OS

[2] http://www.os2museum.com/wp/os2-for-powerpc-tidbits/a-look-i...

[3] http://www.os2museum.com/wp/ibm-power-series-exotica/ I don't recall the boxes looking like the image of the Power Series 600; they were black, desktop machines, too tall to be pizza-boxes.


I should have rephrased that: if its creators prefer to keep it under lock and key for decades (or forever).

Regarding the tech: ok for 360 in the 60's, 70's, 80's even. But they couldn't do it even after 30 years? We had to wait for VMWare and Docker...


You keep attributing some kind of unfounded malice to IBM.

VMWare released their first hypervisor for x86 in 1999. That's 30 years after the first IBM virtual machines. If you are asking: "why didn't IBM release a hypervisor for x86?" I would respond "why would you expect them to make a virtual machine on a platform that isn't theirs?"

This 30 years is also the same timeline for Intel to make their microprocessors OoO and superscalar. That didn't have anything to do with IBM keeping their technology locked up.


Yes, their start in virtual machine was their CP67/CMS. The CP67 abbreviated "control program 360/67" and was written at their Cambridge Scientific Center in Boston as an interactive computing tool for development of operating systems. The CMS abbreviated "conversational monitor system" or, if you will, the command line interface to CP67. So CMS had a command IPL or "initial program load" which was software for the big red button on the front of an actual machine. The 360/67 was a 360/65 'tweaked' to have virtual memory. At one point for a demonstration they ran CP67 on CP67 on ... CP67 on the 'bare metal' seven levels deep -- it still actually ran. Eventually the 'system administration' advantages of virtual machine were noticed, and CP67 became VM/370. Some years later I got a dial up terminal and used CP67/CMS to write PL/I software to schedule the fleet at FedEx -- pleased the BoD and saved the company.

IBM did try with several projects and products cheaper than their 'big iron' mainframes: They had small versions of System 360, e.g., 360/40 with 64 KB of main memory and some still smaller. They had a rugged minicomputer sometimes used as a communications node/router. They had their System/38 that was 'friendly' and made relational database easy for some small/medium businesses; the S/38s were popular with hospitals that shared their applications software freely. Then of course they did the IBM/PC with PC/DOS they got from Microsoft and 8088 they got from Intel.

Yes, at that point and for some years later IBM had essentially everything in computing, e.g., even ran all of the Internet under a contract from the NSF. IBM had an amplifier that could wrap around an optical fiber and amplify the signal without detecting it and re-transmitting it.

But, right, IBM blew it. They had annual technology appraisals that gave them excellent projections about the future. They did take microprocessor lithography very seriously, in a sense more seriously than anyone else, still, in that they got a cyclotron as an X-ray source for making fine lines on silicon; the rest of microelectronics is still working with extreme ultra-violet and not as short as X-rays. They were into AI (a group I was in), wareable computing, etc. They did some very high level work on disk subsystems with some tricky optimization for placement of physical records based on access data.

But IBM didn't take personal computing or TCP/IP anywhere nearly seriously enough. In a sense, they missed what Gates saw -- put a PC on each desk of the developed world.

One reason Gates was right IBM should have seen quickly enough -- the PC blew away the old typewriters, including IBM's best efforts at typewriters.

So, IBM wanted to continue selling to their usual customers their usual way, suits to suits from the IBM branch offices. They knew plenty well enough, essentially from the first Intel chips, that microprocessors would challenge the existing approach to 'big iron' System/370 processors, but they neglected to see the implications. Or, now can buy an AMD FX-8350 processor, 64 bit addressing, 8 cores, 4.0 GHz standard clock speed, quantity one, for less than $100, and that puppy is, in historical terms, one HECK of a processor for one astoundingly low price per computer instruction per second.

So, along rushed Intel, Microsoft, Gateway, Dell, etc. The Intel 386, etc. architecture was serious, no toy, and by Windows NT Microsoft was well into operating systems essentially as serious as IBM's 'big iron' MVS (multiple virtual storage).

Then along came, right, TCP/IP and Cisco. IBM should have seen that the future of digital communications was end to end reliability of TCP instead of hop by hop reliability of IBM's SNA (Systems Network Architecture); indeed, major parts of IBM did very clearly and explicitly see this point. Still, for the product line, IBM was painfully slow to embrace TCP/IP, even when IBM was running ALL of the Internet.

And for 'workstations', e.g.. Sun, IBM tried with Power; there for Unix they tried with AIX.

At times IBM did begin to see the future, e.g., had an early Web browser, had Prodigy with Sears for on-line shopping and 'social computing' along with, again, the whole Internet. So, those efforts should have given IBM Netscape, Amazon, and Facebook -- if IBM had been wide awake and trying hard. They should also have had Intel, Microsoft, and Apple. For Apple, did I mention wearaable computing? Uh, don't forget Cisco; long IBM was making the core chips for both the Cisco and Juniper routers; IBM had the catbird seat over digital communications and the Internet but still didn't see them clearly.

By 1994, the word in the meeting room across the hall from the CEO's office in Armonk, as IBM was no longer making their revenue projections, was "God ceased to smile on IBM". In three years IBM lost $16 billion and went from 407,000 employees down to 209,000, and has had lots more shrinkage since.

IBM often wanted to regard itself not as an electronics company or a computer company but a marketing company. Well, as the market for computing changed, good shot a the biggest change in a market in all of history, IBM continued to do well understanding the electronics and computing but not the market or marketing. Net, IBM really blew it, really fumbled the ball, dropped the ball, tripped over the ball, fell on the ball, and ended up face down. IIRC one IBM CEO those days got some stock options and left worth about $125 million -- 'chump change' for what IBM missed from Netscape, Intel, Microsoft, Apple, Cisco, Facebook, Amazon, etc. Chump change.


Ah, bit rot. Both of the links to "interesting articles" at the bottom of the page are gone ("Designing an Alpha Microprocessor" 404s and the video appears to be gone from "Things CPU Architects Need To Think About"). Anyone know where these might have moved to?

(Anyway, great post!)


So, trying to hunt these down.

Designing an Alpha Microprocessor first appeared in a magazine called 'Computer', Volume 32, Issue 7, July 1999. It was on pages 27-34, and written by Matt Reilly.

It has a a few citations [0]. (And though I owned a lot of them, I don't think I read this particular Issue.)

Members can buy it from the IEEE [0]. That appears to be the only recourse.

---

Thing CPU Architects Need To Think About has a cover page here. [1] Unfortunately, the video isn't attached. But, it was part of the class EE380, which has a YouTube playlist [2], unfortunately though a lot of the talks are good, they don't include our video. Even worse, I found a fairly recent comment from another HNer [3], which suggests all online copies are gone. By persisting, I found the original asx via the Wayback Machine [4], which is utterly useless without the server.

Alas, I cannot find any working copy.

[0] https://ieeexplore.ieee.org/document/774915

[1] https://web.stanford.edu/class/ee380/Abstracts/040218.html

[2] https://www.youtube.com/playlist?list=PLoROMvodv4rMWw6rRoeSp...

[3] https://news.ycombinator.com/item?id=15900610

[4] https://web.archive.org/web/20130325010756/http://stanford-o...


> Designing an Alpha Microprocessor first appeared in a magazine called 'Computer', Volume 32, Issue 7, July 1999. It was on pages 27-34, and written by Matt Reilly.

> It has a a few citations [0]. (And though I owned a lot of them, I don't think I read this particular Issue.)

> Members can buy it from the IEEE [0]. That appears to be the only recourse.

That appears to be the only legal recourse. If you do not care about legality, there is sci-hub.

EDIT: Under https://news.ycombinator.com/item?id=18246996 you can find a legally less doubtful way to obtain this text.


> That appears to be the only legal recourse. If you do not care about legality, there is sci-hub.

Actually I didn't manage to find it there, which was disappointing. I'm happy someone managed to get Google Scholar to spit out a link though.


> > If you do not care about legality, there is sci-hub.

> Actually I didn't manage to find it there, which was disappointing.

Just copy link [0] of https://news.ycombinator.com/item?id=18246834 into sci-hub.


I think you mean link rot.


Looks like IEEE, ACM and ResearchGate all have a copy of Designing an Alpha Microprocessor, but the first two are paywalled and the latter requires you to request the text (possibly with pay too)

Couldn't find anything sadly on the other thou


In those cases, Google Scholar usually gets PDF links:

Designing an Alpha microprocessor: https://pdfs.semanticscholar.org/0155/d9a203497acf81c90e82a9...


I could use an overview that includes an update to the Computer Architecture class I took in the early 90's. This is good - for "general purpose" microprocessors.

At that time, nothing at all was said about GPUs - they basically didn't count at all. I don't really recall anything about DSPs either. And FPGAs were considered neat and exotic, but a little useless, particularly compared to their cost and more of a topic for EE majors.

Now I've seen a great update (posted to HN) about how FPGAs are basically.. no longer FPGAs and include discrete microprocessors, GPUs and DSPs.. often many (low powered) of each!

This statement: "The programmable shaders in graphics processors (GPUs) are sometimes VLIW designs, as are many digital signal processors (DSPs),"

is about as far as it goes. Can someone point me to a 90-minute guide that expands on that?

* What about the GPUs and DSPs that are not VLIW designs? * What is the architecture of some of the more common GPUs and DSPs in general use today? (as they cover common Intel, AMD and ARM designs in this article). eg: Differences between current AMD and NVIDIA designs? I don't even know what "common 2018 DSPs" might be! * How does anything change in FPGAs now, and where is that heading? (the FPGAs-aren't-FPGAs article was a few years old)


A few questions I've had for a while:

First, if a reasonably high performance processor is going to use register renaming anyway, why not have split register files be an implementation detail? Tiny embedded processors can do without register renaming and have a single register file. Higher performance implementations can use split register files dedicated to functional units. Very few pieces of code both need large numbers of integers and large numbers of floating point numbers.

Second, on architectures designed with 4-operand fused multiply-add (FMA4) from the start, and a zero-register (like the Alpha's r31, SPARC's g0, MPIPS's r0, etc.), why not make the zero-register instead an identity-element-register that acts as a zero when adding/substracting and a one when multiplying/dividing? An architecture could optimize an FMA to a simple add, a simple multiply, or simply a move (depending on identity-element-register usage) in the decode stage, or a minimal area FPGA implementation could just run the full FMA. This avoids using up valuable opcodes for these 3 operations that can just be viewed as special cases of FMA. Move: rA = 1.0 * rC + 0.0. Add: rA = 1.0 * rC + rD. Multiply: rA = rB * rC + 0.0. FMA: rA = rB * rC + rD.


A processor tiny enough that co-locating the integer and floating point computation units closely enough to share a register bank is a good idea will be too small to use register renaming. Having separate clusters with their own banks and their own bypass networks is a really big win.

For the second, if you have a variable length instruction encoding scheme adding an extra argument is going to increase i-cache pressure. If not then you might as well if you do FMA4 but I think most fixed encoding ISAs use FMA3.


In a three-address machine, separating the integer and floating point registers basically saves you three bits per instruction word compared to a unified register file of the same aggregate size. Also, on a 32-bit machine, you save a few transistors by making the integer rename registers 32 bits instead of all 64 bits to accommodate a double float. (And if you have vectors, it really makes no sense to throw away 128 or 256 or 512 bits to store a 32-bit or 64-bit integer).


As I mentioned, though, there aren't many functions that use both the full compliment of integer and fp registers, so I think the aggregate register file size is rarely a factor. Aggregate register file size is also a detriment to fast context switches.

As long as you defined consistent semantics for switching among integer, floating point, and vector use of the same logical register, there's nothing stopping one from using a 32-bit-wide integer register file, a 64-bit-wide fp register file, and a 512-bit-wide vector register file. From an ISA level, you could (for instance) define all operations as if they worked on vector registers. Your imul could always compute a 32-bit result and sign-extend it to 64 bits as the first vector element, and zero out all but the first element of the vector. You wouldn't actually store it that way, since the top 33 bits would always be identical for the results of integer operations (and subsequent vector elements would always be zero). So, from the outside, it would look like all operations worked on very wide registers, just that the vast majority of operations did very trivial things with most of the output bits in those wide registers. The sign extension and zeroing operations would actually only happen when moving values between the internal register files.

Presumably, you'd use the same tricks used in other processors for actually tracking the amount of vector state that needs to be saved on context switches. You might re-use some of the same techniques for economizing the amount of vector state saved across function boundaries. Or, perhaps you'd define an ABI such that system calls preserve vector state, but all vector state beyond the first double is caller-saved state across function boundaries.


Integer and fp is indeed separate in many modern processors.


Yes. My question wasn't if processors have split register files. The question was why the split is exposed at the instruction level instead of being hidden away as an implementation detail. Register renaming hardware is very common in modern processor designs and would make it very easy to make split physical register files look like a unified architectural register file.

I did a bit more reading, and both the IBM Cell and the Adepteva Epiphany processors expose unified register sets at the instruction level (architectural registers).

Many processors these days already contain the hardware to hide this away as an implementation detail, giving more design flexibility to the hardware designers. Furthermore, the processors that don't have register renaming hardware are likely to be small embedded processors that would benefit from not having split register files.


By exposing it at the ISA level you save bits in every instruction through having the register addressing implicit in the instruction type.


That's only true if you replace, say, 32 integer and 32 fp registers with 64 registers. As I mentioned, very few functions require both a large number of fp and a large number of integer registers.


You were talking about high end processors with register renaming though, right? At that point you have stuff like L2 caches which take up way more transistors than the register file. And with one register file the space near it is going to be at a premium as you try to squeeze both the integer and floating point execution units near to it. But with separate clusters you can surround your integer register file with the integer bypass network and the integer execution resources and you can surround your floating point execution unit with your floating point execution units and bypass network. It's the same reason, mostly, that processors have split data/instruction L1 caches - you want to put the cache near the structures that use it.


I guess this is a good opportunity...

It irks me a bit that scoreboarding is not considered "out of order execution" in modern classification. If I have a long latency memory read followed by an independent short latency instruction, the short second instruction will execute before the first has finished executing in a processor with dynamic scheduling via scoreboarding, but this doesn't "count" as OoO. I mostly get it, it just bothers me.


score-boarding to me represents an in-between, because:

1. they stall on the first RAW conflict. 2. they initiate execution in-order, although they may complete execution out-of-order.

I wish there was better nomenclature so people don't get confused, because clearly it doesn't fit into the dichotomy of in-order vs. out-of-order execution.


A next chapter is in need of being written, as the Mill is a radical departure (non-OOO), and a wildly more efficient and secure architecture, first published since 2014 (see https://millcomputing.com/ videos).


Quite possibly. EDGE is another contender.

https://en.wikipedia.org/wiki/Explicit_data_graph_execution


> While the original Pentium, a superscalar x86, was an amazing piece of engineering, it was clear the big problem was the complex and messy x86 instruction set. Complex addressing modes and a minimal number of registers meant few instructions could be executed in parallel due to potential dependencies. For the x86 camp to compete with the RISC architectures, they needed to find a way to "get around" the x86 instruction set.

I've always struggled to understand why they didn't simply retire the x86 instruction set by the early 90s.

The best reason I've been given is an existing body of x86 software, but that's obviously nonsense as demonstrated by the Transmeta Crusoe and Apples's move from 68k to PPC to x86.


Modern Processor Design by Shen is a great book if you want to read more on this stuff.


Great summary of (recent) modern computer architecture. Fun exercise: Try to spot how Spektre style attacks surface as a result.


very good write-up - thanks.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: