Hacker News new | past | comments | ask | show | jobs | submit login
Why Registers Are Fast and RAM Is Slow (mikeash.com)
225 points by anandabits on Oct 11, 2013 | hide | past | web | favorite | 90 comments

For a way more detailed look at memory architectures and implementation, check out Ulrich Drepper's classic paper "What Every Programmer Should Know About Memory"[1]

[1] http://www.akkadia.org/drepper/cpumemory.pdf

Or on a more light-hearted note: http://folklore.org/StoryView.py?project=Macintosh&story=Sou...

Which just goes to show, hitting memory is a Bad Thing(tm) even when you're running on a slow(from today's perspective) processor like a 68000.

Very impressive

Doing 22kHz generation on a Macintosh is very close to the limit

It wasn't always thus: On the 6502, which the early Apple II machines were built around, it was possible to access RAM at only a one- or two-cycle penalty compared to doing everything in registers and immediate values. This was only the case if you used zero-page memory without indexing, however, so you couldn't have a lot of stuff in RAM without incurring more speed penalties.


(Zero-page memory on the 6502 was the memory accessed via addresses with a high byte of 0x00. Since 6502 had sixteen-bit RAM addressing, this meant each page was 256 bytes large, so the zero-page was almost as good as having 256 single-byte registers.)

This, ladies and gentlemen, is a particularly detailed and good read. Please do give it a glance if you haven't already.

If you want to buy good book on the topic: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob, Spencer Ng, David Wang

A few days late, but thanks

A simple thought experiment suffices here. What is the shape which holds the most physical bits while minimizing the overall latency for random access? It's a sphere. Each bit occupies a space packed within that sphere. The radius of the sphere is the distance that light must traverse, and thus corresponds to latency.

"slow" elements of the memory hierarchy are on the outside of the sphere, while faster elements (cache, registers, etc) are layered on the inside, like an onion. Since those spheres are smaller they must, by definition, hold fewer bits, but they are, by definition, faster.

The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.

The volume of a sphere is 4/3pir^3. Because latency is a function of the radius (how far it takes light to bounce to the edge of the sphere and back) that means that latency must rise as at least the cube root of the number of bits you want to store. That is the best possible bound.

This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).

This is right for theoretical limits, but modern chips are fabricated as stacked 2D layers, forming planes rather than spheres. This changes information density gain per distance from the core from cubic to quadratic-- in Nehalem, the 64KB of L1 cache has 4 cycle latency, while 256KB of L2 cache (4x more) has 10 cycle latency (~2x slower).

> This implies that no algorithm is ever O(1) for an asymptotically large number of elements--not even hash tables or pointer dereferences.

O(1) is about number of operations required by algorithm to finish for given data size, not about the time. So latency doesn't matter.

Also: if the amount of information that can be kept in universe is finite (most probably it is) - then you can make algorithm that takes the same amount of operations no matter data size (just always add dummy data to fill up the data to the physical limit). Thus every algorithm is technically O(1).

Proof: let N be the number of bits that we can keep in memory. Every deterministic algorithm either does infinite loop, or finishes the execution after at most 2^N changes of state (otherways it is 2 times in the same state with different follow-up, and he can't, cause it's deterministic). So if we design an algorithm, that for every data fitting into memory calculates the result and then does busy loop for the remaining steps until the step 2^N - this algorithm is O(1) no matter what it does.

There's probably a hole in my understanding somewhere, cause algorithmic complexity would be a really useless definition if that was true :)

I think the hole in your understanding is assuming that math (in this case big-O) actually maps to reality. Big-O (and algorithms themselves) is defined entirely in mathematical terms. This model can allow input to be arbitrary large, and can allow operation to take a constant time. If you want to, you can talk about the algorithmic complexity of an algorithm assuming prime factorization in constant time. Maybe not useful, but no reason we cannot talk about it.

Usually the implicit assumption with O notation is that n may go to infinity.

Time and the number of operations are equivalent here: as proof, just define the operation as "move an information-carrying photon a tiny distance episilon". That must take a finite amount of time, as the speed of light is finite, and the number of those operations must increase with the number of randomly accessed elements you're working with, as they're necessary simply to retrieve the element from memory.

Algorithms have the same complexity no matter the machine: bubble sort is O(n^2) no matter if you use C64 or a new PC. That's why it uses operations instead of time - to be able to compare algorithms independently of machines it runs on.

Operations are usually defined as addition or multiplication or comparison. Moving a photon by epsilon isn't a valid operation in any architecture I'm aware of. Even if we use moving an electron by epsilon - you can't tell pentium to move one electron by exactly epsilon, it will move many at once, and it will move them by whatever it need to perform it's actual operations.

As for infinity - for all physically possible inputs the algorithm modified as described above will produce the same output as the algorithms that are considered correct by most people. If we care about infinities: any algorithm I've seen ever implemented was incorrect - most use integers or floats or doubles so their input space is very limited, and even the ones that use arbitrary length math - are run on machines with finite amount of memory.

Algorithmic complexity is determined by the complexity of the primitive operations. Most computers have primitive operations that are constant time, and can emulate the primitive operations of other computers in constant time. A notable exception to this is quantum computers, which have some operation that can be done faster than classical computers. Another exception is the Turing Machine, which take O(n) time to look up a random value from memory, whereas RAM based machines can do that in O(1) time.

> That's why it uses operations instead of time - to be able to compare algorithms independently of machines it runs on.

Splitting hairs here. You can talk about operations or time. Same thing as operations are sequential. One operation follows another. You can count them when you are done to get to total and the total is also referred to as the "time" in this context.

> I've seen ever implemented was incorrect - most use integers or floats or doubles so their input space is very limited

So are we talking about specific hardware here or not. I thought we weren't. There ambiguity and discussion point is there because one can define what they consider is a "constant" operation. You can say it is a hypothetical von neuman architecture machine and these operations (op1, op2, ....) take a constant time. Now we compare two algorithms and see how they do.

Moving data from one place to another (like in a sort) is also an operation.

Has anyone done a study on the optimal number of registers to have?

The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.

In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:

* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)

* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.

* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.

* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.

* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...

Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.

Yes, it's been studied. You rapidly run into diminishing returns.


Here's a good thread discussing this: https://groups.google.com/forum/#!searchin/comp.arch/number$...

Awesome! Thank you for the link.

The studies would vary over time because CPU design and bottle necks have changed. Early designs were of course limited by transistor count, now we have OoOe and physical registers are limited by muxers and latency (see the presentations by the mill CPU guy [1]

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Saving on context switches (interrupts alone aren't a big deal) was indeed a problem back when AltiVec was designed, thus it has a special register to keep track of which registers need to be saved. In modern designs this is less of a problem, between higher frequencies, multiple cores, and the other effects of a context switch dominating (effective flush of l1 cache and predictors).

The interesting bits nowadays are that load/store is expensive power-wise, which was what ARM identified as the major motivation behind having 32 registers (fewer spills in functions) and OoOe designs.

[1] http://m.youtube.com/watch?v=QGw-cy0ylCc&desktop_uri=%2Fwatc...

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?

No one uses more registers just to use more registers - in OoOe designs the main reason to use more registers is to reduce spilling and reloading. So in effect a compiler isn't going to use a register it has to save, unless in doing so it saves a spill+reload, which would result in the same number of load/store as without the additional register.

In-order designs have more reasons to use more registers, but again they aren't going to use more registers unless they gain something.

> The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers?

TFA gives at least one reason:

> Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.

Furthermore, although ISA expose a number of registers A, OOO architectures (and their friends parallel and speculative executions) pretty much require the CPU to have > A registers and do register renaming, which lowers the number of registers the ISA can define. For instance the Alpha ISA defines 32 integer registers, but the Alpha 21264 had 80 physical integer registers.

That's definitely another factor. Again though, I doubt it's the limiting one. No-one (as far as I know) has produced a power-hungry CPU with (say) 5000 registers on it.

I've heard that modern Intel processors have 100 < x < 200 physical registers. I'm not sure they actually document the exact number.

Itanium has at least 256. (128 Integer + 128 Float + 128 predicate (1 bit), which are essentially flip-flops.)

Register windows are a way to put 1000 registers in a CPU. See the SPARC and Itanium instruction sets for how this can be done. There are also plenty of studies about both.

Vector registers are another way to use 1000 registers.

But directly coding 1000 registers into each instruction does not seem to be such a good idea. You might as well use a 1st level cache. The difference between the cache and the register file ist mostly how the instruction set architecture references it. Registers are usually easier to acces because each one has a single name and the CPU can detect dependencies and conflicts easily. Memory accesses and caches are more complex because you need to calculate the addresses before you can detect dependencies/conflicts.

PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.

It's complicated, but modern processors actually do have many more registers than you can name in the instructions. They use things like "register renaming" to avoid false conflicts between instructions.

Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.

I thought this was an interesting insight in to that: http://ootbcomp.com/docs/belt/index.html

Note that the actual number of registers is considerably different than the number of registers you can access through instruction set. They are used via register renaming and optimizations of complex instructions.

Yes. As other commentors have said, if you are doing out-of-order execution well, the CPU will have many more 'hidden' registers and do register renaming to use them. But this has an interesting interaction with compilers.

Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r1, [r0, #4]
  ADD r1, r1, #1
  STR r1, [r0, #4]
  LDR r1, [r0, #8]
  ADD r1, r1, #1
  STR r1, [r0, #8]
Now, if your CPU can do OoOE, it can spot that register r1 is used for three independent loads, adds and stores, and can internally use three different registers for them, allowing the operations to be done in parallel. But, equally, the compiler could have written the code as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r2, [r0, #4]
  ADD r2, r2, #1
  STR r2, [r0, #4]
  LDR r3, [r0, #8]
  ADD r3, r3, #1
  STR r3, [r0, #8]
Compilers and register renaming are fighting each other. In traditional compiler writing, you try to minimise the register usage and output the first code listing. But if you have plenty of registers, you could output the second code instead, and let the CPU do parallel execution without the need for register renaming.

In other words, once you have enough 'real' registers does it get rid of the need for register renaming? Intel added it to their pentiums to improve existing x86 code, but I wonder if it has that much of a benefit with newer ISAs that have 'enough' registers and properly tuned compilers?

You still need OoOe to execute your second example optimally since you didn't schedule the instructions, which points to why OoOe isn't going away - there are going to be code sequences that the compiler cannot schedule optimally, particularly around branches. Additionally, cache misses are impossible to predict statically, and OoOe helps hide those.

And no one does OoOe without register renaming.

Yeah, I avoided any other changes to avoid confusing the issue. But any reordering I could have done, the compiler could have done too. Your point about branches is fair though, as the 'active' renamed registers after a branch can only be known at runtime.

Still, I wonder whether some of the features of modern CPUs could be dropped if it wasn't for legacy code. On the other hand, Itanium tried to push the parallelism work onto the compiler and look where that ended up!

Most high performance CPUs will have ~100 physical registers or so, possibly divided up in multiple segments.

But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.

The tradeoffs involving ISA registers are more intense. You have to load and store all of them on thread swaps, but that's pretty tivial. More importantly the bits you have to use to specify which register you're using are bits that you're paying in every single instruction you have, increasing the size of your executable and the pressure on your caches.

Different sorts of architectures have their sweet spots at different places. In order processors doing lots of matrix math and such benefit from lots of architectural registers, the Itanium had 128 integer and 128 floating point registers and that was the right amount for a VLIW architecture with it's features. Modern GPUs are similar.

On the other hand, your typical OoO CPU will have either 16 or 32 registers you can address at a time, and that seems to be close to optimal. It's hard to say since instructions come in discrete chunks and your number of registers has to be a power of 2 as a practical matter.

Fundamentally, having more registers increases the speed of light delays in accessing the register. If it did not, we would just operate on main memory itself. However, two few registers and you lose the ability to perform complex computations efficiently. So I believe it is, indeed, a compromise between speed and a need to maintain scratch state. I would be surprised if Intel and AMD didn't constantly run simulations of common computations in an effort to find the optimal size of all on-chip structures.

That's definitely another factor but I suspect it isn't the limiting factor. Sure, design a chip with a million registers and you'll end up constructing them like RAM. But with orders-of-magnitude fewer registers, 16 or 32 or whatever, the size of the register banks on the CPU can't be that significant to incur speed-of-light style delays, surely?

WITH 16x fewer registers, that equates to about 1 chip's worth of registers (remember a stick of DRAM often has 8-16 individual chips on it). While this is already clearly a huge problem, consider additionally that DRAM is made with trench capacitors, unlike SRAM. DRAM is dramatically slower and more dense than SRAM. So we either sacrifice speed, or bloat our one-chip's-worth of area by a few factors, say x4-8.

Then there's practicalities like sense amp design. Large register arrays are not read in a digital fashion, and current L2 and L3 sizes already press their sense amps to their limits. DRAM also uses sense amps, but the amps are again slower and larger.


Probably not, but there are definitely delay effects at play or L2 and L3 cache would be unnecessary, you could just have humongous L1s.

Perhaps it would clarify things with analogy:

Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.

He looks and see he doesn't have any beer on the table. So he goes to the fridge and gets what he wants, and comes back to the couch. Later, Bubba runs out of beer (useful data). This is a cache miss, so he has to go down to the corner store and get some beer. Instead of just getting what he wants, maybe he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes back, puts the beer and TV dinner in the fridge, and brings some beers with him to the table. Next time he runs out of beer, he goes to the corner store, but they're all out of beer. So he buys some seed, tills the fields, and grows his own barley. This is disk access.


Hmph. You forgot the part where Bubba's friends are watching him drink beer and eat Hungry-Mans, and if they want some, they can force Bubba to throw out all his food and pour all his beer down the drain, and everyone has to go back to the store.

Theres something in between, which you will find on microcontrollers: SRAM. If you use simple architectures, like AVR, you also get completely deterministic timings for a load from SRAM (e.g. 2 cycles for AVR).

Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but it is a very important implementation detail given that it is directly exposed to the programmer as memory, not in some automagically managed cache.

SRAM is used in every CPU, not just microcontrollers. Registers and cache are usually implemented as SRAM. The false distinction this article makes between registers and RAM is misleading and indicative of the author's general ignorance of computer architecture.

It's not misleading in the least unless you're a pedantic smartass who wants something to complain about. TFA uses terminology which "Reader Daniel Hooper" will understand, and in which RAM is a synonym for "main memory". Which is the colloquial meaning of RAM outside of hardware design labs and pedantic smartassery.

That's the implementation detail of the substrate, TFA uses "RAM" in the sense of "main memory" which is the colloquial meaning of the acronym. Registers can be implemented in SRAM. So can CPU-level caches or various hardware buffers.

That was my thought when I was reading the article. On-chip SRAM on microcontrollers feels different because on general purpose CPUs the generic programming model has registers and RAM with the cache managed for us by others. On MCUs you almost always end up being aware of on-chip SRAM and off-chip SRAM or DRAM. The lines are blurry for larger MCUs but for lower end stuff like Cortex M, AVR or MSP430 it's definitely a good idea to look over instruction timing for all the different flavors of storage.

Most ARM SoCs have a few hundred kilobytes of "internal RAM" (which is obviously SRAM) used mainly by the ROM and bootloader before the memory controller is initialized and can usually be accessed with the same latency as the L2 cache.

It's usually unused once the kernel has started but it can be mapped by the kernel later on if there's a use for it.

Modern x86 chips generally allow the onboard cache to be used as RAM during early boot for the same reason, too.

This is great stuff to know. Not relevant to my audience, I think, but it's something I wasn't quite aware of before, and I'm happy you pointed it out.

So, I'm a bit confused. Are registers SRAM? Or are they faster than SRAM?

Any of these computer architecture concepts: register file, L1/L2/L3 cache, main memory

Can be implemented with any of these components: DRAM, SRAM, D-FF (flip-flops)

It's common for main memory (in embedded systems) and register files to use SRAM. But you can also implement the registers with flip-flop banks, and get something bulkier but faster. I'm not sure what Intel/AMD does.

That's an awesome explanation. Thank you. [Making obvious reference to how relevant your username is]

Do any current ARM implementations do register renaming over a physical register set larger than the architected set?

Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.

EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

Basically anybody doing out of order execution these days is going to be be doing register remapping at some level.

That article did a lot of simplifying, but probably simplifying that was needed for the person who asked that question.

An interesting thing about Apples take on AArch64 in particular that some people have been speculating about is about how Apple's Cyclone core's memory subsystem works. ARM cores usually use the virtual (post-MMU) address of data to determine where in the cache data lives, but if you stick with page size as big or bigger than the L1 size you can start your L1 lookup at the same time you do your TLB lookup, and save a lot of latency. Apple's control of the OS is what lets them force 64K page sizes.

This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."

And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."

This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.

In some ways, the actual execution of code is opaque to compilers. Modern x86 processors further divide their instructions into op-codes in the instruction translation units. AMD and Intel both have their approaches to this internal instruction set deeply ingrained into every CPU since perhaps K7 for AMD and Pentium Pro for Intel. Pentium M and later the Core architecture contained op-code fusing where instead of just rearranging op-codes, the op-codes were combined into composite op-codes that could be executed in one step. The opcode fusing + out-of-order execution basically makes the CPU act like a compiler internally for binary. It's a like a JIT run-time for binary that's implemented in hardware.

As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.

Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.

Depends on the app, and your use case, and the CPU, etc. YMMV.

For a long time though, the Linux kernel has been compiled to optimize code-size rather than 'performance' (according to GCC). Why? Because the kernel gets involved in every syscall the OS makes, so the kernel code gets paged in and out very frequently. Loading a little less code from RAM means everything goes faster.

Well, it's going to be faster if the smaller executable can keep its entire text segment in memory.

I've done the instruction scheduling stuff by hand on paper; it's pretty interesting. We did Tomasulo scheduling, which is hardly modern, being developed in 1967, but it'll execute your instructions all sorts of ways.

Great explanation for folks without a hardware background. I also enjoyed his previous article about ARM64. Thanks for sharing.

As a kid I had a TI 99/4a. The TMS9900 processor didn't have any registers, it had a "workspace pointer" which let you treat a block of RAM as your registers. This was slow, but in theory allowed for convenient context switches as you'd just load a new workspace pointer.

Do any modern CPUs still use an approach like this?

Not if the CPU runs faster than 10 MHz or so. Fundamentally the speed of CPUs has gone up much, much faster than the speed of RAM for the reasons listed in the article. Some micro-controllers can still do things like you describe, but anything you'd think of as a modern CPU uses some form of caching that makes things more complicated than that.

It's funny reading this and then remembering that on top of all this, there's paging (i.e. fetching from hard drive).

It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring town

woooooo memory!

Don't forget DMA which is like drop shipping with a guaranteed delivery date, like 3 days later, but they just shove it directly into the shelf

While I personally love this answer, I have to admit a basic physical metaphor works. If you remember an answer, it is practically immediate. The further back in your records you have to go to find something, the slower it will be.

We have faster ways of recalling notes today than we did in the past, you might say? Well, yeah. In many respects our ram is faster than registers of early computers, too. That all things have gotten faster doesn't change that things which were faster are still faster. (I'd be delighted to know examples where this radically changed somehow.)

Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON, not NEXT TO.

Hell, if you know and optimize for registers and don't know why they're fast, you should be shot. Otherwise you're using a language that doesn't really give you control over registers why do you care?

Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.

While the CPU is waiting for data to load from RAM, is the operating system smart enough to give it a different task to execute?

The overhead of task switching is too great for that to be useful, plus the OS would probably need to talk to RAM as part of the whole process anyway.

However, this is part of what hyperthreading accomplishes. The OS gives the CPU two tasks ahead of time, then when one task stalls, the CPU can switch over to the other one and work on it for a while.

This is actually what hyperthreading is all about: cache misses. I missed that in the article. There are more things missing actually, but I guess it would be too much to explain it all in a single article. Things like caches, coherence protocols, prefetching, memory disambiguation. Registers are also much more complex because you have things like register renaming, result forwarding etc. In the end there are simply much less registers than memory locations, that's why you can build faster registers than memory.

I thought hyperthreading was able to go beyond this, and e.g. execute the two streams in parallel if one is hitting the FPU and the other is doing integer work, even if neither one is stalled.

And you're right, it's missing a lot because I'm writing an article, not a book. It is fun to explore details, but ultimately you have to stop somewhere.

That was the impression I had too, but if so I can see how "this is actually what hyperthreading is all about" would make sense. Two streams of code are unlikely to have long segments of just-FPU and just-integer respectively, and even more unlikely that those streams will happen to align during execution. It happens, sure, but the gains would be smallish.

On the other hand, long periods of no cache misses followed by long periods of waiting after a cache miss are exactly what you expect from real code (especially optimized code). So I'd think that you'd have much bigger gains from that. The same goes for branch misprediction.

Well, the gains are smallish. Real-world gains from hyperthreading are on the order of 10-20% when you load up a CPU with two threads.

Yeah, but when I said "smallish" I was thinking more on the order of 1%. I would consider 10% actual gains to be quite large given the craziness of what Hyperthreading tries to accomplish.

It may also be a matter of more fully utilizing multiple integer/floating-point units. Say, if the CPU has two integer units but the current code is only using up one of them, then it could run the second hyperthread on the other. I really don't know the details though.

Yes, hyperthreading (aka SMT), as implemented in Intels processors, can execute instructions from several threads in the same clock cycle. Other processors, like Sun's Niagara, switch threads only on certain events like cache misses (this is known as SoEMT). Workloads with a lot of cache misses is where both really shine.

Of course it's hard to write about a complex topic, choose the right details, and make it all seem simple. Thumbs up for trying!

Glad you asked ;-) It's called Hyper-threading [1] and works best when the scheduler is aware. Provided in some Intel's CPUs (2 threads per core) and in Sun's (later Oracle's) UltraSPARC T1...T5 (8 threads per core).

[1] http://en.wikipedia.org/wiki/Hyper-threading

[2] for example, CONFIG_SCHED_SMT in linux

In addition to what others said about the overhead of context switching just for a DRAM access stall (at today's DRAM latencies, which are ~200 to 400 cycles), there's an architectural issue with the idea, too. Consider that from software's point of view, missing the cache and going to DRAM is "invisible": it happens as part of executing a single instruction. Software doesn't know the cache miss happened; architecturally, the result of the memory load is the same whether it came from cache or DRAM. So to allow the OS to do something clever, the processor would have to define a way of notifying the software that a cache miss occurred, probably by raising an exception and aborting the instruction, to be resumed later (like a page fault). So it would take a nontrivial amount of effort by CPU architects to enable such an OS feature.

Interestingly, there is at least one academic proposal to do something like this [1], but I'm not aware of any real implementations.

[1] http://dl.acm.org/citation.cfm?id=891494

The OS does not. That's the job of the out of order architecture (load prediction and reordering of later non-dependent instructions).

And when that fails, it's exactly the use case for hyperthreading.

But electrical signals don't propogate at the speed of light in a vacuum ( I didn't read past that point). The signals travel at about 2/3 the speed of light. This is very significant when you look at path lenghts

So what's the state of research in breaking out of the Von Neumann approach and going with a RAM-free architecture where the CPU has m(b)illions of registers you just do everything in? Of course it's expensive, but let's say you have effectively infinite dollars, is this a good idea?

Where would your processor fetch its program code from, if not RAM?

Assuming you place code also into registers...

If you squint hard enough registers are also a form of RAM, just closer to the processor and faster. A machine with only instruction execute and registers would still have a Harvard/Von Neumann architecture.

The reason why processors don't have more registers is because they are quite power hungry and they are not very dense. For a given chip area, D-RAM gives you more than 6x the capacity for less than half the power use. And no, you can't make registers with the same technology as D-RAM.

Right, registers are a kind of very small working memory, the only place where "work" operations can happen. Most program code eventually has to go through the register bank anyway, except it all has to be MOV in and out of the registers, eating up unbelievable amount of time.

I've always viewed RAM as a kind of register cache, necessitated because registers are expensive to build and RAM, though expensive, is cheaper. I've heard registers these days are just a small bit of SRAM, but reaching into my way back machine in college, I seem to remember them being a different kind of memory element.

But RAM and all the caches these days leading up to registers are all require fetch from somewhere, store in the register, do the work, then write back the result somewhere (even if the instruction set obfuscates that). If you had enough registers, the fetch and store parts of that work are pretty much gone, turning something like

mov 0xaddressh-1 RegA mov 0xaddressh-2 RegB add RegA RegB RegC mov RegC 0xaddressh-3


add 0xReg-1 0xReg-2 0xReg-3

where each mov we do today introduces a cascade down the cache and memory stack (perhaps even dipping into on-disk VM) just to copy a few bytes into a register. And we have to do that 3 times here. The number of adds we could do in the time it takes to do a mov is probably pretty high, but we simply can't do them because we're waiting on bits moving from one place to another.

So suppose money, power etc. weren't considered issues and engineering effort was put into a register-only approach, how much faster would that be? (one the reasons that the Von Neumann architecture became "the" way to do things was that registers were considered expensive to build, but what if we didn't care about money?)

I'd bet a general purpose system built this way would be an order of magnitude faster than anything we have today. But you're right, it would be an enormous resource hog and be expensive as a medium-sized mega yacht.

I think you don't understand how physics work.

The problem with modern chips is not money, or the price of energy. The poor things would simply fry (or explode) if we were to make them like you suggest and clock them at competitive frequency.


Cites distance and cost/power as reasons why "RAM is slow, registers are fast", not a mention of differences between SRAM and (S)DRAM.

Not worth reading.

He clearly describes SRAM in the following paragraph, then contrasts it with DRAM in the rest:

Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

Yeah, I did not even recognize this as description of SRAM, thanks for pointing this out.

from http://www.differencebetween.net/technology/difference-betwe...

Because of its lower price, DRAM has become the mainstream in computer main memory despite being slower and more power hungry compared to SRAM. SRAM memory is still used in a lot of devices where speed is more crucial than capacity. The most prominent use of SRAM is in the cache memory of processors where speed is very essential, and the low power consumption translates to less heat that needs to be dissipated.

In any case, SRAM can be more power hungry that DRAM (per bit) but it can also be vastly less. SRAM power consumption is not at all driven by the fact that registers are "continuously powered". Accessing SRAM is the power hungry operation, but powering requirements otherwise are negligible. If anything, it's the DRAM that requires constant powering (refreshing).

I appreciate the discussion around this. I'm not too knowledgeable about hardware, and wasn't sure about this part in particular. I did pass it by a friend who did hardware professionally for a long time, and he ultimately agreed with my assessment. However, I think you've convinced me that I was wrong, and that it's purely about cost, not power consumption. I'll have to see about editing the article accordingly.

Agreed. Completely misses on-die caches for instance. Article might have been accurate in the 6502 days, but not today.

> Agreed. Completely misses on-die caches for instance.

Absolutely, except for the part where he mentions them

> If you're really lucky and the value is in L1 cache, it'll only take a few cycles.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact