
Why Registers Are Fast and RAM Is Slow - anandabits
http://www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html
======
yan
For a way more detailed look at memory architectures and implementation, check
out Ulrich Drepper's classic paper "What Every Programmer Should Know About
Memory"[1]

[1]
[http://www.akkadia.org/drepper/cpumemory.pdf](http://www.akkadia.org/drepper/cpumemory.pdf)

~~~
demallien
Or on a more light-hearted note:
[http://folklore.org/StoryView.py?project=Macintosh&story=Sou...](http://folklore.org/StoryView.py?project=Macintosh&story=Sound_By_Monday.txt&sortOrder=Sort%20by%20Date&detail=medium)

Which just goes to show, hitting memory is a Bad Thing(tm) even when you're
running on a slow(from today's perspective) processor like a 68000.

~~~
raverbashing
Very impressive

Doing 22kHz generation on a Macintosh is very close to the limit

------
codex
A simple thought experiment suffices here. What is the shape which holds the
most physical bits while minimizing the overall latency for random access?
It's a sphere. Each bit occupies a space packed within that sphere. The radius
of the sphere is the distance that light must traverse, and thus corresponds
to latency.

"slow" elements of the memory hierarchy are on the outside of the sphere,
while faster elements (cache, registers, etc) are layered on the inside, like
an onion. Since those spheres are smaller they must, by definition, hold fewer
bits, but they are, by definition, faster.

The total number of bits you can store is a function of the volume of the
sphere. For a given latency level, it's a function of the surface area of the
sphere at a given radius.

The volume of a sphere is 4/3 _pi_ r^3. Because latency is a function of the
radius (how far it takes light to bounce to the edge of the sphere and back)
that means that latency _must_ rise as at least the cube root of the number of
bits you want to store. That is the best possible bound.

This implies that no algorithm is ever O(1) time for an asymptotically large
number of elements accessed randomly--not even hash tables or pointer
dereferences. They're at best O(n^1/3).

~~~
ajuc
> This implies that no algorithm is ever O(1) for an asymptotically large
> number of elements--not even hash tables or pointer dereferences.

O(1) is about number of operations required by algorithm to finish for given
data size, not about the time. So latency doesn't matter.

Also: if the amount of information that can be kept in universe is finite
(most probably it is) - then you can make algorithm that takes the same amount
of operations no matter data size (just always add dummy data to fill up the
data to the physical limit). Thus every algorithm is technically O(1).

Proof: let N be the number of bits that we can keep in memory. Every
deterministic algorithm either does infinite loop, or finishes the execution
after at most 2^N changes of state (otherways it is 2 times in the same state
with different follow-up, and he can't, cause it's deterministic). So if we
design an algorithm, that for every data fitting into memory calculates the
result and then does busy loop for the remaining steps until the step 2^N -
this algorithm is O(1) no matter what it does.

There's probably a hole in my understanding somewhere, cause algorithmic
complexity would be a really useless definition if that was true :)

~~~
codex
Usually the implicit assumption with O notation is that n may go to infinity.

Time and the number of operations are equivalent here: as proof, just define
the operation as "move an information-carrying photon a tiny distance
episilon". That must take a finite amount of time, as the speed of light is
finite, and the number of those operations must increase with the number of
randomly accessed elements you're working with, as they're necessary simply to
retrieve the element from memory.

~~~
ajuc
Algorithms have the same complexity no matter the machine: bubble sort is
O(n^2) no matter if you use C64 or a new PC. That's why it uses operations
instead of time - to be able to compare algorithms independently of machines
it runs on.

Operations are usually defined as addition or multiplication or comparison.
Moving a photon by epsilon isn't a valid operation in any architecture I'm
aware of. Even if we use moving an electron by epsilon - you can't tell
pentium to move one electron by exactly epsilon, it will move many at once,
and it will move them by whatever it need to perform it's actual operations.

As for infinity - for all physically possible inputs the algorithm modified as
described above will produce the same output as the algorithms that are
considered correct by most people. If we care about infinities: any algorithm
I've seen ever implemented was incorrect - most use integers or floats or
doubles so their input space is very limited, and even the ones that use
arbitrary length math - are run on machines with finite amount of memory.

~~~
gizmo686
Algorithmic complexity is determined by the complexity of the primitive
operations. Most computers have primitive operations that are constant time,
and can emulate the primitive operations of other computers in constant time.
A notable exception to this is quantum computers, which have some operation
that can be done faster than classical computers. Another exception is the
Turing Machine, which take O(n) time to look up a random value from memory,
whereas RAM based machines can do that in O(1) time.

------
joosters
Has anyone done a study on the optimal number of registers to have?

The website answers the register question well, but leads to a further
question: If registers are so great, why stick with just 16/32/64/n registers?
Why not have more? After all, x86-64 and ARM64 decided that having more suited
them.

In the end it must come down to a compromise, with the downsides of having
more registers possibly being some of the following:

* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)

* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.

* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.

* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.

* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...

Anyway, that's all speculation. I was wondering if someone had done a study.
You could construct a virtual, bespoke CPU with n registers, then make gcc
compile some SPEC benchmarks using the ISA and model it to see how efficient
having an extra register makes it. You could graph registers vs simulated
runtime and see where the sweet spot is.

~~~
Scaevolus
Yes, it's been studied. You rapidly run into diminishing returns.

[http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf](http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf)

Here's a good thread discussing this:
[https://groups.google.com/forum/#!searchin/comp.arch/number$...](https://groups.google.com/forum/#!searchin/comp.arch/number$20of$20registers/comp.arch/BClo97LG3mo/DhEfQXKaRM4J)

~~~
joosters
Awesome! Thank you for the link.

------
Theodores
Perhaps it would clarify things with analogy:

Let's say Bubba's watching the Super Bowl. The table in front of him are his
registers, the fridge is cache, and the corner shop a quick walk away is
memory.

He looks and see he doesn't have any beer on the table. So he goes to the
fridge and gets what he wants, and comes back to the couch. Later, Bubba runs
out of beer (useful data). This is a cache miss, so he has to go down to the
corner store and get some beer. Instead of just getting what he wants, maybe
he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes
back, puts the beer and TV dinner in the fridge, and brings some beers with
him to the table. Next time he runs out of beer, he goes to the corner store,
but they're all out of beer. So he buys some seed, tills the fields, and grows
his own barley. This is disk access.

[http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-
library-...](http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-
cache-metaphor)

~~~
terrymah
Hmph. You forgot the part where Bubba's friends are watching him drink beer
and eat Hungry-Mans, and if they want some, they can force Bubba to throw out
all his food and pour all his beer down the drain, and everyone has to go back
to the store.

------
revelation
Theres something in between, which you will find on microcontrollers: SRAM. If
you use simple architectures, like AVR, you also get completely deterministic
timings for a load from SRAM (e.g. 2 cycles for AVR).

Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but
it is a _very important implementation detail_ given that it is directly
exposed to the programmer as memory, not in some automagically managed cache.

~~~
georgecmu
SRAM is used in every CPU, not just microcontrollers. Registers and cache are
usually implemented as SRAM. The false distinction this article makes between
registers and RAM is misleading and indicative of the author's general
ignorance of computer architecture.

~~~
masklinn
It's not misleading in the least unless you're a pedantic smartass who wants
something to complain about. TFA uses terminology which "Reader Daniel Hooper"
will understand, and in which RAM is a synonym for "main memory". Which is the
colloquial meaning of RAM outside of hardware design labs and pedantic
smartassery.

------
twoodfin
Do any current ARM implementations do register renaming over a physical
register set larger than the architected set?

Obviously Intel has been doing this for a while: Haswell has something like
168 integer registers, while the x86-64 ISA only exposes 16.

EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32
architectural registers to 56 physical:
[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CHDIBEGC.html)

~~~
Symmetry
Basically anybody doing out of order execution these days is going to be be
doing register remapping at some level.

------
Symmetry
That article did a lot of simplifying, but probably simplifying that was
needed for the person who asked that question.

An interesting thing about Apples take on AArch64 in particular that some
people have been speculating about is about how Apple's Cyclone core's memory
subsystem works. ARM cores usually use the virtual (post-MMU) address of data
to determine where in the cache data lives, but if you stick with page size as
big or bigger than the L1 size you can start your L1 lookup at the same time
you do your TLB lookup, and save a lot of latency. Apple's control of the OS
is what lets them force 64K page sizes.

------
eksith
This part stood out : "The ideal of 3-4 instructions per clock cycle can only
be achieved in code that has a lot of independent instructions."

And a bit later : "3.Modern CPUs do crazy things internally and will happily
execute your instruction stream in an order that's wildly different from how
it appears in the code."

This may potentially explain why a smaller executable isn't necessarily faster
when execut _ing_. I guess a lot of compiler gymnastics are devoted to
breaking down complex instructions to take advantage of this.

~~~
knappador
In some ways, the actual execution of code is opaque to compilers. Modern x86
processors further divide their instructions into op-codes in the instruction
translation units. AMD and Intel both have their approaches to this internal
instruction set deeply ingrained into every CPU since perhaps K7 for AMD and
Pentium Pro for Intel. Pentium M and later the Core architecture contained op-
code fusing where instead of just rearranging op-codes, the op-codes were
combined into composite op-codes that could be executed in one step. The
opcode fusing + out-of-order execution basically makes the CPU act like a
compiler internally for binary. It's a like a JIT run-time for binary that's
implemented in hardware.

As far as executable size and performance, compiling with -Os in GCC will
occasionally yield a performance increase that might even change across CPU's
and architectures as the memory sub-systems hit a good rhythm or there are
less misses overall. Usually smaller is better for this. -O3 will occasionally
unroll gigantic loops, while using compiler directed optimization to analyze
which parts of a binary can benefit overall execution from unrolling vs less
misses with smaller executable size can yield even better agreement between
memory subsystem performance and execution speed.

Microarchitectures like MIPS have further blind alleys such as branch-delay
slots that will finish execution even if a branch instruction -before- the
slots is taken. This is an out-of-order program, but putting the burden on the
compiler instead of implementing the reordering in hardware actually became a
nuisance because the architecture couldn't change how it expected instructions
without breaking binary compatibility and the compiler wouldn't have been able
to tweak for different CPU's without a fat-binary approach.

------
eliteraspberrie
Great explanation for folks without a hardware background. I also enjoyed his
previous article about ARM64. Thanks for sharing.

------
ams6110
As a kid I had a TI 99/4a. The TMS9900 processor didn't have any registers, it
had a "workspace pointer" which let you treat a block of RAM as your
registers. This was slow, but in theory allowed for convenient context
switches as you'd just load a new workspace pointer.

Do any modern CPUs still use an approach like this?

~~~
Symmetry
Not if the CPU runs faster than 10 MHz or so. Fundamentally the speed of CPUs
has gone up much, much faster than the speed of RAM for the reasons listed in
the article. Some micro-controllers can still do things like you describe, but
anything you'd think of as a modern CPU uses some form of caching that makes
things more complicated than that.

------
rlu
It's funny reading this and then remembering that on top of all this, there's
paging (i.e. fetching from hard drive).

It's like registers are refrigerators, RAM is like the grocery store around
the corner, and Page faults are like the grocery stores in a neighboring town

woooooo memory!

~~~
grogenaut
Don't forget DMA which is like drop shipping with a guaranteed delivery date,
like 3 days later, but they just shove it directly into the shelf

------
taeric
While I personally love this answer, I have to admit a basic physical metaphor
works. If you remember an answer, it is practically immediate. The further
back in your records you have to go to find something, the slower it will be.

We have faster ways of recalling notes today than we did in the past, you
might say? Well, yeah. In many respects our ram is faster than registers of
early computers, too. That all things have gotten faster doesn't change that
things which were faster are still faster. (I'd be delighted to know examples
where this radically changed somehow.)

------
grogenaut
Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON,
not NEXT TO.

Hell, if you know and optimize for registers and don't know why they're fast,
you should be shot. Otherwise you're using a language that doesn't really give
you control over registers why do you care?

Okay okay, I like reading about the blackbird and I know that I know nothing
about how it really works other than lots of fuel. Still. Okay, I'm a
Hypokrite.

------
ars
While the CPU is waiting for data to load from RAM, is the operating system
smart enough to give it a different task to execute?

~~~
mikeash
The overhead of task switching is too great for that to be useful, plus the OS
would probably need to talk to RAM as part of the whole process anyway.

However, this is part of what hyperthreading accomplishes. The OS gives the
CPU two tasks ahead of time, then when one task stalls, the CPU can switch
over to the other one and work on it for a while.

~~~
solarexplorer
This is actually what hyperthreading is all about: cache misses. I missed that
in the article. There are more things missing actually, but I guess it would
be too much to explain it all in a single article. Things like caches,
coherence protocols, prefetching, memory disambiguation. Registers are also
much more complex because you have things like register renaming, result
forwarding etc. In the end there are simply much less registers than memory
locations, that's why you can build faster registers than memory.

~~~
mikeash
I thought hyperthreading was able to go beyond this, and e.g. execute the two
streams in parallel if one is hitting the FPU and the other is doing integer
work, even if neither one is stalled.

And you're right, it's missing a lot because I'm writing an article, not a
book. It is fun to explore details, but ultimately you have to stop somewhere.

~~~
mistercow
That was the impression I had too, but if so I can see how "this is actually
what hyperthreading is all about" would make sense. Two streams of code are
unlikely to have long segments of just-FPU and just-integer respectively, and
even more unlikely that those streams will happen to align during execution.
It happens, sure, but the gains would be smallish.

On the other hand, long periods of no cache misses followed by long periods of
waiting after a cache miss are exactly what you expect from real code
(especially optimized code). So I'd think that you'd have much bigger gains
from that. The same goes for branch misprediction.

~~~
mikeash
Well, the gains are smallish. Real-world gains from hyperthreading are on the
order of 10-20% when you load up a CPU with two threads.

~~~
mistercow
Yeah, but when I said "smallish" I was thinking more on the order of 1%. I
would consider 10% actual gains to be quite large given the craziness of what
Hyperthreading tries to accomplish.

~~~
mikeash
It may also be a matter of more fully utilizing multiple integer/floating-
point units. Say, if the CPU has two integer units but the current code is
only using up one of them, then it could run the second hyperthread on the
other. I really don't know the details though.

------
markb139
But electrical signals don't propogate at the speed of light in a vacuum ( I
didn't read past that point). The signals travel at about 2/3 the speed of
light. This is very significant when you look at path lenghts

------
gonzo
I'll just leave this here.
[https://www.usenix.org/system/files/1309_14-17_mickens.pdf](https://www.usenix.org/system/files/1309_14-17_mickens.pdf)

------
bane
So what's the state of research in breaking out of the Von Neumann approach
and going with a RAM-free architecture where the CPU has m(b)illions of
registers you just do everything in? Of course it's expensive, but let's say
you have effectively infinite dollars, is this a good idea?

~~~
knz42
Where would your processor fetch its program code from, if not RAM?

Assuming you place code also into registers...

If you squint hard enough registers are also a form of RAM, just closer to the
processor and faster. A machine with only instruction execute and registers
would still have a Harvard/Von Neumann architecture.

The reason why processors don't have more registers is because they are quite
power hungry and they are not very dense. For a given chip area, D-RAM gives
you more than 6x the capacity for less than half the power use. And no, you
can't make registers with the same technology as D-RAM.

~~~
bane
Right, registers are a kind of very small working memory, the only place where
"work" operations can happen. Most program code eventually has to go through
the register bank anyway, except it all has to be MOV in and out of the
registers, eating up unbelievable amount of time.

I've always viewed RAM as a kind of register cache, necessitated because
registers are expensive to build and RAM, though expensive, is cheaper. I've
heard registers these days are just a small bit of SRAM, but reaching into my
way back machine in college, I seem to remember them being a different kind of
memory element.

But RAM and all the caches these days leading up to registers are all require
fetch from somewhere, store in the register, do the work, then write back the
result somewhere (even if the instruction set obfuscates that). If you had
enough registers, the fetch and store parts of that work are pretty much gone,
turning something like

mov 0xaddressh-1 RegA mov 0xaddressh-2 RegB add RegA RegB RegC mov RegC
0xaddressh-3

into

add 0xReg-1 0xReg-2 0xReg-3

where each mov we do today introduces a cascade down the cache and memory
stack (perhaps even dipping into on-disk VM) just to copy a few bytes into a
register. And we have to do that 3 times here. The number of adds we could do
in the time it takes to do a mov is probably pretty high, but we simply can't
do them because we're waiting on bits moving from one place to another.

So suppose money, power etc. weren't considered issues and engineering effort
was put into a register-only approach, how much faster would that be? (one the
reasons that the Von Neumann architecture became "the" way to do things was
that registers were considered expensive to build, but what if we didn't care
about money?)

I'd bet a general purpose system built this way would be an order of magnitude
faster than anything we have today. But you're right, it would be an enormous
resource hog and be expensive as a medium-sized mega yacht.

~~~
knz42
I think you don't understand how physics work.

The problem with modern chips is not money, or the price of energy. The poor
things would simply fry (or explode) if we were to make them like you suggest
and clock them at competitive frequency.

------
georgecmu
Cites distance and cost/power as reasons why "RAM is slow, registers are
fast", not a mention of differences between SRAM and (S)DRAM.

Not worth reading.

~~~
jff
He clearly describes SRAM in the following paragraph, then contrasts it with
DRAM in the rest:

 _Registers use an expensive and power-hungry active design. They 're
continuously powered, and when reading them, this means that their value can
be read quickly. Reading a register bit is a matter of activating the right
transistor and then waiting a short time for the powerful register hardware to
push the read line to the appropriate state._

~~~
georgecmu
Yeah, I did not even recognize this as description of SRAM, thanks for
pointing this out.

from [http://www.differencebetween.net/technology/difference-
betwe...](http://www.differencebetween.net/technology/difference-between-sram-
and-dram/)

 _Because of its lower price, DRAM has become the mainstream in computer main
memory despite being slower and more power hungry compared to SRAM. SRAM
memory is still used in a lot of devices where speed is more crucial than
capacity. The most prominent use of SRAM is in the cache memory of processors
where speed is very essential, and the low power consumption translates to
less heat that needs to be dissipated._

In any case, SRAM can be more power hungry that DRAM (per bit) but it can also
be vastly less. SRAM power consumption is not at all driven by the fact that
registers are "continuously powered". Accessing SRAM is the power hungry
operation, but powering requirements otherwise are negligible. If anything,
it's the DRAM that requires constant powering (refreshing).

~~~
mikeash
I appreciate the discussion around this. I'm not too knowledgeable about
hardware, and wasn't sure about this part in particular. I did pass it by a
friend who did hardware professionally for a long time, and he ultimately
agreed with my assessment. However, I think you've convinced me that I was
wrong, and that it's purely about cost, not power consumption. I'll have to
see about editing the article accordingly.

