Which just goes to show, hitting memory is a Bad Thing(tm) even when you're running on a slow(from today's perspective) processor like a 68000.
Doing 22kHz generation on a Macintosh is very close to the limit
(Zero-page memory on the 6502 was the memory accessed via addresses with a high byte of 0x00. Since 6502 had sixteen-bit RAM addressing, this meant each page was 256 bytes large, so the zero-page was almost as good as having 256 single-byte registers.)
"slow" elements of the memory hierarchy are on the outside of the sphere, while faster elements (cache, registers, etc) are layered on the inside, like an onion. Since those spheres are smaller they must, by definition, hold fewer bits, but they are, by definition, faster.
The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.
The volume of a sphere is 4/3pir^3. Because latency is a function of the radius (how far it takes light to bounce to the edge of the sphere and back) that means that latency must rise as at least the cube root of the number of bits you want to store. That is the best possible bound.
This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).
O(1) is about number of operations required by algorithm to finish for given data size, not about the time. So latency doesn't matter.
Also: if the amount of information that can be kept in universe is finite (most probably it is) - then you can make algorithm that takes the same amount of operations no matter data size (just always add dummy data to fill up the data to the physical limit). Thus every algorithm is technically O(1).
Proof: let N be the number of bits that we can keep in memory. Every deterministic algorithm either does infinite loop, or finishes the execution after at most 2^N changes of state (otherways it is 2 times in the same state with different follow-up, and he can't, cause it's deterministic). So if we design an algorithm, that for every data fitting into memory calculates the result and then does busy loop for the remaining steps until the step 2^N - this algorithm is O(1) no matter what it does.
There's probably a hole in my understanding somewhere, cause algorithmic complexity would be a really useless definition if that was true :)
Time and the number of operations are equivalent here: as proof, just define the operation as "move an information-carrying photon a tiny distance episilon". That must take a finite amount of time, as the speed of light is finite, and the number of those operations must increase with the number of randomly accessed elements you're working with, as they're necessary simply to retrieve the element from memory.
Operations are usually defined as addition or multiplication or comparison. Moving a photon by epsilon isn't a valid operation in any architecture I'm aware of. Even if we use moving an electron by epsilon - you can't tell pentium to move one electron by exactly epsilon, it will move many at once, and it will move them by whatever it need to perform it's actual operations.
As for infinity - for all physically possible inputs the algorithm modified as described above will produce the same output as the algorithms that are considered correct by most people. If we care about infinities: any algorithm I've seen ever implemented was incorrect - most use integers or floats or doubles so their input space is very limited, and even the ones that use arbitrary length math - are run on machines with finite amount of memory.
Splitting hairs here. You can talk about operations or time. Same thing as operations are sequential. One operation follows another. You can count them when you are done to get to total and the total is also referred to as the "time" in this context.
> I've seen ever implemented was incorrect - most use integers or floats or doubles so their input space is very limited
So are we talking about specific hardware here or not. I thought we weren't. There ambiguity and discussion point is there because one can define what they consider is a "constant" operation. You can say it is a hypothetical von neuman architecture machine and these operations (op1, op2, ....) take a constant time. Now we compare two algorithms and see how they do.
The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.
In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:
* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)
* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.
* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.
* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.
* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...
Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.
Here's a good thread discussing this:
Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.
Saving on context switches (interrupts alone aren't a big deal) was indeed a problem back when AltiVec was designed, thus it has a special register to keep track of which registers need to be saved. In modern designs this is less of a problem, between higher frequencies, multiple cores, and the other effects of a context switch dominating (effective flush of l1 cache and predictors).
The interesting bits nowadays are that load/store is expensive power-wise, which was what ARM identified as the major motivation behind having 32 registers (fewer spills in functions) and OoOe designs.
Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?
In-order designs have more reasons to use more registers, but again they aren't going to use more registers unless they gain something.
TFA gives at least one reason:
> Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.
Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.
Furthermore, although ISA expose a number of registers A, OOO architectures (and their friends parallel and speculative executions) pretty much require the CPU to have > A registers and do register renaming, which lowers the number of registers the ISA can define. For instance the Alpha ISA defines 32 integer registers, but the Alpha 21264 had 80 physical integer registers.
Vector registers are another way to use 1000 registers.
But directly coding 1000 registers into each instruction does not seem to be such a good idea. You might as well use a 1st level cache. The difference between the cache and the register file ist mostly how the instruction set architecture references it. Registers are usually easier to acces because each one has a single name and the CPU can detect dependencies and conflicts easily. Memory accesses and caches are more complex because you need to calculate the addresses before you can detect dependencies/conflicts.
PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.
Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.
Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:
LDR r1, [r0, #0]
ADD r1, r1, #1
STR r1, [r0, #0]
LDR r1, [r0, #4]
ADD r1, r1, #1
STR r1, [r0, #4]
LDR r1, [r0, #8]
ADD r1, r1, #1
STR r1, [r0, #8]
LDR r1, [r0, #0]
ADD r1, r1, #1
STR r1, [r0, #0]
LDR r2, [r0, #4]
ADD r2, r2, #1
STR r2, [r0, #4]
LDR r3, [r0, #8]
ADD r3, r3, #1
STR r3, [r0, #8]
In other words, once you have enough 'real' registers does it get rid of the need for register renaming? Intel added it to their pentiums to improve existing x86 code, but I wonder if it has that much of a benefit with newer ISAs that have 'enough' registers and properly tuned compilers?
And no one does OoOe without register renaming.
Still, I wonder whether some of the features of modern CPUs could be dropped if it wasn't for legacy code. On the other hand, Itanium tried to push the parallelism work onto the compiler and look where that ended up!
But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.
The tradeoffs involving ISA registers are more intense. You have to load and store all of them on thread swaps, but that's pretty tivial. More importantly the bits you have to use to specify which register you're using are bits that you're paying in every single instruction you have, increasing the size of your executable and the pressure on your caches.
Different sorts of architectures have their sweet spots at different places. In order processors doing lots of matrix math and such benefit from lots of architectural registers, the Itanium had 128 integer and 128 floating point registers and that was the right amount for a VLIW architecture with it's features. Modern GPUs are similar.
On the other hand, your typical OoO CPU will have either 16 or 32 registers you can address at a time, and that seems to be close to optimal. It's hard to say since instructions come in discrete chunks and your number of registers has to be a power of 2 as a practical matter.
Then there's practicalities like sense amp design. Large register arrays are not read in a digital fashion, and current L2 and L3 sizes already press their sense amps to their limits. DRAM also uses sense amps, but the amps are again slower and larger.
Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.
He looks and see he doesn't have any beer on the table. So he goes to the fridge and gets what he wants, and comes back to the couch. Later, Bubba runs out of beer (useful data). This is a cache miss, so he has to go down to the corner store and get some beer. Instead of just getting what he wants, maybe he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes back, puts the beer and TV dinner in the fridge, and brings some beers with him to the table. Next time he runs out of beer, he goes to the corner store, but they're all out of beer.
So he buys some seed, tills the fields, and grows his own barley. This is disk access.
Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but it is a very important implementation detail given that it is directly exposed to the programmer as memory, not in some automagically managed cache.
It's usually unused once the kernel has started but it can be mapped by the kernel later on if there's a use for it.
Can be implemented with any of these components: DRAM, SRAM, D-FF (flip-flops)
It's common for main memory (in embedded systems) and register files to use SRAM. But you can also implement the registers with flip-flop banks, and get something bulkier but faster. I'm not sure what Intel/AMD does.
Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.
EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....
An interesting thing about Apples take on AArch64 in particular that some people have been speculating about is about how Apple's Cyclone core's memory subsystem works. ARM cores usually use the virtual (post-MMU) address of data to determine where in the cache data lives, but if you stick with page size as big or bigger than the L1 size you can start your L1 lookup at the same time you do your TLB lookup, and save a lot of latency. Apple's control of the OS is what lets them force 64K page sizes.
And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."
This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.
As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.
Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.
For a long time though, the Linux kernel has been compiled to optimize code-size rather than 'performance' (according to GCC). Why? Because the kernel gets involved in every syscall the OS makes, so the kernel code gets paged in and out very frequently. Loading a little less code from RAM means everything goes faster.
I've done the instruction scheduling stuff by hand on paper; it's pretty interesting. We did Tomasulo scheduling, which is hardly modern, being developed in 1967, but it'll execute your instructions all sorts of ways.
Do any modern CPUs still use an approach like this?
It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring town
We have faster ways of recalling notes today than we did in the past, you might say? Well, yeah. In many respects our ram is faster than registers of early computers, too. That all things have gotten faster doesn't change that things which were faster are still faster. (I'd be delighted to know examples where this radically changed somehow.)
Hell, if you know and optimize for registers and don't know why they're fast, you should be shot. Otherwise you're using a language that doesn't really give you control over registers why do you care?
Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.
However, this is part of what hyperthreading accomplishes. The OS gives the CPU two tasks ahead of time, then when one task stalls, the CPU can switch over to the other one and work on it for a while.
And you're right, it's missing a lot because I'm writing an article, not a book. It is fun to explore details, but ultimately you have to stop somewhere.
On the other hand, long periods of no cache misses followed by long periods of waiting after a cache miss are exactly what you expect from real code (especially optimized code). So I'd think that you'd have much bigger gains from that. The same goes for branch misprediction.
Of course it's hard to write about a complex topic, choose the right details, and make it all seem simple. Thumbs up for trying!
 for example, CONFIG_SCHED_SMT in linux
Interestingly, there is at least one academic proposal to do something like this , but I'm not aware of any real implementations.
And when that fails, it's exactly the use case for hyperthreading.
Assuming you place code also into registers...
If you squint hard enough registers are also a form of RAM, just closer to the processor and faster. A machine with only instruction execute and registers would still have a Harvard/Von Neumann architecture.
The reason why processors don't have more registers is because they are quite power hungry and they are not very dense. For a given chip area, D-RAM gives you more than 6x the capacity for less than half the power use. And no, you can't make registers with the same technology as D-RAM.
I've always viewed RAM as a kind of register cache, necessitated because registers are expensive to build and RAM, though expensive, is cheaper. I've heard registers these days are just a small bit of SRAM, but reaching into my way back machine in college, I seem to remember them being a different kind of memory element.
But RAM and all the caches these days leading up to registers are all require fetch from somewhere, store in the register, do the work, then write back the result somewhere (even if the instruction set obfuscates that). If you had enough registers, the fetch and store parts of that work are pretty much gone, turning something like
mov 0xaddressh-1 RegA
mov 0xaddressh-2 RegB
add RegA RegB RegC
mov RegC 0xaddressh-3
add 0xReg-1 0xReg-2 0xReg-3
where each mov we do today introduces a cascade down the cache and memory stack (perhaps even dipping into on-disk VM) just to copy a few bytes into a register. And we have to do that 3 times here. The number of adds we could do in the time it takes to do a mov is probably pretty high, but we simply can't do them because we're waiting on bits moving from one place to another.
So suppose money, power etc. weren't considered issues and engineering effort was put into a register-only approach, how much faster would that be? (one the reasons that the Von Neumann architecture became "the" way to do things was that registers were considered expensive to build, but what if we didn't care about money?)
I'd bet a general purpose system built this way would be an order of magnitude faster than anything we have today. But you're right, it would be an enormous resource hog and be expensive as a medium-sized mega yacht.
The problem with modern chips is not money, or the price of energy. The poor things would simply fry (or explode) if we were to make them like you suggest and clock them at competitive frequency.
Not worth reading.
Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.
Because of its lower price, DRAM has become the mainstream in computer main memory despite being slower and more power hungry compared to SRAM. SRAM memory is still used in a lot of devices where speed is more crucial than capacity. The most prominent use of SRAM is in the cache memory of processors where speed is very essential, and the low power consumption translates to less heat that needs to be dissipated.
In any case, SRAM can be more power hungry that DRAM (per bit) but it can also be vastly less. SRAM power consumption is not at all driven by the fact that registers are "continuously powered". Accessing SRAM is the power hungry operation, but powering requirements otherwise are negligible. If anything, it's the DRAM that requires constant powering (refreshing).
Absolutely, except for the part where he mentions them
> If you're really lucky and the value is in L1 cache, it'll only take a few cycles.