It operates by pushing data back and forth across it. Backus argued that that "V...

palish · on Oct 9, 2011

I... wish I understood. I'm just a lowly Blub programmer.

An executable is often less than 1MB, and certainly always less than 100MB. In contrast, a 14GB video game still loads pretty quickly, and the data that goes across the bus per frame is often a couple orders of magnitude larger than the program executable itself.

I know I'm missing something obvious...

InclinedPlane · on Oct 9, 2011

Modern computers are extremely fast, and can still do tremendous amounts of computations despite the limitations of the Von Neumann architecture. But make no mistake, the Von Neumann bottleneck is a serious and fundamental problem. The CPU has to spend a lot of effort shuttling data back and forth. Worse yet, it has to spend a lot of time waiting on data (swapping to/from disk, for example). Even when you're CPU is at 100% utilization the vast majority of cycles are spent doing nothing but waiting. That has huge ramifications, affecting everything from performance to power efficiency, etc.

Consider a typical snippet of CPU's life. The next instruction is read from memory, it tells the CPU to move a value from memory into a register. The next instruction after that is read from memory, it tells the CPU to move a different value from memory into a different register. The next instruction is read from memory, it tells the CPU to do some operation with the values in those two registers. The next instruction is read from memory, it tells the CPU to test whether the result from the previous instruction is 0, if it was then jump to a specific address. Since it was the CPU fetches the next instruction from that location in memory. And so on. It only takes following this process for a little while to see how tedious it is. We've managed to significantly improve it by adding fast local memory caches to the CPU but even if the memory operated at the speed of the CPU it would still be inefficient.

Now, imagine if instead of megabytes of low latency cache you have gigabytes. Now, imagine if instead of having a low latency cache at all the processor is directly wired to the RAM as if the RAM was just a large collection of registers. Instead of "fetch me X, fetch me Y, add X + Y, put the result back to Z" all of that could be a single CPU instruction. Moreover, it would be far, far rarer for the CPU to be waiting for data merely due to local latency. This would improve the effective computing power and power efficiency of CPUs by several orders of magnitude. The impact it would have on computing is truly mind boggling.

Let me express it in a different way. Imagine if your cell phone had the same raw computing power as a top of the line GPU does today, with the same battery life and with the same transistor count and clock speed on the CPU, just with a different architecture and different RAM.

palish · on Oct 9, 2011

I think... Maybe... I'm getting it. Kind of. Probably not.

By wiring the CPU directly to the RAM, to use your metaphor, then we can entirely bypass the ASM stage of "a program" (but then what is a program if not a sequence of instructions?) and therefore we may better predict which data our program needs at runtime? Thereby caching that data more effectively than the random access patterns of Von Neumann?

Basically, instead of "accessing a pointer causes its data to be cached into L1", it would be... Well, I have no idea. Something else?

Here are my points of confusion, sorry:

1) in this non-Neumann paradigm, there will still be "data", in the traditional sense, right? (Or is "everything a program"?)

2) then... There will surely still be "caches" for that data, yeah? (Or is that what I'm missing? But without caches, I don't understand how it could be faster.)

But yeah, I don't want to waste anyone's time... certainly not anyone of your guys' caliber. Don't feel compelled/obligated to reply or anything. :)

InclinedPlane · on Oct 9, 2011

Nope, still missing it.

When you wire the RAM to the CPU you don't need a cache. Imagine you have a billion or even a trillion registers, or more. That's a non-Von Neumann architecture. You're not shuffling data around on buses, the data is directly connected to the CPU.

Look at the example I gave again. Consider a simple addition command. The first CPU instruction says "take the word at this memory address, and move it to a register", the second does the same with a different address, the third adds the two values in the registers, the fourth then puts the result back in some other memory location. But what if there's no difference between the memory and registers? Instead you just have one instruction that says: add the values at these two locations, put the result at this other location. Now you've replaced 4 clock ticks with one clock tick. More than that, you save however many clock ticks it would have taken on average for the data to get to / from main memory (sometimes cached, sometimes not). Such an architecture would mean that you only have to wait on things you really have to wait on, like network and device latency, etc.

The structure of programs need not be terribly different per say, it can still be a sequence of instructions in memory. There are other non-Von Neumann architectures which would work differently (such as neural networks), but those are even more complicated.

hetman · on Oct 9, 2011

Except addressing that amount of memory is still going to need a bus, it doesn't matter if the memory is sitting right on top of the CPU core or in the next room. It simply isn't going to be possible to provide direct access to every single memory cell when there are billions of them. This is still going to be a von Neumann (actually Modified Harvard) architecture, it's just going to be blazingly fast.

Now, once we start applying memristor implicational logic data processing we will have truely left the confines of the von Neumann architecture.

pjscott · on Oct 9, 2011

Don't need a cache? The larger your memory is, the greater the access latency will be, even if it's directly on the CPU die. That's why L1 and L2 caches tend to be around 32*2 and 256 KB, respectively. Most of the cache access time comes from the wire delays of sending signals around, and the larger the cache is, the longer the wire delays will be.

If you have a bunch of memory directly on the CPU, caching will still give significant speedups.

modeless · on Oct 9, 2011

The Von Neumann architecture refers to the idea of a computer that has a CPU with a separate memory which stores both programs and data (as pretty much all computers do today). In this type of system the bus between memory and CPU becomes a bottleneck. A non-Von-Neumann architecture might look more like the brain, which doesn't have a CPU at all, but instead colocates processing with memory, eliminating the "memory bus" bottleneck and enabling massive parallelism.