The really exciting part here, to me, is the idea of fabricating large amounts of nonvolatile memory on top of a CPU. Modern processors already spend a huge amount of their time waiting on memory, and a great amount of power trying to hide that memory latency. If these guys can lower memory latency dramatically -- and it looks like they can -- computers would get a lot faster.
As the number of cores per socket is steadily increasing it is becoming more difficult to keep all of the cores on a single computer operating efficiently. A higher clock rate per core, or more cores per socket, is not very useful if they are all waiting on memory.
Now imagine a large amount of memory connected to the cores through tiny metal wires on the chip itself. Now imagine that it's split into a bunch of small independent memories with enormous bandwidth, with data automatically migrating between them to cut down on wire delays. Give it a few years, and this could be reality.
I watched the whole 47 minutes of that talk on youtube, and it is actually really good. I have no qualifications in this field, but the talk is full of challenging ideas. You can skip reading the article.
All of those years spent trying to educate my family and non-technical friends on the difference between RAM and HDD and explaining how it is confusing when they tell me they couldn't install a program because their laptop was "out of memory".... wasted!
I think the greatest impact of the memristor is in the consumer space. When you have a 1000-fold (or million-fold) quantitative change you are bound to have a qualitative change as well...
What would a computer with petabytes of non-volatile RAM look like? How would you package software for it? Will it reboot from time to time? What does a reboot mean when your memory is non-volatile? I have my hunches, but the effort to find answers to my few questions is dwarfed by the effort to find out what the relevant questions will be in this brave new world and if we can answer them. Or even understand them.
It operates by pushing data back and forth across it. Backus argued that that "Von Neumann bottleneck" of not being able to access program and data at the same time ought to be done away with somehow in his "Can Programming Be Liberated From The Von Neumann Style" http://www.stanford.edu/class/cs242/readings/backus.pdf
I... wish I understood. I'm just a lowly Blub programmer.
An executable is often less than 1MB, and certainly always less than 100MB. In contrast, a 14GB video game still loads pretty quickly, and the data that goes across the bus per frame is often a couple orders of magnitude larger than the program executable itself.
The Von Neumann architecture refers to the idea of a computer that has a CPU with a separate memory which stores both programs and data (as pretty much all computers do today). In this type of system the bus between memory and CPU becomes a bottleneck. A non-Von-Neumann architecture might look more like the brain, which doesn't have a CPU at all, but instead colocates processing with memory, eliminating the "memory bus" bottleneck and enabling massive parallelism.
Modern computers are extremely fast, and can still do tremendous amounts of computations despite the limitations of the Von Neumann architecture. But make no mistake, the Von Neumann bottleneck is a serious and fundamental problem. The CPU has to spend a lot of effort shuttling data back and forth. Worse yet, it has to spend a lot of time waiting on data (swapping to/from disk, for example). Even when you're CPU is at 100% utilization the vast majority of cycles are spent doing nothing but waiting. That has huge ramifications, affecting everything from performance to power efficiency, etc.
Consider a typical snippet of CPU's life. The next instruction is read from memory, it tells the CPU to move a value from memory into a register. The next instruction after that is read from memory, it tells the CPU to move a different value from memory into a different register. The next instruction is read from memory, it tells the CPU to do some operation with the values in those two registers. The next instruction is read from memory, it tells the CPU to test whether the result from the previous instruction is 0, if it was then jump to a specific address. Since it was the CPU fetches the next instruction from that location in memory. And so on. It only takes following this process for a little while to see how tedious it is. We've managed to significantly improve it by adding fast local memory caches to the CPU but even if the memory operated at the speed of the CPU it would still be inefficient.
Now, imagine if instead of megabytes of low latency cache you have gigabytes. Now, imagine if instead of having a low latency cache at all the processor is directly wired to the RAM as if the RAM was just a large collection of registers. Instead of "fetch me X, fetch me Y, add X + Y, put the result back to Z" all of that could be a single CPU instruction. Moreover, it would be far, far rarer for the CPU to be waiting for data merely due to local latency. This would improve the effective computing power and power efficiency of CPUs by several orders of magnitude. The impact it would have on computing is truly mind boggling.
Let me express it in a different way. Imagine if your cell phone had the same raw computing power as a top of the line GPU does today, with the same battery life and with the same transistor count and clock speed on the CPU, just with a different architecture and different RAM.
I think... Maybe... I'm getting it. Kind of. Probably not.
By wiring the CPU directly to the RAM, to use your metaphor, then we can entirely bypass the ASM stage of "a program" (but then what is a program if not a sequence of instructions?) and therefore we may better predict which data our program needs at runtime? Thereby caching that data more effectively than the random access patterns of Von Neumann?
Basically, instead of "accessing a pointer causes its data to be cached into L1", it would be... Well, I have no idea. Something else?
Here are my points of confusion, sorry:
1) in this non-Neumann paradigm, there will still be "data", in the traditional sense, right? (Or is "everything a program"?)
2) then... There will surely still be "caches" for that data, yeah? (Or is that what I'm missing? But without caches, I don't understand how it could be faster.)
But yeah, I don't want to waste anyone's time... certainly not anyone of your guys' caliber. Don't feel compelled/obligated to reply or anything. :)
When you wire the RAM to the CPU you don't need a cache. Imagine you have a billion or even a trillion registers, or more. That's a non-Von Neumann architecture. You're not shuffling data around on buses, the data is directly connected to the CPU.
Look at the example I gave again. Consider a simple addition command. The first CPU instruction says "take the word at this memory address, and move it to a register", the second does the same with a different address, the third adds the two values in the registers, the fourth then puts the result back in some other memory location. But what if there's no difference between the memory and registers? Instead you just have one instruction that says: add the values at these two locations, put the result at this other location. Now you've replaced 4 clock ticks with one clock tick. More than that, you save however many clock ticks it would have taken on average for the data to get to / from main memory (sometimes cached, sometimes not). Such an architecture would mean that you only have to wait on things you really have to wait on, like network and device latency, etc.
The structure of programs need not be terribly different per say, it can still be a sequence of instructions in memory. There are other non-Von Neumann architectures which would work differently (such as neural networks), but those are even more complicated.
Except addressing that amount of memory is still going to need a bus, it doesn't matter if the memory is sitting right on top of the CPU core or in the next room. It simply isn't going to be possible to provide direct access to every single memory cell when there are billions of them. This is still going to be a von Neumann (actually Modified Harvard) architecture, it's just going to be blazingly fast.
Now, once we start applying memristor implicational logic data processing we will have truely left the confines of the von Neumann architecture.
Don't need a cache? The larger your memory is, the greater the access latency will be, even if it's directly on the CPU die. That's why L1 and L2 caches tend to be around 32*2 and 256 KB, respectively. Most of the cache access time comes from the wire delays of sending signals around, and the larger the cache is, the longer the wire delays will be.
If you have a bunch of memory directly on the CPU, caching will still give significant speedups.
It's a good question. Ideally, in a perfect world, your entire computer would be nothing but monstrously fast CPU registers.
But because building those (and the associated support architecture) are much more expensive than slower solutions, we've ended up with a stack of slower and cheaper memories to handle datasets that are too large for the register space, then too large for RAM, etc.
Right now data has to be moved up that stack of faster, but smaller, memories before it can be worked on (nonvolatile storage like a HD->volatile storage like RAM->CPU register) and then back down that stack to store the result (register->RAM->HD).
That movement has to happen across the system bus which on CPU bound operations is a word length for that system -- familiar numbers like 8-bits, 16-bits, 32-bits, 64-bits, etc. but for some older systems that work length might have been something oddball like 7-bits, or 36-bits or something. In other words, you don't move 2GB of data in one shot from your hard drive into RAM. You have to do it in chunks of "words" which means chunks of 32-bits or 64-bits at a time -- over and over again until you've moved the data into memory.
However, RAM is hideously slow compared to CPU registers. I don't know the numbers off the top of my head, but lets say moving data from one part of RAM to another takes 160 cpu cycles (80 to read, 80 to write). This sounds impressively fast on a modern 3.6GHz computer, but that means you can only move 22.5 million "words" around.
By comparison, moving data from one register to another might take 1 cpu cycle.
Furthermore, you can't operate on data in RAM, you have to move it into a register anyways, do the operation, then move the result back out into RAM. So we might be restricted to slightly less than 22.5 million operations per second -- which is pathetically slow.
Clever compilers (and ASM coders) will try and keep things in register space as long as possible to avoid this and try and reach closer to 1 operation per CPU cycle. And modern CPUs have a number of enhancements that also help with this (pipelining, instruction optimization) etc.
But the most important bit are caches. Cache memory are designed to take fewer operations to store/retrieve data than RAM and transparently sandwich in between register space and RAM space and hold working sets of data too large to fit into register space, but are still being worked on so shouldn't end up in RAM space yet. For sake of argument, let's say it takes 20 cpu cycles to read/write from a Cache. If we're operating on data that can fit in a cache, then we can do 180 million instructions per second instead of 22.5. But again, if the working data is too big to fit into the cache, we get a cache miss and end up having to go back to RAM. But now we have to add 20 cpu cycles to a data round trip giving us 180 cycles to read that data and write it back out.
Because the speed difference between RAM and CPUs are so great, multi-level caches which are slightly larger, but use slightly more cycles to retrieve data than the main cache (L1), sit below the CPU cache stack. Even if it takes 30 cycles to read/write some data from an L2 cache (which would only happen on a cache miss in the L1 cache), 20+30 is still faster than 180 for RAM. And so son. Today we have L1, L2 and L3 caches which are all designed to keep the CPU from waiting on RAM.
In other words, all designed to try and overcome the delays of moving data across the system bus and into register space as introduced in the von Neuman architecture.
The system stack these days is something more like HD->RAM->L3->L2->L1->registers with the caveat that moving from one level of the stack to the other probably requires it to move across the system bus which again is likely 32-bits or 64-bits these days. That's 4 or 8 bytes at a time.
With today's multi-gigabyte datasets, that's a ton of data moving across the bus, little of which will end up fitting into any number of caches, slowly eeking it's way into the register space so the CPU can do something like adding two number together. At 3+Ghz, the CPU is mostly just sitting around waiting on data to make it's way up or down this ridiculous stack that's all been designed to accommodate the von Neuman design.
If we had only one kind of memory, and it was fast, and the CPU could directly operate on any part of that memory like it was register space, and we could eliminate the difference between slow but large non-volatile memory (HD) and faster but smaller volatile memory (RAM) computers wouldn't do 180million instructions per second on a good day, they'd do closer to 3.6billion like they are capable of.
Great explanation, but we should distinguish between memory bandwidth -- how many bytes we can read or write per second -- and memory latency, how long it takes to load or store some memory location. Bandwidth is actually pretty fast these days; the latency is what sucks.
So, for example, if you want to add two large streams of numbers (e.g. dense matrices) together, a CPU can do this pretty quickly, because it can fetch the memory in bulk and not need to incur much latency penalty. (It can also avoid polluting the cache, with the correct hint instructions.)
On more typical workloads, though, where you've got a lot of harder-to-predict memory access, what would really come in handy is lower memory latency. And if this type of RAM works, it will give us both: huge bandwidth, and dramatically lower latency.
The system stack, instead of being HD->RAM->L3->L2->L1->registers, could look more like "Hard drive -> Giant shared non-uniform nonvolatile L3 cache -> shared or core-local L2 cache -> core-local L1 cache -> registers".
What are the implications for back end development? Will this greatly reduce server complexity (need for redundancy)? Could AWS just give you a machine in the cloud, and you wouldn't need to worry about databases and database backups and so on, you could just keep all your data locally in natural data structures?
This would not necessarily affect back end development as you would still want high availability for the application by using an active-active setup across two availability zones in EC2.
With regard to your question on data structures, yes - they would have to change. MySQL and most other databases use b-trees since they were meant to live on disk. Just running MySQL on memsistor-backed storage would result in a considerable waste of CPU and storage capacity (b-trees are not very compact).
Running a database like MemSQL on memsistors would make the most sense since it uses data structures meant for DRAM.
Ok. So your computer is in lock-screen mode requiring you to enter a password before resuming your interactive session. Someone with physical access to your computer certainly can find a way to divorce the memory from the rest of the system without letting the OS do its thing ... your live program memory is compromised. This memory often will have more sensitive info than your (possibly encrypted) mass storage.
Firstly, the fabrication process everyone is excited about puts the memrister cells directly atop the cmos logic gates as just another layer to the die. No external memory bus to tap into. So you'd need the sort of equipment used to remove layers from dies to expose the cells for reverse engineering, testing, etc. If someone with these resources is after you, you're already fucked regardless of the exact technological vector.
Secondly, a system designer could trivially add some amount of volatile storage for holding security sensitive data. Various schemes of encrypted pages that are decrypted using a key stored in volatile SRM within the CMS could be used. In other words, we could do the same things we currently do with hard drives between the memrister array and SRAM that acts as a decrypted cache.
Thirdly, you're applying an expectation to all memristers that is not applied to existing storage technologies. It might be a fair criticism of a concrete product that operates in an insecure way, but it's absurd to apply this expectation to an entire technology.
This is already possible with existing DRAM memory; simply cool the RAM chips sufficiently and you increase data retention enough to extract the RAM sticks and read back the data. However, if someone has physical access to your PC, there is almost certainly an easier way for them to gain access to your data. The best protection is restricting physical access.
It's an estimate not an exact length of time. 18 months takes us to April 2013, so maybe they rounded April into "summer" or maybe they rounded 20 months to the nearest .5 of a year. Or maybe they class April as summer anyway.