"So, I mean, we're all of a sudden talking petabits of memory in a square centimeter device. What can you do with that? Interesting to think about."
Also, application specific processing... writing instruction sets for each application will be really interesting. Cell processing et al will be a thing of the past.
I'm especially curious how Rich Hickey's approach to state, time, and identity in a functional language fit in?
I hope Meg Whitman does better than Leo Apotheker. Shouldn't be hard.
What would a computer with petabytes of non-volatile RAM look like? How would you package software for it? Will it reboot from time to time? What does a reboot mean when your memory is non-volatile? I have my hunches, but the effort to find answers to my few questions is dwarfed by the effort to find out what the relevant questions will be in this brave new world and if we can answer them. Or even understand them.
An executable is often less than 1MB, and certainly always less than 100MB. In contrast, a 14GB video game still loads pretty quickly, and the data that goes across the bus per frame is often a couple orders of magnitude larger than the program executable itself.
I know I'm missing something obvious...
Consider a typical snippet of CPU's life. The next instruction is read from memory, it tells the CPU to move a value from memory into a register. The next instruction after that is read from memory, it tells the CPU to move a different value from memory into a different register. The next instruction is read from memory, it tells the CPU to do some operation with the values in those two registers. The next instruction is read from memory, it tells the CPU to test whether the result from the previous instruction is 0, if it was then jump to a specific address. Since it was the CPU fetches the next instruction from that location in memory. And so on. It only takes following this process for a little while to see how tedious it is. We've managed to significantly improve it by adding fast local memory caches to the CPU but even if the memory operated at the speed of the CPU it would still be inefficient.
Now, imagine if instead of megabytes of low latency cache you have gigabytes. Now, imagine if instead of having a low latency cache at all the processor is directly wired to the RAM as if the RAM was just a large collection of registers. Instead of "fetch me X, fetch me Y, add X + Y, put the result back to Z" all of that could be a single CPU instruction. Moreover, it would be far, far rarer for the CPU to be waiting for data merely due to local latency. This would improve the effective computing power and power efficiency of CPUs by several orders of magnitude. The impact it would have on computing is truly mind boggling.
Let me express it in a different way. Imagine if your cell phone had the same raw computing power as a top of the line GPU does today, with the same battery life and with the same transistor count and clock speed on the CPU, just with a different architecture and different RAM.
By wiring the CPU directly to the RAM, to use your metaphor, then we can entirely bypass the ASM stage of "a program" (but then what is a program if not a sequence of instructions?) and therefore we may better predict which data our program needs at runtime? Thereby caching that data more effectively than the random access patterns of Von Neumann?
Basically, instead of "accessing a pointer causes its data to be cached into L1", it would be... Well, I have no idea. Something else?
Here are my points of confusion, sorry:
1) in this non-Neumann paradigm, there will still be "data", in the traditional sense, right? (Or is "everything a program"?)
2) then... There will surely still be "caches" for that data, yeah? (Or is that what I'm missing? But without caches, I don't understand how it could be faster.)
But yeah, I don't want to waste anyone's time... certainly not anyone of your guys' caliber. Don't feel compelled/obligated to reply or anything. :)
When you wire the RAM to the CPU you don't need a cache. Imagine you have a billion or even a trillion registers, or more. That's a non-Von Neumann architecture. You're not shuffling data around on buses, the data is directly connected to the CPU.
Look at the example I gave again. Consider a simple addition command. The first CPU instruction says "take the word at this memory address, and move it to a register", the second does the same with a different address, the third adds the two values in the registers, the fourth then puts the result back in some other memory location. But what if there's no difference between the memory and registers? Instead you just have one instruction that says: add the values at these two locations, put the result at this other location. Now you've replaced 4 clock ticks with one clock tick. More than that, you save however many clock ticks it would have taken on average for the data to get to / from main memory (sometimes cached, sometimes not). Such an architecture would mean that you only have to wait on things you really have to wait on, like network and device latency, etc.
The structure of programs need not be terribly different per say, it can still be a sequence of instructions in memory. There are other non-Von Neumann architectures which would work differently (such as neural networks), but those are even more complicated.
Now, once we start applying memristor implicational logic data processing we will have truely left the confines of the von Neumann architecture.
If you have a bunch of memory directly on the CPU, caching will still give significant speedups.
But because building those (and the associated support architecture) are much more expensive than slower solutions, we've ended up with a stack of slower and cheaper memories to handle datasets that are too large for the register space, then too large for RAM, etc.
Right now data has to be moved up that stack of faster, but smaller, memories before it can be worked on (nonvolatile storage like a HD->volatile storage like RAM->CPU register) and then back down that stack to store the result (register->RAM->HD).
That movement has to happen across the system bus which on CPU bound operations is a word length for that system -- familiar numbers like 8-bits, 16-bits, 32-bits, 64-bits, etc. but for some older systems that work length might have been something oddball like 7-bits, or 36-bits or something. In other words, you don't move 2GB of data in one shot from your hard drive into RAM. You have to do it in chunks of "words" which means chunks of 32-bits or 64-bits at a time -- over and over again until you've moved the data into memory.
However, RAM is hideously slow compared to CPU registers. I don't know the numbers off the top of my head, but lets say moving data from one part of RAM to another takes 160 cpu cycles (80 to read, 80 to write). This sounds impressively fast on a modern 3.6GHz computer, but that means you can only move 22.5 million "words" around.
By comparison, moving data from one register to another might take 1 cpu cycle.
Furthermore, you can't operate on data in RAM, you have to move it into a register anyways, do the operation, then move the result back out into RAM. So we might be restricted to slightly less than 22.5 million operations per second -- which is pathetically slow.
Clever compilers (and ASM coders) will try and keep things in register space as long as possible to avoid this and try and reach closer to 1 operation per CPU cycle. And modern CPUs have a number of enhancements that also help with this (pipelining, instruction optimization) etc.
But the most important bit are caches. Cache memory are designed to take fewer operations to store/retrieve data than RAM and transparently sandwich in between register space and RAM space and hold working sets of data too large to fit into register space, but are still being worked on so shouldn't end up in RAM space yet. For sake of argument, let's say it takes 20 cpu cycles to read/write from a Cache. If we're operating on data that can fit in a cache, then we can do 180 million instructions per second instead of 22.5. But again, if the working data is too big to fit into the cache, we get a cache miss and end up having to go back to RAM. But now we have to add 20 cpu cycles to a data round trip giving us 180 cycles to read that data and write it back out.
Because the speed difference between RAM and CPUs are so great, multi-level caches which are slightly larger, but use slightly more cycles to retrieve data than the main cache (L1), sit below the CPU cache stack. Even if it takes 30 cycles to read/write some data from an L2 cache (which would only happen on a cache miss in the L1 cache), 20+30 is still faster than 180 for RAM. And so son. Today we have L1, L2 and L3 caches which are all designed to keep the CPU from waiting on RAM.
In other words, all designed to try and overcome the delays of moving data across the system bus and into register space as introduced in the von Neuman architecture.
The system stack these days is something more like HD->RAM->L3->L2->L1->registers with the caveat that moving from one level of the stack to the other probably requires it to move across the system bus which again is likely 32-bits or 64-bits these days. That's 4 or 8 bytes at a time.
With today's multi-gigabyte datasets, that's a ton of data moving across the bus, little of which will end up fitting into any number of caches, slowly eeking it's way into the register space so the CPU can do something like adding two number together. At 3+Ghz, the CPU is mostly just sitting around waiting on data to make it's way up or down this ridiculous stack that's all been designed to accommodate the von Neuman design.
If we had only one kind of memory, and it was fast, and the CPU could directly operate on any part of that memory like it was register space, and we could eliminate the difference between slow but large non-volatile memory (HD) and faster but smaller volatile memory (RAM) computers wouldn't do 180million instructions per second on a good day, they'd do closer to 3.6billion like they are capable of.
So, for example, if you want to add two large streams of numbers (e.g. dense matrices) together, a CPU can do this pretty quickly, because it can fetch the memory in bulk and not need to incur much latency penalty. (It can also avoid polluting the cache, with the correct hint instructions.)
On more typical workloads, though, where you've got a lot of harder-to-predict memory access, what would really come in handy is lower memory latency. And if this type of RAM works, it will give us both: huge bandwidth, and dramatically lower latency.
The system stack, instead of being HD->RAM->L3->L2->L1->registers, could look more like "Hard drive -> Giant shared non-uniform nonvolatile L3 cache -> shared or core-local L2 cache -> core-local L1 cache -> registers".
With regard to your question on data structures, yes - they would have to change. MySQL and most other databases use b-trees since they were meant to live on disk. Just running MySQL on memsistor-backed storage would result in a considerable waste of CPU and storage capacity (b-trees are not very compact).
Running a database like MemSQL on memsistors would make the most sense since it uses data structures meant for DRAM.
As a consumer, I'm indifferent to who does it. As long as it gets out.
I wonder how these theoretical projects will survive in an HP that is only concerned with how much profit each project makes to appease shareholders...
Firstly, the fabrication process everyone is excited about puts the memrister cells directly atop the cmos logic gates as just another layer to the die. No external memory bus to tap into. So you'd need the sort of equipment used to remove layers from dies to expose the cells for reverse engineering, testing, etc. If someone with these resources is after you, you're already fucked regardless of the exact technological vector.
Secondly, a system designer could trivially add some amount of volatile storage for holding security sensitive data. Various schemes of encrypted pages that are decrypted using a key stored in volatile SRM within the CMS could be used. In other words, we could do the same things we currently do with hard drives between the memrister array and SRAM that acts as a decrypted cache.
Thirdly, you're applying an expectation to all memristers that is not applied to existing storage technologies. It might be a fair criticism of a concrete product that operates in an insecure way, but it's absurd to apply this expectation to an entire technology.
and "We have a lot of big plans for it and we're working with Hynix Semiconductor to launch a replacement for flash in the summer of 2013 and also to address the solid-state drive market"
June 2013 is twenty months away, so "a year and a half" is a very reasonable approximation, which it looks like the article writers then approximated again as "18 months"
It's not like those two times are way different.
Perhaps they could do away with the stupid idea altogether and say 2nd quarter 2013 if they can't be more specific.