Hacker News new | comments | show | ask | jobs | submit login
HP plans to release first memristor, alternative to flash and SSDs in 18 months (nextbigfuture.com)
140 points by peritpatrio on Oct 8, 2011 | hide | past | web | favorite | 54 comments

This article is mostly made of quotes from an EEtimes article[1], adds no information or value to it, and is (to me at least) less readable.


EE Times article submitted here: http://news.ycombinator.com/item?id=3080963

The really exciting part here, to me, is the idea of fabricating large amounts of nonvolatile memory on top of a CPU. Modern processors already spend a huge amount of their time waiting on memory, and a great amount of power trying to hide that memory latency. If these guys can lower memory latency dramatically -- and it looks like they can -- computers would get a lot faster.

This would be a huge boon for scientific computing. The data is often huge, but all of it must be constantly pushed through the CPU every iteration.

As the number of cores per socket is steadily increasing it is becoming more difficult to keep all of the cores on a single computer operating efficiently. A higher clock rate per core, or more cores per socket, is not very useful if they are all waiting on memory.

Now imagine a large amount of memory connected to the cores through tiny metal wires on the chip itself. Now imagine that it's split into a bunch of small independent memories with enormous bandwidth, with data automatically migrating between them to cut down on wire delays. Give it a few years, and this could be reality.

Unless we are considering coupling lots of memory to the cores themselves, bypassing slow external buses.

Watch HP research's Stanley Williams describe the memristor and what they are working towards in more detail on YouTube:


From the video:

"So, I mean, we're all of a sudden talking petabits of memory in a square centimeter device. What can you do with that? Interesting to think about."

I watched the whole 47 minutes of that talk on youtube, and it is actually really good. I have no qualifications in this field, but the talk is full of challenging ideas. You can skip reading the article.

The no more hard drive and persistent ram thing should be really interesting. Completely change how we do things.

Also, application specific processing... writing instruction sets for each application will be really interesting. Cell processing et al will be a thing of the past.

All of those years spent trying to educate my family and non-technical friends on the difference between RAM and HDD and explaining how it is confusing when they tell me they couldn't install a program because their laptop was "out of memory".... wasted!

Great to see some positive press on HP. Hope HP can innovate their way out of the current slump.

So how would this affect what languages we use? How different data structures perform? etc. Will this amplify the end of the free lunch?

I'm especially curious how Rich Hickey's approach to state, time, and identity in a functional language fit in?

Here we have HP on the verge of profoundly transforming our industry and a CEO (who should be aware of this) wanting to turn the company into an SAP, because that's the future...

I hope Meg Whitman does better than Leo Apotheker. Shouldn't be hard.

But will Meg take it in the right direction either? Her experience is consumer-facing... I hope there is an empowered VP or 2 who will make that initiative a success.

I think the greatest impact of the memristor is in the consumer space. When you have a 1000-fold (or million-fold) quantitative change you are bound to have a qualitative change as well...

What would a computer with petabytes of non-volatile RAM look like? How would you package software for it? Will it reboot from time to time? What does a reboot mean when your memory is non-volatile? I have my hunches, but the effort to find answers to my few questions is dwarfed by the effort to find out what the relevant questions will be in this brave new world and if we can answer them. Or even understand them.

Enlarge the data bus and...are we finally finding a way out of the von Neuman architecture?

Is the Von Neumann architecture related to the data bus?

It operates by pushing data back and forth across it. Backus argued that that "Von Neumann bottleneck" of not being able to access program and data at the same time ought to be done away with somehow in his "Can Programming Be Liberated From The Von Neumann Style" http://www.stanford.edu/class/cs242/readings/backus.pdf

I... wish I understood. I'm just a lowly Blub programmer.

An executable is often less than 1MB, and certainly always less than 100MB. In contrast, a 14GB video game still loads pretty quickly, and the data that goes across the bus per frame is often a couple orders of magnitude larger than the program executable itself.

I know I'm missing something obvious...

Modern computers are extremely fast, and can still do tremendous amounts of computations despite the limitations of the Von Neumann architecture. But make no mistake, the Von Neumann bottleneck is a serious and fundamental problem. The CPU has to spend a lot of effort shuttling data back and forth. Worse yet, it has to spend a lot of time waiting on data (swapping to/from disk, for example). Even when you're CPU is at 100% utilization the vast majority of cycles are spent doing nothing but waiting. That has huge ramifications, affecting everything from performance to power efficiency, etc.

Consider a typical snippet of CPU's life. The next instruction is read from memory, it tells the CPU to move a value from memory into a register. The next instruction after that is read from memory, it tells the CPU to move a different value from memory into a different register. The next instruction is read from memory, it tells the CPU to do some operation with the values in those two registers. The next instruction is read from memory, it tells the CPU to test whether the result from the previous instruction is 0, if it was then jump to a specific address. Since it was the CPU fetches the next instruction from that location in memory. And so on. It only takes following this process for a little while to see how tedious it is. We've managed to significantly improve it by adding fast local memory caches to the CPU but even if the memory operated at the speed of the CPU it would still be inefficient.

Now, imagine if instead of megabytes of low latency cache you have gigabytes. Now, imagine if instead of having a low latency cache at all the processor is directly wired to the RAM as if the RAM was just a large collection of registers. Instead of "fetch me X, fetch me Y, add X + Y, put the result back to Z" all of that could be a single CPU instruction. Moreover, it would be far, far rarer for the CPU to be waiting for data merely due to local latency. This would improve the effective computing power and power efficiency of CPUs by several orders of magnitude. The impact it would have on computing is truly mind boggling.

Let me express it in a different way. Imagine if your cell phone had the same raw computing power as a top of the line GPU does today, with the same battery life and with the same transistor count and clock speed on the CPU, just with a different architecture and different RAM.

I think... Maybe... I'm getting it. Kind of. Probably not.

By wiring the CPU directly to the RAM, to use your metaphor, then we can entirely bypass the ASM stage of "a program" (but then what is a program if not a sequence of instructions?) and therefore we may better predict which data our program needs at runtime? Thereby caching that data more effectively than the random access patterns of Von Neumann?

Basically, instead of "accessing a pointer causes its data to be cached into L1", it would be... Well, I have no idea. Something else?

Here are my points of confusion, sorry:

1) in this non-Neumann paradigm, there will still be "data", in the traditional sense, right? (Or is "everything a program"?)

2) then... There will surely still be "caches" for that data, yeah? (Or is that what I'm missing? But without caches, I don't understand how it could be faster.)

But yeah, I don't want to waste anyone's time... certainly not anyone of your guys' caliber. Don't feel compelled/obligated to reply or anything. :)

Nope, still missing it.

When you wire the RAM to the CPU you don't need a cache. Imagine you have a billion or even a trillion registers, or more. That's a non-Von Neumann architecture. You're not shuffling data around on buses, the data is directly connected to the CPU.

Look at the example I gave again. Consider a simple addition command. The first CPU instruction says "take the word at this memory address, and move it to a register", the second does the same with a different address, the third adds the two values in the registers, the fourth then puts the result back in some other memory location. But what if there's no difference between the memory and registers? Instead you just have one instruction that says: add the values at these two locations, put the result at this other location. Now you've replaced 4 clock ticks with one clock tick. More than that, you save however many clock ticks it would have taken on average for the data to get to / from main memory (sometimes cached, sometimes not). Such an architecture would mean that you only have to wait on things you really have to wait on, like network and device latency, etc.

The structure of programs need not be terribly different per say, it can still be a sequence of instructions in memory. There are other non-Von Neumann architectures which would work differently (such as neural networks), but those are even more complicated.

Except addressing that amount of memory is still going to need a bus, it doesn't matter if the memory is sitting right on top of the CPU core or in the next room. It simply isn't going to be possible to provide direct access to every single memory cell when there are billions of them. This is still going to be a von Neumann (actually Modified Harvard) architecture, it's just going to be blazingly fast.

Now, once we start applying memristor implicational logic data processing we will have truely left the confines of the von Neumann architecture.

Don't need a cache? The larger your memory is, the greater the access latency will be, even if it's directly on the CPU die. That's why L1 and L2 caches tend to be around 32*2 and 256 KB, respectively. Most of the cache access time comes from the wire delays of sending signals around, and the larger the cache is, the longer the wire delays will be.

If you have a bunch of memory directly on the CPU, caching will still give significant speedups.

The Von Neumann architecture refers to the idea of a computer that has a CPU with a separate memory which stores both programs and data (as pretty much all computers do today). In this type of system the bus between memory and CPU becomes a bottleneck. A non-Von-Neumann architecture might look more like the brain, which doesn't have a CPU at all, but instead colocates processing with memory, eliminating the "memory bus" bottleneck and enabling massive parallelism.

It's a good question. Ideally, in a perfect world, your entire computer would be nothing but monstrously fast CPU registers.

But because building those (and the associated support architecture) are much more expensive than slower solutions, we've ended up with a stack of slower and cheaper memories to handle datasets that are too large for the register space, then too large for RAM, etc.

Right now data has to be moved up that stack of faster, but smaller, memories before it can be worked on (nonvolatile storage like a HD->volatile storage like RAM->CPU register) and then back down that stack to store the result (register->RAM->HD).

That movement has to happen across the system bus which on CPU bound operations is a word length for that system -- familiar numbers like 8-bits, 16-bits, 32-bits, 64-bits, etc. but for some older systems that work length might have been something oddball like 7-bits, or 36-bits or something. In other words, you don't move 2GB of data in one shot from your hard drive into RAM. You have to do it in chunks of "words" which means chunks of 32-bits or 64-bits at a time -- over and over again until you've moved the data into memory.

However, RAM is hideously slow compared to CPU registers. I don't know the numbers off the top of my head, but lets say moving data from one part of RAM to another takes 160 cpu cycles (80 to read, 80 to write). This sounds impressively fast on a modern 3.6GHz computer, but that means you can only move 22.5 million "words" around.

By comparison, moving data from one register to another might take 1 cpu cycle.

Furthermore, you can't operate on data in RAM, you have to move it into a register anyways, do the operation, then move the result back out into RAM. So we might be restricted to slightly less than 22.5 million operations per second -- which is pathetically slow.

Clever compilers (and ASM coders) will try and keep things in register space as long as possible to avoid this and try and reach closer to 1 operation per CPU cycle. And modern CPUs have a number of enhancements that also help with this (pipelining, instruction optimization) etc.

But the most important bit are caches. Cache memory are designed to take fewer operations to store/retrieve data than RAM and transparently sandwich in between register space and RAM space and hold working sets of data too large to fit into register space, but are still being worked on so shouldn't end up in RAM space yet. For sake of argument, let's say it takes 20 cpu cycles to read/write from a Cache. If we're operating on data that can fit in a cache, then we can do 180 million instructions per second instead of 22.5. But again, if the working data is too big to fit into the cache, we get a cache miss and end up having to go back to RAM. But now we have to add 20 cpu cycles to a data round trip giving us 180 cycles to read that data and write it back out.

Because the speed difference between RAM and CPUs are so great, multi-level caches which are slightly larger, but use slightly more cycles to retrieve data than the main cache (L1), sit below the CPU cache stack. Even if it takes 30 cycles to read/write some data from an L2 cache (which would only happen on a cache miss in the L1 cache), 20+30 is still faster than 180 for RAM. And so son. Today we have L1, L2 and L3 caches which are all designed to keep the CPU from waiting on RAM.

In other words, all designed to try and overcome the delays of moving data across the system bus and into register space as introduced in the von Neuman architecture.

The system stack these days is something more like HD->RAM->L3->L2->L1->registers with the caveat that moving from one level of the stack to the other probably requires it to move across the system bus which again is likely 32-bits or 64-bits these days. That's 4 or 8 bytes at a time.

With today's multi-gigabyte datasets, that's a ton of data moving across the bus, little of which will end up fitting into any number of caches, slowly eeking it's way into the register space so the CPU can do something like adding two number together. At 3+Ghz, the CPU is mostly just sitting around waiting on data to make it's way up or down this ridiculous stack that's all been designed to accommodate the von Neuman design.

If we had only one kind of memory, and it was fast, and the CPU could directly operate on any part of that memory like it was register space, and we could eliminate the difference between slow but large non-volatile memory (HD) and faster but smaller volatile memory (RAM) computers wouldn't do 180million instructions per second on a good day, they'd do closer to 3.6billion like they are capable of.

Great explanation, but we should distinguish between memory bandwidth -- how many bytes we can read or write per second -- and memory latency, how long it takes to load or store some memory location. Bandwidth is actually pretty fast these days; the latency is what sucks.

So, for example, if you want to add two large streams of numbers (e.g. dense matrices) together, a CPU can do this pretty quickly, because it can fetch the memory in bulk and not need to incur much latency penalty. (It can also avoid polluting the cache, with the correct hint instructions.)

On more typical workloads, though, where you've got a lot of harder-to-predict memory access, what would really come in handy is lower memory latency. And if this type of RAM works, it will give us both: huge bandwidth, and dramatically lower latency.

The system stack, instead of being HD->RAM->L3->L2->L1->registers, could look more like "Hard drive -> Giant shared non-uniform nonvolatile L3 cache -> shared or core-local L2 cache -> core-local L1 cache -> registers".

What are the implications for back end development? Will this greatly reduce server complexity (need for redundancy)? Could AWS just give you a machine in the cloud, and you wouldn't need to worry about databases and database backups and so on, you could just keep all your data locally in natural data structures?

This would not necessarily affect back end development as you would still want high availability for the application by using an active-active setup across two availability zones in EC2.

With regard to your question on data structures, yes - they would have to change. MySQL and most other databases use b-trees since they were meant to live on disk. Just running MySQL on memsistor-backed storage would result in a considerable waste of CPU and storage capacity (b-trees are not very compact).

Running a database like MemSQL on memsistors would make the most sense since it uses data structures meant for DRAM.

Not if they continue cutting their hardware divisions, it won't.

> Asked about the competition, Williams said: "Samsung has An even larger group working on this than we do."

As a consumer, I'm indifferent to who does it. As long as it gets out.

Whatever happened with Nantero?

A memristor TouchPad would've been really hot.

I wonder how these theoretical projects will survive in an HP that is only concerned with how much profit each project makes to appease shareholders...

The bad side-effects of this memory technology: you can't just power-off your computer to hide your current activity; decrypted passwords in memory still will be readable after shutoff.

This is silly. Just because memory is non volatile storage doesn't mean the OS can't do reasonable things like clearing out some state as it goes to sleep.

Ok. So your computer is in lock-screen mode requiring you to enter a password before resuming your interactive session. Someone with physical access to your computer certainly can find a way to divorce the memory from the rest of the system without letting the OS do its thing ... your live program memory is compromised. This memory often will have more sensitive info than your (possibly encrypted) mass storage.

I disagree.

Firstly, the fabrication process everyone is excited about puts the memrister cells directly atop the cmos logic gates as just another layer to the die. No external memory bus to tap into. So you'd need the sort of equipment used to remove layers from dies to expose the cells for reverse engineering, testing, etc. If someone with these resources is after you, you're already fucked regardless of the exact technological vector.

Secondly, a system designer could trivially add some amount of volatile storage for holding security sensitive data. Various schemes of encrypted pages that are decrypted using a key stored in volatile SRM within the CMS could be used. In other words, we could do the same things we currently do with hard drives between the memrister array and SRAM that acts as a decrypted cache.

Thirdly, you're applying an expectation to all memristers that is not applied to existing storage technologies. It might be a fair criticism of a concrete product that operates in an insecure way, but it's absurd to apply this expectation to an entire technology.

This is already possible with existing DRAM memory; simply cool the RAM chips sufficiently and you increase data retention enough to extract the RAM sticks and read back the data. However, if someone has physical access to your PC, there is almost certainly an easier way for them to gain access to your data. The best protection is restricting physical access.

The article mentions both 18 months and summer 2013. That's not entirely consistent.

The specific quote was "We’re planning to put a replacement chip on the market to go up against flash within a year and a half"

and "We have a lot of big plans for it and we're working with Hynix Semiconductor to launch a replacement for flash in the summer of 2013 and also to address the solid-state drive market"

June 2013 is twenty months away, so "a year and a half" is a very reasonable approximation, which it looks like the article writers then approximated again as "18 months"

It's an estimate not an exact length of time. 18 months takes us to April 2013, so maybe they rounded April into "summer" or maybe they rounded 20 months to the nearest .5 of a year. Or maybe they class April as summer anyway.

It's not like those two times are way different.

Well summer is December to February, so April is a fair way off...

Perhaps they could do away with the stupid idea altogether and say 2nd quarter 2013 if they can't be more specific.

I doubt HP itself will exist in 18 months.

It won't disappear. It could be split, close some sites, change leadership again, etc. but it won't simply cease to exist. The existing infrastructure itself is worth too much.

What would you say are HP's top 5 most valuable pieces of infrastructure?

Just guessing but: lab machines, buildings, purpose-built networks/processing centres, etc. I don't know how the production is organised, but assembly lines may be included.

Even if HP doesn't the technology for this will probably survive.

Hyperbole much?

All day, every day!

Eight days a wee-eeek!

I don't think that's possible, Jersey Shore doesn't have as much drama as HP.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact