Hacker News new | past | comments | ask | show | jobs | submit login
Intel Announces Optane DIMMs (anandtech.com)
446 points by p1esk 9 months ago | hide | past | web | favorite | 183 comments

These are going to change things for a lot of reasons, not the least of which is significantly higher transaction rates in systems. In particular being past a memory barrier means that your cache is consistent and persistent across reboots in nanoseconds rather than milliseconds or even microseconds means large transnational systems can do more operations at up to 6 orders of magnitude faster. In file systems, and especially so called 'IRON' or error corrected systems, this will significantly boost the speed of operations as more writes can be delayed safely to insure that dense spinning media is writing sequentially rather than randomly. Using the old NetApp cluster architecture I expect you could get a couple of gigabytes per second of random (from the clients) R/W performance on a very reasonable number of drives.

Of course it also makes it possible to rapidly reboot into an OS if all you need in the boot loader is to copy a gigabyte of 'ram' from one place to another and then jump to it. What that enables is powering down servers that aren't being used with the ability to power them up in milliseconds rather than seconds. That has been of Google and other cloud providers to have 'power proportionate computing' clusters.

Very cool stuff indeed.

> if all you need in the boot loader is to copy a gigabyte of 'ram' from one place to another and then jump to it

I suppose you don't even have to copy it before you jump to it. You could map the pages and then lazily copy when you get page faults.

> powering down servers that aren't being used with the ability to power them up in milliseconds

Still, if this is your goal, can't you get 90% of the way there by keeping some DRAM powered up while everything else is off? Refreshing DRAM requires some power, but not that much from what I understand (as a software guy).

On keeping DRAM alive, at NetApp where the company built custom motherboards specifically for filers, this continued to be an issue for the memory controllers in the processor chips. Neither AMD nor Intel had any support for staying clear of the DRAM configuration on power up of their server chips. There was an interesting config in one of their laptop chips but no ECC on that chip so it was a non starter. At Google the motherboards were less customized and even less likely to incur a Google specific cost of something like holding up memory. The benefit of this coming from Intel is that they do the work to qualify it and everyone else gets to reap the benefits.

The memory bus on these processors is 3 - 30 GBs so copying a full gig would be a 3 - 30mS sort of deal.A pretty short start time. And yes, you tell the BIOS to skip POST (its a common start up option in even off the shelf BIOS packages these days) in order to get a faster boot time.

Netapp can also get rid of their NVRAM architecture. More importantly any journaling file system can have in memory logging now and replication means just transfer the fs transaction logs.

> The memory bus on these processors is 3 - 30 GBs so copying a full gig would be a 3 - 30mS sort of deal

This should say 30 - 300ms I think

> I suppose you don't even have to copy it before you jump to it. You could map the pages and then lazily copy when you get page faults.

You could also just map it and not even fault on access. For data that's mostly read and seldom written, and not too much of a hot spot, there would be no reason to move it to DRAM. Depending on the read latency of these DIMMs, that might include a lot of executable code.

> keeping some DRAM powered up while everything else is off?

STR suspend to ram, available in off the shelf computers since ~1995

Novice in these contexts but I think it would be less of a saving as DRAM must be refreshed, at a pretty high clockrate too, typically.

DRAM has a self-refresh mode powered by an internal clock that you use for low-power standby, so this is not a problem. The DRAM just need to be powered.

At the current rating of 30 drive writes per day if you try to use it the way people use RAM now it will maybe last a week if you're lucky.

It is important to remember that this isn't FLASH. It is more like FRAM with its non-volatility. When Intel and Micron announced it three years ago[1] they made some pretty bold claims, like 1000x faster than NAND, unlimited write endurance, Etc. For a long time it was just vaporware and then it came out as "NVME" type SSDs and now they have figured out how to put it right on the frontside bus with the memory controller.

[1] https://newsroom.intel.com/news-releases/intel-and-micron-pr...

not fsb, memory controller is now integrated into the processor die.

Ah good point, they are still north of the cache and MMUs though (as opposed to being on the PCI bus and behind the IOMMU at best)

Of course, it comes at a cost: These dimms have several orders of magnitude higher latency and a fraction of the throughput of DRAM.

Once you introduce volatile memory as highspeed buffer, the coolness factor drops quite a bit as you end up in the same volatile/non-volatile tiered storage split that we have always had. It's just back on a parrallel bus like in the good ol' days of PATA.

How do you know? Other comments have pointed out that Intel was aiming for Optane DIMMs with a read latency half of the current RAM.

I appear to have exaggerated slightly when I said several orders of magnitude, but the original 3D XPoint latency numbers put it at about 25 times higher latencies than contemporary DRAM (~7000ns vs. ~200ns).

Likewise, quad-channel DDR4 setups can push around 60GB/s, while Optane PCIe SSD's have only been able to push around 2GB/s. The PCIe x4 interfaces that they use should have been able to push 4GB/s, and if they started hitting a wall there, they could just have used a full PCIe x16 interface which could push around 15GB/s. This indicates that the throughput was not interface bound.

Nothing seems to signal that Intel have have anything up their sleeve to provide the 25x latency improvement and 30x throughput improvement needed to just be even with DRAM, although placing the chips on the memory bus will of course provide some speedup.

Note that transactional costs on PCIe are much higher than they are for memory, and you are going through a disk driver (if you're and SSD, you can avoid that using the NVME architecture extensions) There is no driver needed for non-volatile storage on the memory bus. You just access it like memory.

I would not be surprised if early users of these devices were seeing 10GBs+ of data throughput.

PCIe 3.0 does not have a very high transaction cost. The encoding overhead is just 1.5%, and the protocol is quite simple for the primary communication (there's a bunch of "extras" like retransmits and stuff, but that's not important here).

I can say that the PCIe x16 cards we develop at work have no problem reaching the theoretical maximum, transmitting around 15GB/s worth of payload data (e.g. our 2x100Gb/s NIC with one port pushing 100Gb/s in both directions, and another a few tens of Gb/s in both directions). We don't make cards smaller than x8, so I can't give measured numbers there, but Intel should have no problem transmitting ~4GB/s on that x4.

NVMe surely has an overhead, but this overhead is far from interesting, unless you are implying that the overhead eats a whopping 35% of the available bandwidth. Likewise, if Intel had hit a PCIe/NVMe bottle-neck, going to up x8 would not not have been difficult in any way (x16 is more annoying, though).

The performance has either been restricted by the flash itself, or by the controller—not by the interface. The numbers are also small enough that the software side shouldn't be a problem yet. Attaching to the memory bus minimizes penalties that are unlikely to be a bottle-neck.

for rebooting, you'd have to have super-fast POST and device initialization after power cycle, so i wouldn't say milliseconds.

Definitely. Most of my Dell machines take 2-5 minutes to get to the "booting OS" part of the process, but a couple of them have at times taken 3 hours.

The latter is probably a firmware bug, but the best Dell ProSupport has been able to do to solve it is "flea drain" (unplug, hold the power button for 60 seconds, plug back in).

Dell servers are notoriously slow to boot, but this is due to the remote management subsystem being extremely slow.

It's quite a pain in the arse, and one of the reasons we prefer to use Supermicro machines for development at work, using only Dell for testing.

The aerospike demo Intel did at the event showed 17 seconds vs over 2000 seconds for a restart. The difference being rebuilding a dram index on a many many terrabyte data set. Remembering aerospike has consistency, so you can't have the index be slightly wrong.

He doesn't mean rebooting will take microseconds/nanoseconds, he means the transaction commit will take that long. Before, a transaction commit meant flushing disk, at least as far as the disk controller. Now, commit = memory barrier.

Actually and "restarting". Technically it isn't "booting" because the OS is already in memory and initialized, it just isn't running.

Given your typical motherboard which you tell to bypass POST, it can go from power applied to memory controllers alive in just a few microseconds. The code then to go from there to running is also quite small, especially if there aren't things like graphics controllers on the PCI bus that have to be initialized first.

And even those can probably be lazy-loaded, with a page fault in the IOMMU and handled as soon as the rest is done, or the rest isn't doing anything and something is thus blocking on the GPU initialisation.

"What that enables is powering down servers that aren't being used with the ability to power them up in milliseconds rather than seconds." (great-grandparent)

At least we've been moving glacially in that direction, thanks to UEFI & faster booting. POST, device initialization, etc has begun to stick out, and is slowly being improved.

You can write your own custom firmware and skip the unnecessary bits.

close your laptop lid, wait 10 seconds, open it up, did it resume quickly?

2017 MacBook Pro 15". No, it doesn't. Open the lid, time passes before screen is no longer black. Even longer amount of time before the TouchID figures out my finger is the accepted finger. So, when did it resume? When the screen lit up, or when I could actually use it? Obviously, Apple's TouchID is preventing me from resuming, but to the user, that's all they care about.

You made the choice buying Apple.

What I can't work out from both the article and the comments here: from an application point of view, do I use this like I use memory, or do I use it like I use a disk?

No matter how fast a disk is, using it means either some expensive serialization/deserialization step (and also the associated memory access to create the 'working' object that my logic actually works on) or writing my algorithms to forego in memory objects (and the associated features offered by my programming language, e.g. classes / objects or whatever) and working from the raw byte values.

What I really want, and would be a game changer as to how we use things, would be that my programming languages heap can be made persistent (or at least a part of it). In this case instead of:

  var mything = new Thing();
I might have:

  persistent var mything = new Thing();
Done. However this also introduces more questions, like transactional commits to memory etc (as few apps are coded to ensure consistency of memory across reboots).

However I cant help thinking that some way to harness persistent fast memory without needed some complex disk->logic mapping would be a game changer.

Edited: spelling and wording

Disclaimer: I work at Intel on PMDK (pmem.io)

It is the game changer that you wish for, since the marshaling logic that you mention is gone. Persistent Memory can be accessed directly through memory mapped file, bypassing the traditional read()/write() I/O paths. Recent file systems have also been modified to a) skip the page cache layer and b) forgo the msync() call that would be otherwise required to synchronize the modified pages. This is what's called DAX (Direct Access [0]). In the place of msync() you can now just use CPU cache flush instructions. These two file system changes entirely eliminate kernel code from the I/O path (apart from the initial page faults).

Persistent Memory Development Kit contains libpmemobj [1], which is almost exactly what you are imagining ;) It's a persistent heap, with transactions for durability. It's not as nice (yet) as your code snippet, but here's C++ example [2] of a persistent queue push:

  obj::transaction::exec_tx(pool, [this, &value] {
    auto n = obj::make_persistent<Node>(value, nullptr);

    if (head == nullptr) {
      head = tail = n;
    } else {
      tail->next = n;
      tail = n;
`make_persistent` is, akin to `make_unique`, a memory allocation of a "Node" class. Once allocated, we can just assign the newly allocated object to a different persistent variable. No kernel code executing, no serialization ;)

[0]- https://www.kernel.org/doc/Documentation/filesystems/dax.txt

[1] - https://github.com/pmem/pmdk

[2] - https://github.com/pmem/pmdk/blob/master/src/examples/libpme...

If your data contains pointers, then for it to be round-tripped through persistence correctly, i would imagine you'd need to map it at the same virtual memory address every time. Which isn't possible. Have i got that wrong?

That's an excellent observation. You are indeed correct that pointers in memory mapped files are quite tricky to get right. When you think about shared memory in general, this isn't a new problem [0], and the solution is almost exactly the same [1]. Instead of dealing with raw pointers, the library provides an encapsulated fat pointer which contains an offset from the beginning of the mapping. And when the file is opened, we simply register the new virtual address, and calculate the real address when needed.

[0] - https://www.boost.org/doc/libs/1_63_0/doc/html/interprocess/...

[1] - http://pmem.io/pmdk/cpp_obj/master/cpp_html/classpmem_1_1obj...

When you mmap() a file you can specify the virtual address so it will be the same every time.

Yes, but to accomplish that you would have to use the MAP_FIXED flag, which is quite dangerous because it can replace previous mappings. That can lead to problems with dynamic memory allocation since almost all malloc() implementations use anonymous mmap.

Yes but this is a trivial problem to fix on 64 bit machines. There's so much address space the kernel can just be told to never pick certain address ranges for unfixed mmaps, leaving the rest of the address space free for persistent heaps.

The actual hard part of persistent heaps isn't the persistence part. It's transactionality and upgrade management.

It is of course important to point out that this is akin to casting a struct to a void pointer and writing it to a file (it's just faster), which works extremely well but requires the data structures to have a stable memory representation. If one changes the structs in any way, old persistent data will look like garbage. One should therefore still have an extremely well-defined and versioned system for managing persistent data, rather than just arbitrarily allocating objects on the persistent heap.

It's still neat, though.

Yup. In this aspect libpmemobj could be compared to how Cap'N Proto [0] works. And of course, this has some trade-offs that users need to be aware of.

[0] - https://capnproto.org/index.html

It would appear to be treated as a type of object store that you access with a special library, or through the OS as a filesystem: https://software.intel.com/en-us/articles/introduction-to-pr...

Thats very very interesting. Many thanks for sharing.

So its still a serialize/deserialize cycle, but the access libs built on top of the persistent memory look interesting.

I see it as a disk drive where memory mapping is a NOP and its performance close to RAM. So an .exe starting from this disk already has all of its contents mapped to RAM, and you can create and memory-map a 512 GB file on it and use it essentially as a RAM blovk in your processe's memory space. But you then can close and reopen it meaning it's guaranteed to be persistent.

Simon Peter, at UT Austin, has been working on a file system that works better for systems with NVM. It looks like existing file systems (e.g. EXT4) are going to suck in this new paradigm (DRAM -> NVM -> SSD) . It’s a pretty interesting read and the benchmarks are damn impressive.


Disclaimer: I work at Intel on PMDK (pmem.io)

There's been a lot of interesting research around file systems for persistent memory.

One that shows a lot of promise is NOVA [0]. Its focus is on making full us on this new type of memory. And it's not just pure research, they are attempting to get NOVA included into Linux kernel [1, 2].

And while talking about file systems, we shouldn't forget about the effort that was put into modifying the existing ones to support DAX (Direct Access) [3, 4].

[0] - http://nvsl.ucsd.edu/index.php?path=projects/nova

[1] - https://lwn.net/Articles/729812/

[2] - https://lkml.org/lkml/2017/8/5/188

[3] - https://lwn.net/Articles/717953/

[4] - https://lwn.net/Articles/731706/

Interesting read, thanks for sharing.

It didn’t even occur to me that the file systems will need to change to fully take advantage of NVRAM. I wonder at what point the abstraction will stop leaking and require another higher layer to account for differences in performance. I’m sure the OS will need tuning, but applications might not unless they’re pretty bare metal.

One of the reasons why Apple's new file system APFS was developed was optimization for flash storage. HFS+ was designed with floppy and spinning disks in mind.

What notable changes did APFS implement to accommodate flash storage?

Anybody want to give a brief comparison of how these compare in practice to DDR4 and SSD storage? I assume it's "slower than the former, faster than the latter", but having an idea of the magnitudes would be useful.

The specs on these chips aren't released, but previously Intel has quoted latencies around 1/2 DRAM speed (many orders of magnitude faster than SSDs).

These are byte addressable - they will look like RAM to the OS. (If you have a motherboard that supports them, there is a slight change to the spec).

To the application developer, the interface will be http://pmem.io/pmdk/ . Last I looked, there were several ways to do things, but the most commonly used would basically be allocating a chunk of memory and assigning it a file name so you could re-open it next time.

This is exciting because it could open up exciting possibilities like zero-CPU IO with DMA straight to persistent memory, "sleep" mode that is essentially free, re-thinking paging in modern operating systems, and generally re-thinking everything we've assumed since core memory.

On the same day we are talking about WASM microkernels in another thread. Things are getting fun again :)

> On the same day we are talking about WASM microkernels in another thread

That would be https://news.ycombinator.com/item?id=17187384

> but the most commonly used would basically be allocating a chunk of memory and assigning it a file name so you could re-open it next time.

So the PC platform finally caught up with Amiga's RAM disk ;-)

I think you meant RAD: disk... RAM disks don't survive a warm boot, but RAD did (IIRC). But, even RAD: didn't survive a cold boot (where this Optane memory does)

Still, I do fondly remember my Amiga days!

Intel's trying to keep the performance numbers under wraps for at least several more months. My predictions: write performance for a single Optane DIMM will be less than an order of magnitude better than an Optane NVMe SSD, maybe only twice as fast. Read latency might get close to that of slow DRAM, but you'll still notice that Optane is slower.

I wonder if Intel is trying to keep the endurance number under wraps. If they can't beat the current drive writes per day spec it is like having a 10 gigabit internet connection but a usage cap of 1 gigabyte.

Endurance should be fine, orders of magnitude better than flash. Just some temperature wear.

But the speed and density seems to be a problem. E.g. it's still 2x slower than DRAM, but theoretically should be faster. It could also have 3 bits, not just 2.

Linus did a video about Optane I found helpful.


This video is about the Optane devices you place in M2/nvme slots to be used as cache.

It will be interesting to see the real benchmarks on this DIMMS. We all want to know if they're comparable to real DDR4 memory.

The current market is terribly overpriced (there's some debate on if there's price fixing with the big three or if it's a genuine shortage/supply problem with the Note recalls and new phone releases). DDR4 is nearly double the price it was the last time I did a build over a year ago. :-/

EDIT: Looks like these chips will be specialized for certain server boards/CPUs and only share the DIMM interface and not protocol.

Thanks for the clarification, I'm watching the video now and it's interesting (Optane is way cheaper than I expected) and still seems useful as a lower bound for what we can expect from the DIMM version.

Thanks for the link! The Blender benchmark is really telling. Using Optane M2 modules as a swap disk allowed to run a task that required 12GB on a system with 4 GB of RAM and it took just 1.8 more in time then the native RAM case. So I guess I can put those Optane modules into a laptop with 16GB of RAM and run a calculation that requires 64GB of memory and it will be much cheaper than a laptop with 64GB of RAM.

But what is the price difference between an Optane module and the equivalent size of RAM?

This is not the Linus I was expecting.

hehe .. indeed, the Linus you were expecting would have been a mailing list post.

What saves the page table between reboots...the OS must have specific support for this surely...?

You're still going to have some DRAM attached to your CPU, too. The non-volatile portion of the address range is probably going to be managed with a filesystem that supports DAX.

Wait...so is this about having an ssd on the memory bus...or non-volatile paged memory?

I'm so confused. I can't find a good explanation of how applications and/or the OS "see" these DIMMs.

How does my OS/app see this? Is it accessed like regular DRAM memory... except slower and persistent?

Or would my OS see it as a "normal" drive... except one that's really fast and happens to be connected via a DIMM slot instead of PCIe/SATA/whatever?

Your operating system will see them as storage, not regular DRAM. Intel has released SDKs for developers to use if their applications need to be able to interact with persistent memory in any kind of detailed way (say, if you develop a database product, or a filesystem).


It is very likely that generic kernel support will come for use in Linux and Windows directly, building on top of the existing DAX systems in those operating systems (DAX - direct access - APIs being used for IO to memory-like devices, bypassing cache layers which are useful for more traditional storage types). This would allow a user to create a regular old storage volume in their NVDIMM for general use.


Do note that NVDIMMs aren't a drop-in replacement for regular DRAM DIMMS, despite using the same bus and electrical subsystem. You'll need proper hardware support on your motherboard and CPU, since memory controllers are on CPU these days.

> Your operating system will see them as storage

Do you mean it will present them to userland as storage, rather than see them as storage?

Seeing them as storage implies to me that the DIMM emulates an AHCI, which i don't think is the case.

Correct yeah, I probably oversimplified that part a bit. The kernel will be fully aware that your NVDIMM is NVDIMM, none of the technical details available so far suggest there will be any kind of emulation of legacy storage protocols.

That's a fantastic summary. Thank you!

I don't really understand how this is supposed to work. These are DIMMs, meaning they are ram alternatives? Will they be cheaper and slower than DDR4, but persistent and getting benefits from that? Which advantages exactly? Wouldn't the OS need to be aware of that, defeating the point of them being DIMMs?

128GB, 256GB and 512GB per module is sadly too much for consumer motherboards. Why not a 16GB version, didn't Intel even launch Optane with those small sizes?

There's a new slot called "NVDIMM", where "non-volatile" memory can be plugged into the motherboard so that the CPU can interact with it.

Its cheaper and slower than DDR4, but persistent and likely to be way denser.

> 128GB, 256GB and 512GB per module is sadly too much for consumer motherboards.

NVDIMMs are for database applications. If you were running 1TB in-memory databases, but are willing to lose a bit of performance to severely reduce costs, you're in the market for an NVDIMM.

The memory bus is pretty much the fastest thing that you can put things in your system; for example looking at some random Xeon CPU, it has 6 channels of DDR4-2666, each capable of transferring about 21 GB/s data for a total of 128 GB/s. Compared to that, even 16 lane PCI-e 3.0 is relatively slow at 16 GB/s.

While I don't have the figures to back it up, I believe the differences in latency (and by extension random access perf) are even more dramatic, and where the real performance advantages come from.

The bus is fast but Optane is slow. It doesn't even saturate 4x PCIe with 512GB Capacity. So the final Optane DIMM on 6 Channel Memory may only be about 21 - 24GB/s.

But it's only going to get better from then on it's a first revision of a product with a novel tech inside it. First SLC SSDs had abysmal performance numbers and density by today's standards.

> Why not a 16GB version, didn't Intel even launch Optane with those small sizes?

16GB is a single 3D XPoint memory die. The DIMMs need to use more than a few dies to support the throughput that people expect from their memory bus. The same is actually true of DRAM; if your current memory modules only had one DRAM die each you not only would have 1/8th the memory capacity, but your memory bandwidth would be annoyingly small as well.

They could pull an i5 and just disable half the modules.

My understanding is that all these have in common with RAM is sharing the DIMM connector. There are a number of technologies that use DIMM connectors that are not RAM, for example, the Raspberry Pi Compute Module[0] uses a SO-DIMM connector to save space and money by not having to break out IO nor provide any sort of connectors.

[0] https://www.raspberrypi.org/products/compute-module-3/

It's not just the connector. The electrical interface is DDR4, and it's accessed through the CPU's DRAM controller as part of the existing memory hierarchy. It's just that the memory addresses corresponding to the Optane DIMMs will be slower than the addresses corresponding to DRAM.

New CPUs will be required, and if there's a technical justification for that it will probably be that accessing an Optane DIMM requires timings that are far outside the normal range for DRAM modules that the existing memory controllers were designed to accommodate.

It's also integrated to new instructions to flush things and wait for them to be made persistent.

Intel actually backed off on those plans, and now any regular cache flush will suffice. There are still some new cache flushing instructions, but they merely offer performance enhancements, not stronger memory safety guarantees.


Does this mean that, with a DIMM-to-LAN interface and cloud storage backing, one could literally download more RAM from the internet?

With the current marketing pitch, this seems to target enterprises/cloud providers for in-house DB solution. It might be too much for regular consumer to understand its benefits.

I can already imagine storage systems where the blocks are just written to memory before acknowledging as a synchronous commit, and reads occurring completely from this cache or for one direct hop from this cache. It's going to be amazing in the future as this tech matures. A persistent RAM will change a lot of our architectures for the better.

This is already done in enterprise storage systems, they just use a non-Optane persistent journal technology.

NVRAM solutions exist today that are just DRAM with a supercap and/or battery plus a flash SSD. Or PCIe cards. Or ordinary RAM plus specialized firmware that kicks in on brown out and copies RAM out to SSD.

Optane looks neat, but it's kind of expensive and has relatively low write endurance compared to ordinary DRAM (from an enterprise filesystem journal perspective).

That's pretty much how SANs used to work.

The idea goes back a long time. There used to be a product called Legato Prestoserve that used battery-backed SRAM. It was available in the early 1990s, maybe late 1980s.

Here's some documentation about the version of it that Sun Microsystems licensed and sold:


It says:

"Each NVSIMM contains memory, a battery, and power controller circuitry, which ensure that the memory is not lost when the system is shut-down or halts because of an abnormal condition."

"Synchronous write requests to disk are intercepted, and the data is stored in non-volatile memory"

ah ... ye olde NVSIMM!

If you're referring to NetApp's use of NVRAM, to take in writes, and acknowledging the IO before actually writing to disk, it is not that simple. They do NVRAM mirroring to save those writes to another partner's NVRAM for HA. This is a network tax on core IO path. If there is a mature Optane DIMM, it may not be necessary to do the network dance in IO path.

I sincerely hope no storage company acknowledges a write after just writing to volatile main memory. That's a recipe for disaster when a node goes down.

> I sincerely hope no storage company acknowledges a write after just writing to volatile main memory. That's a recipe for disaster when a node goes down.

Sorry to break this to you, but this is exactly what every major storage product does. Data will be mirrored to separate DDR behind and then good status is given to the host. The data will be destaged at some point later when cache space is required for some other operation.

The data itself is safe as long as the battery backup (or capacitor for smaller systems) is charged enough to handle a power outage. The storage system knows the battery levels and may not allow a write cache if there isn't enough supplemental power to destage the full write cache in the event of a power loss.

This is just mixing up the issues, same as saying "what if the HDD that acknowledged the write suddenly catastrophically failed"?

I don't know if it's SAN specific, but while waiting for XPoint, several vendors went to battery-backed DRAM. They confirm writes when in RAM, then whenever power is lost all RAM is dumped to flash, then loaded back up on boot:


Fortunately XPoint should solve this.

We never really worried about write endurance with dynamic RAM, but it would seem to be a lot busier than disk I/O at least in some server applications.

We need reviewers report a new metric: time to failure at continuous max write throughput.

For some SSDs, it's under a week.

The write to failure is usually reported in the drive specs. https://www.intel.com/content/www/us/en/products/memory-stor... See Endurance Rating (Lifetime Writes)

"Endurance Rating (Lifetime Writes) 41.0 PBW"

"Sequential Write (up to) 2200 MB/s"

41e15 / 2.2e9 /60/60/24 = 215 days sequential write to failure

"Mean Time Between Failures (MTBF) 2 million hours"

2e6 / 24/365.25 = 228 years MTBF

So it seems the MTBF is being stated at 0.26% average write utilization.

[corrected math]

0.26% * 2200 MB/s = 5.72 MB/s for 24/7/365 which seems about right for most users.

A 750 GB drive assuming you want to store the data for 48 hours can only write (750 GB/24/60/60*1000) = 4.34 MB/s on average. Dropping that to even 1 hour still gives reasonable lifetimes.

That's not how MTBF works. It's not how long you expect the drive to last. The number is how many failures on average you get for the number of service hours in use. It's really only useful in aggregate.

For a MTBF of 2 million hours, that means that on average, if you have one thousand drives, then you should expect one drive failure every 2000 hours, or 83 days (1k * 2k hours = 2M hours)

MTBF is what you need to take a look at for a mechanical system, if you are concerned about service intervals (in a HA cluster style setup, where you can handle fixing after break down, without downtime), and for e.g. continuous operation of optical disc and magnetic tape drives, as those wear out over time (though even laptop BD-R drives reach 1 year active spindle MTBF). There is determines how often you have to go to the system and swap faulty drives, to have at least x% working, and over how much time you can budget the CAPEX.

Of course this breaks down at an MTBF of over 50 years, as thosre rarely mention exotic failure modes, and don't actually have an MTBF of 50+ years over the life, but an annualized failure rate corresponding to 50+ years MTBF, measured over the first couple of years or even the first year. For non-wet-electrolytic-capacitor-using computing, one can calculate a temperature-dependent MTBF in the 5-25 years range, mostly depending on how bad the chips are hit by electromigration and similar aging in the semiconductors. This is incidentally a reason why I miss clock speeds for different processors as reported by overclockers to at least in some cases extrapolate the life due to electromigration, as there is a formula with like iirc 2 parameters, which gives a temperature (and maybe voltage) dependent lifetime/MTBF for this semiconductor device. I'd likely sttrive for about 5 years MTBF on the processor, if speed is of concern and reliability/uptime not in the foreground.

> For a MTBF of 2 million hours, that means that on average, if you have one thousand drives, then you should expect one drive failure every 2000 hours

That assumes your drives fail with a constant independent probability, like nuclear decay events (Poisson distribution). The reality is more like https://en.wikipedia.org/wiki/Bathtub_curve .

MTBF is not a good metric for complex systems used under wildly-varying load conditions, but ... it's a metric.

For some applications that might be worth it.

You're not taking into consideration write amplification. Assuming 3x write amplification, the actual userspace writes would be 1/3rd.

Are you sure? A quick back of a napkin calculation seems to suggest that a fully saturated SATA3 connection would still be a measly 36 terabytes in a week.

It seems quite low for any practical purpose. I don't doubt that there probably are some tiny shitty drives that will conk out after a week like that but are there any reasonably popular drives like that?

The 500GB 970 Pro is rated for 2300 MB/s sequential writes and 600 TB write endurance. That's about three days to exhaust the write endurance. Latest high end SSD model from the leading manufacturer. Not that it could actually come anywhere near sustaining that throughput for three days straight.


The perf isn’t really that good for ext extended periods.


This was a drive by a top brand NAND manufacturer.

What about nvme/pcie drives? They can push considerably more data.

I am guessing that this now means that we also have to encrypt our devices in DIMM, especially if it is persistent.

There are lots of papers out there discussing the possibilites of non-volatile memory, e.g. Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems (Arulraj, Pavlo, Dulloor, 2015)[1]. However, they all seem to work with "simulated" NVM. It will be interesting to see how these real units compare to the simulations.

[1] https://www.cs.cmu.edu/~jarulraj/papers/2015.storage.sigmod....

So we could fit, 512GB DIMM, at up to 8 TB of Optane, and using them like in~memory DB with persistence? I wonder what performance improvement would I get on a Postgre DB compare to RAM + NVMe SSD.

A lot of the overhead there might come from Postgres (e.g. building temporary in-memory indices for hash-joins, because it's assuming random access to the tablespace is slow.) Ideally you have a DBMS that already understands that it's running on nonvolatile memory, and so doesn't have separate "in memory" and "on disk" formats for its data.

Maybe something like Aerospike? (don't know much about it but I've heard it's "for" that)

Or some LMDB-backed database, sure. With LMDB you'd essentially map the entire address space of the disk, then just persist pointers. Writes would still not be as performant as you would hope, but random read performance would be in the small handfuls of nanoseconds per.

Just running Postgres on an NVDIMM, or any other database for that matter, will give you a large speedup compared to normal storage, especially when you have strong durability requirements for a transaction (fsync() will be faster).

But to fully benefit from persistent memory, the DBMS will need to modified. To see how that might look, read this [0] post by Microsoft about their efforts in SQL Server.

There's also an interesting research paper from CMU [1] that talks about challenges associated with pmem in the context of databases.

[0] - https://blogs.msdn.microsoft.com/bobsql/2016/11/08/how-it-wo...

[1] - https://www.cs.cmu.edu/~jarulraj/papers/2015.storage.sigmod....

Running Redis would be pretty fun on this.

How would Optane-on-DIMM affect the performance of relational databases as compared to the NVMe variant of Optane?

NVDIMMs allow for a significantly different storage layer compared to a traditional block-based device [0]. This is mostly because it's on the memory bus, thus significantly reducing latency. See slide 8 in this presentation [1].

[0] - https://news.ycombinator.com/item?id=17195018

[1] - https://www.openfabrics.org/images/2018workshop/presentation...

As far as I know nvme is exclusively block based. So you get to write in 512 or 4K blocks. I believe the DIMM versions are byte addressable or at least they will be at some point in the future.

I'm pretty sure the smallest addressable unit of RAM in modern computers is a whole cache line (64 bytes).

Actually 64 bytes is the false sharing boundary - on modern CPUs addressing a single byte not in cache will pull down 128 bytes.

While being technically correct, that is still much smaller than 512 or 4K blocks.

AFAIK, the Optane disks are _byte_ addressable.

Not even DDR4 RAM is byte-addressable. DDR4 is typically burst-length 8 for 64-bytes per burst (although BL2 exists, I'm fairly certain all modern processors have settled on BL8)

I think you can still do non burst transactions and even set masks on byte granularity for writes. Classic processors probably don't do that, though.

I don't really know much about memory controllers, but being able to mask at the byte level seems like an important optimization. Without that, many writes will have to do a read first to them merge the read bytes with the dirty bytes.

The capacitors inside of DRAM cells are so small, that the very act of reading the DRAM cell obliterates the data. I'm not kidding.

The "Full procedure" of reading a DRAM cell is:

1. Row-Address -- Load a "row" (usually 1k to 8k. DDR4 is 8k IIRC) to the sense amplifiers. Sense-amplifiers can indefinitely hold data, but there's relatively few of them.

2. Column Address -- Once loaded, you talk to the sense-amplifiers.

3. Precharge -- You begin to move the data from the sense-amplifiers back to the DRAM cells. Again, step #1 obliterated the data, you have to write it back regardless.

4. Row-Address -- After the old data is loaded, you send it back.

So regardless, you have to Read-then-write EVERY time. In fact, DDR4 has faster write-speeds because you don't have to do the read step if you are only writing.

While this is true, the point was that it's an important optimization to avoid doing RMW at the memory controller level. If you did it there, it would cost tens of nanoseconds. Doing an on-die refresh in parallel with the write is almost free.

I think with both DDR3 and DDR4, the number of bits in the row address depends on the DRAM density.

That was the goal yes but currently via nvme you can only write at a block level. Presumably the ram interface will be byte addressable and driverless.

I'm curious about this, could you point at the documentation?

Which part? The NVMe spec is at https://nvmexpress.org/

Intel doesn't publicly share the full specifications documents for their SSDs any more, just the 2-page product briefs. And the news article contains all the official information that's public so far about the Optane DIMMs.

So it's like putting an SSD in your RAM slots? But what about speed, and especially amount of write cycles?

I believe the whole idea, beyond pure speed of RAM vs SSDs, is they are promoting a programming 'model' where you bypass the operating system I/O (transiting via pages/blocks) and instead directly read/write data (via bytes) to the memory from the application. Which could be useful for particular usecases / subsets of data.

They mention write cycles in the article...

> The existing enterprise Optane SSD DC P4800X initially launched with a write endurance rating of 30 drive writes per day (DWPD) for three years, and when it hit widespread availability Intel extended that to 30 DWPD for five years. Intel is now preparing to introduce new Optane SSDs with a 60 DWPD rating

The idea is to drive CPU utilization as well.

For example, in the x86 world, pairing NVMe drives with putting portions of the application writing to NVDIMM drives core performance, say, in SQL, from 40% util to 100% util.

Even a single 8Gb DIMM can dramatically increase utilization and performance.

And literally half the bandwidth. NVMe means that you have to write the data, flush the caches over those ranges to ensure that it's actually in DRAM, then instruct the NVMe controller to read it back out of DRAM. With these drives you just write (and maybe flush), and it's done. And they probably have their own dedicated DRAM protocol controller.

And it might even be slightly better than halving the bandwidth needs, since swapping DRAM banks isn't free, so you might be saving on mildly thrashing your DRAM controller when you're using DRAM and the NVMe drive is trying to read at the same time.

Modern DMA is cache coherent (maybe except if you opt-out of it? I'm not even sure you can). It is still costly though.

That level of endurance is great if you've got wear leveling routines like a SSD has but not if writes to the same logical cache line hit the same physical block each time.

Does 3D xpoint have a need for wear leveling? It's a fundamentally different technology than NAND flash. The numbers stated are very close to what you get out of spinning rust in the enterprise space.

It does use wear leveling, especially since the quoted endurance is far lower than promised.

Nope, because the whole thing about Optane is that it is super low latency, like RAM, so yes, it’s slower to actually perform read-writes, but way faster than even an SSD disk for random reads making it reasonable suitable as a RAM replacement.

It should also be noted for those saying "well the Optane ssd's are fast but not THAT fast" A large majority of the latency from those drives is the PCIe layer.

Yeah, I am really curious as to the actual read/write performance as well.

The write cycles are covered in the article, though without much detail as to how wear leveling works with this kind of setup.

Please measure the latency too.

So will regular RAM become "level 4 cache" and Optane exist on the other side of that?

Basically you run everything you can in memory and then just mmap() in the files you want to use?

But that's how we do it now. It's just that Optane persistent storage is attached to a different bus with much better latency and throughput. Probably categorically and disruptively better.

How will user-mode applications refer to their persistent data if not by a filesystem path? You gotta put access permissions on some kind of object that humans can copy-and-paste into their backup scripts.

I would say that it basically already is.

I found this PDF to be very helpful, explaining the general use case of persistent memory programming.


What's the API for these Optane DIMMs? Does the program decide what gets placed in them? Does the OS? (if so, how?)

To the OS, its just memory and you can just mmap it. There are some filesystem abstractions built on top though eg https://lwn.net/Articles/729770/

What about multithreading? An OS can read from an SSD then context switch to do useful other stuff until the data comes back. Once the OS issues a request to (relatively) slow memory, the only thing that can use that core are other instructions in the pipeline and other hyperthreads; i.e no other OS threads allowed.

The fastest NVMe SSDs are already right around the threshold where there isn't time to complete a pair of context switches before you get the data back, and the difference in latency between polling or waiting for an interrupt is significant. These Optane DIMMs should be fast enough that only hardware-managed context switches like hyperthreading/SMT are usable without performance loss.

Yes. But the Optane chips are fast enough for that to be only a minor nuisance.

See also the timeline of those PCM, based on chalcogenide glass, which were invented in the 60ies by Stanford_R._Ovshinsky. He always insisted that it's better than DRAM. Finally we are getting there.


Theoretically it should be 1000x more durable than flash, and also 1000x faster, but they are not there yet. But it looks like they solved the packing problem. And Micron insists that it is chalcogenide based, but not "phase-change memory", the one they started in 2012 and took back in 2014.

PCM to Micron is an existing product/design that is NOR-Flash like but utilizes PC material as the storage medium.

3D XPoint is said to be able to stack the storage medium on the die and requires no access transistor -- which "PCM" has. So it is based on PC material but has some new kind of selector.

The 3D/stacking part is important for scaling/density -- even NAND-Flash has hit 2d limits and gone vertical.

Reading this thread, it seems like there are doubts about whether this particular product will really work like non-volatile RAM. Either way, it's an indicator that the real thing will be here soon. I can't wait to see how fast persistent memory will change the architecture of our computing environments. I think it will have huge effects that we can't foresee yet.

How does it compare to SanDisk ones? Is this going to enable HP's The Machine without memristors?

Now hibernation will be fast and usable. Stand-by will not be needed anymore...

I was on the outside vendor was panel at the event (along with Oracle and redislabs) if anyone would like an engineering view of the technology.

Everything we want to know is exactly what you're not allowed to talk about.

No need for paging for the most part with these. Current virtual memory systems won’t be able to deal...can’t wait to see what’s next.

What databases are built with this kind of storage in mind already?

VoltDB comes to mind.

This interview from 2013 is still worth a listen imo!


Persistent Memory age is coming. Can we install NVDIMM into a desktop?

The product being announced is for data centers.

This is ideal for something like LMDB. Can't wait.

> Optane DC Persistent Memory DIMMs have twice the error correction overhead of ECC DRAM modules.

That is pretty impressive.

"640K ought to be enough for anybody." (apparently 4 people in HN hated this comment)

Oh, the security implications of this...

e.g. malloc owner PID x is now PID y

Throwaway account for obvious reasons. Was closely involved in the development of it.

To summarize, this product is shot, and is just hype.

I'll check the technical questions/gaps and answer or fill them in tomorrow.

Lol, looks like Linus actually inspired them. I mean with the whole DDR industry acting corruptly to keep prices sky high, this seems like a reasonable way to add some competition.

I assume you mean Linus from YouTubes "Linus Tech Tips", and not Linus Torvalds. Surprised you wouldn't clarify that considering what website this is.

Considering Intel's history, I wouldn't bet they direct their efforts based on a YouTube reviewer.

Also, considering Intels very anti-competitive behavior (e.g [0], [1], [2]), I am wary of stating Intel entering the DDR industry will make it any less corrupt.

[0]: https://www.theverge.com/2014/6/12/5803442/intel-nearly-1-an... [1]: https://www.wired.com/2009/12/ftc-sues-intel-for-anti-compet... [2]: https://www.youtube.com/watch?v=osSMJRyxG0k

> Intel entering the DDR industry

Eh. https://en.wikipedia.org/wiki/Intel_1103

RAM was their 1xxx product line, EPROM was 2xxx, microprocessors 4xxx (later 8xxx, what with 8 glorious bits of data bus ...)

Are you talking about the LinusTechTips guy? I watched the recent video where he was using Optane and the results didn't seem like a major competitor to RAM unless I was misjudging it. Of course, it could be in that 1:1 RAM wins out, but Optane can put more memory within the space so that 1:1 becomes 16:1 or such at which point it wins out, not sure.

I think the main point is that optane can put more memory within your budget, and for some use cases it's a semi-suitable replacement for ram

There's no reason to believe this won't be sky-higher than current DDR4 RAM prices.

Intel is saying it will be cheaper but they've raised prices before.

This has been on Intel's roadmaps for years. Nothing to do with LTT.

> Lol, looks like Linus actually inspired them

How so? Point to the quote?

LinusTechTips made a video recently taking about this.


It is quite unlikely that Intel was actually inspired by Linus of LinusTechTips. He's exaggerating for dramatic effect, just like the fake drama of not being able to find his RAM.

Persistent memory is something that Intel has been working on for at least 5 years (https://github.com/pmem), and given that that's the public face of the software side of it, they were likely starting to develop the hardware even earlier.

Totally thought GGP was referring to Linus Torvalds

No fucking way. The persistent main memory to replace dram older than Sebastian and this kind of thing takes at a decade to develop if you're lucky.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact