Hacker News new | comments | show | ask | jobs | submit login
Intel’s first Optane SSD: 375GB that you can also use as RAM (arstechnica.com)
325 points by xbmcuser 65 days ago | hide | past | web | 144 comments | favorite



Somebody linked an "Intel to mislead press on Xpoint next week" article from SemiAccurate to another thread on the same topic. Interesting read and adds some context to the announcement. http://semiaccurate.com/2017/03/10/intel-mislead-press-xpoin...


By "broken" SemiAccurate seems to mean "only twice as fast, 10x lower latency, and 3x the endurance of NAND." So who is misleading whom?


Probably just marketing hype.

Some numbers:

Modern DDR4 RAM quad channel: 50,000MB/s

Optane SSD: 2,000MB/s

Also nothing on the PCIe bus is going to have a latency of RAM to the OS, unless mapped as RAM.


The initial marketing blitz was all about the performance of the memory itself. Putting it in a NVMe SSD hanging off the PCIe bus instead of directly on the CPU's memory controller takes away most of the performance benefit. Intel's hit their goals for density, and missed their goals for endurance.


Using memtest86 on dual core Opterons got a measurement of about 2200 Mbytes/second, I think, with DDR1 PC3200 ECC RAM.

I think you could get a lot of work done with a 750GB RAM (2 of these, or one coming later) and dualcore Opteron setup.


DDR1 PC3200 single channel should give you 25.6 GB/s in theory.


Isn't the PC component an upper limit on performance of a DIMM? PC3200 means 3200MB/s. While a single DDR3 1600 PC12800 DIMM would max at a theoretical 12.8GB/s.

Expand the tables here: http://www.crucial.com/usa/en/support-memory-speeds-compatab...


Err I made a mistake, you are right. PCx means a limit of "x" per channel. My brain read PC3200 as "DDR-3200" ie. 3200 MT/s.

High-end server CPUs such as Intel Broadwell support quad-channel DDR4-2400 PC19200 which is 76.8 GB/s, a lot more bandwidth than the Octane.


Well the Ars article says they're still promising 1000x lower latency so that seems pretty misleading. And that 3x endurance is vs. MLC flash; it still handily loses to SLC.

The SemiAccurate article looks very fair to me.


If you read the details of the article, there's some pretty open questions about whether Xpoint is 3x the endurance of NAND, or even higher duration than NAND.

And the latency improvement seems to be 2x, not the 10x you claim, or the 1000x Intel originally claimed. (Intel's own numbers compare 20ms for flash versus 10ms for XPoint...) Which isn't nothing, to be sure, but...


It's weird that the article doesn't mention the actual, average latency of this device, after writing that Intel has promised a 1000x improvement in this area.

It mentions that latency is below 100 microseconds 99.999% of the time, but that's not more than 100x faster than a rotational HDD, albeit with less variance.


I'm skeptical. It must be hard for a once innovator like Intel to revert to the mean.


This is a big deal.

We've always made a distinction between memory and disk. Much of computer science is about algorithms that recognize the difference between slow, persistent disk and fast, volatile RAM and account for it somehow.

What happens if the distinction goes away? What if all data is persistent? What if we can perform calculations directly on persisted data without pulling it into RAM first?

My guess is that we'll start writing software very differently. It's hard for me to predict how, though.


To some extent, that change has already happen.

The RAM is no longer fast: unless cached, it takes around 150 CPU cycles to access the RAM.

The RAM is no longer byte addressable. It’s closer to a block device now, the block size being 16 bytes for dual-channel DDR, 32 bytes for quad channel.

Too bad many computer scientists who write books about those algorithms prefer to view RAM in an old-fashioned way, as fast and byte-addressable.


> the block size being 16 bytes for dual-channel DDR, 32 bytes for quad channel.

For most practical purposes I believe in x86 computers, the block size to consider should be at least a cache line, so 64 bytes.


You need a clean abstraction when describing elementary algorithms at an undergraduate level. A lot of work has been done on cache-oblivious algorithms that respect the memory hierarchy, including from the authors of the classic CLRS book.


Remember the "3M" computer ideal? Mega-Herz, Megabyte, Megapixel, for speed, memory, screen, to make up a new capability of workstation?

https://en.wikipedia.org/wiki/3M_computer

I now think the next target ought to be latency and power defined, in particular with IOT requirements.

Microseconds, Milliwatts, Millions of endpoints.

or, for storage, Microseconds, Millions IOPS, Multi-Parity

or for computer, efficiency, IOT goals:

Milli-watts as a constraint for a benchmark of compute Millions of endpoints / processes or cores on a network topology Microseconds as a measure generically of access to resources, whether memory is sat on another node, or local store, the delay ought to be inside the same order of magnitude as a target.

My purpose in all of that, is, should there - in fact, can there be - consideration of architecture and topology that avoid .. rather can we aim for linear cost in addressing complexity at all, as a goal, or has that been lost already?

I mean "that" and "lost" very vaguely, being non expert, but my question is really should I imagine that there are no effective gains to be had, designing for a IOT style or massively networked future, from the way memory is addressed, or has the complexity that we have, been introduced out of necessity, so will be here to stay, for practical reasons, so the idea of low latency, low power, IOT "grid computing" on a ad hoc basis, is smoke in my pipe?


> It’s closer to a block device now, the block size being 16 bytes for dual-channel DDR, 32 bytes for quad channel.

Don't forget the 8-deep prefetch buffer. The real block size is more like 128 bytes for dual channel.


Differently is right. Every buffer over/underrun can now potentially overwrite your "persisted" data - now we're all writing file/storage/db systems .... ;-)

It'll be wonderful if we actually can get some simplifications in our byzantine architectures, although - I wonder if it doesn't all boil down to our "RAM" (fastest storage) becoming "smarter" - as CPUs manage the three levels of cache - and we "only" loose the "slow disk/ssd" on the far end. CPU cache becomes the new RAM, and cache management microcode becomes our new memory allocators....?


A buffer overrun can't overwrite random memory that isn't mapped into your process. We solved this issue with virtual memory like 35 years ago.


Kind of... We solved this issue like 35 years ago and then broke it again in the last decade. First CPU timing attacks allowed peeking at secret key generation in shared environments. Then we got rowhammer which ignores your mapping. Now we got memory access timing attacks which allow probing kernel memory mapping and bypassing ASLR.

While you can't easily write into other processes anymore, the barrier is more of a suggestion again.


"ignores your mapping" is pretty generous considering rowhammer doesn't work on ECC RAM.


From https://en.m.wikipedia.org/wiki/Row_hammer

> Tests show that simple ECC solutions, providing single-error correction and double-error detection (SECDED) capabilities, are not able to correct or detect all observed disturbance errors because some of them include more than two flipped bits per memory word

But anyway - even if ECC was perfect at preventing the issue, lots of hosts do not use ECC. That includes practically every desktop.


So you're saying there will be a web assembly version in a few years that will claim to be a huge advancement?


Yeah, just like that time microsoft invented lambda functions


Correct, but in practice memory has permissions on a page level and most pages today are 4K. So a page containing a buffer is also likely to contain other data as well. Unless you are going to isolate each buffer by placing it in its own page.


That doesn't make sense to me. So an OS will assign a 4K page of memory to multiple processes, with some granularity that each process "should" only be able to write/read some of it?


So your 1tb linked-list/hashtable that holds all the mail servers email, and is conveniently persisted because ram is the disk, and everything on the heap is magically permanent - isn't at risk?


> Differently is right. Every buffer over/underrun can now potentially overwrite your "persisted" data - now we're all writing file/storage/db systems .... ;-)

Sounds like a great use-case for lmdb :)


Well, many of us are ready today for the new era, writing software differently: not optimizing for memory, throwing out the window ACID databases in favor of quickly hacked together in-memory ones, not knowing any disk-based algorithms and data structures, etc.

/s


Aha, I found what I first came to say in this thread, here is the memory "driver / middleware" origins: http://www.scalemp.com/media-hub-item/scalemp-software-defin...

But ComodoHacker's comment, made me think yet again, why was ACID compliance ever a issue?

I was reliving some nostalgia at the weekend, explaining to a friend how I got excited Microsoft Transaction Service came bundled with NT.

That was my "free" ACID transactions.

It as cross platform, then, too. (or at least multi arch, and advertised to play nice with things like IBM CICS, which may have been much of the point of the bundling, to win deals, even in check box tallies)

MTS is one of the few OS level dependencies SQLServer has. But i think the cross platform origin I recall, was born out when SQLServer/Linux appeared.

I have all this time been confused.

I get it, if you don't need ACID, and have other design reasons, go ahead.

But, well I just always saw it as a problem solved, or plug and play solvable, thanks to MTS. Getting a book on MTS is hat convinced me NT was serious and our business should take heed. If you're a small shop, have margins, can work with fat servers - or as we did, fundamentally scale through transaction routing in the first place - NT (plus the variations of services for Unix now for Linux) can be a happy place.

Incidentally, I believe the middleware - driver that Intel is shipping with Optane, is the work of ScameMP.

This is their Flash Expansion description: http://www.scalemp.com/products/flx/

we've used their product very happily - it may be a good fit especially if you are staging oversize db's you intend to shard, but need them up behind a connection while you test, which was our need.

N.B. Scale MP has a free tier which may also fit your needs.

Not meaning to shill, but I never know why I don't hear of them more, our experience was absolutely satisfying.

edit: typos; "free"/free edit: removed "very" from "may be a very good fit" about ScaleMP - felt so for us, but can't say why anything that works to meet specific need is qualitatively better when doing the job is a binary y/n..

and to add, now if products like Optane eliminate the performance cost discarding ACID transaction compliance has been justified by, then will that not upset a few conceptual applecarts? I mean, I think some very loose and fast argument has been given, surrounding non ACID compliance and data and databases generally, over recent years, and certainly the performance cost trade off arguments has appeared to me to mask logic elsewhere that needs attention.


Those old enough will remember the AS/400 (now called iSeries) computers map all storage to a single address space. You had no disk - you had just an address space that encompassed everything and an OS that dealt with that.


Just in case anyone else is wondering:

https://en.wikipedia.org/wiki/Single-level_store

I must admit that until looking into it more closely i figured it was no different from the memory map of a micro. Silly me.


What were the pros and cons of that design?


I never programmed for one myself, but I imagine that it's a very alien world if you see everything as a single vast address space of persistent objects. I imagine layers of software would make it less strange in order to facilitate porting programs to the platform - I remember IBM ported Node.js to it, so, there must be a way to see it as a more traditional machine.


Bugs become a lot more persistent, for one. "Reset it and see if it goes away" stops working nearly as well.


The end is near! Only Haskell and purely functional programming (persistent shared data structures) will save us from persistent bugs.


Considering how many odd state issues I see these days (more often in Angular 1.x apps than anything else), don't underestimate the potential for problems there.


Keep in mind that Smalltalk also persisted everything in what I think was called images, and it managed fine.


Already learned the joy of waiting for a Surface to drain, to force a reboot. Forbidding shutdown/reboot without authentication on the network was the proximate cause of the wait. But not being able to remove power, from a device that was not trusted, and could be doing anything in theory, was a sobering thought.

Should devices like tablets, with enclosed batteries, be required to have physical circuit breakers for power, as a security measure, if compromised?

We realized that we had to have a planned action to block MACs from the network, in response to any device having questionable integrity.

Even a small Faraday cage was considered - WiFi not the only radio on portables: laptops without removable batteries hop across LTE via VPN... so that was another policy to set a script to shut down.

Persistence, in this case of processes - or just not being able to remove a battery - is a threat that with good reason shocked us, because it is so ubiquitous. I believe the moment a unscheduled reboot or shutdown occurs, at the very least all LAN/WAN access needs to be automatically cut.

edit:typo


Won't OS just have a service to 0 out all the RAM, definitely on shutdown for security reasons, with a performance hit in time and write cycles on your sdd. Same mechanism at boot?


That assumes that the OS didn't crash. It also assumes that the data isn't valuable, in which case why not just use faster, cheaper ram?

There are workarounds, like adding a layer of faster, cheaper ram, but then this starts to look like a big perf improvement in a rather traditional system.


And on startup, if anyone could have physical access to your hardware.


>What happens if the distinction goes away? What if all data is persistent?

A while ago (a few years?) there was some buzz about a new kind of machine from HP (maybe based on memristors? which were also in the news around then, IIRC), that was supposed to do away with that distinction. They called it The Machine or something like that (:-). I did not follow it at the time, after the initial read, so don't know what happened to it.


The results of the research being made into a product now, managed by a new foundation named Gen Z.

"BUD17-503 The HPE Machine and Gen Z" at Linaro Connect 2017

https://www.youtube.com/watch?v=1BVtChDQVyQ


Thanks, will check that out.


Keith Packard also did a number of talks at linux.conf.au over the years about various bits and pieces of the machine


Interesting, thanks. I recognized his name as one who worked [1] on the X Window system, from a book in an X/Xlib/Motif course [2] I attended much earlier.

[1] https://en.wikipedia.org/wiki/Keith_Packard

Googled his name and found this:

A look at The Machine [LWN.net]:

https://lwn.net/Articles/655437/

[2] On a side note, Motif was pretty powerful. I remember a colleague of mine who was also in that course, creating a rudimentary app like MS Paint, in the class, in just half an hour or so - as the instructor was teaching, in fact. (Maybe he had studied it some before, of course.) As he demoed it to us, he went "Foo!" :)


CMU has been conducting research in this area for a few years now.

http://www.pdl.cmu.edu/NVM/index.shtml


I don't know. I feel like instead of reducing memory tiers Optane will only add another layer in the stack. It is still 10x slower than DRAM, which sounds quite significant. Besides, I'm not sure if persistence is really the most important aspect of storage anyways, especially with modern-day always-on clustered systems.


Here here. 12 factor has taken us out of the age where this should matter. Don't drag us back and tell me disk is as good as memory now.


Many of us have always made the distinction between cache lines and system memory. While this will probably be a welcome improvement it's not like all of your latency problems magically disappear.


I've thought about this, and I don't think that making all data persistent is necessarily a good idea. Programs tend to crash and corrupt their volatile memory. If you want some data to be reliable and permanent, you have to treat it as such.

The main obvious advantage of using non-volatile RAM across the board is that it won't consume power all the time, like DRAM.


"Persist all data" shouldn't be taken as a literal notion. It's better to think about decisions of persistence eventually moving from strongly hardware influenced to purely software- and problem- defined. Just as we've developed abstractions and tools at the persistence ("filesystem") layer now, like snapshotting, similar tools will evolve to take advantage of the new hardware constraints.


I'm not sure that the hardware has been that much of a constraint lately. Evolution, not revolution.


I mean how long ago did we have 32 mb of ram or less? Now we have 32 GB of ram in personal computers. This is more of the same progression, right?


And roughly the same period of time further back (give or take), we had 32kB of RAM.


I think the title may be a bit far fetched. They are talking about 60-100 microseconds latency. DDR3 is usually 9 nanoseconds latency.


L2 cache is several nanoseconds. DDR is much slower.

This page says 100ns: https://gist.github.com/jboner/2841832

This one says 60ns: http://stackoverflow.com/q/4087280/126995


Got these 9ns from the wikipedia [1]. But 60-100ns would still be 3 orders of magnitude faster than this SSD.

I think the difference comes from the latency of operation from the CPU point of view vs actual memory/drive latency.In the article they seem to be talking about the drive latency.

1. https://en.wikipedia.org/wiki/DDR3_SDRAM#Latencies


That text is talking about the CAS latency, not full read/write latency. CAS latency is the latency for reading from an open row. Rows are small enough that the only way you can persistently hit an open row is if you are either reading sequentially, (in which case the CPU prefetch logic makes the memory latency go away completely) or if you work on such a small working set that you actually hit the cache all the time anyway.

On DDR3, the full latency of a miss to ram is, in a low-bandwidth scenario:

      latency of the full cache system, typically expressed as the L3 latency
    + latency of opening a row (Trcd)
    + latency for reading from an open row (Tcas)
    + latency of passing data from memory controller to L1 cache
In a situation with some but not anywhere near full load on the memory subsystem:

      caches
    + latency of closing a row (Trp)
    + latency of opening a row (Trcd)
    + latency of reading from an open row (Tcas)
    + latency of passing data to L1
and when doing full tilt random access:

      caches
    + time remaining waiting out the minimum allowed row active time of the previous memory access (Tras)
    + latency of closing a row (Trp)
    + latency of opening a row (Trcd)
    + latency of reading from an open row (Tcas)
    + latency of passing data to L1
The total latency is typically between 50 to 150 ns depending on the CPU and RAM in use.


Or maybe 3x9 if you are doing wild random accesses in big area, or even worse in huge area if your TLB and then caches start to not hold enough and you effectively need to do multiple accesses for each single ones, but you are right: DRAM is faster. However, caching Optane with real DRAM seems a really good idea (or maybe using Optane as swap could be useful too)


Not shockingly, we've already moved beyond caring about local ram for many problems. Any system that is running distributed should be taking into account how long it takes to get data from across the network.

Similarly, the difficulty in large data problems is often more of getting the data in the first place. This is often about coordinating the data creation of many machines and systems.

So, for most of us, this shift will not be huge. Because it is mostly irrelevant. Even systems that do have this much data often benefit from algorithms that don't rely on directly accessing and processing all of it in one shot. If only for resilience. (That is, if a system failing doesn't mean reprocessing all of the data, restarting and recovery is often faster and easier to deal with.)


"As above, so below" comes to mind.

Sometimes it seems that the diff between a CPU and a cluster is the suffix put on the latency times.


I thought I read somewhere that the fastest networking technology actually gets close to a couple hundred instructions window of processing.

http://netoptimizer.blogspot.com/2014/05/the-calculations-10...

Hm, 67 ns per packet.

So even this local storage would be the drag in many scenarios...


The mapping layer to treat Optane as RAM, comes from ScaleMP. They have a free tier product. For heavily read intensive databases, cheap read optimize SSD can work pretty well with that. Now I don't have to also buy a product license with Optane, to do that, the price for the Optane drive looks much better to me.

Not sure drive is quite the right word, any more. Has a better nomenclature settled yet?


Computer Science is not Computer Engineering. If anything, much of CS is pure theory, relying instead on abstraction of underlying computation device, called Turing Machine.


RAM technology hasn't stood still. There's multiple levels of cache, wider channels, etc. There will be different levels of performance for a while now.


>My guess is that we'll start writing software very differently. It's hard for me to predict how, though.

This is huge deal. SSD's started it. It always seemed wrong for me to address memory through a disk interface.

The main issue is that the re-design of software for permanent non-volatile storage is huge. It goes back decades, and there are many design decisions that need rethinking.


I may be wrong, but I am guessing that database software could be the first class of software to really take advantage of this shift.


There has been some outliers though. I seem to recall Palm used something called "run in place" on their early generations. But then perhaps storage on those devices were battety backed RAM.


Well that should obsolete ORMs! There's no impedance mismatch when there's no RAM/Disk. We'd need more Disk (DB) components running for Persistent-RAM (indexing, transactions.)


You'd be surprised the number of devs I know that have a hard time thinking in terms of memory-mapping... the other issue is upgrading structures to add/remove fields will become much more difficult that way.


> Well that should obsolete ORMs!

Why? ORMs are essentially systems to convert data structures. Adding a new layer of memory does nothing to address the problem.


wow, and here I thought that SSDs were disks made of RAM


Originally Intel claimed that this new technology would offer 1000x shorter latencies and 1000x better endurance than NAND flash, and 10x better density than DRAM. The figures they're quoting now are more like 10x shorter latencies and 3x better endurance (compared with flash), and 2.5x better density (compared with RAM).

The article linked here says "3D XPoint has about one thousandth the latency of NAND flash" but I don't see any actual evidence for that. The paragraph that says it is followed by a link to actual specs for a "3D XPoint" device, saying: "the Intel flash SSD has a 20-microsecond latency for any read or write operation, whereas the 3D XPoint drive cuts this to below 10 microseconds." which sounds to me more like a 2x latency improvement than a 1000x improvement.

So I ask the following extremely cynical question. Is there any evidence available to us that's inconsistent with the hypothesis that actually there is no genuinely new technology in Optane? In other words, have they demonstrated anything that couldn't be achieved by taking existing flash technology and, say, adding some redundancy and a lot more DRAM cache to it?

[EDITED to add:] I am hoping the answer to my question is yes: I'd love to see genuine technological progress in this area. And it genuinely is a question, not an accusation; I have no sort of inside knowledge here.


> The article linked here says "3D XPoint has about one thousandth the latency of NAND flash" but I don't see any actual evidence for that. The paragraph that says it is followed by a link to actual specs for a "3D XPoint" device, saying: "the Intel flash SSD has a 20-microsecond latency for any read or write operation, whereas the 3D XPoint drive cuts this to below 10 microseconds." which sounds to me more like a 2x latency improvement than a 1000x improvement.

Keep in mind that this first Optane product is a NVMe SSD. The latency overhead of PCIe and NVMe is usually about 4 microseconds minimum, as measured by reading from a SSD that has just been secure erased and thus doesn't have to actually touch the non-volatile memory in order to return a block full of zeros. This Optane SSD has a best-case latency that is only a few times better than the best case for NAND flash SSDs. This does not mean that the underlying 3D XPoint memory doesn't have a far bigger latency advantage when accessed directly by a capable memory controller, but 3D XPoint DIMMs will be next year.


The key here seems to be that performance does not drop off very much under load. DRAM cache hides latency, but only up to a point. Hiding latency is like a magic trick, and you can only hide so much before the trick starts to break down.


I'm from the nand flash industry. There seem to be a few fundamental improvements in XPoint. For one, achieving a 3X endurance improvement while keeping the same process size (dimensions of the memory cell) is new.

Given that XPoint is byte addressable is rather impressive as the circuitry and metal layers (wires) needed for this is a lot more than page addressable nand.

The true test is when they connect it directly to DIMMs versus the PCIE bus. Latency numbers there may further prove fundamental improvements in technology.


I think the byte addresability is a software layer from ScaleMP.

I have no idea, but some concern how that might affect latency.


The performance consistency that Intel claims cannot be provided by caching. The characteristics of XPoint sound vaguely similar to SLC NOR flash, but it seems like Intel would be opening themselves up to liability by outright lying.


Good to know. Have they actually said anything about Optane that would be an outright lie if Optane devices were using already-existing hardware technology?


Yes, Intel has said that it's a resistive non-flash technology.


They've said that "3D XPoint" is a resistive non-flash technology. But one of the things that makes me suspicious is that they seem to be going out of their way not to say quite explicitly, in so many words, that these Optane devices are "3D XPoint" devices.

For instance, take a look here: http://www.intel.co.uk/content/www/uk/en/architecture-and-te... -- lots of stuff about Optane, always just called Optane. There's a link at the bottom to info about "3D XPoint" but no explicit statement of the relationship between the two.

Also at the bottom of that page, a link to a video called "Revolutionizing the Storage Media Pyramid with 3D XPoint Technology". OK then. What does it say? It talks about DRAM, flash and spinning rust; then it says Intel is introducing new things; first, "DIMMs based on 3D XPoint technology" (OK, but that isn't what they're releasing right now), and then -- these are the exact words -- "Intel Optane SSDs, based on 3D XPoint technology and other Intel storage innovations" (emphasis mine). Hmmmmm.

I really hope my cynicism is misguided. But so far, everything I've seen seems to be consistent with the following story: Intel begin by announcing a new hardware technology called "3D XPoint", which works quite differently from existing flash memory and has amazing performance characteristics, and saying they're going to release products based on it under the "Optane" brand. They work on this technology but can't actually get it to work. But they need to release something. So they make the highest-performance thing they can based on existing technologies, release it under the Optane brand, and tread super-carefully to make sure they never quite say, in so many words, that this thing they're releasing actually uses the new technology they talked about before.

Now, mtdewcmu and you both say that existing tech can't actually deliver the performance Intel say this new product has, in which case it must after all be based on something genuinely new. Again, I really hope you're right. Has this performance profile -- whatever features it has that are impossible to replicate with existing technologies -- actually been demonstrated, or only claimed?

[EDITED to add:] Aha, no, looks like I'm either too cynical or not cynical enough. I found an Intel webpage -- http://www.intel.com/content/www/us/en/solid-state-drives/op... -- that actually does say, in so many words, that the P4800X uses 3D XPoint. So the story two paragraphs up isn't consistent with their current marketing materials, and I now think the story is more likely "3D XPoint doesn't work nearly as well as predicted" than "3D XPoint doesn't work at all yet and they're fudging".


It's possible the technology actually is 1000x, but it would be really dumb of Intel to release it as such, instead of milking it for years slowly creeping to 1000x.


I think like all tech they were stating the best theoretical possibility but will take a few years for them to actually get there.


At previous employer, we built a system using Druid as the primary store of reporting data. The setup worked amazingly well with the size/cardinality of the data we had, but was constantly bottlenecked at paging segments in and out of RAM. Economically, we just couldn't justify a system with RAM big enough to hold the primary dataset. As the result, we had to prioritize data aggressively, focusing on the more recent transactions and locating them on the few servers with very high RAM that we did have. Historic data segments had to go through a lot of paging in/out of RAM. User experience on YTD (year-to-date) or YOY (year-over-year) reports really suffered as the result.

I don't have access to the original planning calculations anymore, but 375GB at $1520 would definitely have been a game changer in terms of performance/$, and I suspect be good enough to make the end user feel like the entire dataset was in memory.


Make sure you're looking at updated prices for ram too. 16x16GB of registered ECC DDR3 is about the same price and enormously faster.


Sure, but I believe we were limited by the available chassis to a lot lower than 16 slots.


Well the first google result for "1u 16 dimms" is a refurbished chassis+motherboard+PSU for a hundred bucks. Brand new costs more but not terribly so; the main cost is the ram whether you go 8 slots or 16.

These SSDs have situational uses but unless you want 10+ TB in one server you can get a system with >50% as much actual RAM for the same price.


It's not the cost. We ran standardized chassis, so whatever our ops had is what they had...


Would you choose to run with druid again?


For that use case, absolutely! We made do with the version that could not even support label appends (limited joins). The current version would allow us a lot fewer workarounds.

The probabilistic hyperloglog data type is also a game changer compared to say redshift, but again it's only viable if you are dealing with counting (estimating) unique entities across billions of rows and super-wide dimension sets.

If you are doing a general purpose analytics store, Redshift is hard to beat because of reliability and ease of implementation.

Druid is a purpose-built race car. Redshift is a good cross-over - far less headache and can do almost any job good enough, but you won't have the tuning or performance (when tuned right) at scale. Although, I'm continuously impressed with what redshift actually can do, dispite the humble feature set.

Druid's main weakness is lack of SQL support, so it's not a great analyst datastore. You pretty much have to wrap it into a reporting app.


Hi sologoub - can you elaborate a bit on the tuning for Redshift you're referring to? What's the pain there? Asking because we're building a performance management product for Redshift, I'd love your input! lars at intermix dot io


What do you think of ClickHouse vis-a-vis Druid and Redshift?


Don't have any experience with that tech, but from reading the marketing landing page it sounds more akin to memSQL than Redshift, in that it seems to include options for streaming ingestion.

If I'm going to take on a similar project, I may POC memSQL or Citus DB, and possibly Big Query (if the project is built on Google Cloud as opposed to AWS or raw iron).


Yay! Nice to see these things becoming more real. The choice of U.2 is interesting, it might force wider adoption of that form factor.

This is definitely going to change the way you build your computational fabric. Putting that much data that close to the CPU (closeness here measured in nS) makes for some really interesting database and distributed data applications. For example, a common limitation in MMO infrastructure is managing everyone getting loot and changing state (appearance, stats, etc). The faster you can do those updates consistently the more people you can have in a (virtual) room at the same time.


The U.2 form factor has had no trouble catching on in the datacenter, which is where these first Optane SSDs are intended to be used. Intel is also offering standard half-height half-length add-in cards, and M.2 is simply too small to accommodate much 3D XPoint memory.


With drives like this, the approach of "throw more hardware at it" continues working for databases to the point where most database loads in the world can be handled on a single machine.


As someone who's worked with PostgreSQL for a decade now, I've very definitely seen cases where memory bandwidth is the bottleneck, not disk bandwidth. I'm pretty sure the IO subsystem would still be a narrower pipe than main memory, even if you stuck Optane devices in every available connector at the same time and striped across them.

EDIT: Don't mistake me, I'm very excited about the potential of Optane devices in the workloads my databases handle (though, since my postgres machines are all in the cloud right now, that remains a purely theoretical question). It's just not a panacea.

Then again, nothing is.


Scaling beyond what fits inside one computer is a reasonable concern that this addresses, but ultimately that one computer can be sucked up by a tornado at an inconvenient time, so distributed systems will always be necessary for availability and durability.


Resiliency, yes, but maybe not performance. Still a great win :)


"where most database loads in the world can be handled on a single machine"

I don't think it works like that. There is always some point when at certain number of users avoiding downtime becomes more important, than handling a load and hardware performance stops matter as much.


high availability systems can counter this and if properly implemented you could share certain workloads with both the primary and HA systems; basically anything which doesn't change data.

the its memory and storage would be a boon to dynamic indexing and common table expressions.

maybe this new technology will bring single level storage as IBM has employed on some machines to the public. Where all storage is treated as one resource and only the machine knows the difference.


The more powerful the computers, the more data we want to aggregate and process. All that tracking that companies are gathering has to be used somewhere.

Also SaaS means that one provider holds data of thousands of companies. Salesforce, for example, has more than 100,000 customers. So the scale is still too big for one machine (or two or three for redundancy).


Sure but Salesforce has natural sharding. Customers never need to see each other's data. So can a single DB hold the largest customer? That is more the question here...


> Customers never need to see each other's data.

Correct and incorrect at the same time. Separate orgs may not need direct access to each others rows in the multi-tenant database, but there are plenty of use cases where data does need to be shared between them (companies with multiple orgs, like ours - business parters, etc) and Salesforce has tooling to handle this (Salesforce to Salesforce, Lightning Connect).


There are faster drives. Load is just one factor in database design.

For example, if you want fast query returns, the speed of light will limit you if you're using one machine and your query comes from around the world. Then there's the whole high availability thing.

One single machine just doesn't fit every scenario.


I wonder what the performance of these are with PostgreSQL's pg_test_fsync, which is one of the proper tools to benchmark a SSD. I get 4000 iops with my Intel 320 SSD and 9000 iops with Intel S3700. For comparison, Intel 600p maxes around 1500 iops, and Samsung 750 Evo at 400 iops.


"Why IOPS Suck and Everything You Know About Them is Probably Wrong!" - https://www.youtube.com/watch?v=cEb270L5Q1Y


The smart thing that Intel is doing is making stuff that they know big cloud providers like AWS etc. will pay crazy amounts for and buy in huge volumes. The "use it as RAM" is incredibly valuable -- especially to bring down costs for databases. For example, running a database with an allocated 32GB of RAM is pretty expensive per month.. And if somebody like AWS sold a cheaper DB instance version that ran from this drive as its memory (or was smart paged), then that could bring down the cost of allocating huge databases to memory with a performance hit that many people would be willing to take to save the money.


64 bit machines (given the right controllers etc) should be able to map the whole 375GB / 750GB directly into the memory space. You'd almost certainly need a kernel driver, similar to the way a graphics card is mapped into the address space but isn't treated as regular RAM. With the right driver you could just mmap() the address space.


Intel's strategy here is actually to continue using the standard NVMe interface for Optane SSDs, and to offer a hypervisor that does the job of presenting to the guest OS a pool of memory that is backed by a combination of DRAM and 3D XPoint.


The big limitation of this device is wear. "Optane SSDs can safely be written 30 times per day", says the article. That implies a need for wear monitoring and leveling. Although you can modify one byte at a time, the need to monitor wear implies that just memory-mapping the thing isn't going to work.

Wear management could be moved to hardware, though, using a special MMU/wear monitor/remapping device. If you're using this thing as a level of memory below DRAM, viewing DRAM as another level of cache, something like that would be necessary. That's one application.

This device would make a good key/value store. MongoDB in hardware. Then you don't care where the data is physically placed, and it can be moved for wear management.


From the article:

> This gives the drives much greater endurance than NAND of a comparable density, with Intel saying that Optane SSDs can safely be written 30 times per day, compared to a typical 0.5-10 whole drive writes per day.

I understand this as: to get normal length usage of traditional SSDs you can get away with writing enough data to completely rewrite the device 0.5 to 10 times, whereas with the new device, it's 30 times.

The new technology is 3 times better than SSDs with regards to write durability.

I think the article would be better worded if it didn't leave out the most important number: 10 rewrites per day for how long before either the SSD or the new thing fails?


It doesn't say that the wear leveling is not present inside the box already. It could mean rewriting the whole storage 30 times a day effectively.


If writes can be done at the byte level, the bookkeeping and indirection for byte level wear leveling would take more space than the data. If wear leveling is required, it has to be done on blocks of some reasonable size. They might be smaller than for flash, though.


What's the endurance like if you actually use it as ram? How many times can you do a "label1: inc [rax]; jmp label1" loop having rax pointing to a particular byte address on the SSD? (With GHz CPUs, wouldn't that mean giga-writes per second? Isn't NAND rated for 10k-100k writes total, and if this is rated for 1000x more than that, you'd still hit a 10m-100m total writes in like a second?)


I wonder why no one mentioned application of this technology in mobile phones. Most obvious case: it will be posible to bring entire system from hybernate state while user pulling phone from the pocket and pressing "awake" button. Power and computing cost of hybernating/restoring system would be slightly north of zero, which leads to dramatic increasing of battery life.

Not to mention size factor and lower power consumption of chip itself.


Meanwhile, the Raspberry Pi is using a garbage microSD controller that corrupts cards… the divide between cheap and high end stuff is just mindblowing these days.

Also, normal (NAND) NVMe M.2 SSDs are still TWICE as expensive as good old SATA ones, at least in my country… And they want to push Optane into the consumer space later this year… who even needs that much performance at home?


> who even needs that much performance at home

In the next ten years (closer to ten than not) we'll need it for massive and or hyper intricate 4k VR worlds, the assets of which you won't want to download every time you load up the world/s. You'll want to hold as many of the assets locally in something extremely fast, if not ram then the next best thing. That is until we commonly have 10gbps plus to the home.


RAM and VRAM these days are huge, and SATA is definitely fast enough to stream assets from. Heck, I load most of my games from an old 2.5" spinning HDD that I pulled from a dead laptop.

I think more improvement is needed on the VRAM side than on the disk side, so HBM2 hype > Optane hype :)


Nothing new. Just look at a PDP-11 tape drive vs the cassette used with early micros.


It would be useful to move some SQLd indices to an Optane drive with DB rows on arrays of m.2 flash drives. SQL consolidation/redundancy over network out of band, slowly. This to scale somewhat cheap OSS SQL clusters.

This starts to make sense in a single server node with enough 4 channel pcie interfaces. AMD Naples?


So finaly we're going to get from Intel what HP promised years ago as "memristors".


Memristors theoretically go further than mere storage. They enable the idea of data storage and processors eventually becoming the same thing. I hope they continue with the work because some of what I was reading about it was fascinating - rather than have caches, you could, in theory, write temporary routines that ran in physical locality to the data they had to process.


You can already do that, it's just that dumb storage can be packed more densely than smart storage.

Memristors can make things more dense, but that applies to both options.


Tried to look up modern memory latency numbers but could not find. Because "numbers every programmer should know" are about 10 years old as I understand.


It feels like Intel are putting this out in lieu of actual hardware. As always remain skeptical until real silicon appears on the market.


Isn't the article about real silicon appearing on the market?

"Initial limited availability starts today, for $1520"


The "limited availability" is almost certainly for those with per-arranged deals with Intel and 2H is almost always Q4. It's not to say this isn't an interesting product but I've a feeling Optane won't live up to the hype. But I'm reserving judgment until it gets into the wider market/ has more real world exposure.


If by market you mean someone not Intel. They probably sold all their capacity for the year to certain key customers, and possibly even with the parts not meeting spec (some companies won't care if they still see a boost), and the small guy will not see it until next year.


Seems like this would be ideal in cases where you are waiting for file sync in multiple locations which I assume a lot of banks/corporations do.

Interesting it seems to be marketed as cheaper memory. You'd think at first they'd try and rip super high margins out of banks/corps by selling it as "persistent" memory.

Although I guess if your waiting for file writes in multiple locations the network overhead makes the actual write sort of irrelevant.....


I would opt for a more conventional solution, you can get 256GB of ram for under $2k, and then enable write caching.


/dev/null is also pretty fast

But for use cases where data matters, fsync seems like a reasonable idea.


You can also attach batteries to the RAM. This is already a solution people ship in production.


Any references? I tried to find something but I failed. I'm assuming we are talking about something different than UPS + going to sleep? It seems very tricky to just keep RAM powered.


They are called NVDIMMs. One example http://www.vikingtechnology.com/arxcis-nv

I think some of them are protected by both a battery and flash memory for long term backup.


Storage Search is the best rabbit hole on the subject of RAM SSD that comes to mind. http://www.storagesearch.com/ssd-ram.html

I mean, how else would you get 80 MB/sec? http://web.archive.org/web/20010613164621/http://www.mncurti...


Intel has a reference design for just keeping RAM powered but it's only used in storage controllers, not in any servers you can buy.


Check out Adaptec RAID controllers. Notice how these guys have an extra battery back so that during critical power failure the large (2GB?) cache is preserved.


Is this based on memristors? That's pretty amazing. I thought hp owned that.


Ramdisks are back. This time persistent.


I was under the impression that this is the opposite. It's a 'disk' (ram disks are just storage that can store files, so I'm pretending disk means it stores files) like a SSD that's fast enough to act like ram?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: