Hacker News new | comments | show | ask | jobs | submit login
The Intel Optane SSD DC P4800X (375GB) Review: Testing 3D XPoint Performance (anandtech.com)
134 points by randta on Apr 20, 2017 | hide | past | web | favorite | 89 comments

Aerospike CTO here. We did a lot of testing with this drive, and the interesting bit is that performance doesn't degrade under write pressure. With a NAND drive, when you push writes high ( usually needed in front-edge / microservice apps ), the read latencies take a real hit, and often get into millisecond-average at the device. Optane simply doesn't behave that way.

All of that means you have to code to it differently. It isn't DRAM, it's not Flash. We've done the work to integrate and test at Aerospike ( semi-plug sorry ), so the system works.

The XPoint tech gets really interesting later in the year, when three things happen: the higher density drives show up, Dell ships their high-NVMe chassis, and more clouds support it.

Regarding cloud - IBM's BlueMix has announced Optane, and the other providers have a wide variety of plans. I can't comment more.

Finally, Intel has been clear about putting this kind of tech on the memory bus, and that really opens the doors to some interesting choices, some data structure changes. That requires new coding, we're on it.

Here's an interesting ComputerWorld article about our experience with Optane: http://www.computerworld.com/article/3188338/data-storage/wh...

> All of that means you have to code to it differently.

> That requires new coding, we're on it.

Can you give a rough example to guide my thinking? I understand OSes are considering ways that this has both traditional disk and RAM properties, and are sorting out storage subsystems of drivers, but I assume you're talking about something closer to user-level structures and algorithms?

Yes, I think there are clear user-level approaches you can ( and must ) take.

There are some interesting talks out there about NAND, how it works, and how to optimize - I saw something here on HN a few days ago about writing your own time-series database, which got a variety of the facts wrong but was an example of how to choose data structures that are NAND-reasonable. You can look up some of my YouTube talks and slideshare, for example - I've been talking this for a while.

At a high level, NAND has more IOPs than god, because they don't seek. An old enterprise spindle littlerally does 200 to 250 seeks per second. And Flash can read from 500,000 different random locations per second. That's so far apart that different user level approaches are called for.

In terms of XPoint, let me give you one detail. What does a "commit" look like in XPoint? What do the different kinds of memory barriers look like? What's the best way to validate this kind of persistence on restart, which you don't have to do with DRAM? Does that change your "malloc" free list structure, because you need to validate? Is it a good idea to chop up all the space available, so you can validate different parts independently, or does that mean you end up with the multi-record transaction problem? These are the kinds of things we consider in database design on new hardware ( obligatory: we are hiring ).

Databases have traditionally gone to great lengths to bypass whatever heuristics the operating system is using for I/O scheduling and caching so that they can implement their own optimizations while maintaining their data consistency guarantees. Any changes to OS behavior you can imagine are also potentially applicable to databases that take this approach.

What's a "high-NVMe chassis" and what large cloud provider uses Dell gear these days? They long ago killed the C-series line as everyone switched to Quanta (FB et al) or Supermicro (smaller players). The only people buying Dell are enterprises, and they don't buy in C-series volume nor are they price conscious.

The truely humongous buyers, like Amazon / Google / FB, certainly don't buy dell anymore, also don't buy software from people like Aerospike.

Medium-size companies ( say, box, or AppLovin, or PayPal ) buy software, have some cost consciousness, and buy Dell.

I expect Dell will come out with some chassis closer to what Supermicro has been offering to the market, I also see some defection just like you do. However, that's all speculation.


On a semi-related note, the fact that cloud providers offer managed Postgres databases is great, but things like this keep pushing me to think about bare metal in colo. A $20,000 server/backup combo with a couple of these will give me 5x the performance of a server that costs me $70k/year on AWS before provisioned IOPS. That's a huge gap to play with for taking care of your own maintenance costs.

It has always been the case that owned metal is much cheaper than cloud. Cloud wins for things like very small machines (you can't buy 20% of a metal box); temporary deployments; and startups where the investors don't want to be on the hook for big $$ capex that they end up needing to shift on eBay post flame-out. Also where you just don't have the ability to run hardware (increasingly the case) and where for whatever reason the cost to deploy and run metal would add up to more than your $70k delta (easy if it needs one more person on staff).

Also remember that funky storage card may have a firmware bug that costs you hundreds of hours to track down and get resolved. That time could have been burned by Jeff Bezos' guys instead..

I never understood this. Just rent a dedicated box or ten. There's no capex, it's way cheaper to have 3-5 times the capability you will need just sitting there instead of renting stuff from Amazon with all the I/O unpredictability, complexity, w/e. People talk as if colo and cloud is the only solution -- and I am like, rent dedicated is the only solution. I do not readily know how huge your site needs to be for colo to worth it but Examiner was ~75th on Quantcast and we ran it on a dozen dedicated boxes and colo would'v been ridiculously more expensive.

I would've agreed with you a couple years ago, but now things have changed. Clouds like AWS and GCE are increasingly offering services that one can't replicate on their own dedicated box. ML, Managed Databases, etc. can't really be replaced.

Another thing is the skills one needs to efficiently manage and deploy everything. The sysadmin, devops positions in a startup have vanished because now that's AWS and GCE. Sure, AWS/GCE are expensive, but the developers don't need to worry about uptime, and if something happens to their instances, they just hit 'delete instance' and start up a new one.

The new theme seems to be developer productivity rather than cost effectiveness. With the new automatic deploy tools, and automatic load balancing, there's not really a need for anyone to worry about managing systems. It might seem like premature optimization, but you're saving the paycheck that needs to be paid to the sysadmin who has to be there for 24/7 monitoring.

This is all assuming one of the existing developers on the team isn't also a rockstar sysadmin, security guru and an insomniac who doesn't have a family.

With public cloud, you're paying for the control plane. Period.

Yes the raw compute cost is much higher (and the performance often less) than bare metal, but software development is really, really expensive. With the public cloud you get the result of literally millions of programmer-hours "for free". To many that's worth it (at least at first, below a certain scale).

This is also one reason why Kubernetes is really exciting, BTW. It's a control plane that you don't have to rent.

I really badly do not understand what EC2 gives you that a dedicated box doesn't. Yes, if you use all the other services, then there's a value but then again for example SQS is a particularly shitty queue.

It's not ec2, it's the ecosystem.

DDOS, multi-path, multi-az, dr, snapshots, hardware redundancy, elasticity, freedom from datacenter contracts, fiber contracts, driving to rack stuff, jammed fingers, staff costs, firmware bugs, bad hardware batches, cdn, compliance costs. I have more but I'll stop there.

It doesn't make sense at very small or large scales but it captures a hell of a lot of the middle.

Most of what you described is colo and not dedicated boxes...

And honestly, even colo doesn't take that much time unless you're scaling up like crazy. Once we racked our machines, we almost never had to touch them. You can get a LOT of resources in a single box for $10k/server these days.

> SQS is a particularly shitty queue.

Do you have any other recommendations for queue systems that can handle a million messages per minute (our use case)?

A million messages per minute ? How's that a problem? Please see http://bravenewgeek.com/dissecting-message-queues/

Almost all of the systems tested hit 17k / s (aside from NSQ) -- on a laptop. NATS (gnatsd) hits 0.2M per second, that's a magnitude above what you need and that's a brokered system, if your task fits with a brokerless one then nanomsg is yet another magnitude. I do not particularly see where the problem really is?

Our messages range from a few KB to a few MB (we store a pointer to S3 for the large ones) but that look pretty interesting.


It gives you dev ops out of the box. For small projects it has its merits. For most developers tinkering with the server isn't high in their priority list. I'm with you on dedicated but if I were to run an asynchronous crawler I'd rather go with AWS and save me the trouble of setting up the proper environment.

Also keep in mind that literally every hardware vendor has no problem giving great lease terms on actual hardware that you take possession of. Certainly more overhead, dev ops work, etc but yet another way to go about a desired technical approach while also being able to (largely) satisfy any policy/strategy/whatever with regard to capex.

Renting a dedicated box doesn't have that cool "we can scale INFINITELY!" vibe. ;)


Exactly. The only time the cloud services really pay off are when you need to scale up or down massive capacity overnight. The odds of that happening at most companies are very, very slim, even most of the web companies.

If you're really concerned about that though, I'd say go for the cloud option - just make it so you can start small on a single physical server, but can scale onto the cloud, then migrate that to physical hardware as needed. With this set up, you save massive cash and avoid vendor lock-in.

Or if you are a team like mine that develops/manages 200 servers, 15 RDS database instances, 500 TB of compressed S3 data, and provides accessibility to all of that through API's with only 5 developers because of how easy AWS makes it for us.

While our server costs are probably higher than going bare metal, how many developers would it take you to manage all of that on bare metal hosts?

The expense of one's outlay to Amazon is not a persuasive statistic. It only shows that you have a lot of incentive to keep confirmation bias in gear.

AWS is specifically designed to maximize instance count (that is, cost). The reason "devops" has exploded since EC2 hit critical mass is that before EC2, people were reasonable with the number of servers they needed. With EC2, it's all water, and it's so easy to press "create instance" that people just do it without thinking, and wonder how they ended up paying Amazon $100k/mo for something that used to cost them $5k/mo.

I'm engaged on a project now that has similar figures to what you've quoted here. It could be done with less than two dozen bare-metal servers. I know because it was done with less than two dozen bare-metal servers before someone in the C suite felt left out and suggested some "modernization" via the cloud.

I think that cloud zealots are mostly people who were either deathly afraid of sysadmin or people who were cutting their teeth as cloud got hyped, because there's no way any competent person who was doing this type of stuff before 2008-2009 can pretend that Amazon is not laughing all the way to the bank.

This is not to say that cloud has no advantages or that its use is always inappropriate, but what you're describing is pure fantasy. Yes, a small team of coders should be more than capable of managing the servers that they need, especially if they are renting dedicated boxes from a professional datacenter facility that handles hardware swaps and similar failures for them.

Our EC2 costs are about the same as S3, RDS is the most as I forgot to mention the backup servers we have in each region but we are trying to move all of that data to S3, and more than our SQS costs.

I'm pretty new to the backend design cost/benefit (have only used AWS since I graduated college) but I'm curious to here about bare metal storage solutions for 500 TB of compressed json as I have not read any compelling solutions for that data requirement (Outside of MongoDB but I have heard good and bad things about that).

Grab small servers with one or two big disk (Hetzner will give you two 4TB disks for 39.00 EUR, OVH one 6TB for about 30 USD w/ an ARM CPU, less than 50 USD for two 4TB under the soyoustart brand), serve them over iSCSI, use RAID-Z3 to create a nice huge ZFS filesystem. Do not forget to have hot spares. Done. Costs about half or even third of what S3 costs and it's a filesystem instead of an object storage. Also has transparent compression. ZFS grows online. Both OVH and Hetzner and I bet other such providers have API so you can just script the whole thing to grow as soon as you cross a threshold.

In 2017, 500TB is easily within the realm of colocated bare metal.

There are many build it yourself options, which would involve things similar to this SuperMicro JBOD [0] + some other servers/external RAID controller to handle distributions over the disks. You could also go really barebones and do a couple of home-built "RAIDZilla"-style devices. [1]

For the cost of one month's S3 storage, you can buy a pre-built Backblaze Pod from a third-party supplier [2]. I've found this to be the case with Amazon; the monthly cost is about 50%-100% of the permanent cost for the actual hardware (which will usually last at least 3-5 years). Even if you have to hire a couple of your own hardware jockeys, you're going to be saving 6x.

For a less DIY route, any SAN provider will be able to accommodate 1PB (for redundancy) without breaking a sweat. Of course, this will be a large upfront expenditure, but it should still easily be cheaper than S3 over the long run. There are tons of options and storage engineering is a big field. Look around and I'm sure you'll find something acceptable.


All this said, I really have to be skeptical that you need 500TB of (compressed!) JSON data. I would look into how much of that data you really need to keep, set up some retention policies, and seriously consider reworking your storage format to make these numbers more reasonable (JSON is obviously not space-efficient), which will not only greatly reduce infrastructure costs but also make the project much easier to handle.

You may also wish to look into modern compression codecs if you haven't already. LZMA provides the best ratio but the compression cycle is slow (decompression is fast). Brotli and zstd are new compression options that are at least comparable to gzip in ratio and much faster. Data deduplication should also help.

[0] https://www.cdw.com/shop/products/Supermicro-SC417-BE1C-R1K2...

[1] https://www.glaver.org/raidzilla25/

[2] https://www.backblaze.com/blog/open-source-data-storage-serv... (estimates cost of third-party pod at $12,849.40; S3 cost calculator says 500TB of storage in us-east-1 is $12,407.17 without any bandwidth, request costs)

>where you just don't have the ability to run hardware (increasingly the case)

Softlayer has 'on demand' bare metal plans. I'm not sure how cost effective it would be compared compared to running your own box, but the simplicity is nice and they had me up and running in under 2 hours (assembled with customizations).


As a customer, I can tell you we're very, very happy (Singapore SoftLayer, our Asian DC). No upfront capex, I can add or subtract hardware as necessary (I can say the same, they usually take about 2 hours to have it up and configured as requested), and my favorite benefit is when new hardware is available, we just spin up new servers and retire the old ones; bill ends up being the ~same.

It runs substantially cheaper than our colo here in Irvine, but that's also not apples-to-apples. That said, I don't plan on continuing with colos unless there's a pressing reason, after almost 15 years of them being my default. Bare Metal is the sweet spot in my eyes, the benefits of cloud without the downside of shared/inconsistent resources.

How custom can you get with the hourly machines? The order page (without logging in) looks like you have to pick a ram size (and they don't go beyond 256GB), and you can only go up to 4 disks. That said, I know the order page for monthly has maximum ram sizes listed below some of the monthly machines I manage, so maybe the hourly page is a simplified version of the truth too.

I'm not really sure about hourly plans. When I was configuring my machine (~1-2 years ago) I was able to select hourly or monthly billing (hourly would've been outrageously expensive).

Yep but the FTEs you need are 100k+ each, managing baremetal servers is almost certainly not your core business, and you're running the risk of gitlab style fat fingering dinking around with low level stuff that you did to save money.

You don't even need to co-lo. Just build a system for your home on a static IP address with a business account from your ISP.

It's so much cheaper than AWS, and you get unlimited bandwidth.

Some people like to claim that old storage technologies go away. But in reality old storage technologies live on along side of new... all that happens is that we end up having more tiers to deal with.

Drum magnetic memory has been replace but we still have spinning rust, tape, optical, DRAM, SRAM, SSD...

There are only 3 forms of bit storage, historically.

1) Poking holes in things ( tape, DVD, etc ) 2) Magnets ( core memory, drum memory, tape, disks ) 3) Circuits ( DRAM, NAND, etc )

Examples of all three of these still exist.

What's interesting about XPoint is it is literally a fourth form that has never been commercially available: melting a substance and cooling it quickly or slowly, forming either a crystal or amorphic solid, which then has different properties. We don't know what the substance is, but it's cool that we now have this 4th thing.

> melting a substance and cooling it quickly or slowly, forming either a crystal or amorphic solid, which then has different properties

This is exactly how rewritable optical media works.

There are only 3 forms that really survive to this day. There are a couple of notable obsolete ones:

4) Delay lines. I would definitely not categorize these as circuits. The bits are stored as pressure waves in motion.

5) Electrostatic charge. Talking about the Williams tube. The bits are stored as residual charge on a phosphor surface.

... and magnetic bubble memory, which is a very odd hybrid somewhere between tape and delay lines.

Also, phase-change optical (as in rewritable CDs/DVDs).

Also also, printing stuff out at high bit-density with good ECC, and scanning it back in again.

Also also also, chemically encoded bits (as in the DNA-based storage that was demonstrated recently).

There are more, not in common use or very scalable (e.g., mechanical switches).

I think it's a bit conspicuous how you mention good ECC for printing stuff out, since good ECC is a staple of many forms of storage (especially hard drives, optical drives, tape) and we don't mention e.g. "storing bits on spinning rust with good ECC and reading it back in again".

I think the problem with paper is thot it's not really part of the computer any more. You need a human to take the paper out of the printer, store it, and put it back in the scanner. At least with tape and optical, we have robots to do that for us.

I hadn't thought of that. Your average "crappy thumb drive" probably has more ECC than you'd need on a physical piece of paper. You're right.

Although when crappy thumb drives fail the whole thing tends to be unreadable or doesn't even show up as a USB device.

Generally no amount of ECC will help the default failure mode of these.

Is #5 that much different from dynamic RAM, with bits stored as capacitive charge?

Memristors would technically be another form of storage but after the hype HP have been a bit quiet on this for the last 3 years.

MO drives heated the medium to the Curie point. 3D XPoint isn't all the revolutionary.

Drum memory was still spinning rust, the only difference is the axis of rotation.

My floppies would like to have a word with you.

It lives on in the save icon. But in all seriousness you're right, floppy as a medium for data storage is dead.

Unless you're the US gov. According to Wikipedia: "In May 2016 the United States Government Accountability Office released a report that covered the need to upgrade or replace legacy computer systems within Federal Agencies. According to this document, old IBM Series/1 minicomputers running on 8-inch floppy disks are still used to coordinate "the operational functions of the United States’ nuclear forces..." The government plans to update some of the technology by the end of the 2017 fiscal year." [1]

1. https://en.wikipedia.org/wiki/Floppy_disk#Use_in_the_early_2...

Optical discs are relegated to back shelves too. Hard to beat cheap rewritable nand flash.

ps: I'd be curious about a tiny, 2017 laptop friendly optical disc storage. Say a 2.5" RW BD layer in cartridge. A modern mini disc.

There are people using optical media for long term storage. So I am not counting it out as a storage medium. In my classification I would put it in cold storage tier.

Surely. Although I'd have thought for long time storage magnetic tapes were cheaper.

I doubt anything could be compact enough to be laptop friendly in the sense of being a physical part of the machine - but external USB 3 BD burners are quite small (powered over usb, too), I have a Samsung which works fine.

i don't know, a small radius "may" help remove mecanical constraints, and with a non stacked lens design sdcard reader height device.

I was mostly wonder about the psychological aspect of form factor. naked optical disks are non-personal, mini discs weren't. You interacted with them freely, carried them as is in your pocket. A large enough yet tiny enough reincarnation might be "fun".

On one hand, people complain about government spending. On the other, they complain about government's old equipment.

Floppies are pretty much dead unless you use industrial computers that can't be upgraded.

Floppies are a type of spinning rust.

spinningrust.io the new web 2.0 tech start up that performs data science statistical arbitage techniques on your customer data.

Floppy drives themselves can generally be replaced by emulators, at least from a technical perspective. Like with a FlexiDrive or HxC.

That's not to say it'd always be possible, I presume that there may be regulatory issues in some cases.

I'm the technical lead for the Storage Performance Development Kit (http://spdk.io). I have one of these in my development system and we're hoping to post benchmarks fairly soon. SPDK further reduces the latency at QD 1 by several microseconds. It's an impressive device!

Any word on real world endurance?

There's some speculation these devices are massively overprovisioned due to the er... cells (or equivalent) wearing out much faster than early hype/info/etc.

So, it'd be interesting to get a real world idea if these things are likely to explode badly ;) 6 months after purchase or not (etc). :)

Jeebus, that's one performant drive. http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p480... -- I can't believe those numbers. The QD1 numbers are impressive to say the least.

The random access performance is IMPRESSIVE! It is going to bring a storm in DB landscape, since now we probably don't need to force to use data structure that tailored to perform with sequential writes, but all the other data structures as well? Very exciting indeed!

At $5 per gb, I would instead buy RAM.

Edit: why the downvotes? I'm currently doing image processing and using a ram drive with ~200GBs.

The reason you would use this over RAM is persistence, it doesn't need to stay powered to keep the data.

If all you need is a massive amount of temporary storage for some algorithm, you'll still need RAM, but if you want a stupidly fast backing store for a huge amount of source and then output data, this is pretty incredible.

When your computation fits in RAM, what's the point of a lower latency backing store that does fewer gbps per dollar?

Many of these systems have write caching enabled, which means that on critical power failure, some data will be corrupted. And often such system have back up supplies.

Let's be clear here. The real reason people used this stuff was that it was 3x/4x cheaper than RAM.

Have you considered write-back cache scenarios? Can't use RAM there. Stuff like this would be fantastic for those applications.

That sounds interesting, can you tell us more a about the system?

Better than NAND performance if ever so slightly for a first generation memory product but I still feel like it would not able to scale to DRAM speeds. The search for a universal memory goes on...

They're not advertising it as universal memory or a DRAM replacement though, so I'm not sure why you would make that comparison. They're advertising it as sitting between RAM and an SSD on the storage hierarchy, and it seems to mostly deliver on that promise.

One of the proposed uses for 3D XPoint has been replacing system memory, owing to its extremely low latency (which it has already delivered), high throughput, where it's early, and supposed almost unimaginable reliability/rewrites. In DIMM form it wouldn't have the overhead of a storage controller, could hypothetically be very parallel, etc.

It's extremely early in the technology, and I imagine we will get there. The first SSDs were terrible compared to SSDs now, and 3D XPoint as a technology is extremely scalable and refinable.

> extremely low latency (which it has already delivered)

Are we looking at the same numbers? "probably under 10 microseconds" is pretty terrible compared to DRAM.

3D XPoint has extremely low latency (e.g. 7 usec), and has already been demonstrated as such. Putting it through a PCI slot and a storage controller is not the same. The discussion is about running it as DIMM memory through a normal memory controller.

So you remove the storage controller and PCIe bus and you go from 10us to 7us. 7us is still a hundred times slower than DRAM, is it not?

You're ignoring a lot of context being created by the parent comment. While the latencies may never directly compete with DRAM, planned densities are already more than favourable. Having a 2TB chunk of persistent memory mapped into your address space with single-thread performance exceeding 100,000 random reads/sec with extremely consistent latency is a game-changer in for example database applications.

What context am I missing? While Analemma_ was talking about using it to augment DRAM, like you are, endorphone was very specifically talking about replacing DRAM in that comment. And that comment claims that it already has good enough latency for the job.

It's turtles all the way down, and I think this is needlessly splitting hairs.


3D XPoint is being considered as the DIMM-module, byte-addressed "memory" before storage. Saying that it "augments" DRAM is almost meaningless because we already have multiple levels of DRAM.

If I am not mistaken, we have only one level of DRAM. Caches are SRAM. (Also, DRAM on DIMM is not byte-addressable)

The PS2, Xbox 360, Wii, and some intel chips with iris provide on-die DRAM caches.

It's about 30 times slower than the real-world latency of RAM, which is obviously a problem but it's one that caching might resolve (e.g. an L4 of 32GB of GDDR5)

They had a demo at Cebit showing that it outperforms RAM access "to the wrong CPU's RAM" via QPI on a dual socket board. For real world use cases I'd say that's not too bad.

Not sure this looks like "ever so slightly". They're measuring a idle random I/O latencies of less than 10us! Not DRAM, no. But still close to an order of magnitude better than what you can get from flash.

Good thing computers don't run on feelings.

Which are the use cases for an SSD like that? Would I see improvements if I were to load my db on it or is it for working with bigger files?

These are going to be absolute beasts for databases.

Databases are best on drives that have reduced latencies. Data access is random and keeps jumping around the drive for joins. Your SELECT statements are going to speed up in proportion to latency improvements.

First thing I think of is a Cache drive for a SSD array.

I am rather hoping this make large capacity DRAM some price drop.

I'll settle for NAND pricing to drop. There's a bit of large-scale NAND shortage right now. Companies are bearing the brunt of the shortage right now, but they'll past it on to consumers if it continues much longer.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact