Hacker News new | past | comments | ask | show | jobs | submit login
Switch Your Databases To Flash Storage (highscalability.com)
187 points by jpmc on Dec 10, 2012 | hide | past | web | favorite | 77 comments

Wear patterns and flash are an issue, although rotational drives fail too. There are several answers. When a flash drive fails, you can still read the data. A clustered database and multiple copies of the data, you gain reliability – a server level of RAID. As drives fail, you replace them.

Unlike magnetic disks, SSDs have a tendency to fail at a really predictable rate. So predictably that if you've got two drives of the same model, put them into commission at the same time, and subject them to the same usage patterns, they will probably fail at about the same time. That's a real problem if you're using SSDs in a RAID array, since RAID's increased reliability relies on the assumption that it's very unlikely for two drives to fail at about the same time.

With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.

That said, if you're careful then that predictability should be a good thing. A good SSD will keep track of wear for you. So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan. If you add that extra step you're probably actually improving your RAID's reliability. But if you treat your RAID as if SSDs are just fast HDDs, you're asking for trouble.

> That said, if you're careful then that predictability should be a good thing.

Yes, it's a very good thing. In a high end SSD storage system, you predict early enough based on a calculation of how many drives there are, and what their current wear is, what type they are (SLC, eMLC, cMLC), etc. Then you phone home and have a drive delivered before the user even sees a disk failure.

With HDD's, the failure rate is so random that the disk completely failing is the signal that get's a replacement drive into the enclosure. S.M.A.R.T-type alert systems have been epic failures (too little info too late). The difference is because of the RAID rebuild having a lower probability of failure (let's set aside mulitple URE's for a second) that you can count on the mttf of the next drive failing being longer than the time it takes to get a drive out there.

However this is not much of a guarantee, so most people crazy over provision their storage.

SSDs let you predict this, thus provision correctly, and choose how to replace the drives to least impact the customer. It's win win to have predictable failure. I don't understand people who say otherwise.

It's hard to predict without SMART features to measure the current wear state. Write amplification from the file system, and from the drive itself if you're not using large block writes, means you can't just calculate - you have to measure.

To extend on your comment: It's not possible to predict individual drive failure with reasonable accuracy. It's disconcerting to see the parent comment suggesting this still sits at the top of the thread.

You can roughly predict the longest possible lifespan for a SSD under a given workload. Regardless, a significant percentage of drives will still die earlier than that.

I'm only saying this from experience developing storage systems that are yet unreleased. You can predict the lifespan of the SSD in the storage system if you give up many of the functions of the SSD controller and put them in software RAID.

If you're talking about most incredibly naive SSD storage systems available today (excluding violin memory and maybe xtreme/pure), then I agree with you.

Depending on the SSD vendor, many drives expose performance counters to help you estimate wear level.

I'm sorry, I didn't mean you'd calculate some value once for all the drives. It's definitely something you measure over the lifespan of the SSD itself with the rest of your QoS subsystem.

Reads, program/erases, controller ecc/ read disturb management, the g/p list mapping of the blocks... This all has to be taken into account in a dynamic way. And yes, some people are doing this at a higher level than the SSD controller.

I bet you could setup your drives to fail in a set pattern. Lets say you had 4 drives in a RAID-10. If they were all fresh, swap out the first 2 drives when they are at 50% wear... then on you could swap them out back and forth as they approach 100% wear.

Since they lifespan of some of the drives (Intel and Samsung I believe) is reported in the SMART data, you could easily do this.

SSD drives with SLC memory (enterprise SSD) have 100,000 P/E cycles, so they should last a while unless you are writing just a massive amount of data. Anandtech had a nice little writeup about SLC vs MLC vs TLC memory a little bit ago:



That's talking about hard drives. The comment you quoted is talking about SSDs.

Assuming this predictability is not a good idea in my experience. SSDs fail in various ways, some may be predictable and some are completely unpredictable. It is also not true that an ssd failure means it simply goes to readonly mode. I've seen plenty of SSDs failing unexpectedly and are no longer readable, returning sense key 0x4 (HARDWARE ERROR) and the only recourse is to ship them out.

The risk of correlated failures is indeed non-trivial in SSDs and plain RAID is riskier, be sure to keep a watchful eye on your arrays.

We've currently got about 3500 SSDs in production across our clusters. I worry about them deciding to all fail at once, so far they have been sporadic failures (about 1/2 of which leave the drive unusable).

Are they all the same model? How long have they been running? Is it a relatively similar load on all of them? Can you share smart attributes for them? (in private if needed)

On a very low fire I'm trying to create a disk survey project (http://disksurvey.org ) and such information is of great interest to me.

Good point. I was only addressing the predictability that comes from the flash memory simply wearing out. There are plenty of other ways that drives can fail. But I don't think they introduce any new worries for RAID users the way that flash memory wearing out after so many writes does.

With that I agree.

So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan

Sorry, that is terrible advice. Do not do that.

In availability planning two is one and one is zero.

If you love your data you run your databases in pairs. If you really love your data you run them in triplets. This applies no matter what disk technology you're using.

Speculations about failure rate or -prediction don't belong here. Your server can go up in flames at any time for a dozen reasons, the disks being only one of them.

Not only that - you make sure that critical components, like hard drives (rotational or SSD, doesn't matter), are from different manufacturers or at least not the same production, or at the very least not put them to use at the same time. Basic design flaws (intentional or not), that result in non functional hardware, tend to hit at the same time - so you'd rather not want to have all 12 drives, and the hard drives of the 3 replicas, fail within the same week.

I used to run a free webmail service back in '99-2000. It was the first system of that kind of scale I'd worked on (about 1.5 million user accounts - large by the standards of the day - today I have more storage in my home fileserver; heck, I've got almost as much storage in my laptop), and though I wasn't in charge of ordering hardware I was equally oblivious to this problem as the guy who did.

I learned in a way that ensures this advice is burned into my memory forever:

It was when IBM had one of their worst ever manufacturing problems for one of their drive ranges.

While the IBM distributor we dealt with was very fast at turning around replacement drives, we had some nerve-wrecking weeks when the second drive in one of our arrays failed only something like 6-9 months after we went live, and we found out about the problem.

They all failed one after the other within a week or two of each other. Every drive in our main user mailbox storage array...

Thankfully for us, the gap was long enough between each failure that the array was rebuilt and then some in between each failure, but we spent a disproportionate amount of time babysitting backups and working on contingency plans because we'd made that stupid mistake.

(And I'll never ever build out a single large array, or any number of other thing - it made me spend a lot of time thinking about and reading up on redundancy and disaster recovery strategies, as it scared the hell out of me; it was mostly luck that prevented us from losing a substantial amount of data)

You're responding as if I had suggested that this is a replacement for all the other practices one should already be doing.

If I had, yes I would agree with you 100%. However, far from suggesting anything remotely like that, I made sure to work in the phrase "add that extra step." It's not a panacea, it's an additional thing that needs to be done to account for one new quirk that a particular technology throws into the mix.

With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.

Not disputed that there's a slightly increased chance of concurrent disk failures with SSD, but on what basis is a second failure before rebuild any better than aduring it?

Also, I'm guessing you're referring to RAID5, as RAID6 / RAID DP is immune to double-disk failure, and RAID 10 and 0+1 are more tolerant of it.

I wasn't aware of that... anyone know how that's handled when you're buying an SSD instance from say, Amazon? Do they predict failure and replace the drives before they go bad, or do you have to bake this into your deployment logic somehow?

Just curious, which software do you recommend to monitor SSDs life span/upcoming death in production?

Can something like Munin do this out of the box?-

Munin cannot predict a SSD's lifespan.

Most sysadmins watch counters and set their calendars to replace the drives before they fail since it is so predictable.

But with current SSD speed/size ratio this vulnerability window can be only a few minutes, also this can be minimized with mixing batches and vendors of drives.

Hopefully. . . though speaking of speed, there's another pitfall to be aware of: At present, very few RAID controllers support the TRIM command. On one that doesn't any SSDs plugged into it will slow down over time, perhaps becoming slower than magnetic disks.

Or just mixing intervals of memory installation; so you install the next SSD at half the life-span of the previous one.

I'm surprised that the author didn't capture what I consider to be the most important component of HDD/Flash/Memory Balancing - frequency of access.

The rule of thumb that I've heard thrown about is, "If you touch it more than once a day, move to flash. If you touch it more than once an hour, move to memory."

While we can debate where that actual line falls based on both the price and performance of the various media (And, as the price of flash drops, it may be more like, "once every couple days) - it's important to note that frequency of access is critical when determining which media to put your data on.

We have some 50 TB+ Data Sets that are queried weekly for analytics, that don't make a heckuva lot of sense on flash storage. Contra-wise, our core device files are queried multiple times a second, and so we make certain those database servers always have enough memory to keep the dataset in memory cache, even if that means dropping 256 GB onto those database servers for larger customers.

Jim Gray wrote a classic paper about this. "The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time". http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf

There is an updated version that also talks about SSDs. http://cacm.acm.org/magazines/2009/7/32091-the-five-minute-r...

The PostgreSQL mailing list is having a conversation right now about using SSDs. This seems like a very important comment for anyone considering them:

Basically, you need to make sure that you buy SSDs with a capacitor that allows the drive to flush what it needs in event of abrupt power loss.

EDIT: Looks like the list archives didn't preserve the thread very well, so here is the original question for anyone interested:


Disk drives with battery backed write caches are a requirement for any system where data corruption is considered a system failure. We use these at my job for things other than just our data bases. It doesn't need to be just SSD's either; traditional platter-based hard drives have this feature as well.

Thanks; I wondered why this would be unique to SSDs.

Normally you use an HDD without any write cache and if you do use a write cache along the way (on the HDD or on the server) you make sure you are battery backed.

SSD is a bit different since even if you do not use write caching on the SSD there is a non-trivial amount of meta data that is kept in RAM and needs to be written safely to the media. You need quite a bit of juice to do all this work which is entails keeping most of the hardware working.

Even the HDD has enough capacitor power to park the head back and lock it safely. But I believe that you don't really need all hardware operating at full capacity, only enough spin to rotate and to pull the head back and the spin continues even if you don't power it for the parking time.

Both HDD and SSD do not guarantee much about an IO that was started writing but an acknowledgement wasn't sent about it. SCSI standard from which all disks derive requirements doesn't require anything in such a case and leaves it as undefined.

Is this because PostgreSQL's file format will become hopelessly confused and unable to restart if flush doesn't work?

A lot of applications can lose the last 100ms of writes, especially if its rare because of a k-safe cluster design, as long as you don't have a corrupted file format. A good transaction log based system will recover - as the author's should.

It's not only a case of file formats.. a few years ago one of the Linux kernel developers (Theodore Tso IIRC) made a post regarding drive behaviour under power loss, and the results were pretty insane.

For example when a rotating drive fails, you might lose +/- 4kb around the previous sector under write, whereas with particular SSDs, he witnessed 1mb chunks zeroed out every Nmb across the entire drive. That kind of thing, you simply can't work around in software

> A good transaction log based system will recover - as the author's should.

Unless the drive lies. Tells filthy dirty ugly lies.

And lots of them do.

OS: "Did you persist that data I flushed?"

Disk controller (hiding single copy in RAM cache behind its back): "Oh absolutely, bro. It's tight. Solid. You got nothing to worr-"BZZZT

I love my consumer SSD backed database, but don't get visions of 380,000 IOPS on a real workload quite yet. Like any radical performance increase on just one component it's more likely to just reveal a non-disk latency bottleneck somewhere else in your system.

Be aware that the performance characteristics of flash are very unlike spinning disks, and vary widely between models. You will see things like weird stalls, wide latency variance, and write performance being all over the place during sustained operations and depending on disk fullness. I chose Intel 520s because they performed better on MySqlPerformanceBlog benchmarks than the then-current Samsung offering [1] and because of OCZ's awful rep [2]. I hit about 5K write IOPS spread across two SSDs before my load becomes CPU-bound, which is nowhere near benchmark numbers but pretty sweet for a sub-$1k disk investment.

It's also my understanding that non-server flash drives like recommended by the article do not obey fsync and are suspect from a ACID standpoint. RAID mirroring does not fix this--if integrity across sudden power loss is critical you might not be able to use these at all and will have to find a more expensive server SSD.

[1] http://www.mysqlperformanceblog.com/2012/04/25/testing-samsu...

[2] http://www.behardware.com/articles/881-7/components-returns-...

p.s. the RAM benefits the article mentions are real and potentially huge. My query and insert performance has gone from having heavy RAM scalability issues to it hardly mattering at all. This is all on MariaDB on a non-virtualized server; I'm looking forward to better SSD-tuned databases in the future doing even better.

The "weird stalls" can be attributed to GC on lower end SSDs:


There are several background operations that happen on ssds, low or high end doesn't matter they all need them. I've seen quite a few supposedly high-end ssds that show abysmal behavior with wide variations in performance across time.

SSD qualification is a tedious job.

It's not even just about performance of the SSD the other points to care about are non-trivial if you intend to use lots of SSDs for important tasks. I collected some questions to consider at http://disksurvey.org/blog/2012/11/26/considerations-when-ch...

Anandtechs recent article on SSD IOPS consistency touches on this topic, and has some links for further info:


My favorite quote:

"Flash is 10x more expensive than rotational disk. However, you’ll make up the few thousand dollars you’re spending simply by saving the cost of the meetings to discuss the schema optimizations you’ll need to try to keep your database together."

Lots of great technical details presented in a commonsense style, well worth a read.

Sounds great in theory, but in practice you'll be having that conversation about your database schema anyway.

Nah, just throw it all into a NoSQL store and let the developers figure it out.

NoSQL, making devs into DBAs since 2010: "But hey, look, we never hired any DBAs, that's a win right ?"

Object databases and ORM were trying to do that from at least the 90's.

And look how well that's worked out.

My point exactly.

The developers will just tell you to virtualize it.

Is that a joke?


"Switch Your Databases To Flash Storage. Now. Or You're Doing It Wrong."

Unless you know, you're storing a lot of stuff and are quite happy with your current level of performance and don't want to shell out a load on new hardware that will fail quicker.

Is anyone ever actually happy with database performance? I have never met a customer that wouldn't welcome better performance for so little outlay.

The failure rate of drives shouldn't be a huge concern, data is kept in redundant drives and replacing them is just a matter of routine maintenance. The data is typically worth considerably more than the drives it sits on, but several orders of magnitude.

Is anyone ever actually happy with database performance?

It depends on your access patterns and how large your active data set grows. Just because your entire DB is 5 TB doesn't mean anything. You could be running a forum where only the most recent 2 GB of posts are read by humans, most people are reading and not contributing, and the rest is trawled through by indexing bots.

I'm perfectly happy with all DB performance when the write load is reasonable, indexes are doing the right thing, and the working data fits into memory (which these days can be multi hundred GBs -- just pray to the gods of uptime you don't have to failover to a cold secondary server).

Good architecture eliminates the need for prayer.

Exactly. The (stealthy) thing we're doing doesn't have much data at all since we're not doing anything with images or video.

Data for 1M users would be ~40GB, of which only 8GB of that would be relevant at any single point in time. Well within the realms of in memory caching.

DB writes (our writes are batchy based on external events that we're not in control of) would be our biggest bottleneck but we offset them with MEMORY based mysql tables to hold the 'in progress' data and then update the disk based tables for all users in off-peak times.

We can speed up DB writes with more spindles, and then by an order of magnitude by moving to SSD(s). A single DB server at ~$5000 to handle the data for 10M users and, if necessary, a number of mysql slave servers. Subsequent scaling is just sharding.

Compared to the serving that data via HTTP[s] the DB side of our application is easy.

Multiple hundreds of gigabytes of RAM still costs a non-trivial amount of money; certainly it costs a lot more than SSD. You may see more bang for the buck with SSD in many applications.

The article acknowledged up-front that HDD may make sense if your working set fits into memory.

Shameless plug alert. At Uptano[1], this is one of the neatest things we've seen with our very inexpensive SSD machines. It's amazing what you can do with 8GB RAM + 100 GB RAID1 SSD. It's probably the best price:performance DB you can run, and is sufficient for ~95% of projects.

1. https://uptano.com

Hey, nice relooking of your site. I prefer this colour palette better.

I would love if cloud providers offered SSD options for their full range of boxes.

For example, to be able to get a Linode at only a fraction more of the cost (say, a 10% premium) with the disk being SSD (and obviously reduced capacity compared to HDD).

I have seen the current offerings but found them to either be too costly (AWS, only one of the the largest instances), or too onerous (ssdnodes.com whose base products aren't aligned with the costs elsewhere, and to move all of your hosts to be near your SSD powered database is a big task when I only seek a little task).

I was even considering co-locating as the most cost-effective way to get SSDs when providers still massively overprice them. It all feels a bit like the RAM scam a decade ago when they'd charge you near the cost of the RAM every 2 months. Again though... co-location fell into the onerous class of actions.

Right now, pragmatically I stay with HDD and Linode.

But Linode should look at my $500 per month account and be well aware that as soon as I see a competitor offer SSD nodes at a cost-competitive point that offsets the burden to move... I'll be gone.

Since NAND flash is being price fixed, and likely will be for several more years to come (we're just now starting to see LCD prices drop to reasonable levels after years and years of price fixing, I expect it'll take a similar amount of time for the NAND fixing to get busted and the market to respond), I don't think a 10% premium will be at all possible for a very long time. NAND storage SHOULD be significantly cheaper than rotational storage since it is cheaper to produce, requires fewer exotic materials, has a far wider market, etc. And eventually it will be. But for now, there is a huge premium on NAND storage. I am sure Amazon, EMC, and the others have done the math and simply don't think there is a significant market of people willing to pay for such a service, at least not at the steep rates they would have to charge to meet their growth projections.

Note he didn't say 10% premium for the same capacity. It's possible for cloud providers to replace $50 hard disks with $64 SSDs today.

Does a $64 SSD have enough charge to flush pending writes after a power failure? If not, then be prepared for widespread corruption.

See comment by pjungwir.

You're already on a cloud system, so you have to have a plan in place for your instance to up and disappear without warning. If your instance has an unplanned outage of any kind, you kill it and spawn a new one. You may as well use libeatmydata and reap the performance benefit.

You are describing EC2. Most other providers do provide proper persistence.

I hosted at vps.net for a while and they were offering Fusion IO drives as an addon. I can't easily find information on their site about it anymore, but I know it was an option at one point.

(Their management change a couple years ago, made a bunch of changes that pissed me off, so I left and moved to Linode).

EDIT: vps.net charges $25/month for 1GB of storage on FusionIO drives, and they sell in 2GB increments. 2GB is $50/month. 36GB is $900/month

> I would love if cloud providers offered SSD options for their full range of boxes.

Both CloudSigma and ElasticHosts have SSD options.

On CloudSigma you can even grab a small SSD to use as a L2ARC or ZFS log device :)

In Australia, OrionVM offer fully SSD-backed VMs.

It's cost-effective because in Australia, the major determinant of monthly hosting bills is our ludicrously overpriced bandwidth fees.

Those caught in the middle on DB size needs and performance would be well off to take a look at Bcache. http://bcache.evilpiepirate.org/ It's a block write-back cache and seems to perform really nicely. Here's some benchmarks. http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bc...

Does anyone else find the section "Don’t use someone else’s file system" a bit confusing? It starts off by convincingly saying O_DIRECT shouldn't be used and then goes on to say O_DIRECT works very well.

Linus Torvalds said O_DIRECT shouldn't be used because Linux's already implemented page cache and application developers should not bother to re-invent the wheel.

However page caching algorithms are pretty generic and database people think performance could be improved by using customized db caching routines instead of the generic OS one.

There are two ways to bypass OS page cache: 1) directly access the disk as a block device; or 2) use the O_DIRECT flag to disable page cache on a per-file/directory basis.

Direct access to disk as a block device would be ideal from the performance and flexibility point of view, but then you lose all the benefits and tools to manage databases as files. O_DIRECT flag seems to strike a sweet pot and that's what ended up being used most in the real world.

Then no body is really interested in improving the F_ADVISE interface, which is supposed to be a better than O_DIRECT from Linus's perspective. You know, the “worse is better” thing.

I'm not a db or kernel dev, but I've been watching this argument for a while, and so far that's my understanding. Very interesting.

Clearly, there are different usage regimes where different solutions will make sense. Nonetheless, there's a really strong case to be made that SSDs have entered a sweet spot in terms of price/performance for databases, and this trend is only accelerating. Here's one discussion of the rationale: http://www.foundationdb.com/#SSDs.

"Use large block writes and small block reads"

Yep. Write amplification is a big deal on SSDs and gets worse due to their internal garbage collection, if you give them a high-entropy write pattern. This is not a problem though, with TokuDB. See our "advantage 3" here: http://www.tokutek.com/2012/09/three-ways-that-fractal-tree-...

Funny, and it's a no brainer really.. There was a thread about SSD's about 2 years back, regarding good ways to use them. My conclusion was pretty much the same when it came to DB's, yet nobody agreed with me back then and I received 3 downvotes. Odd!!

Good article!

I don't disagree with the conclusions, but don't you have to short stroke those ssds pretty significantly in a high transaction environment to avoid write amplification?

It's too bad longevity worries are keeping them out of the no commitment market.

Not really. "short stroking" is called "overprovisioning" with SSDs, and you'll see different effects with different drives. The magic number with most consumer SSDs (the mentioned Intel and Samsung drives) do best with about 20% overprovisioning. The "enterprise class" drives don't require this - they bake in the overprovisioning. The new Intel s3700 works extraordinarily well with no overprovisioning.

Anyone have any idea when Amazon will start providing SSD-backed RDS?

They already do, although it is quite new. The AWS provisioned IOPS layer ("pIOPS") is SSD-backed, and can be used for RDS:


There isn't currently an easy way to migrate to pIOPS from traditional RDS, but the performance is fantastic and works as advertised.

Only that append-only journal (transaction log).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact