Unlike magnetic disks, SSDs have a tendency to fail at a really predictable rate. So predictably that if you've got two drives of the same model, put them into commission at the same time, and subject them to the same usage patterns, they will probably fail at about the same time. That's a real problem if you're using SSDs in a RAID array, since RAID's increased reliability relies on the assumption that it's very unlikely for two drives to fail at about the same time.
With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.
That said, if you're careful then that predictability should be a good thing. A good SSD will keep track of wear for you. So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan. If you add that extra step you're probably actually improving your RAID's reliability. But if you treat your RAID as if SSDs are just fast HDDs, you're asking for trouble.
Yes, it's a very good thing. In a high end SSD storage system, you predict early enough based on a calculation of how many drives there are, and what their current wear is, what type they are (SLC, eMLC, cMLC), etc. Then you phone home and have a drive delivered before the user even sees a disk failure.
With HDD's, the failure rate is so random that the disk completely failing is the signal that get's a replacement drive into the enclosure. S.M.A.R.T-type alert systems have been epic failures (too little info too late). The difference is because of the RAID rebuild having a lower probability of failure (let's set aside mulitple URE's for a second) that you can count on the mttf of the next drive failing being longer than the time it takes to get a drive out there.
However this is not much of a guarantee, so most people crazy over provision their storage.
SSDs let you predict this, thus provision correctly, and choose how to replace the drives to least impact the customer. It's win win to have predictable failure. I don't understand people who say otherwise.
You can roughly predict the longest possible lifespan for a SSD under a given workload. Regardless, a significant percentage of drives will still die earlier than that.
If you're talking about most incredibly naive SSD storage systems available today (excluding violin memory and maybe xtreme/pure), then I agree with you.
Reads, program/erases, controller ecc/ read disturb management, the g/p list mapping of the blocks... This all has to be taken into account in a dynamic way. And yes, some people are doing this at a higher level than the SSD controller.
Since they lifespan of some of the drives (Intel and Samsung I believe) is reported in the SMART data, you could easily do this.
SSD drives with SLC memory (enterprise SSD) have 100,000 P/E cycles, so they should last a while unless you are writing just a massive amount of data. Anandtech had a nice little writeup about SLC vs MLC vs TLC memory a little bit ago:
The risk of correlated failures is indeed non-trivial in SSDs and plain RAID is riskier, be sure to keep a watchful eye on your arrays.
On a very low fire I'm trying to create a disk survey project (http://disksurvey.org ) and such information is of great interest to me.
Sorry, that is terrible advice. Do not do that.
In availability planning two is one and one is zero.
If you love your data you run your databases in pairs. If you really love your data you run them in triplets. This applies no matter what disk technology you're using.
Speculations about failure rate or -prediction don't belong here. Your server can go up in flames at any time for a dozen reasons, the disks being only one of them.
I learned in a way that ensures this advice is burned into my memory forever:
It was when IBM had one of their worst ever manufacturing problems for one of their drive ranges.
While the IBM distributor we dealt with was very fast at turning around replacement drives, we had some nerve-wrecking weeks when the second drive in one of our arrays failed only something like 6-9 months after we went live, and we found out about the problem.
They all failed one after the other within a week or two of each other. Every drive in our main user mailbox storage array...
Thankfully for us, the gap was long enough between each failure that the array was rebuilt and then some in between each failure, but we spent a disproportionate amount of time babysitting backups and working on contingency plans because we'd made that stupid mistake.
(And I'll never ever build out a single large array, or any number of other thing - it made me spend a lot of time thinking about and reading up on redundancy and disaster recovery strategies, as it scared the hell out of me; it was mostly luck that prevented us from losing a substantial amount of data)
If I had, yes I would agree with you 100%. However, far from suggesting anything remotely like that, I made sure to work in the phrase "add that extra step." It's not a panacea, it's an additional thing that needs to be done to account for one new quirk that a particular technology throws into the mix.
Not disputed that there's a slightly increased chance of concurrent disk failures with SSD, but on what basis is a second failure before rebuild any better than aduring it?
Also, I'm guessing you're referring to RAID5, as RAID6 / RAID DP is immune to double-disk failure, and RAID 10 and 0+1 are more tolerant of it.
Can something like Munin do this out of the box?-
The rule of thumb that I've heard thrown about is, "If you touch it more than once a day, move to flash. If you touch it more than once an hour, move to memory."
While we can debate where that actual line falls based on both the price and performance of the various media (And, as the price of flash drops, it may be more like, "once every couple days) - it's important to note that frequency of access is critical when determining which media to put your data on.
We have some 50 TB+ Data Sets that are queried weekly for analytics, that don't make a heckuva lot of sense on flash storage. Contra-wise, our core device files are queried multiple times a second, and so we make certain those database servers always have enough memory to keep the dataset in memory cache, even if that means dropping 256 GB onto those database servers for larger customers.
There is an updated version that also talks about SSDs.
EDIT: Looks like the list archives didn't preserve the thread very well, so here is the original question for anyone interested:
SSD is a bit different since even if you do not use write caching on the SSD there is a non-trivial amount of meta data that is kept in RAM and needs to be written safely to the media. You need quite a bit of juice to do all this work which is entails keeping most of the hardware working.
Even the HDD has enough capacitor power to park the head back and lock it safely. But I believe that you don't really need all hardware operating at full capacity, only enough spin to rotate and to pull the head back and the spin continues even if you don't power it for the parking time.
Both HDD and SSD do not guarantee much about an IO that was started writing but an acknowledgement wasn't sent about it. SCSI standard from which all disks derive requirements doesn't require anything in such a case and leaves it as undefined.
A lot of applications can lose the last 100ms of writes, especially if its rare because of a k-safe cluster design, as long as you don't have a corrupted file format. A good transaction log based system will recover - as the author's should.
For example when a rotating drive fails, you might lose +/- 4kb around the previous sector under write, whereas with particular SSDs, he witnessed 1mb chunks zeroed out every Nmb across the entire drive. That kind of thing, you simply can't work around in software
Unless the drive lies. Tells filthy dirty ugly lies.
And lots of them do.
OS: "Did you persist that data I flushed?"
Disk controller (hiding single copy in RAM cache behind its back): "Oh absolutely, bro. It's tight. Solid. You got nothing to worr-"BZZZT
Be aware that the performance characteristics of flash are very unlike spinning disks, and vary widely between models. You will see things like weird stalls, wide latency variance, and write performance being all over the place during sustained operations and depending on disk fullness. I chose Intel 520s because they performed better on MySqlPerformanceBlog benchmarks than the then-current Samsung offering  and because of OCZ's awful rep . I hit about 5K write IOPS spread across two SSDs before my load becomes CPU-bound, which is nowhere near benchmark numbers but pretty sweet for a sub-$1k disk investment.
It's also my understanding that non-server flash drives like recommended by the article do not obey fsync and are suspect from a ACID standpoint. RAID mirroring does not fix this--if integrity across sudden power loss is critical you might not be able to use these at all and will have to find a more expensive server SSD.
p.s. the RAM benefits the article mentions are real and potentially huge. My query and insert performance has gone from having heavy RAM scalability issues to it hardly mattering at all. This is all on MariaDB on a non-virtualized server; I'm looking forward to better SSD-tuned databases in the future doing even better.
SSD qualification is a tedious job.
It's not even just about performance of the SSD the other points to care about are non-trivial if you intend to use lots of SSDs for important tasks. I collected some questions to consider at http://disksurvey.org/blog/2012/11/26/considerations-when-ch...
"Flash is 10x more expensive than rotational disk. However, you’ll make up the few thousand dollars you’re spending simply by saving the cost of the meetings to discuss the schema optimizations you’ll need to try to keep your database together."
Lots of great technical details presented in a commonsense style, well worth a read.
Unless you know, you're storing a lot of stuff and are quite happy with your current level of performance and don't want to shell out a load on new hardware that will fail quicker.
The failure rate of drives shouldn't be a huge concern, data is kept in redundant drives and replacing them is just a matter of routine maintenance. The data is typically worth considerably more than the drives it sits on, but several orders of magnitude.
It depends on your access patterns and how large your active data set grows. Just because your entire DB is 5 TB doesn't mean anything. You could be running a forum where only the most recent 2 GB of posts are read by humans, most people are reading and not contributing, and the rest is trawled through by indexing bots.
I'm perfectly happy with all DB performance when the write load is reasonable, indexes are doing the right thing, and the working data fits into memory (which these days can be multi hundred GBs -- just pray to the gods of uptime you don't have to failover to a cold secondary server).
Data for 1M users would be ~40GB, of which only 8GB of that would be relevant at any single point in time. Well within the realms of in memory caching.
DB writes (our writes are batchy based on external events that we're not in control of) would be our biggest bottleneck but we offset them with MEMORY based mysql tables to hold the 'in progress' data and then update the disk based tables for all users in off-peak times.
We can speed up DB writes with more spindles, and then by an order of magnitude by moving to SSD(s). A single DB server at ~$5000 to handle the data for 10M users and, if necessary, a number of mysql slave servers. Subsequent scaling is just sharding.
Compared to the serving that data via HTTP[s] the DB side of our application is easy.
The article acknowledged up-front that HDD may make sense if your working set fits into memory.
For example, to be able to get a Linode at only a fraction more of the cost (say, a 10% premium) with the disk being SSD (and obviously reduced capacity compared to HDD).
I have seen the current offerings but found them to either be too costly (AWS, only one of the the largest instances), or too onerous (ssdnodes.com whose base products aren't aligned with the costs elsewhere, and to move all of your hosts to be near your SSD powered database is a big task when I only seek a little task).
I was even considering co-locating as the most cost-effective way to get SSDs when providers still massively overprice them. It all feels a bit like the RAM scam a decade ago when they'd charge you near the cost of the RAM every 2 months. Again though... co-location fell into the onerous class of actions.
Right now, pragmatically I stay with HDD and Linode.
But Linode should look at my $500 per month account and be well aware that as soon as I see a competitor offer SSD nodes at a cost-competitive point that offsets the burden to move... I'll be gone.
See comment by pjungwir.
(Their management change a couple years ago, made a bunch of changes that pissed me off, so I left and moved to Linode).
EDIT: vps.net charges $25/month for 1GB of storage on FusionIO drives, and they sell in 2GB increments. 2GB is $50/month. 36GB is $900/month
Both CloudSigma and ElasticHosts have SSD options.
On CloudSigma you can even grab a small SSD to use as a L2ARC or ZFS log device :)
It's cost-effective because in Australia, the major determinant of monthly hosting bills is our ludicrously overpriced bandwidth fees.
However page caching algorithms are pretty generic and database people think performance could be improved by using customized db caching routines instead of the generic OS one.
There are two ways to bypass OS page cache: 1) directly access the disk as a block device; or 2) use the O_DIRECT flag to disable page cache on a per-file/directory basis.
Direct access to disk as a block device would be ideal from the performance and flexibility point of view, but then you lose all the benefits and tools to manage databases as files. O_DIRECT flag seems to strike a sweet pot and that's what ended up being used most in the real world.
Then no body is really interested in improving the F_ADVISE interface, which is supposed to be a better than O_DIRECT from Linus's perspective. You know, the “worse is better” thing.
I'm not a db or kernel dev, but I've been watching this argument for a while, and so far that's my understanding. Very interesting.
Yep. Write amplification is a big deal on SSDs and gets worse due to their internal garbage collection, if you give them a high-entropy write pattern. This is not a problem though, with TokuDB. See our "advantage 3" here: http://www.tokutek.com/2012/09/three-ways-that-fractal-tree-...
It's too bad longevity worries are keeping them out of the no commitment market.
There isn't currently an easy way to migrate to pIOPS from traditional RDS, but the performance is fantastic and works as advertised.