
Switch Your Databases To Flash Storage - jpmc
http://highscalability.com/blog/2012/12/10/switch-your-databases-to-flash-storage-now-or-youre-doing-it.html
======
bunderbunder
_Wear patterns and flash are an issue, although rotational drives fail too.
There are several answers. When a flash drive fails, you can still read the
data. A clustered database and multiple copies of the data, you gain
reliability – a server level of RAID. As drives fail, you replace them._

Unlike magnetic disks, SSDs have a tendency to fail at a really predictable
rate. So predictably that if you've got two drives of the same model, put them
into commission at the same time, and subject them to the same usage patterns,
they will probably fail at about the same time. That's a real problem if
you're using SSDs in a RAID array, since RAID's increased reliability relies
on the assumption that it's very unlikely for two drives to fail at about the
same time.

With an SSD, though, once one drive goes there's a decent (perhaps small, but
far from negligible) chance that a second drive will go out before you've had
a chance to replace the first one. Which makes things complicated, but is much
better than the similarly likely scenario that a second SSD fails shortly
after you replace the first one. Because then it's possibly happening during
the rebuild, and if that happens then it really will bring down the whole RAID
array.

That said, if you're careful then that predictability should be a good thing.
A good SSD will keep track of wear for you. So all you've got to do is monitor
the status of the drives, and replace them before they get too close to their
rated lifespan. If you add that extra step you're probably actually improving
your RAID's reliability. But if you treat your RAID as if SSDs are just fast
HDDs, you're asking for trouble.

~~~
baruch
Assuming this predictability is not a good idea in my experience. SSDs fail in
various ways, some may be predictable and some are completely unpredictable.
It is also not true that an ssd failure means it simply goes to readonly mode.
I've seen plenty of SSDs failing unexpectedly and are no longer readable,
returning sense key 0x4 (HARDWARE ERROR) and the only recourse is to ship them
out.

The risk of correlated failures is indeed non-trivial in SSDs and plain RAID
is riskier, be sure to keep a watchful eye on your arrays.

~~~
ChuckMcM
We've currently got about 3500 SSDs in production across our clusters. I worry
about them deciding to all fail at once, so far they have been sporadic
failures (about 1/2 of which leave the drive unusable).

~~~
baruch
Are they all the same model? How long have they been running? Is it a
relatively similar load on all of them? Can you share smart attributes for
them? (in private if needed)

On a very low fire I'm trying to create a disk survey project
(<http://disksurvey.org> ) and such information is of great interest to me.

------
ghshephard
I'm surprised that the author didn't capture what I consider to be the most
important component of HDD/Flash/Memory Balancing - frequency of access.

The rule of thumb that I've heard thrown about is, "If you touch it more than
once a day, move to flash. If you touch it more than once an hour, move to
memory."

While we can debate where that actual line falls based on both the price and
performance of the various media (And, as the price of flash drops, it may be
more like, "once every couple days) - it's important to note that frequency of
access is critical when determining which media to put your data on.

We have some 50 TB+ Data Sets that are queried weekly for analytics, that
don't make a heckuva lot of sense on flash storage. Contra-wise, our core
device files are queried multiple times a second, and so we make certain those
database servers always have enough memory to keep the dataset in memory
cache, even if that means dropping 256 GB onto those database servers for
larger customers.

~~~
rxin
Jim Gray wrote a classic paper about this. "The 5 Minute Rule for Trading
Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time".
<http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf>

There is an updated version that also talks about SSDs.
[http://cacm.acm.org/magazines/2009/7/32091-the-five-
minute-r...](http://cacm.acm.org/magazines/2009/7/32091-the-five-minute-
rule-20-years-later/fulltext)

------
pjungwir
The PostgreSQL mailing list is having a conversation right now about using
SSDs. This seems like a very important comment for anyone considering them:

    
    
        http://archives.postgresql.org/pgsql-general/2012-12/msg00202.php
    

Basically, you need to make sure that you buy SSDs with a capacitor that
allows the drive to flush what it needs in event of abrupt power loss.

EDIT: Looks like the list archives didn't preserve the thread very well, so
here is the original question for anyone interested:

    
    
        http://archives.postgresql.org/pgsql-general/2012-11/msg00427.php

~~~
bbulkow
Is this because PostgreSQL's file format will become hopelessly confused and
unable to restart if flush doesn't work?

A lot of applications can lose the last 100ms of writes, especially if its
rare because of a k-safe cluster design, as long as you don't have a corrupted
file format. A good transaction log based system will recover - as the
author's should.

~~~
hosay123
It's not only a case of file formats.. a few years ago one of the Linux kernel
developers (Theodore Tso IIRC) made a post regarding drive behaviour under
power loss, and the results were pretty insane.

For example when a rotating drive fails, you might lose +/- 4kb around the
previous sector under write, whereas with particular SSDs, he witnessed 1mb
chunks zeroed out every Nmb across the entire drive. That kind of thing, you
simply can't work around in software

------
bcoates
I love my consumer SSD backed database, but don't get visions of 380,000 IOPS
on a real workload quite yet. Like any radical performance increase on just
one component it's more likely to just reveal a non-disk latency bottleneck
somewhere else in your system.

Be aware that the performance characteristics of flash are very unlike
spinning disks, and vary widely between models. You will see things like weird
stalls, wide latency variance, and write performance being all over the place
during sustained operations and depending on disk fullness. I chose Intel 520s
because they performed better on MySqlPerformanceBlog benchmarks than the
then-current Samsung offering [1] and because of OCZ's awful rep [2]. I hit
about 5K write IOPS spread across two SSDs before my load becomes CPU-bound,
which is nowhere near benchmark numbers but pretty sweet for a sub-$1k disk
investment.

It's also my understanding that non-server flash drives like recommended by
the article do not obey fsync and are suspect from a ACID standpoint. RAID
mirroring does not fix this--if integrity across sudden power loss is critical
you might not be able to use these at all and will have to find a more
expensive server SSD.

[1] [http://www.mysqlperformanceblog.com/2012/04/25/testing-
samsu...](http://www.mysqlperformanceblog.com/2012/04/25/testing-samsung-ssd-
sata-256gb-830-not-all-ssd-created-equal/)

[2] [http://www.behardware.com/articles/881-7/components-
returns-...](http://www.behardware.com/articles/881-7/components-returns-
rates-7.html)

p.s. the RAM benefits the article mentions are real and potentially huge. My
query and insert performance has gone from having heavy RAM scalability issues
to it hardly mattering at all. This is all on MariaDB on a non-virtualized
server; I'm looking forward to better SSD-tuned databases in the future doing
even better.

~~~
jdcryans
The "weird stalls" can be attributed to GC on lower end SSDs:

[http://en.wikipedia.org/wiki/Garbage_collection_(SSD)#Garbag...](http://en.wikipedia.org/wiki/Garbage_collection_\(SSD\)#Garbage_collection)

~~~
baruch
There are several background operations that happen on ssds, low or high end
doesn't matter they all need them. I've seen quite a few supposedly high-end
ssds that show abysmal behavior with wide variations in performance across
time.

SSD qualification is a tedious job.

It's not even just about performance of the SSD the other points to care about
are non-trivial if you intend to use lots of SSDs for important tasks. I
collected some questions to consider at
[http://disksurvey.org/blog/2012/11/26/considerations-when-
ch...](http://disksurvey.org/blog/2012/11/26/considerations-when-choosing-ssd-
storage/)

------
paulsutter
My favorite quote:

"Flash is 10x more expensive than rotational disk. However, you’ll make up the
few thousand dollars you’re spending simply by saving the cost of the meetings
to discuss the schema optimizations you’ll need to try to keep your database
together."

Lots of great technical details presented in a commonsense style, well worth a
read.

~~~
pdog
Sounds great in theory, but in practice you'll be having that conversation
about your database schema anyway.

~~~
IheartApplesDix
Nah, just throw it all into a NoSQL store and let the developers figure it
out.

~~~
ovi256
NoSQL, making devs into DBAs since 2010: "But hey, look, we never hired any
DBAs, that's a win right ?"

~~~
stcredzero
Object databases and ORM were trying to do that from at least the 90's.

~~~
jfb
And look how well that's worked out.

~~~
stcredzero
My point exactly.

------
jiggy2011
"Switch Your Databases To Flash Storage. Now. Or You're Doing It Wrong."

Unless you know, you're storing a lot of stuff and are quite happy with your
current level of performance and don't want to shell out a load on new
hardware that will fail quicker.

~~~
hnriot
Is anyone ever actually happy with database performance? I have never met a
customer that wouldn't welcome better performance for so little outlay.

The failure rate of drives shouldn't be a huge concern, data is kept in
redundant drives and replacing them is just a matter of routine maintenance.
The data is typically worth considerably more than the drives it sits on, but
several orders of magnitude.

~~~
seiji
_Is anyone ever actually happy with database performance?_

It depends on your access patterns and how large your active data set grows.
Just because your entire DB is 5 TB doesn't mean anything. You could be
running a forum where only the most recent 2 GB of posts are read by humans,
most people are reading and not contributing, and the rest is trawled through
by indexing bots.

I'm perfectly happy with all DB performance when the write load is reasonable,
indexes are doing the right thing, and the working data fits into memory
(which these days can be multi hundred GBs -- just pray to the gods of uptime
you don't have to failover to a cold secondary server).

~~~
hnriot
Good architecture eliminates the need for prayer.

------
staunch
Shameless plug alert. At Uptano[1], this is one of the neatest things we've
seen with our very inexpensive SSD machines. It's amazing what you can do with
8GB RAM + 100 GB RAID1 SSD. It's probably the best price:performance DB you
can run, and is sufficient for ~95% of projects.

1\. <https://uptano.com>

~~~
lasonrisa
Hey, nice relooking of your site. I prefer this colour palette better.

------
buro9
I would _love_ if cloud providers offered SSD options for their full range of
boxes.

For example, to be able to get a Linode at only a fraction more of the cost
(say, a 10% premium) with the disk being SSD (and obviously reduced capacity
compared to HDD).

I have seen the current offerings but found them to either be too costly (AWS,
only one of the the largest instances), or too onerous (ssdnodes.com whose
base products aren't aligned with the costs elsewhere, and to move all of your
hosts to be near your SSD powered database is a big task when I only seek a
little task).

I was even considering co-locating as the most cost-effective way to get SSDs
when providers still massively overprice them. It all feels a bit like the RAM
scam a decade ago when they'd charge you near the cost of the RAM every 2
months. Again though... co-location fell into the onerous class of actions.

Right now, pragmatically I stay with HDD and Linode.

But Linode should look at my $500 per month account and be well aware that as
soon as I see a competitor offer SSD nodes at a cost-competitive point that
offsets the burden to move... I'll be gone.

~~~
otakucode
Since NAND flash is being price fixed, and likely will be for several more
years to come (we're just now starting to see LCD prices drop to reasonable
levels after years and years of price fixing, I expect it'll take a similar
amount of time for the NAND fixing to get busted and the market to respond), I
don't think a 10% premium will be at all possible for a very long time. NAND
storage SHOULD be significantly cheaper than rotational storage since it is
cheaper to produce, requires fewer exotic materials, has a far wider market,
etc. And eventually it will be. But for now, there is a huge premium on NAND
storage. I am sure Amazon, EMC, and the others have done the math and simply
don't think there is a significant market of people willing to pay for such a
service, at least not at the steep rates they would have to charge to meet
their growth projections.

~~~
wmf
Note he didn't say 10% premium for the same capacity. It's possible for cloud
providers to replace $50 hard disks with $64 SSDs today.

~~~
jeffdavis
Does a $64 SSD have enough charge to flush pending writes after a power
failure? If not, then be prepared for widespread corruption.

See comment by pjungwir.

~~~
bcoates
You're already on a cloud system, so you have to have a plan in place for your
instance to up and disappear without warning. If your instance has an
unplanned outage of any kind, you kill it and spawn a new one. You may as well
use libeatmydata and reap the performance benefit.

~~~
j-kidd
You are describing EC2. Most other providers do provide proper persistence.

------
knappador
Those caught in the middle on DB size needs and performance would be well off
to take a look at Bcache. <http://bcache.evilpiepirate.org/> It's a block
write-back cache and seems to perform really nicely. Here's some benchmarks.
[http://www.accelcloud.com/2012/04/18/linux-flashcache-and-
bc...](http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-
performance-testing/)

------
cioc
Does anyone else find the section "Don’t use someone else’s file system" a bit
confusing? It starts off by convincingly saying O_DIRECT shouldn't be used and
then goes on to say O_DIRECT works very well.

~~~
riobard
Linus Torvalds said O_DIRECT shouldn't be used because Linux's already
implemented page cache and application developers should not bother to re-
invent the wheel.

However page caching algorithms are pretty generic and database people think
performance could be improved by using customized db caching routines instead
of the generic OS one.

There are two ways to bypass OS page cache: 1) directly access the disk as a
block device; or 2) use the O_DIRECT flag to disable page cache on a per-
file/directory basis.

Direct access to disk as a block device would be ideal from the performance
and flexibility point of view, but then you lose all the benefits and tools to
manage databases as files. O_DIRECT flag seems to strike a sweet pot and
that's what ended up being used most in the real world.

Then no body is really interested in improving the F_ADVISE interface, which
is supposed to be a better than O_DIRECT from Linus's perspective. You know,
the “worse is better” thing.

I'm not a db or kernel dev, but I've been watching this argument for a while,
and so far that's my understanding. Very interesting.

------
stephenpiment
Clearly, there are different usage regimes where different solutions will make
sense. Nonetheless, there's a really strong case to be made that SSDs have
entered a sweet spot in terms of price/performance for databases, and this
trend is only accelerating. Here's one discussion of the rationale:
<http://www.foundationdb.com/#SSDs>.

------
leif
"Use large block writes and small block reads"

Yep. Write amplification is a big deal on SSDs and gets worse due to their
internal garbage collection, if you give them a high-entropy write pattern.
This is not a problem though, with TokuDB. See our "advantage 3" here:
[http://www.tokutek.com/2012/09/three-ways-that-fractal-
tree-...](http://www.tokutek.com/2012/09/three-ways-that-fractal-tree-indexes-
improve-ssd-for-mysql/)

------
dutchbrit
Funny, and it's a no brainer really.. There was a thread about SSD's about 2
years back, regarding good ways to use them. My conclusion was pretty much the
same when it came to DB's, yet nobody agreed with me back then and I received
3 downvotes. Odd!!

Good article!

------
trotsky
I don't disagree with the conclusions, but don't you have to short stroke
those ssds pretty significantly in a high transaction environment to avoid
write amplification?

It's too bad longevity worries are keeping them out of the no commitment
market.

~~~
bbulkow
Not really. "short stroking" is called "overprovisioning" with SSDs, and
you'll see different effects with different drives. The magic number with most
consumer SSDs (the mentioned Intel and Samsung drives) do best with about 20%
overprovisioning. The "enterprise class" drives don't require this - they bake
in the overprovisioning. The new Intel s3700 works extraordinarily well with
no overprovisioning.

------
crazygringo
Anyone have any idea when Amazon will start providing SSD-backed RDS?

~~~
oasisbob
They already do, although it is quite new. The AWS provisioned IOPS layer
("pIOPS") is SSD-backed, and can be used for RDS:

<http://aws.amazon.com/rds/#PIOPS>

There isn't currently an easy way to migrate to pIOPS from traditional RDS,
but the performance is fantastic and works as advertised.

------
dschiptsov
Only that _append-only_ journal (transaction log).

