
What we learned about SSDs in 2015 - NN88
http://www.zdnet.com/article/what-we-learned-about-ssds-in-2015/
======
velox_io
The most exciting recent development in SSDs (until 3DXpoint is released), is
bypassing the SATA interface, connecting drives straight into the PCIe bus (no
more expensive raid controllers). Just a shame hardly any server motherboards
come with M.2 slots right now.

The 4x speed increase and lower CPU overhead means it is now possible to move
RAM only applications (for instance in-memory databases) to SSDs, keeping only
the indexes in memory. Yeah, we've been going that way for a while, just seems
we've come a long way from expensive Sun e6500's I was working with in just
over a decade ago.

~~~
wtallis
M.2 slots don't make much sense for servers, at least if you're trying to take
advantage of the performance benefits possible with a PCIe interface. Current
M.2 drives aren't even close to saturating the PCIe 3.0 x4 link but they're
severely thermally limited for sustained use and they're restricted in
capacity due to lack of PCB area. Server SSDs should stick with the
traditional half-height add-in card form factor with nice big heatsinks.

~~~
lsc
most of the NVMe backplanes I've seen give full 'enterprise 2.5" drive
clearance to the thing, so if they are actually as thick as consumer SSDs, as
most current SATA 'enterprise SSD' are, there's plenty of room for a heatsink
without expanding the slot. The supermicro chassis (and I've only explored the
NVMe backplanes from supermicro) usually put a lot of effort into drawing air
through the drives, so assuming you put in blanks and stuff, the airflow
should be there, if the SSD are setup to take advantage of it.

~~~
wtallis
You need to be more precise with your terminology. There is no such thing as a
NVMe backplane. NVMe is a software protocol. The backplanes for 2.5" SSDs to
which you refer would be PCIe backplanes using the SFF-8639 aka U.2 connector.
None of the above is synonymous with the M.2 connector/form factor standard,
which is what I was talking about.

~~~
lsc
edit: okay, I re-read what you said and yes, these won't support M.2 drives,
if I understand what's going on here, and it's possible I still don't. (I have
yet to buy any non-sata SSD, though I will soon be making experimental
purchases.)

I was talking about these:

[http://www.supermicro.com/products/nfo/NVMe.cfm](http://www.supermicro.com/products/nfo/NVMe.cfm)

Note, though, it looks like if you are willing to pay for a U.2 connected
drive, you can get 'em with the giant heatsinks you want:

[http://www.pcper.com/news/Storage/Connector-Formerly-
Known-S...](http://www.pcper.com/news/Storage/Connector-Formerly-Known-
SFF-8639-Now-Called-U2)

further edit:

[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/product-
specifications/ssd-750-spec.pdf)

available; not super cheap, but I'm not sure you'd want the super cheap
consumer grade stuff in a server, anyhow.

further edit:

but I object to the idea of putting SSDs on PCI-e cards for any but disposable
"cloud" type servers (unless they are massively more reliable than any that
I've seen, which I don't think is the case here.) just because with a U.2
connected hard drive in a U.2 backplane, I can swap a bad drive like I would
swap a bad sata drive; an alert goes off and I head off to the co-lo as soon
as convenient and I can swap the drive without disturbing users, whereas with
a low-profile PCI-e card, I've pretty much gotta shut down the server, de-rack
it, then make the swap, which causes downtime that must be scheduled, even if
I have enough redundancy that there isn't any data loss.

~~~
wtallis
Take a look at how much stricter the temperature and airflow requirements are
for Intel's 2.5" U.2 drives compared to their add-in card counterparts. (And
note that the U.2 drives are twice as thick as most SATA drives.)

M.2 has almost no place in the server market. U.2 does and will for the
foreseeable future, but I'm not sure that it can serve the high-performance
segment for long. It's not clear whether it will reach the limits on capacity,
heat, or link speed first, but all of those limits are clearly much closer
than for add-in cards.

~~~
lsc
>M.2 has almost no place in the server market. U.2 does and will for the
foreseeable future, but I'm not sure that it can serve the high-performance
segment for long. It's not clear whether it will reach the limits on capacity,
heat, or link speed first, but all of those limits are clearly much closer
than for add-in cards.

No argument on m.2 - it's a consumer grade technology. No doubt, someone in
the "cloud" space will try it... I mean, if you rely on "ephemeral disk" \-
well, this is just "ephemeral disk" that goes funny sooner than spinning disk.

But the problem remains, If your servers aren't disposable, if your servers
can't just go away at a moment's notice, the form factor of add-in cards is
going to be a problem for you, unless the add-in cards are massively more
reliable than I think they are. Taking down a whole server to replace a failed
disk is a no-go on most non-cloud applications...

~~~
kijiki
You're probably tired of hearing this from me, but if you distribute the
storage, you can evacuate all the VMs off a host, down it, do whatever, bring
it back up, and then unevacuate.

~~~
wtallis
And if uptime is that important, you can just buy a server that supports PCIe
hotswap.

~~~
lsc
can you point me at a chassis designed for that?

------
japaw
"2015 was the beginning of the end for SSDs in the data center." is quit a
bold statement especial when not discussing any alternative. I do not see us
going back to magnetic disk, and most new storage technology are some kind of
ssd...

~~~
scurvy
My thoughts exactly. The article is quite inflammatory and tosses out some
bold statements without really deep diving into them. My favorite:

"Finally, the unpredictable latency of SSD-based arrays - often called all-
flash arrays - is gaining mind share. The problem: if there are too many
writes for an SSD to keep up with, reads have to wait for writes to complete -
which can be many milliseconds. Reads taking as long as writes? That's not the
performance customers think they are buying."

This is completely false in a properly designed server system. Use the
deadline scheduler with SSD's so that reads aren't starved from bulk I/O
operations. This is fairly common knowledge. Also, if you're throwing too much
I/O load at any storage system, things are going to slow down. This should not
be a surprise. SSD's are sorta magical (Artur), but they're not pure magic.
They can't fix everything.

While Facebook started out with Fusion-io, they very quickly transitioned to
their own home-designed and home-grown flash storage. I'd be wary of using any
of their facts or findings and applying them to _all_ flash storage. In short,
these things could just be Facebook problems because they decided to go build
their own.

He also talks about the "unpredictability of all flash arrays" like the fault
is 100% due to the flash. In my experience, it's usually the RAID/proprietary
controller doing something unpredictable and wonky. Sometimes the drive and
controller do something dumb in concert, but it's usually the controller.

EDIT: It was 2-3 years ago that flash controller designers started to focus on
uniform latency and performance rather than concentrating on peak performance.
You can see this in the maturation of I/O latency graphs from the various
Anandtech reviews.

~~~
fleitz
There is unpredictablilty in SSDs however, its most like whether an IOP will
take 1 ns or 1 ms, instead of 10 ms, or 100 ms with an HD.

The variability is an order of magnitude greater but the worst case is an is
several orders of magnitude better. Quite simply no one cares that you might
get 10,000 IOPS or 200,000 IOPS from an SSD when all you're going to get from
a 15K drive is 500 IOPS

~~~
wtallis
Best-case for a SSD is more like 10µs, and the worst-case is still tens of
milliseconds. Average case and 90th percentile are the kind of measures
responsible for the most important improvements.

And the difference between a fast SSD and a slow SSD is pretty big: for the
same workload a fast PCIe SSD can show an average latency of 208µs with 846µs
standard deviation, while a low-end SATA drive shows average latency of 1782µs
and standard deviation of 4155µs (both are recent consumer drives).

~~~
baruch
Where does one find 10us reads? The NAND is usually with a Tread of 50 to 100
us so just the NAND operation itself is more than 10us.

Tprog is around 1ms and Terase can be upwards of 2ms.

All in all this means a large variability in read performance depending on
what other actions are done on the SSD and how well the SSD manages the writes
and erase operations in the background.

This doesn't even change with the interface (SAS/SATA/PCIe), those add their
own queues, link errors and thus variability.

Then you have the differences in over provisioning that allow high OP drives
to mask out better the programming and erase processes.

------
_ak
At one of my previous employers, they built a massive "cloud" storage system.
The underlying file system was ZFS, which was configured to put its write logs
onto an SSD. With the write load on the system, the servers burnt through an
SSD in about a year, i.e. most SSDs started failing after about a year. The
hardware vendors knew how far you could push SSDs, and thus refused to give
any warranty. All the major SSD vendors told us SSDs are strictly considered
wearing parts. That was back in 2012 or 2013.

~~~
rsync
Just a note ... we use SSDs as write cache in ZFS at rsync.net and although
you _should_ be able to withstand a SLOG failure, we don't want to deal with
it so we mirror them.

My personal insight, and I think this should be a best practice, is that if
you mirror something like an SLOG, you should source two _entirely different_
SSD models - either the newest intel and the newest samsung, or perhaps
previous generation intel and current generation intel.

The point is, if you put the two SSDs into operation at the exact same time,
they will experience the exact same lifecycle and (in my opinion) could
potentially fail exactly simultaneously. There's no "jitter" \- they're not
failing for physical reasons, they are failing for logical reasons ... and the
logic could be identical for both members of the mirror...

~~~
leonroy
We ship voice recording and conferencing appliances based on Supermicro
hardware, a RAID controller and 4x disks on RAID 10.

We tried to mitigate the failure interval on the drives by mixing brands. Our
Supermicro distributor tried to really dissuade us from using mixed batches
and brands of SAS drives in our servers. Really had to dig in our heels to get
them to listen.

Even when you buy a NAS fully loaded like a Synology it comes with the same
brand, model and batch of drives. In one case we saw 6 drive failures in two
months for the same Synology NAS.

Wonder whether NetApp or EMC try mixing brands or at least batches on the
appliances they ship?

~~~
baruch
I can tell you that EMC and IBM both use the same drives from the same batch
in an entire system of tens to hundreds of drives and while I don't know about
all cases completely I did oversee a large number of systems and drives and
there was never a double disk failure we had that completely took two drives.
With a proper background media scan procedure you also reduce the risk of a
media problem in two different drives.

Ofcourse, the SSDs we use are properly vetted for design issues and bugs in
the firmware actually get fixed for us in a relatively timely manner. You get
that level of service with the associated large volume.

------
ak217
This article has a lot of good information, but its weirdly sensationalistic
tone detracts from it. I appreciate learning more about 3D Xpoint and Nantero,
but SSDs are not a "transitional technology" in any real sense of the word,
and they won't be displaced by anything in 2016, if nothing else because it
takes multiple years from volume capability to stand up a product pipeline on
a new memory technology, and more years to convince the enterprise market to
start deploying it. The most solid point the article makes is that the
workload-specific performance of SSD-based storage is still being explored,
and we need better tools for it.

~~~
nostrademons
I got the sense that it was a PR hit for Nantero, bought and paid for. Notice
the arc of the article: it says "[Popular hot technology] is dead. [Big
vendors] have recently introduced [exciting new product], but [here are
problems and doubts about those]. There's also [small startup you've never
heard of] which has [alternative product] featuring [this list of features
straight from their landing page]."

Usually these types of articles are designed to lead people directly to the
product that's paying for the article. Sensationalistic is good for this; it
gets people to click, it gets people to disagree, and then the controversy
spreads the article across the net. Seems like it's working, in this case.

~~~
Laforet
Robin Harris has been advocating for the abolishment of block abstraction
layer for a couple of years now and this piece is consistent with his usual
rhetoric

------
Hoff
FWIW, the top-end HPE SSD models are rated for up to 25 writes of the entire
SSD drive, per day, for five years.

The entry-level SSDs are rated for ~two whole-drive writes per week.

Wear gage, et al.

[http://www8.hp.com/h20195/v2/GetPDF%2Easpx%2F4AA4%2D7186ENW%...](http://www8.hp.com/h20195/v2/GetPDF%2Easpx%2F4AA4%2D7186ENW%2Epdf)

~~~
tim333
Also maybe of interest, the Techreport The SSD Endurance Experiment. Their
assorted drives lasted about 2,000 - 10,000 whole disk writes.

[http://techreport.com/review/27909/the-ssd-endurance-
experim...](http://techreport.com/review/27909/the-ssd-endurance-experiment-
theyre-all-dead)

------
ComputerGuru
Kind of stupid to end with "Since CPUs aren't getting faster, making storage
faster is a big help."

CPUs and storage exist for completely disjoint purposes, and the fastest CPU
in the world can't make up for a slow disk (or vice versa). Anyway, CPUs are
still "faster" than SSDs, whatever that means, if you wish to somehow compare
apples to oranges. That's why even with NVMe if you are dealing with
compressible data enabling block compression in your FS can speed up your IO
workflow.

~~~
TeMPOraL
Ever tried to play a modern computer game? You never have enough RAM for
stuff; a lot of content gets dumped onto hard drive sooner or later (virtual
memory), or is be streamed from the drive in the first place. Having faster
access helps tremendously.

From my observation, actually most personal and business use machines are IO-
bound - it often takes just the web browser itself - with webdevs pumping out
sites filled with superfluous bullshit - to fill out your RAM completely, and
then you have swapping back and forth.

~~~
Dylan16807
I don't think I've touched a game on PC where you can't fit all the levels
into RAM, let alone just the current level. Sometimes you can't fit music and
videos into ram, but you can stream that off the slowest clunker in the world.
A game that preloads assets will do just fine on a bad drive with a moderate
amount of RAM. Loading time might be higher, but the ingame experience
shouldn't be affected.

As far as swapping, you do want a fast swap device, but it has nothing to do
with "Since CPUs aren't getting faster". You're right that it's IO-bound. It's
so IO-bound that you could underclock your CPU to 1/4 speed and not even
notice.

So in short: Games in theory could use a faster drive to better saturate the
CPU, but they're not bigger than RAM so they don't. Swapping is so utterly IO-
bound that no matter what you do you cannot help it saturate the CPU.

The statement "Since CPUs aren't getting faster, making storage faster is a
big help." is not true. A is not a contributing factor to B.

~~~
simoncion
> I don't think I've touched a game on PC where you can't fit all the levels
> into RAM, let alone just the current level.

I _know_ , right?! I would _rather_ like it if more game devs could get the
time required to detect that they're running on a machine with 16+GB of RAM
and -in the background, with low CPU and IO priority- decode and load all of
the game into RAM, rather than just the selected level + incidental data. :)

------
otakucode
Can't wait for the inevitable discovery that NAND chips are being price-fixed.
You would think after the exact same companies price-fixed RAM and LCD panels
that peoples radars would go off faster. You expect me to believe that the
metals and neodymium magnets and now helium containing drives are cheaper to
manufacture than an array of identical NAND chips? NAND chips which are
present in a significant percentage of all electronic products purchased by
anyone anywhere? When a component is that widely used, it becomes
commoditized. Which means its price to manufacture drops exponentially, not
linearly like the price drops of NAND has done. This happened with RAM and
LCDs as well. When you can look around and see a technology being utilized in
dozens of products within your eyesight no matter where you are, and those
products are still more expensive than the traditional technology they
replace, price-fixing is afoot.

I am open to being wrong on this, but I don't think I am. Can anyone give a
plausible explanation why 4TB of NAND storage should cost more to manufacture
than a 4TB mechanical hard drive does, given the materials, widespread demand
for the component, etc?

~~~
pjc50
"Apples do not cost the same as oranges, therefore oranges are being price-
fixed" is not a convincing line of reasoning. The two technologies are very
different and NAND storage is much newer, and has _always_ been much more
expensive than disk storage.

The correct thing to compare NAND prices to is other chips that are being
fabbed at the same process node, by die area.

------
vbezhenar
May be SSD should add something like "raw mode", when controller just reports
everything he knows about disk, and operating system takes in control the
disk, so firmware won't cause unexpected pauses. After all, operating system
knows more, what files are not likely to be touched, what files are changing
often, etc.

~~~
wtallis
The industry is moving toward a mid-point of having the flash translation
layer still implemented on the drive so that it can present a normal block
device interface, but exposing enough details that the OS can have a better
idea of whether garbage collection is urgently needed:
[http://anandtech.com/show/9720/ocz-announces-first-sata-
host...](http://anandtech.com/show/9720/ocz-announces-first-sata-host-managed-
ssd-saber-1000-hms)

Moving the FTL entirely onto the CPU throws compatibility out the window; you
can no longer access the drive from more than one operating system, and UEFI
counts. You'll also need to frequently re-write the FTL to support new flash
interfaces.

~~~
nqzero
OS compatibility is important for laptops/desktops, but not in at least some
database / server applications, and those are the applications that would
benefit most from raw access

------
transfire
Not so sure. There are plenty of benefits to SSD too. I suspect system
designers will just add more RAM to act as cache to offset some of these
performance issues. Not to mention further improve temperature control.

~~~
stefantalpalaru
More RAM means more reserve power needed to flush it to permanent storage when
the main power is cut.

What's more likely to happen is exposing the low level storage and software to
kernel drivers.

~~~
scurvy
I think transfire was referring to RAM in the system to act as a pagefile read
cache, not RAM on the SSD to act as a cache there. There's no power risk to an
OS-level read cache.

------
rodionos
>log-structured I/O management built into SSDs is seriously sub-optimal for
databases and apps that use log-structured I/O as well

This assert piqued my interest given that my hands-on experience with HBase
speaks to the contrary. The paper by SanDisk they refer to
[https://www.usenix.org/system/files/conference/inflow14/infl...](https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf)
seems to suggest that most of the issues are related to sub-optimal
degragmentation by the disk driver itself. More specifically, the fact that
some of the defragmentation is unnecessary. Hardly a reason to blame the
databases and can be addressed down the road. After all, GC in Java is still
an evolving subject.

------
ilaksh
Article is garbage. Basically "I told you so" by someone who never got up-to-
date after the first SSDs came out and found some numbers to cherry pick that
seemed to support his false beliefs.

------
bravura
I need an external disk for my laptop that I leave plugged in all the time.

What is the most reliable external hard drive type? I thought SSDs were more
reliable than spinning disks, especially to leave plugged in constantly, but
now I'm not as sure.

~~~
ScottBurson
I still don't trust SSDs as much as I do spinning disks. While neither kind of
drive should be trusted with the only copy of important data, I would say that
drives used for backup, or for access to large amounts of data that can be
recovered or recreated if lost and where the performance requirements do not
demand an SSD, might as well be HDDs -- they're cheaper, and arguably still
more reliable. If the workload is write-heavy, HDDs are definitely preferred
as they will last much longer.

While all disks can fail, HDDs are less likely to fail completely; usually
they just start to develop bad sectors, so you may still be able to recover
much of their contents. When an SSD goes, it generally goes completely (at
least, so I've read).

So it depends on your needs and how you plan to use the drive. For light use,
it probably doesn't matter much either way. For important data, you need to
keep it backed up in either case. SSDs use less power, particularly when idle,
so if you're running on battery a lot, that would be a consideration as well.

~~~
cnvogel
Anecdotal evidence: 2.5" SATA HDD failed for me suddenly just last Tuesday,
SMART was fine before, both the attributes and a long (complete surface scan)
selftest I did a few weeks ago after I got this lightly used notebook from a
colleague (I only needed for tests).

I think what people experience is that the sudden death of SSDs doesn't occur
more often than on HDDs. But with the mechanical issues and slowly building up
of bad sectors gone, sudden death is probably the only visible issue left.

(Just my personal opinion.)

------
exabrial
Does NVMe solve the garbage collection problem?

~~~
pkaye
NVMe is just the communications protocol between the host and device. Garbage
collection is due to hiding the non ideal properties of the NAND from the
host. Primary among them is endurance limits and the size difference between
write and erase units. You could for example move the handling of some of
these GC details to the OS filesystem level but then that becomes more
complicated and has to deal with the differences with each NAND technology
generation. You couldn't just copy a file system image from one drive to
another for example.

~~~
nly
> You could for example move the handling of some of these GC details to the
> OS filesystem level

Isn't this basically the Linux 'discard' mount flag? Most distros seem to be
recommending periodic fstrim for consumer uses, what's the best practice in
the data center?

~~~
wtallis
Discard operations only tell the drive that a LBA is eligible for GC. It's
basically a delete operation that explicitly can be deferred. It does not give
the OS any input into when or how GC is done, and it doesn't give the OS any
way to observe any details about the GC process.

I think the recommendations for periodic fstrim of free space is due to
filesystems usually not taking the time to issue a large number of discard
operations when you delete a bunch of data. Even though discards should be
faster than a synchronous erase command, not issuing any command to the drive
is faster still.

~~~
pkaye
SATA drives until recently didn't have queued trims so if you did an
occasional trim between read/writes you would have to flush the queue. Queued
trims were added later on but have been slow to be adopted because it can be
difficult to get it working fast, efficient and correct when intermingled with
reads and writes. I know atleast one recent drive with queued trim had some
bugs in the implementation.

~~~
wtallis
Yeah, SATA/AHCI's limitations and bugs have affected filesystem design and
defaults, to the detriment of other storage technologies. NVMe for example
requires the host to take care of ordering requirements, so basically every
command sent to the drive can be queued.

------
fleitz
Given the number of IOPS SSDs produce they are a win even if you have to chuck
them every 6 months.

------
acquacow
...unless you are Fusion-io, in which case, most of these problems don't
affect you.

~~~
ddorian43
why? isn't fusion-io based on ssd ?

~~~
jjtheblunt
fusion-io (i believe, but please verify online) uses a spinning drive for
frequent writes and an ssd for frequent reads, with software deciding what
goes where, lessening the write traffic to the ssd, and thus wear to it.

~~~
acquacow
No, Fusion-io has nothing to do with spinning drives. They make PCI-e flash
drives with an FPGA as the "controller" for the flash. There are multiple
communication channels, so you can simultaneously read and write from the
drives, and there are various tunings available to control how garbage
collection works. They are the only "SSD" that doesn't hide all the flash
behind any kind of legacy disk controller or group protocol like NVMe

------
unixhero
Great thread.

