
Testing disks: Lessons from our odyssey selecting replacement SSDs - yavor-atanasov
http://www.bbc.co.uk/blogs/internet/entries/ce3eff16-228f-49a0-8d4e-d1a013e4895f
======
wtallis
The biggest lesson to take away from this is probably that they _thought_ they
knew how to test a SSD, but were quite obviously clueless:

> _we run a fairly comprehensive set of block-level tests using fio,
> consisting of both sequential and random asynchronous reads and writes
> straight to the disk. Then we throw a few timed runs of the venerable dd
> program at it._

Running dd as a benchmark is a major red flag. It show that they didn't know
what they were doing with fio, and didn't trust its results. They later
started using IOzone and a custom-written tool to accomplish stuff they should
have done with fio in their initial testing.

They also did not mention pre-conditioning the drives or ensuring that their
tests run long enough to reach a steady state. This is one of the most
important aspects of enterprise SSD testing and they would have known that if
they'd consulted any outside resources on the subject instead of making up
their own testing guidelines from a position of extreme ignorance about the
fundamentals of the hardware they were using and the details of their own
workload.

They really should stop calling any of their tests "comprehensive".

~~~
fjsolwmv
BBC published a technical (but really PR) article written by amateurs posing
as pros, instead of consulting reputable experts?

~~~
tomcart
To be clear, this isn't a news article written by our journalists - it is a
piece written by the team themselves that we felt may be of interest to
others, and that might help us do things better in the future. While I enjoyed
reading it, I can assure you that SSD performance testing doesn't move the BBC
PR needle compared to the identity of the new Doctor Who.

We're acutely aware that we've still got much to learn in this space, so if
there are thoughts you have on how we could do better we're all ears.

Finally, while I assured you it wasn't a PR piece we're always looking for
engineers in this area (and across the whole BBC) so if you'd be interested in
helping us improve, get in touch.

~~~
mrguyorama
Do you accept American Engineers? /s

I found the piece to be wonderful. I don't do large scale storage work, so I'm
very un-knowledgeable in the area, but it's great to see someone else's
struggles other than Amazon or a backup service. And it is yet another
indicator that the BBC _cares_ about quality content instead of just pushing
up some stock price.

Thanks for you write up

------
nickcw
This is the problem IMHO

> We also looked up whether our HBA used TRIM in its current configuration. It
> turns out, in RAID mode, the HBA did not support TRIM. We did do some trim-
> enabled testing with a different machine, but these results are hard to
> compare fairly. In any case, we can't currently enable TRIM on our
> production systems.

In our experience SSD write performance goes to sh*t if you don't regularly
TRIM them.

Running fstrim once a day is enough to keep them healthy.

RAID cards not passing TRIM is a big problem for us too...

(Experience from day job at Hosting Provider)

~~~
masklinn
> In our experience SSD write performance goes to sh _t if you don 't
> regularly TRIM them.

Interesting, is that because of the load? It seemed "modern" SSDs have GCs
good enough that trim isn't quite necessary anymore to ensure good
performances in consumer loads.

> RAID cards not passing TRIM is a big problem for us too...

Are there NVMe RAID cards? I assume they'd necessarily pass the command along
considering _deallocate* is just one parameter/option of the DATA SET
MANAGEMENT command, or do RAID cards just drop the entire command?

~~~
takeda
> Interesting, is that because of the load? It seemed "modern" SSDs have GCs
> good enough that trim isn't quite necessary anymore to ensure good
> performances in consumer loads.

A drive has no way to tell whether filesystem is using a given block or not.
TRIM is a way for the filesystem to tell it that. So I would imagine the GC
that you're referring to is working on the blocks marked with TRIM.

BTW, besides running fstrim from cron on Linux, you can also use discard flag
to mount the drive, so the filesystem sends TRIM command when files are
deleted.

~~~
wtallis
> So I would imagine the GC that you're referring to is working on the blocks
> marked with TRIM.

Not necessarily. Since flash doesn't support in-place modification of data,
any change to a portion of a file (or other FS data structure) that writes
less than a contiguous 16MB (depending on the flash) will create a need for GC
on the drive with or without TRIM. You can put a drive into a state of needing
to do a lot of GC even without changing the quantity of live data.

------
linsomniac
This reminds me of testing I did years ago on ... CD-ROMs. Funny how lessons
from old technology can apply to new technology.

Around 15 years ago my company did a Linux distribution on CDs: KRUD. It was
updated monthly, and we had something like 400 subscribers. For various
reasons we burned these CDs in house on a cluster I built.

We would burn, eject, read and checksum, and if the read test succeeded we
would ship it out. We found some users with some discs had problems reading
them. We contacted these users and paid them to return the CDs and did further
testing on them.

Our initial test was using dd, and we found that the discs that were not
obviously damaged in shipping, would tend to pass tests on some of our CD-ROM
drives, but fail on others. But when they did succeed, they would tend to take
longer than normal.

I wrote a new test program that instead of using dd directly used SCSI read
commands, and timed every one. It would then count the number of reads that
were "slow" (like 2x normal) and those that were "really slow" (like 5x), and
if these got over a certain threshold we would throw away the disc.

Being able to time the raw operations was incredibly useful, and seems like it
could have shown the authors of this paper problems before being deployed to
production.

Except, they didn't really seem to do very thorough testing of the drives.
Running stress testing on a 1TB drive for an hour seems pretty short.

Also in my above job we did hosting. We found that if we burned in disks by
reading/writing to them 10 times ("badblocks -svw -p 10"), we would almost
never experience drive failures on the Hitachi drives we were using. If we
didn't do this, the drives would have a fairly high chance of falling out of
the RAID array in production.

As drive sizes increased from 20GB to 200GB to 1TB, these tests started taking
weeks to complete. But, they were totally worth it.

------
HarryHirsch
Flash memory has three operations, read, write and erase, the last two
destructively. If you pretend they are harddisks with two operations of read
and write you go through all sorts of contortions. Sometimes you fall flat on
the face, as seen here.

Why don't operating systems treat SSDs more flash memory, and why doesn't the
file system cooperate with the underlying hardware instead of pretending it's
a disk? For home use that may even work, but in a demanding environment the
extra complexity will invariably fail.

This is a genuine question, I'm an amateur here.

~~~
wtallis
There is some work on Open-Channel SSDs, that move most of the flash
translation layer (FTL) to the host system. There are two major problems with
this approach:

1\. Each OS that wants to use the drive needs a compatible implementation of
the FTL. Consumer systems always have at least two operating systems in play
(UEFI counts for these purposes). Enterprise systems are where you will
actually find non-boot data-only drives.

2\. Flash memory changes. The FTL needs very different parameters depending on
whether you're using Toshiba flash or Samsung flash, and even depending on
whether you're using last year's Toshiba flash or the stuff they're
manufacturing today.

These aren't insurmountable problems, but they're enough to keep such products
confined to a small niche. Instead, we're seeing a trend of SSDs accepting
optional hints that allow them to perform the kinds of optimizations you'd
expect from a fully host-managed SSD. The ATA TRIM command was just the tip of
this iceberg.

~~~
jorangreef
Could you provide more details on these hints? Are they ioctl calls? Assuming
one is using the disk as a raw block device, without a filesystem.

~~~
wtallis
I was referring to extensions to the command set the OS uses to interact with
the drive itself. Some of these are quite like a madvise() call, but at a
lower layer. Others permit the drive to expose a bit more information to the
OS so that it can better optimize its IO patterns. I summarized the most
recently standardized changes at [1], but there are several other features in
the NVMe spec [2] that fall into this category. The extension for IO
determinism has been approved for the next standard but the official spec for
it hasn't been published. (I'm referring here mostly to NVMe stuff, but there
are SCSI/SAS analogs to many of these features.)

[1]
[https://www.anandtech.com/show/11436/nvme-13-specification-p...](https://www.anandtech.com/show/11436/nvme-13-specification-
published-new-features)

[2]
[http://www.nvmexpress.org/resources/specifications/](http://www.nvmexpress.org/resources/specifications/)

------
pxlfkr
Plugging SATA drives into a SAS HBA may not be optimal: "SAS/SATA expanders
combined with high loads of ZFS activity have proven conclusively to be highly
toxic" [http://garrett.damore.org/2010/08/why-sas-sata-is-not-
such-g...](http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-great-
idea.html)

~~~
equalunique
Interesting. This may explain a strange incident that I once encountered. One
day I came home to my CSE-847 machine with SATA drives hooked to SAS expanders
on an mpt device. The whole system was unresponsive and the drives were all as
hot as a fresh pot of coffee. I immediately shut down the system and let the
drives cool out on the concrete floor. Everything seemed to work later, but it
was quite a scare. It was 12 2TB drives setup as 6 zraid2 mirrors.

------
mjw1007
One lesson here is that when reusing a previous test setup you ought to look
for assumptions you made which are no longer valid.

If they'd been starting from scratch, while thinking about modern SSDs, it's
quite likely they wouldn't have built an application load tester using files
containing only dots.

But as it was an existing system, it didn't get the same amount of attention.

------
barrkel
I built my home system early this year using the Samsung 960 Evo 1TB M2.
Actual speeds were nowhere near advertised speeds until I enabled write-back
cache on the drive, which gave me pause for concern about data persistence
reliability. AFAIK the Samsung drivers (as opposed to the MS drivers I
originally used) just turn this on without needing to be twiddled in settings.

Just to confirm, I have seen the behaviour described herein, with write-back
cached making enormous difference with the Samsung EVO product in particular.

~~~
olavgg
I also have a Samsung 960 Evo. Its performance is what I consider a joke, fio
and pg_test_fsync make it almost look as slow as spinning SAS drives.

For example on a 4kb sync write with 16 threads test, the 960 Evo cannot do
more than 1000 iops. In comparison the Intel P4800X (Optane) does friggin 500
000 iops on the same test. That is a 500X difference.

[https://forums.servethehome.com/index.php?threads/did-
some-w...](https://forums.servethehome.com/index.php?threads/did-some-write-
benchmarks-of-a-few-ssds.15231/)

~~~
wtallis
The 960 EVO is a consumer grade SSD with firmware tuned for bursts of I/O
(through eg. the use of SLC write caching) at the expense of sustained write
throughput. It doesn't have power loss protection capacitors, so it can't
perform safe write caching when you're issuing synchronous writes. 4kB is much
smaller than the underlying page size of its NAND flash, so performance is
going to suck without write combining. You're testing it in shackles, with a
workload that doesn't at all match its intended use case. That doesn't make it
a joke, it just makes it the wrong kind of drive to use for stereotypical
enterprise applications.

~~~
olavgg
So what is the use case for this drive? The 960 Evo/Pro are supposed to be
premium models, but a better investment would be a cheaper SSD drive with more
storage. And if you rarely write that much, more ram will increase the read
speed significantly.

~~~
basch
a consumer pc.

the pro does not have the buffer the evo does. the evo is not a premium model,
it is entry level cutting edge

~~~
olavgg
That is not very specific, a consumer PC would be fine with a 750 Evo also,
maybe two of them in raid 0 for twice the sequential read & write speed. I
believe for most consumers, having more SSD storage per $ is more important.

~~~
basch
a 750 evo is a sata drive, a 950 is an nvme drive. completely different
technology. if my laptop has an nvme slot why would i buy a sata drive. 2 750s
is not faster than 1 950

------
pcfe
You could give blkreplay a go next time you decide on which disks to buy. I
find the additional effort is worth it, but ymmw. Use one of the shipped loads
for a quick test, but you really want to run blktrace against your current
setup and feed that data to blkreplay.

------
have_faith
Great article, easy to follow considering it's far away from my normal domain.

I noticed they didn't mention any brands by name though, why is that?

~~~
tankenmate
The BBC has a very strong product prominence policy[0] (i.e. avoid naming
brands when possible), being government funded is a large driver of this
policy.

[0]
[http://www.bbc.co.uk/editorialguidelines/guidelines/editoria...](http://www.bbc.co.uk/editorialguidelines/guidelines/editorial-
integrity/product-prominence)

EDIT: fixed policy name and added link

~~~
anoother
It's a shame this is so selectively applied.

See, for example:

\- The constant mention of speaking to people 'over Skype' on the News

\- Publicization of Twitter hashtags on Questiontime and other programs

\- Hours worth of Top Gear footage (and the entire Arctic Special) that were
effectively Toyota Hilux advertisements

~~~
gaius
The constant favourable coverage of Google and Apple, always talking about
hipster-friendly Flickr when boring old Photobucket was doing 20x the
volume... they apply their rules very selectively....

~~~
fjsolwmv
Photobucket was for link sharing to other sites. , no? While flicker was a
destination for publishing albums and browsing, with much higher quality
photos.

~~~
gaius
That's imgur you're thinking of

------
noir_lord
God damn that was well written, excellent post!

------
fulafel
Sounds like they are observing transparent data compression in the SSD
controller and FTL. SandForce controllers even made a marketing point if it
back in the day. It manifests as faster IO with repetitive data, along with
reduced flash wear.

------
pricechild
One of my favourite parts of this article is how Elliot Thomas describes
himself as a "Software Engineer".

We may be writing software, but without a working knowledge of hardware it's
not worth much!

~~~
blowski
I have barely any knowledge of hardware, but I have built plenty of pieces of
software that have helped people.

~~~
ape4
Civil Engineers don't understand chemistry or quantum mechanics - how they
build a bridge.

~~~
pbhjpbhj
They must know some chemistry surely, like weathering effects on concrete;
effects of potential chemical spills, eg on roadways, metal-concrete-surfacing
interactions.

?

