
Deduplicating Devices Considered Harmful   - sciurus
http://queue.acm.org/detail.cfm?id=1985003
======
jws
Since it's come up, let's talk about SSD failure modes.

Motive: Even absent deduplication, how do I know my three metadata writes
didn't end up in the same ECC block and aren't going to fail simultaneously?

Open questions:

• In the real world, how common is a SSD unrecoverable read error?

• Are they confined to a single sector? Or a larger grouping?

The wikipedia article on TRIM claims that SSDs write in 4k pages. It's
reasonable to assume that if I write 4 copies of a 1k block of metadata the
SSD block remapping could drop them all into the same page. Is the ECC
computed on sectors or pages? Will a failure take out all 4 copies? My Intel
SSD 310 specification is mute on the matter.

Intel specifies a mean time between failure of 137 years for the drive and a
"useful life" of 5 years. From that, I think we can assume Intel's reliability
numbers are fiction meant to serve some other purpose than to inform, but
since they are all we have we'll give it a go…

Intel says I can look forward to one unrecoverable read error at 1
sector/10^16 bits read. Which, if I read at maximum speed is 14 months. (But
does reading fast really make it fail sooner? It they come from botched writes
or degradation over time then specifying it in "bits read" is silly.)

Conclusion:

• There might be a danger.

• The spec sheets do not illuminate.

• Someone on HN may have had an SSD develop an unrecoverable read error and
know what happens.

~~~
Symmetry
"Intel specifies a mean time between failure of 137 years for the drive and a
"useful life" of 5 years."

Remember, almost all products have some sort of bathtub curve[1], and the MTBF
is the inverse of the observed failure rate at the bottom of the curve, while
the useful life (or whatever the manufacturer calls it) takes the shape of the
curve into account. Many products, including magnetic hard drives, have
similar discrepancies between their MTBF and their useful lifespans.

[1]<http://en.wikipedia.org/wiki/Bathtub_curve>

~~~
jws
A combined "infant mortality" and "wear out curve" to make a bathtub curve
seems plausible for SSD. "useful life" of 5 years implies that the wear out
curve has risen significantly at 5 years.

I suppose if there are few units that will make it into the thousands of years
they can balance it out. Perhaps they take 10% of the production, pack them in
inert gas shielded inside a meter of lead and bury them deep in salt mines to
keep the MTBF up.

------
tlb
Sometimes papers considering something harmful are harmful. De-duping is an
obvious optimization that preserves the normal semantics of disks. ZFS and FFS
assume particular properties of rotating media which aren't true of flash or
other formats. The answer is not to make flash behave more like rotating
disks.

In this case, they assume that blocks written to very different sector numbers
fail independently. FFS writes its superblock every million sectors or so
across the whole disk, assuming they will end up on different radial positions
or platters so that a head crash won't get both. That hasn't been a valid
assumption in a long time.

~~~
iwwr
So the solution is not to rely on a single disk for filesystem redundancy
(when was _that_ ever considered a good idea, ZFS or not?). Use RAID in a
typical setup.

------
jws
Given there is no way to know what your hardware manufacturer puts behind the
SATA interface of your SSD, ZFS will need to just not assume identical blocks
are stored independently.

• a random salt is not required in the metadata, a sequential serial number
would do, and is easier (you only need know that last, not all previous
numbers).

• with the right checksum algorithm you can remove the previous serial and
update to the next serial with only minor computation on the checksum.

• you do not need to copy the metadata in ram if you can tolerate doing the
writes sequentially. (This is will be a speed loss since the drive can't get
all of them at once and schedule them optimally. It might be a big loss if the
drive is also receiving a lot of other write requests at the same time.)

------
ChuckMcM
Cool, I'll toot my own horn a bit here and say that when I was at Network
Appliance we designed a deduplication device which did block level de-
duplication (patent '603 :-)) and one of the keys of that system was splitting
the meta data and the content data for just that reason. (de-duplicating meta
data made the file system more fragile)

There was a conference where a Western Digital (drive maker) demonstrated
Linux running on the procesor on the drive controller of one of their drives.
NetApp (and Google) used to bemoan the fact that there was 2 million lines of
source code between the SATA connector and the spinning rust that held the
data written by people who worked for a hardware company. You really don't
know what is going on behind your back in your disk 'drive'.

From a historical perspective the old DSSI drives that DEC used to sell had a
mode where you could 'log in' to the disk and get a shell prompt, then run
various drive diagnostic commands right from the drive. That was why they
called them "Intelligent Storage Elements" instead of disk drives.

To Dave's point (author of the article), the only reliable flash based archive
device would be one which uses a data integrity algorithm across multiple
devices apparently. Probably a product idea in there somewhere.

------
jrockway
If you care about your data so much that you need three copies of it, just get
three disks. I do this for my home directory; three 1TB disks in RAID-1. If I
can afford it for my pr0n collection, you can afford it for your critical
production database.

In the mean time, consumer SSDs are optimized for booting Windows again and
again. That's all. (apt-get upgrade is also pretty fast.)

------
gojomo
Next up, the firmware will use rolling-window shingleprints to find non-
aligned duplication, and a simple salt/serial-number won't be enough to ensure
redundant storage. Then you'll have to encrypt your data to hide it from your
own SSD drive's optimizations.

------
panic
Why can't we do all this NAND controller magic in our filesystems? Two layers
of abstraction trying to cleverly allocate the same resource seems like a
recipe for pain in general (c.f. TCP over TCP).

~~~
limmeau
And while we're at it, who took our cylinders, heads and sectors (and tables
of bad blocks) and gave us stupid LBA addresses?

~~~
wmf
Zoned recording, and more recently adaptive formatting.

------
stcredzero
There's a severe terminology problem that keeps on coming up every time
there's a headline about SSD's. Physical/logical terminology is now
overloaded. There are physical/logical blocks from the POV of the OS. However,
those "physical" block addresses are actually "logical" blocks from the POV of
the SSD device.

(Naturally, pedantic commenters will automatically assume the POV which makes
you out to be clueless.)

Is there an augmented terminology for dealing with this?

------
qjz
Was deduplication ever used to conserve system memory (RAM)? If it's
considered a bad idea there, I don't see how it's a good idea on SSDs, since
the ideal is for storage and memory to eventually converge into one entity (at
least as far as performance and physical infrastructure are concerned).

~~~
JoshTriplett
Yes, Linux has code to do deduplication in RAM; it exists primarily for the
benefit of virtual machine host systems, so they can deduplicate common pages
across virtual machines running the same software. (If you run a dozen VMs
from the same base image, you'll end up with a dozen copies of pages from
libc.so.6.)

------
pinko
Interestingly, my bias has always been that hardware de-dup (if available)
would always be preferable to software de-dup like ZFS's own internal
implementation, for performance reasons. Not so, apparently.

~~~
bdonlan
The problem is that the OS isn't aware of it, so it can't take advantage of
the space savings; and moreover _because_ the OS isn't aware of it, the OS
might make poor assumptions about data integrity.

If there was a way for the OS to a) disable dedup when integrity is desired
and b) exploit the space saved by using dedup, then it would be a good idea.

------
wccrawford
What is this 'considered harmful' meme lately? The way it's used, it applies
to everything and thus fails to mean anything. It's now one of my watch
phrases for FUD.

~~~
Turing_Machine
"Lately"? :-)

<http://www.catb.org/jargon/html/C/considered-harmful.html>

~~~
michael_dorfman
Plus, the CACM was the first to use it, so I'd say that they are entitled.

------
JoeAltmaier
Hm. Enterprise file systems use RAID, which recovers from this failure. So you
lose a redundant level of safety. Not as scarey as made out to be.

------
andrewcooke
does anyone know if this _affects_ sandforce drives? the article only says
(imho) that it is possible - whether it actually happens depends on details of
the de-dup. for example, there is probably a minimum size for the duplicated
data.

[edit: and i believe sandforce also compresses data, so it seems likely that
it's some chunk of data, compressed, that must match some other chunk]

------
derleth
Thus we see the gross and bizarre interaction of "Leaky Abstractions" and
"Optimizations (read: Lying)".

Another wonderful example of this is memory overcommitting and the OOM-killer
on Linux, or the entire drama of disk caches vs power failures in just about
every modern desktop and server OS, or how compiler optimizations interact
with both undefined behavior and each other (especially when writing
multithreaded code). My point is that this is neither new nor all that
surprising.

The general solution is to provide a switch to turn off the worst of the
lying. I wonder how long it will take to standardize such a switch for this
behavior.

------
sc68cal
_At least one flash controller, the SandForce SF-1200, was by default doing
block-level deduplication of data written to it. ... Based on discussions with
Kirk McKusick and the ZFS team, the following is a detailed explanation of why
this is a problem for ZFS. For critical metadata (and optionally for user
data) ZFS stores up to three copies of each block._

rut-roh.

~~~
jfoutz
Is that bad? It seems like one or two dedupes more than cover the cost of
redundant metadata.

One great example is a document sent as an email attachement to employees.
that crap takes up tons of spool space.

Just one win like that more than makes up for triple storage of metadata.

~~~
sc68cal
(Why was I voted down!?!??!)

No - what they're talking about is that the controller LIES to ZFS about what
blocks are being stored. That throws a huge wrench in ZFS' COW and safety
features. If ZFS writes two extra copies of a block, and the SSD deduplicates
those two extra without bothering to tell ZFS, then you've got a serious
problem.

Imagine if the SSD deduplicated all the ditto blocks?

(Ditto Blocks)
[http://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_...](http://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape)

~~~
jrockway
So? All disks lie to the controller. Have you ever had a spinning disk with a
bad sector? When the disk controller notices this, it remaps the sector to
some spare space elsewhere on the drive. It doesn't ask the OS if this is OK,
it Just Fucking Does It.

Abstraction means that things lie to you. It also means that spinning disks
can last longer than a month.

~~~
Symmetry
And HDDs have had RAM catches for a while now, further complicating things.

The SATA interface defines an abstraction over the storage system that the
hard drive has to live up to. If the OS makes assumptions that aren't part of
that contract that the drive doesn't live up to, thats the OS's fault.
Separation of concerns is a vital engineering practice, and if some software
violates abstraction barriers and things break that's the software's own
fault.

