
How Controllers Maximize SSD Life (2012) - rajnathani
https://thessdguy.com/how-controllers-maximize-ssd-life/
======
wtallis
Given the age of these posts, the many of the numbers used are outdated, but
most of the general ideas are still relevant. There are a few other concepts
that would be worth mentioning these days even in this kind of high-level
overview: Error handling and recovery has gotten really complex; drives will
adjust how they perform read and program operations as the flash ages. Some
drives do proactive scanning for data degradation to catch errors before they
become uncorrectable. There are new ways for the host OS to provide hints
about data lifetime and preferred data placement, which can be very helpful in
avoiding unnecessary write amplification. In the absence of such hints, some
drives have heuristics that try to infer data lifetime information based on IO
patterns.

~~~
Heronymus_Anon
Not enough knowledge, in any of the areas involved, but my intuition makes me
doubt the longtime security of these complexities, although i really
appreciate the results. Or is there no need for trusting the firmware, if all
the I/O is encrypted? I am fantasizing all these data patterns could combined
with other information lead to the data-storage/-encryption equivalent of
branch-prediction-exploits.. ..but maybe it's just half-knowledge combined
with the fear-what-you-don't-understand.. ..and of course this wouldn't be in
the first layer of attack.

------
sprash
Could you make money by producing a cheap "dumb" flash memory where you have
raw access and leave all the wear leveling to software by using one of the
many available log structured file systems?

Some people would rather have full transparent overview and control over the
state and health of their storage and make their own decisions about handling
problems in software instead of talking to a black box.

~~~
duskwuff
Unlikely.

The cost savings probably wouldn't be as dramatic as you hope. You'd still
need some sort of controller for serdes -- NAND flash chips tend to use big
parallel busses which aren't suited for off-board connectors -- and, at that
point, it's not that much of a jump to make that controller handle wear
leveling and error correction.

(Having the storage device handle those tasks is pretty convenient, anyways.
It means you don't have to perform error correction on the main CPU -- which
is a nontrivial ask -- and it means the SSD can behave as a bootable device.)

~~~
londons_explore
Wear levelling and error correction are pretty different problems.

Error correction is well suited to hardware, whereas wear levelling in my
opinion is not.

My ideal design would have basic hardware error detection and correction
(hopefully configurable for how many data bits Vs ECC bits), and maybe with a
few heirarchical levels. Wear levelling and management of data layouts would
all be done in software.

You could imagine a "bulk read" API which is given a block number to start at,
with a bunch of ECC parameters, which would DMA a large block of error-
corrected data to RAM, together with information about the error rate in each
word/block read.

~~~
st_goliath
> You could imagine a "bulk read" API which is given a block number to start
> at, with a bunch of ECC parameters, which would DMA a large block of error-
> corrected data to RAM, together with information about the error rate in
> each word/block read.

There are a lot of existing SoCs that have built in flash controller hardware
that can interface with an external flash chip and, from a driver perspective,
are controlled pretty much the way you describe it.

------
mar77i
There was this long Ars Technica article from the same year:
[https://arstechnica.com/information-
technology/2012/06/insid...](https://arstechnica.com/information-
technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-
work/)

I'm not going to assume that the complexity of what's going into these drives
was not significantly reduced since, or did it?

~~~
metalliqaz
SDD controllers have grown much more complex since 2012, especially with the
deployment of QLC flash cells.

------
fomine3
Fun story for me: SSD's AES encryption feature is sometimes implemented in
purpose of data randomizing instead of LFSR. (Of couse also for encrypting)

[https://www.researchgate.net/publication/274369001_Data_secu...](https://www.researchgate.net/publication/274369001_Data_security_concurrent_with_homogeneous_by_AES_algorithm_in_SSD_controller)

------
ara24
Given the rate at which we are progressing on SSD endurance, these methods
will become irrelevant. Even 5 years back, the old tech report article showed
that for an average user, they could handle a lot of writes.

[https://techreport.com/review/27909/the-ssd-endurance-
experi...](https://techreport.com/review/27909/the-ssd-endurance-experiment-
theyre-all-dead/)

~~~
tobz1000
That winner of that test, the 840 Pro, uses 2-bit (MLC) cells.

Nowadays, 3-bit (TLC) and 4-bit (QLC) are much more common (not sure QLC even
existed back then). So whilst each of these technologies has matured, the
drive a consumer may consider today is more likely to be bigger, replacing
their HDD entirely, and will probably be capable of fewer writes than an MLC
drive from 5 years ago, iiuc.

~~~
fomine3
MLC->TLC transition decreases endurance but 2D->3D transision increases much
endurance. Current 3D TLC drives handles more TBWs compared to old consumer 2D
MLC drives. So industry going to make QLC chips.

------
mehrdadn
A bit off-topic, but speaking of SSDs, how do you compare them when shopping
around? I see different brands and models advertising similar features at
different price points, and I know enough to look at the throughput, IOPS, but
beyond that it's hard to tell how to evaluate differences. e.g. should
consumers be weary of QLC? Do the models and brands matter a lot? Can you
somewhat trust listed lifetimes? Are there other specs worth paying attention
to? How should they be evaluated? Anyone have thoughts/links on comparing
modern SSDs?

~~~
wtallis
The advertised performance specifications for consumer SSDs are mostly
useless. They pick the metrics that produce the biggest numbers, without
regard to whether those numbers have any relevance to real-world use. Those
numbers are mostly useful for determining what class of controller and flash
memory are used, in case the spec sheet doesn't list that information.
Likewise for warranty period and write endurance ratings; those are more about
signalling product segmentation than about actual expected lifetime.

For consumer SSDs, there's no need to worry about write endurance or QLC NAND
unless you know you have a very atypical usage pattern and have actually
measured your workload by eg. tracking SMART indicators on your current
drive(s) for several days.

Brand matters very little unless you really care about the experience of
getting a warranty replacement in the rare event that your drive fails before
the warranty expires. I'm not aware of any solid information indicating that
certain SSD brands have consistently lower premature failure rates. My
anecdotal experience from reviewing SSDs for several years is that all the
top-tier brands have at some point sent me review samples that were either DOA
or died during testing that shouldn't have killed a drive.

Outside the top tier brands belonging to the NAND flash memory manufacturers,
everyone is buying NAND on the open market from the same 2-3 manufacturers and
buying SSD controller solutions from the same 2-5 vendors. There are literally
dozens of retail models all using the same combination of Phison E12
controller and Toshiba/Kioxia 3D NAND, and the differences between these are
almost entirely cosmetic. All the PCIe 4.0 consumer SSDs that have been
released so far are functionally identical.

~~~
mehrdadn
This is great info, thank you! Regarding SMART indicators, do you know if
they're generally reliable? I seem to recall there have been drives (at least
HDD, not sure about SSD) that didn't report correct numbers. In fact, I just
checked and my current SSD doesn't report SMART data at all. Have they gotten
better over the years?

~~~
wtallis
I don't make a habit out of doing sanity checks on the SMART reporting of
drives I test. My tests log that data, but I haven't written anything to parse
that into useful information. I can't recall noticing obviously wrong or
entirely missing SMART data in any drive I've recently tested.

I'm curious what your SSD is and just how old it is, if it isn't giving you
any SMART information. (Maybe you need to run `smartctl -s on` before it'll
show you the stats it has probably been tracking all along?)

~~~
mehrdadn
Ah I see. It's a PM981, from late 2017. Interestingly, the GSmartControl GUI
says it doesn't support SMART. But now that I try smartctl -a, I see a bit of
information, though not much. The only things regarding failures seem to be
"Media and Data Integrity Errors" and "Error Information Log Entries" (which
I'm not really sure how to interpret in terms of what's too high and what's
too low). Other stuff is just temperature and other statistics. Not sure if
I'm supposed to be seeing more, but I feel like hard disks used to report more
than this!

~~~
wtallis
NVMe SSDs report a somewhat different set of indicators than SATA drives.

If you want to estimate your own personal write endurance requirements, then
you want to keep track of the line that looks like:

    
    
        Data Units Written:                 67,509,873 [34.5 TB]
    

Figure out how much that goes up in a typical day/week/month, and you'll know
roughly how much you endurance you need.

When your SSD is old and you want to monitor it for signs of impending
failure, keep an eye on the lines:

    
    
        Available Spare:                    100%
        Available Spare Threshold:          10%
        [...]
        Media and Data Integrity Errors:    0
    

(Quoted stats taken from the 512GB PM981 I borrowed to test. It was barely
used at the time.)

~~~
mehrdadn
Ah I see, thank you, that makes sense. Sadly it seems this model doesn't
publish an endurance specification at all, so it's hard to tell how close I am
to its end of life. Interestingly mine actually already has some errors:

    
    
      Available Spare:                    100%
      Available Spare Threshold:          10%
      Media and Data Integrity Errors:    10
      Error Information Log Entries:      735
    

but I guess that's probably normal? A least unless it fails
suddenly/catastrophically.

