
Random Write Considered Harmful in SSDs (2012) [pdf] - mpweiher
https://www.usenix.org/legacy/event/fast12/tech/full_papers/Min.pdf
======
smartbit
> _SFS drastically reduces the block erase count inside the SSD by up to 7.5
> times._

Did Apple implement algorithms with similar reductions in APFS?

~~~
smartbit
> _What APFS does, however, is simply write in patterns known to be more
> easily handled by NAND. It 's a file system with flash-aware
> characteristics_

from [https://arstechnica.com/gadgets/2016/06/a-zfs-developers-
ana...](https://arstechnica.com/gadgets/2016/06/a-zfs-developers-analysis-of-
the-good-and-bad-in-apples-new-apfs-file-system/2/#h3)

------
drzaiusapelord
> However, there remain two serious problems limiting wider deployment of
> SSDs: limited lifespan and relatively poor random write performance.

This is a 5 year old paper which is an eternity in SSD engineering. Anyone
know if ssd firmwares today handle this better? Last gen consumer SSD was
hitting nearly 100K IOPS on random writes which is 3-4x better than a spinning
disk. On random reads you're looking at least 50-100x difference.

As far as limited lifespan goes, a fairly popular study from 2014 showed that
even a consumer Samsung SSD could hit 2 petabytes without fault.

A 2016 study by Google showed that the write endurance of its drive isn't
correlated with failure, but its age. So these things are aging out moreso
than exhausting its writable NAND. That's on top of a lower replacement rate
than spinning disks.

[http://www.techradar.com/news/computing-
components/storage/t...](http://www.techradar.com/news/computing-
components/storage/think-your-ssd-will-last-forever-google-has-some-answers-
on-that-1316031)

Not sure what current gen is doing, but I imagine its even better. It seems
like there's been a lot of work towards write leveling and extending endurance
lately and in our own data center we're seeing a lower fail rate of SSD's than
we did with spinning disks. Not sure of all the black magic involved here, but
from both a desktop and server perspective we never experienced the SSD
doomsday scenarios often listed out in the early 2010's when SSDs began to
dominate. Curious to know if modern SSDs have worked around the worst issues
with flash-based media. It seems they have.

~~~
baruch
I've worked on ssd based systems and while we over-engineered our system due
to such fears we never saw any device getting anywhere close to its limits and
only once did I see an SSD have a media problem (i.e. failure to read data
previously written, after a rewrite of the data from raid no more issues came
from that ssd).

This was on enterprise SSDs but I generally believe that the fear from SSD
media is somewhat overblown. You do need to care about the wear you will put
on the SSD and engineer things so that you are within bounds but the SSD
makers and the firmware essentially assume and spec for these random writes.

In fact, I truly doubt that unless you can as an app know in advance what data
is going to have the same lifetime and is better written together and what
data needs to be placed in different places there is very little that you can
do to make things far better.

When you know the lifetimes you can use new features like stream support that
is going into newer SSDs in order to help both sides work better. You will get
longer endurance and also better performance (less random erase cycles slowing
your reads and writes down).

~~~
baruch
An issue with their work for current SSDs is that the write groups they use
are about 16MB to 32MB, current SSDs, need far larger write sizes in order to
get the optimization they used to get (essentially they tried to fill an
entire erase group at once to get control of the locality). This will no
longer work at the same level it worked before.

Their estimate of improving endurance is based on using a disk simulator with
FTL and not by checking the actual wear level change before and after their
test which to me would be fairly suspect. They are not testing against a real
FTL and full world behavior of the SSD and are rather "inventing" a test which
is very limited and doesn't say much. Admittedly, testing the way I suggest
would require a far longer test and take a lot longer but this is the only
thing that really tells you if you are making things work in real life.

SSDs are also affected by the workload and especially by any down times that
you give it. If you let the SSD sit with no queued commands it will clear
buffers and do maintenance work that will improve its behavior which will also
affect the result so to get a real "real-world" behavior you need to factor
that as well which will increase your test time.

------
caf
I believe this research fed into the design of the F2FS filesystem, which is
in the upstream Linux kernel (also from Samsung).

~~~
koverstreet
No, this issue isn't something the filesystem can work around.

The issue is that random writes leave you with all your erase units partially
overwritten, and none of them fully overwritten. If the writes really are
random then there's no data layout that will avoid this issue - this is a
fundamental issue when write size < erase unit size.

~~~
Dylan16807
That problem is actually simple to solve. You _append_ new versions until you
hit some threshold, at which point you actually compact/GC. You don't have a
1:1 mapping. Getting the best performance requires careful tuning, but
reducing erase count by a large factor is quite straightforward.

~~~
wtallis
FYI, the guy you're replying to is the author of bcache and bcachefs. He's
pretty familiar with the tradeoffs of the "simple" solution you refer to.

Fragmenting when you have a partial block invalidation lets you avoid an
immediate and costly read-modify-write cycle (both in terms of performance and
write endurance). But storing metadata to keep track of those fragments does
provide some write amplification of its own, and the eventual garbage
collection is a pretty big source of further write amplification. And this
requires a non-trivial spare area to keep write amplification under control
and performance high. And doing all this on top of an existing flash
translation layer that's already doing the same thing makes it even more
complicated, and it means you might not be saving anywhere near as many media
writes as a naive analysis would suggest.

New SSDs are starting to allow for application-layer hints about data
lifetime, in order to allow for the SSD's flash translation layer to do a sort
of generational garbage collection. This actually does provide a big reduction
to write amplification (and improve performance) beyond the baseline that you
get with a traditional SSD with or without a filesystem that tries to be
flash-friendly.

~~~
Dylan16807
> FYI, the guy you're replying to is the author of bcache and bcachefs. He's
> pretty familiar with the tradeoffs of the "simple" solution you refer to.

Then I don't know why he oversimplified so much. It's not a silver bullet, but
it's _absolutely_ a workaround the filesystem can use to mitigate the issue.

> And doing all this on top of an existing flash translation layer that's
> already doing the same thing makes it even more complicated, and it means
> you might not be saving anywhere near as many media writes as a naive
> analysis would suggest.

Sure, but "some drives already have that mitigation" is a far cry from
"filesystems can't apply that mitigation". And the end result is still a
reduction in write amplification, no matter who does it.

~~~
wtallis
> Sure, but "some drives already have that mitigation" is a far cry from
> "filesystems can't apply that mitigation".

 _ALL_ flash-based SSDs have FTLs that work more or less as described above.
They wouldn't survive even short-term light usage if they did a full erase
block R/M/W cycle on every 512B sector modification. If you don't even
understand that most basic principle of what SSDs are doing under the hood,
you're in no position to judge what a filesystem might be able to do to help.
It's as naive as suggesting zipping a file twice.

~~~
Dylan16807
But it's equally naive to state that it's a 'fundamental issue' that writing a
bunch of text to disk requires a lot of I/O, and that compression doesn't
help.

He didn't just say filesystem smarts were redundant with controller smarts, he
said _no data layout_ could avoid the problem. That's a big misstatement of
the problem.

~~~
wtallis
> But it's equally naive to state that it's a 'fundamental issue' that writing
> a bunch of text to disk requires a lot of I/O, and that compression doesn't
> help.

That wasn't stated anywhere in this thread.

> He didn't just say filesystem smarts were redundant with controller smarts,
> he said no data layout could avoid the problem. That's a big misstatement of
> the problem.

You're imagining the conversation to be about a different problem than the one
that was explicitly referred to. The fundamental problem under discussion is
that random writes will thrash any data structure. You might be able to
convert a stream of random writes to sequential writes on the first pass of
filling the disk, but once you have a data set on disk and are _updating_ it
randomly (which is what almost all real-world random writes are), nothing can
prevent the fragmentation and need for garbage collection. Doing it in the
filesystem layer won't magically be far more effective than what's already
being done at the drive level.

~~~
Dylan16807
> That wasn't stated anywhere in this thread.

It's an extension to your "double zipping" analogy.

> nothing can prevent the fragmentation and need for garbage collection

Nothing prevents fragmentation entirely but these tactics can decrease the
write amplification by an enormous amount. If the initial comment was only
saying that filesystems can't improve on what drives already do, it was
accidentally worded in a very misleading way.

And are you sure filesystems can't arrange writes to make it more effective?

------
reacweb
Maybe filesystems do not have enough metadata to perform sensible
optimisation. Do you want to treat the same way a video file, a shared
library, a temporary file and a log file ?

~~~
amelius
> Do you want to treat the same way a video file, a shared library, a
> temporary file and a log file ?

But none of those require random writes.

------
kutkloon7
I have skimmed the paper. It seems that they assume that the writes are skewed
(e.g. some parts of the memory gets written to more often). This seems to be a
reasonable assumption for real-life situations, but I am not sure if 'random
writes' is an appropriate name (since usually people assume random to be
uniformly random).

(Not 100% sure if I understood this correctly, though)

