Hacker News new | past | comments | ask | show | jobs | submit login
Random Write Considered Harmful in SSDs (2012) [pdf] (usenix.org)
88 points by mpweiher on Sept 17, 2017 | hide | past | favorite | 22 comments



> SFS drastically reduces the block erase count inside the SSD by up to 7.5 times.

Did Apple implement algorithms with similar reductions in APFS?


> What APFS does, however, is simply write in patterns known to be more easily handled by NAND. It's a file system with flash-aware characteristics

from https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...


> However, there remain two serious problems limiting wider deployment of SSDs: limited lifespan and relatively poor random write performance.

This is a 5 year old paper which is an eternity in SSD engineering. Anyone know if ssd firmwares today handle this better? Last gen consumer SSD was hitting nearly 100K IOPS on random writes which is 3-4x better than a spinning disk. On random reads you're looking at least 50-100x difference.

As far as limited lifespan goes, a fairly popular study from 2014 showed that even a consumer Samsung SSD could hit 2 petabytes without fault.

A 2016 study by Google showed that the write endurance of its drive isn't correlated with failure, but its age. So these things are aging out moreso than exhausting its writable NAND. That's on top of a lower replacement rate than spinning disks.

http://www.techradar.com/news/computing-components/storage/t...

Not sure what current gen is doing, but I imagine its even better. It seems like there's been a lot of work towards write leveling and extending endurance lately and in our own data center we're seeing a lower fail rate of SSD's than we did with spinning disks. Not sure of all the black magic involved here, but from both a desktop and server perspective we never experienced the SSD doomsday scenarios often listed out in the early 2010's when SSDs began to dominate. Curious to know if modern SSDs have worked around the worst issues with flash-based media. It seems they have.


The biggest thing missing from the paper seems to be SLC caching, which is now nearly universal on client SSDs but usually not used on enterprise SSDs. Modern SSDs also use much more robust error correction schemes than they did 5 years ago; almost everything has LDPC soft-decode or similar as a fallback, which gives an effective increase in the write endurance of the drive.

There are also trends underway to dismantle some of the abstractions between the file system and the flash. Even when a SSD is presenting an interface of 512B or 4kB logical blocks for compatibility reasons, it can inform the host of the optimal block size to use for I/O. The host can also tag I/O commands with a stream ID, allowing the SSD to store related data together and separate from unrelated streams. Used properly, this approximates generational garbage collection and improves both performance and write amplification.

There's also a recently standardized feature for partitioning a SSD in a way that makes hard guarantees about the sharing (or lack thereof) of memory channels and flash chips. This is intended primarily for multi-tenant use cases, to ensure that one process/VM's I/O won't inflict garbage collection pauses upon another process/VM. In a similar vein, there are now mechanisms to request that the SSD defer all background processing for a while in order to service a burst of I/O with minimal and predictable latency, and controls for whether a SSD is permitted to do non-critical background processing during idle time (you may not want that happening while running on battery).

Beyond that, there's some work on open-channel SSDs that move large portions of the flash translation layer to the host system, allowing all the above controls and more to be managed by the host. With application-layer tuning this can be extremely powerful.


I've worked on ssd based systems and while we over-engineered our system due to such fears we never saw any device getting anywhere close to its limits and only once did I see an SSD have a media problem (i.e. failure to read data previously written, after a rewrite of the data from raid no more issues came from that ssd).

This was on enterprise SSDs but I generally believe that the fear from SSD media is somewhat overblown. You do need to care about the wear you will put on the SSD and engineer things so that you are within bounds but the SSD makers and the firmware essentially assume and spec for these random writes.

In fact, I truly doubt that unless you can as an app know in advance what data is going to have the same lifetime and is better written together and what data needs to be placed in different places there is very little that you can do to make things far better.

When you know the lifetimes you can use new features like stream support that is going into newer SSDs in order to help both sides work better. You will get longer endurance and also better performance (less random erase cycles slowing your reads and writes down).


An issue with their work for current SSDs is that the write groups they use are about 16MB to 32MB, current SSDs, need far larger write sizes in order to get the optimization they used to get (essentially they tried to fill an entire erase group at once to get control of the locality). This will no longer work at the same level it worked before.

Their estimate of improving endurance is based on using a disk simulator with FTL and not by checking the actual wear level change before and after their test which to me would be fairly suspect. They are not testing against a real FTL and full world behavior of the SSD and are rather "inventing" a test which is very limited and doesn't say much. Admittedly, testing the way I suggest would require a far longer test and take a lot longer but this is the only thing that really tells you if you are making things work in real life.

SSDs are also affected by the workload and especially by any down times that you give it. If you let the SSD sit with no queued commands it will clear buffers and do maintenance work that will improve its behavior which will also affect the result so to get a real "real-world" behavior you need to factor that as well which will increase your test time.


I believe this research fed into the design of the F2FS filesystem, which is in the upstream Linux kernel (also from Samsung).


No, this issue isn't something the filesystem can work around.

The issue is that random writes leave you with all your erase units partially overwritten, and none of them fully overwritten. If the writes really are random then there's no data layout that will avoid this issue - this is a fundamental issue when write size < erase unit size.


That problem is actually simple to solve. You append new versions until you hit some threshold, at which point you actually compact/GC. You don't have a 1:1 mapping. Getting the best performance requires careful tuning, but reducing erase count by a large factor is quite straightforward.


FYI, the guy you're replying to is the author of bcache and bcachefs. He's pretty familiar with the tradeoffs of the "simple" solution you refer to.

Fragmenting when you have a partial block invalidation lets you avoid an immediate and costly read-modify-write cycle (both in terms of performance and write endurance). But storing metadata to keep track of those fragments does provide some write amplification of its own, and the eventual garbage collection is a pretty big source of further write amplification. And this requires a non-trivial spare area to keep write amplification under control and performance high. And doing all this on top of an existing flash translation layer that's already doing the same thing makes it even more complicated, and it means you might not be saving anywhere near as many media writes as a naive analysis would suggest.

New SSDs are starting to allow for application-layer hints about data lifetime, in order to allow for the SSD's flash translation layer to do a sort of generational garbage collection. This actually does provide a big reduction to write amplification (and improve performance) beyond the baseline that you get with a traditional SSD with or without a filesystem that tries to be flash-friendly.


> FYI, the guy you're replying to is the author of bcache and bcachefs. He's pretty familiar with the tradeoffs of the "simple" solution you refer to.

Then I don't know why he oversimplified so much. It's not a silver bullet, but it's absolutely a workaround the filesystem can use to mitigate the issue.

> And doing all this on top of an existing flash translation layer that's already doing the same thing makes it even more complicated, and it means you might not be saving anywhere near as many media writes as a naive analysis would suggest.

Sure, but "some drives already have that mitigation" is a far cry from "filesystems can't apply that mitigation". And the end result is still a reduction in write amplification, no matter who does it.


> Sure, but "some drives already have that mitigation" is a far cry from "filesystems can't apply that mitigation".

ALL flash-based SSDs have FTLs that work more or less as described above. They wouldn't survive even short-term light usage if they did a full erase block R/M/W cycle on every 512B sector modification. If you don't even understand that most basic principle of what SSDs are doing under the hood, you're in no position to judge what a filesystem might be able to do to help. It's as naive as suggesting zipping a file twice.


But it's equally naive to state that it's a 'fundamental issue' that writing a bunch of text to disk requires a lot of I/O, and that compression doesn't help.

He didn't just say filesystem smarts were redundant with controller smarts, he said no data layout could avoid the problem. That's a big misstatement of the problem.


> But it's equally naive to state that it's a 'fundamental issue' that writing a bunch of text to disk requires a lot of I/O, and that compression doesn't help.

That wasn't stated anywhere in this thread.

> He didn't just say filesystem smarts were redundant with controller smarts, he said no data layout could avoid the problem. That's a big misstatement of the problem.

You're imagining the conversation to be about a different problem than the one that was explicitly referred to. The fundamental problem under discussion is that random writes will thrash any data structure. You might be able to convert a stream of random writes to sequential writes on the first pass of filling the disk, but once you have a data set on disk and are updating it randomly (which is what almost all real-world random writes are), nothing can prevent the fragmentation and need for garbage collection. Doing it in the filesystem layer won't magically be far more effective than what's already being done at the drive level.


> That wasn't stated anywhere in this thread.

It's an extension to your "double zipping" analogy.

> nothing can prevent the fragmentation and need for garbage collection

Nothing prevents fragmentation entirely but these tactics can decrease the write amplification by an enormous amount. If the initial comment was only saying that filesystems can't improve on what drives already do, it was accidentally worded in a very misleading way.

And are you sure filesystems can't arrange writes to make it more effective?


If you only append you do not get the effect that you want, you actually need to send the writes as a large consecutive write for the SSD to do what they want in this paper. If you send many smaller writes the SSD will for sure split them to different nand dies to get the maximum write parallelization. They will be grouped for the maximum write efficiency which will not necessarily be the same as erase efficiency.


The linked research paper describes a log-structured filesystem called SFS, which they built from NILFS2 in order to implement and test the random-write mitigations they propose.

Samsung then submitted a clean sheet implementation of a flash-friendly log-structured filesystem called F2FS ( https://lkml.org/lkml/2012/10/5/205 ).


The one thing they are saying in that paper is that they make the fs work to have its writes at about the size of the erase group which would in fact help.

A problem I can see with using this on large SSDs is that their erase group size is now fairly large (at one point I was told by a vendor the erase size is 1GB in an 800GB 3WPD SSD).


I really don't think you read the paper...


Maybe filesystems do not have enough metadata to perform sensible optimisation. Do you want to treat the same way a video file, a shared library, a temporary file and a log file ?


> Do you want to treat the same way a video file, a shared library, a temporary file and a log file ?

But none of those require random writes.


I have skimmed the paper. It seems that they assume that the writes are skewed (e.g. some parts of the memory gets written to more often). This seems to be a reasonable assumption for real-life situations, but I am not sure if 'random writes' is an appropriate name (since usually people assume random to be uniformly random).

(Not 100% sure if I understood this correctly, though)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: