If you want to go fast & save NAND lifetime, use append-only log structures.
If you want to go even faster & save even more NAND lifetime, batch your writes in software (i.e. some ring buffer with natural back-pressure mechanism) and then serialize them with a single writer into an append-only log structure. Many newer devices have something like this at the hardware level, but your block size is still a constraint when working in hardware. If you batch in software, you can hypothetically write multiple logical business transactions per block I/O. When you physical block size is 4k and your logical transactions are averaging 512b of data, you would be leaving a lot of throughput on the table.
Going down 1 level of abstraction seems important if you want to extract the most performance from an SSD. Unsurprisingly, the above ideas also make ordinary magnetic disk drives more performant & potentially last longer.
I used to think the same thing, but now that I work on SSD-based storage systems, I'm not sure this holds up in today's storage stacks. Log structuring really helped with HDDs since it meant fewer seeks.
In particular, the filesystem tends to undo a lot of the benefits you get from log-structuring unless you are using a filesystem designed to keep your files log-structured. Using huge writes definitely still helps, though.
Edit: I had originally said "designed for flash" instead of "designed to keep your files log-structured." F2FS is designed for flash, but in my testing does relatively poorly with log-structured files because of how it works internally.
Edit 2: de-googled the link. Thank you for pointing that out.
Achieving cutting-edge storage performance tends to require bypassing the filesystem anyways. Traditionally, that meant using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and operating on it with io_uring will get you most of the way there.
In either case, the advice given in the article and by the OP is filesystem agnostic.
Will an end user downloading a video editing app (or similar) have a NVME drive, know how to give your app direct access to a NVME drive, and will your app not corrupt the rest of the files on the drive?
Extreme performance requires extreme tradeoffs. As with anything else, you have to evaluate your use cases and determine for yourself whether the tradeoffs are worth it. For a mass-market application that has to play nice with other applications and work with a wide variety of commodity hardware, it's probably not worthwhile. For a state-of-the-art high performance data store that expects low latencies and high throughput (à la ScyllaDB), it may very well be.
Would high-performance data storage be easier to implement on commodity hardware if operating systems supplied an API to get a blob of bytes, segmented out of an entire disk (eg. a file), that presented low-level semantics like a full-fledged SSD partition or drive?
I feel that operating systems need to provide self-contained reliable APIs designed for atomically overwriting configuration files, without losing permissions or overwriting symlinks or such. Or perhaps supply more powerful primitives, like a faster/weaker fsync that serves as an ordering barrier rather than flushing to disk, or an API to replace a file without altering permissions. One issue I've heard is:
> I even had an issue with atomic writes over ssh that created the temp file but where not able to rename it, so the old one stayed.
at that point just use a RAM disk and periodically write that data to physical disk or SSD. no extreme tradeoff required, because RAM disks are WAY faster than SSDs.
manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.
> manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.
If we make the reasonable assumption that this subthread is discussing a server use case, then we can assume that the SSD is tolerant of power failures and has the capacitors necessary to finish any cached writes it has reported as complete. Thus, having fewer layers between the hardware and the application means there are fewer opportunities for some layer to lie to those above it about whether the data has made it to persistent storage.
Whether or not you're bypassing large parts of the operating system's IO stack, the application needs to have a clear idea of what data needs to be flushed to persistent storage at what times in order to properly survive unexpected power loss without unnecessary data loss or corruption.
> at that point just use a RAM disk and periodically write that data to physical disk or SSD. no extreme tradeoff required, because RAM disks are WAY faster than SSDs.
A storage application that need to bypass the filesystem will already be implementing its own caching system anyways. The idea is to persist the data to maintain durability without sacrificing latency.
> manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.
Given enough RAM on a Linux machine one may use tmpfs, which maintains a RAM disk and at any moment only uses the amount of RAM needed, with a pre-defined limit.
On PostgreSQL create an adequately-caped tmpfs, create a TABLESPACE on it, then store temporary tables into this TABLESPACE. No SSD (I have access to) beats this. Hint: before shutting PG down you may DROP this TABLESPACE.
It also is useful for a blockchain, amazingly fast (and a relief for HDDs), in most cases alleviating the need for a SSD. Place the blockchain file(s) on the tmpfs mount. Before machine shutdown stop any blockchain-using software, then store a compressed copy of the blockchain file(s) on permanent storage (I use "zstd -T0 --fast"...), and upon reboot restore it on the tmpfs mount. If anything fails the blockchain-writing software will re-download any missing block.
While tmpfs can be very useful even as it is, users must beware that copying a file from another Linux file system to tmpfs can lose a part of the file metadata, without giving any warnings or errors.
The main problem is that copying a file to tmpfs will drop extended attributes. Old versions of tmpfs dropped all extended attributes, modern versions of tmpfs keep some security-related extended attributes, but they still drop any user-defined extended attributes.
Old versions of tmpfs truncated some high-resolution timestamps, e.g. those coming from xfs, but I do not know if this still happens on modern versions of tmpfs.
Before learning these facts, I could not understand while some file copies lost parts of their metadata, after being copied via /tmp between 2 different users, on a multi-user computer where /tmp was mounted on tmpfs.
Now that I know, when I have to copy a file via tmpfs, I have to make a pax archive, which preserves file metadata. Older tar archive formats may have the same problems like tmpfs.
Isn't this extremely dangerous? Disk write caches aren't used most of the time, except on battery backed HBAs. And databases are typically configured to use O_DIRECT for a reason: COMMITs are supposed to be durable. We had this fight at a previous company when an engineer based database server hardware recommendation on a dangerously misconfigured database server, and did not consider the effect of caches. As soon as a safe configuration was used in production, performance dropped off a cliff, particularly on random IO. So the question we had to ask was: do you want to trade durability for performance? Or do you now have to carve up your databases into shards that fit the IO performance characteristics of the badly chosen servers you purchased, and waste rack space and CPU power?
Parent is talking about temporary tables. Those are normally only live for the duration of a transaction (well, session, but in practice if you're using temporary tables across multiple transactions you have a logical application-level transaction which needs to be able to handle failure part-way through). After your transaction the writes to non-temporary tables should be persistent.
Postgres temp tables on ramdisk are a problem for a different reason, the WAL, as pointed to by a sibling comment.
Gotcha, somehow missed that. Yeah, tmp tables on disks are painful and I've made the same optimization on MySQL whenever it wasn't possible to eliminate the need to tmp tables by refactoring SQL.
Why would you want to bypass the filesystem by talking to the block device directly? Doesn't O_DIRECT on a preallocated regular file accomplish the same thing with less management complexity and special OS permissions? Granted, the file extents might be fragmented a bit, but that can be fixed.
A "regular file" might reside in multiple locations on disk for redundancy, or might have a checksum that needs to be maintained alongside it for integrity. Or, as you say, its contents might not reside in contiguous sectors - or you might be writing to a hole in a sparse file. There's a lot of "magic" that could go on behind the scenes when operating on "regular files", depending on what filesystem you're using with what options. Directly operating on the block device makes it easier to reason about the performance guarantees, since your reads and writes map more cleanly to the underlying SCSI/ATA/NVME commands issued.
If you understand your workload and the hardware well enough to understand how doing direct I/O on a file will help - then you’re going to generally do better against a direct block device because there are fewer intermediate layers doing the wrong optimizations or otherwise messing you up. From a pure performance perspective anyway. Extents are one part of the issue, flushes to disk (and how/when they happen), caching, etc.
Doesn’t mean it isn’t easier to deal with as a file from an administration perspective (and you can do snapshots, or whatever!), but Lvm can do that too for a block device, and many other things.
With O_DIRECT though you're opting out of the filesystem's caching (well, VFS's), forced flushes, and most FS level optimizations, so I'd expect it to perform on par with direct partition access.
Do you have numbers showing an advantage of going directly to the block device? Personally, I'd consider the management advantages of a filesystem compelling absent specific performance numbers showing the benefit of direct partition access.
You do when it does that/respects it which isn’t always. The point is that you have more layers. If you’re trying to be as direct as possible, more layers is unhelpful.
Since you get most of the same advantages management wise with lvm while using the block interface (including snapshots, resizing, and all the other management goodies), you’re not exactly getting much extra functionality either.
Your concerns are all theoretical and the management disadvantages of direct partition access are real with or without LVM (which itself is exactly the sort of middle layer you claim to be worried about.)
Since most of what we’re talking about is unnecessary complexity for no real gain, what concrete metric do you think would be useful exactly? I just pointed out that you can get the same management advantages without it (say for a dev environment or rollbacks or whatever). And you get a simpler, cleaner story without extra layers if you don’t want to use lvm (such as in production), which you can’t get from O_DIRECT.
The problem is some of the alternatives seem to be suggested by way of "if we had any support for this it would be better than O_DIRECT". So don't use O_DIRECT, use the alternative which doesn't exist, is still too slow, only covers parts of what you need, etc. .
I'm wondering if it's really necessary to get at the block device directly.
I'm able to saturate a PCIe 3.0 x4 link doing direct IO to an NVMe drive with a single 1.7 GHz Power PC core without breaking a sweat. This is through ext4.
My accesses are sequential though. Maybe there's more of a penalty with random IO.
In my testing of these ideas, I've been able to push over 2 million transactions per second (~1Kb per transaction) to a Samsung 960 Pro. For reference, its rated for 2.1GB/s sequential writes, so I've got it pretty much 100% saturated.
The implementation for something like this is actually really underwhelming when you figure out how to put all the pieces together. I assembled this prototype (also a key-value store) using .NET5, LMAX Disruptor, and a splay tree implementation i copied from google somewhere. The hardest part was figuring out how to wait for write completion on the caller side (multiple calling threads are ultimately serialized into a single worker thread via the Disruptor). Turns out, busy wait for a few thousand cycles followed by a yield to the OS is a pretty good trick. You just do a while(true) over a completion flag on the transaction object which is set en masse by the handling thread after the write goes to disk. Batch sizes are determined dynamically based on how long the previous batch took to write. In practice, I never observed a batch that took longer than 2-3 milliseconds on my 960 pro. Max batch size is 4096, and it is permanently full when 100% loaded. A full batch = a nice big IO to disk.
LMDB has similar write characteristics where its b-tree is append-only. This gives LMDB amazing performance and very robust ACID transaction support as immutability is baked in.
This is quite common in traditional DBs too. Eg PostgreSQL has its write-ahead log. Both LMDB and PostgreSQL then occasionally need to do do some kind of compaction, checkpoint or garbage collection, whatever it's called in various systems, the write-only log is reset and any live data in it improted into the main db data.
I only have a cursory knowledge on LMDB (listening to a podcast while biking). Anyway, LMDB has no transaction log nor write ahead log. There's no overwrite during update. Data page update is copy-on-write and b+tree index update is append only. The update on the b+tree pages is performed from the bottom of the tree to the root, linking newly appended pages to higher level pages. The transaction is committed when the new root page is appended. When there's a crash, the incomplete appended index pages have not been linked up to the root page yet and are not reachable from the previous valid root page. They can be just thrown away. Recovery just means searching for the last valid root index page. There's no need for a WAL and undo/redo of the transaction log.
Deleted pages and obsolete pages are actively put back into a free list (tracked by another b+tree), which will be reused for new page allocation. This avoids the long garbage collection phase to walk all the live pages for compaction (no vacuum is needed).
Also: parallelize your writes. This is the biggest difference between SSDs and HDDs: internal parallelism. You’ll have a hard tine saturating I/O bandwidth even with huge sequential writes if you don’t introduce some parallelism. Fortunately, io_uring makes this easy from a single thread.
Buffering writes is fine if you're ok with losing your data. For some applications that's acceptable, but when I'm writing to disk, it's because I want persistence. "It'll get flushed to disk at some point as long as power doesn't go out" is hardly that.
This page tells me a lot about SSDs, but it doesn't tell me why I need to know these things. It doesn't really give me any indication about how I should change my behavior if I know that I'll be running on SSD vs spinning disk.
I've always been told, "just treat SSDs like slow, permanent memory".
For instance, when reading this sqlite came immediately to my mind and how much a 10000 loop of inserts without begin/commit or some preparing pragmas would wreck a ssd... (forces a full sync between each two inserts)
Yes but you can configure the kernel to ignore that, and by default it does.
For example, way back in the day, to get more life out of my laptop during college, I configured the kernel to only write to disk once an hour or when the buffer filled up. That effectively meant I was only writing to disk once per hour when I shut down to change classes.
The modern linux kernel doesn't actually write to disk when fsync is called. It buffers the writes in a cache. Also, the SSD itself has a cache.
There are lots of abstractions between SQLite and the disk.
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage
device) so that all changed information can be retrieved even if
the system crashes or is rebooted. This includes writing through
or flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
>I configured the kernel to only write to disk once an hour or when the buffer filled up. That effectively meant I was only writing to disk once per hour when I shut down to change classes.
Sounds great until you get a kernel panic or random shutdown, in which case you potentially get file corruption and/or data loss.
> The modern linux kernel doesn't actually write to disk when fsync is called. It buffers the writes in a cache.
Do you have a reference for this? That would break every ACID database that I'm aware of, including sqlite and postgresql. There has been a lot of work in the last few years to fix data durability issues with fsync (e.g. https://lwn.net/Articles/752063/), so I would be very surprised to hear that fsync is now a no-op.
> you can configure the kernel to ignore that, and by default it does.
> The modern linux kernel doesn't actually write to disk when fsync is called.
This is false.
Almost all open source databases' durability guarantees are based upon fsync (including SQLite, Postgres, MySQL, and so on). fsync will result in the corresponding underlying storage flush commands. You configure Linux to ignore fsync, but this is is not the default, on any Linux distribution I'm aware of. It would not make any sense.
Fortunately most people aren't running OLTP workloads on client SSDs. That's mostly done on enterprise SSDs that have much higher endurance. That said even on client SSDs you can probably get away with running such workloads as long as you're not doing them 24/7.
More important than the higher rated endurance (and perhaps contributing a bit to that rating) is the fact that the typical enterprise SSD has power loss protection capacitors for its RAM, so it can cache and combine writes in RAM safely.
Indeed. The summary talks about what you need to do to saturate a SSDs read and write bandwidth. I guess the post would find its audience better if the title was "What a programmer should about SSDs when optimizing IO".
I'd be more interested in the trends in SSD behaviour are. It seems SSDs have bigger and bigger DRAM caches and wear ceased to be an issue many years ago, so there's not much payoff in the write side advice of the article.
I have found trim is not sufficient at least on Windows, we still need to rarely defragment SSDs from what I can tell.
On a Windows server we were having SSD performance issues where sequential reads were often down to 100MB/s, it was kind of confusing but we tried all sorts of ways to copy it with the same result. I eventually tested the drive with a fragmentation tool and it was really high at 80% but most importantly the problem files had so many fragments that they were tending towards 4k IO reads.
What I did was remove all the files to another drive, force trimmed the drive and gave it several hours to sort itself out and then copied them back and performance was restored to 550MB/s as would be expected.
I wrote a quick go program to test sequential read speed of all files across all the drives and I found plenty of files where performance was degraded. This was across a range of SSDs I had, SATA and NVMe from differing vendors. I suspect this is a bigger problem than most people realise, normal use absolutely can get the drive into a bad performing state and trim wont fix it. Very few people expect that the drive will degrade down to its 4K IO speed on a sequential copy but it apparently can.
> Log-structured applications and file systems have been used to achieve high write throughput by sequentializing writes. Flash-based storage systems, due to flash memory’s out-of-place update characteristic, have also relied on log-structured approaches. Our work investigates the impacts to performance and endurance in flash when multiple layers of log-structured applications and file systems are layered on top of a log-structured flash device. We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection. All of these effects can combine to negate the intended positive affects of using a log. In this paper we characterize the interactions between multiple levels of independent logs, identify issues that must be considered, and describe design choices to mitigate negative behaviors in multi-log configurations.
My opinion is probably... not technically correct... until you have to deal with drive reliability and write guarantees, but I don't think programmers actually have to know anything about SSDs in the same way that developers had to know particular things about HDDs.
This is out of pure speculation, but there had to be a period of time during the mass transition to SSDs that engineers said, OK, how do we get the hardware to be compatible with software that is, for the most part, expecting that hard disk drives are being used, and just behave like really fast HDDs.
So, there's almost certainly some non-zero amount of code out there in the wild that is or was doing some very specific write optimized routine that one day was just performing 10 to 100 times faster, and maybe just because of the nature of software is still out there today doing that same routine.
I don't know what that would look like, but my guess would be that it would have something to do with average sized write caches, and those caches look entirely different today or something.
And today, there's probably some SSD specific code doing something out there now, too.
Games used to spend a lot time optimizing CD/DVD layout. Because reading from that is REALLY slow. Optimize mostly meant keep data contiguous. But sometimes it meant duplicate data to avoid seeks.
The canonical case is minimize time to load a level. Keep that level’s assets contiguous. And maybe duplicate data that is shared across levels. It’s a trade off between disc space and load time.
I’m not familiar with major tricks for improving after a disc is installed to drive. (PS4 games always streamed data from HDD, not disc.)
Even consoles use different HDD manufacturers. So it’d be pretty difficult to safely optimize for that. I’m sure a few games do. But it’s rare enough I’ve never heard of it.
While reading through the Quake 3 source code, I noticed that whenever the FS functions were reading from a CD, they were doing so in a loop, because the fread/fopen functions instead of hanging and waiting for the CD to spin up sometimes just returned an error. It wasn't just slow, it was also hilarious at times.
This reminds me of Valve’s GCF (grid cache file, officially, or game cache file, commonly). The benefits must have purely occurred on consoles for the reasons you outlined, because cracked Valve games that had GCF files extracted ran faster than the official retail releases on PCs!
Stream loading is another technique that's used to reduce load time. You start loading data for the next level as the player approaches a boundry and you let them enter the next level before all of the assets(ussally textures) have finished loading.
Consoles also do this with HDDs. That's been one of the talking points around the PS5 from the beginning, with Sony saying that games would get more storage space efficient because they don't need redundancy for faster loading anymore.
This is very very true. The PS5 does hardware decompression, so games by default are now going to be compressed. For a real world reference of how big a difference that makes, see fortnite turning on compression [0] (disclaimer: I worked for epic on fortnite at the time)
> The PS5 does hardware decompression, so games by default are now going to be compressed.
If that really is cause and effect, that's a bit disappointing. For any game that isn't assuming you have an ultra-fast SSD, normal CPU decompression can handle things quite well. Such a hard nudge shouldn't have been necessary.
With few exceptions, video games have been keeping their assets on disk in compressed form for a long time. It's a major embarrassment when someone ships a game with uncompressed audio, and impractical to ship with uncompressed image, texture or video assets (though these can be shipped in compressed form with unnecessarily high resolution).
The hardware decompression acceleration in new consoles doesn't exactly make it easier to use compression for the game assets. Rather, it makes it practical to load compressed assets on-demand instead of reading and decompressing into RAM during the loading screen.
> With few exceptions, video games have been keeping their assets on disk in compressed form for a long time.
Well, we can point to fortnite up there, but also a very large fraction of the games I have on steam can be shrunk by a third just by applying filesystem-level compression, despite it using weak algorithms and small blocks. I'm sure there's compression involved, but it's not even meeting a minimum bar of competency.
Interesting, and fun to read and think about! And, as a professional programmer for 17 years now, not once have I done anything where this would have been important for me to know (even if I had been running my code on a system with SSD's). So, I'm not convinced the title is at all accurate.
I think the key is hidden in
> which can help creating software that is capable of exploiting them
Unless you're writing desktop software or your application behaves in a way where you have actually selected the particular hardware components (most of us in cloud hosting don't do this), you probably don't [need to] care.
It is really puzzling why "every programmer" should burden their already overloaded brains with this. If they're reading/writing some config/data files this knowledge would not help one bit. If they're using database then it falls to the database vendor's to optimize for this scenario.
So I think that unless this "every programmer" is a database storage engine developer (not too many of them I guess) their only concern would be mostly - how close my SSD to that magical point where it has to be cloned and replaced before shit hits the fan.
A little off topic, but I bought a new Macbook Pro with the M1 chip with 8GB of RAM, and I'm worried about the swap usage of this machine wearing out the SSD too quickly. Is this an actual concern, as my swap has been in the multiple GB range with my use?
Generally speaking macOS is extremely write heavy for all sort of reason even before the switch to ARM. But in majority of case if should last 4-5 years without problem.
The heavy write bug Apple said was due to misreporting and was fixed ( so they say ).
I do think you should pay attention to it from time to time. iCloud Sync, Spotlight, Safari heavy tabs are all known to cause heavy paging in some corner case. You might end up having a TB of data written for no apparent reason. Apple used to ship their Macbook with MLC, on a 512GB MLC you could do 500TBW without problem, that is ~13 years of usage if you do 100GB write per day. Not sure about the M1 machines.
If you are doing Dev staging, Video and photos editing a lot these drive will fail quite quickly. In the space of 2 - 3 years. Although some would argue MacBook Air are not made for those task. And especially true if you have 8GB and 256GB NAND.
Well, they no longer need to think about hard disks, but there are a lot assumptions from the world of hard disks that play out very differently in the SSD world.
I don't think there's any optimization for hard drives that is going to hurt on SSDs, and unoptimized workloads are always going to work better on SSDs. I'm inclined to agree with GP that SSDs are quite close to random-access storage and so there is little to worry about.
Sure there are. If nothing else, hard disks have much more consistent latency characteristics for reads and writes. So, for example, you might trade some extra write IOs to ensure data is organized efficiently on disk, reducing the number of read IOs you will subsequently have. With an SSD it's largely a waste of time, because the random reads are so much cheaper and the "contiguous" blocks you think you are seeing are mapped all over the drive anyway. You want to organize things reasonably efficiently when you write, and then rewrite as little as possible, ideally never. LSM's tend to fit the SSD paradigm so much better than say... balanced trees for this reason. Similar story with clustered indexes in databases. If you use a clustered index on an SSD, usually it's for an index on something like time where new records are invariably going to go near the end of the index; anything else will have bad write performance on a hard disk, but it might be worth it for the read performance... with the SSD, it is just an unmitigated disaster.
There was a time where people thought of hard drives as "just random access storage" and consequently "there is little to worry about" and "unoptimized workloads are always going to work better on SSDs". Yup, SSDs are way faster than what came before them, but that if anything tends to mean that data structures & algorithms that used to make sense might not make much sense any more.
I've turned on plenty of cell phones that hadn't been charged or powered on for a couple of years and everything worked normally. Same with thumb drives I've picked up after years.
I mean, anything can fail after three months. Your statement doesn't really add anything without stating the failure rates. For all I know the failure rate could be less than that of physical hard drives.
Thanks, now I understand where this is coming from.
And the linked article makes clear it's not a worry at all. Key part:
> All in all, there is absolutely zero reason to worry about SSD data retention in typical client environment. Remember that the figures presented here are for a drive that has already passed its endurance rating, so for new drives the data retention is considerably higher, typically over ten years for MLC NAND based SSDs...
Average users virtually never pass the endurance rating, so @teddyh's claim seems awfully sensationalistic.
if that is true disks should come with a very visible note stating this... seriously, 3 months would be nothing. i doubt it is true because 3 months is a time frame which should be surpassed quite often making this more known.
Three months is the minimum standard for data retention from an enterprise SSD that has used up its entire write endurance and reached end of life, but is still being stored in a hot chassis.
Outside of that narrow scenario, the three months figure is wildly wrong and should not be repeated. Lower temperatures, a consumer drive, and not having used up 100% of the write endurance will all drastically lengthen data retention.
(However, under no circumstances should you trust a cheap USB thumb drive to retain your data. Those tend to use lower-grade flash memory and lower-quality controllers. If you need an external device to reliably cart around data, shop for a "portable SSD", not a "USB flash drive".)
That's a document from nine and a half years ago, and it states:
> It depends on the how much the flash has been used (P/E cycle used), type of flash, and storage
temperature. In MLC and SLC, this can be as low as 3 months and best case can be more than 10
years. The retention is highly dependent on temperature and workload.
Are there any modern sources provide more accurate stats? "3 months to 10 years" is so vague as to be useless.
It occurs to me now that the key word here may be “unpowered”. As in, if you unplug an SSD and leave it on the shelf, it may lose (some) data in as little as three months. There might not be many people who do that, and those who do might not notice the occasional corruption.
What's the flash translation layer made of? Is the flash technology used for that more durable than the rest of the SSD itself? (like say MLC vs. QLC?)
The FTL is like a virtual memory manager. It is firmware/hardware to manage things like the logical to physical mapping table, garbage collection, error correction, bad block management. Yes there will be a lot of FTL data structures stored on the flash. It can be made durable by redundant copies, writing in SLC mode or having recovery algorithms. I used to develop SSD firmware in the past if you have further questions.
Hey that's very interesting! How much of the FTL logic is done with regular MCU code vs custom hardware? Is there any open source SSD firmware out there that one could look at to start experimenting in this field, or at least something pointing in that direction, be it open or affordable software, firmware, FPGA gateway or even IC IP? I believe there is value in integrating that part of the stack with the higher level software, but it seems quite difficult to experiment unless one is in the right circles / close to the right companies. Thanks!
Typically the Host and NAND interface have custom hardware. When the host issues a command, the hardware might validate it and queue up data to a buffer. On the NAND interface there might be a similar queue for NAND commands. You might have multiple queues for different priorities of operation. The error correct will also be in hardware. When you issues NAND reads and writes, the ECC will be checked or encoded. The rest of the FTL might all be in firmware. Perhaps a single core does everything. Or maybe its partitioned between two cores, one for the host related code and the other for the FTL related. Some companies have tried lots of cores, each with a dedicated state machine to handle some part of the operation. These can be complex to coordinate their operation and to debug. Some companies convert some of these state machines into custom hardware.
The only open SSD platform I've read about is http://openssd.io/ but I've never played with it. One of the challenges is the NAND manufacturers a lot of the critical documentation under an NDA these days. You really need that information to make a reliable SSD. When you learn how the internals of an SSD work, its a wonder that it retains data at all!
In terms of integrating SSD with the higher software level, I believe FusionIO was doing this in the past. They put the whole logical to physical mapping into the host memory.
Do you know if Apple's M1 also does something similar to what is done by Fusion IO. I read something about it in Twitter, but didn't think to follow up with the twitter poster at that time.
Thank you, note taken, that is very valuable information! OpenSSD is at least a good starting point to research and prototype, even if manufacturer help is needed later.
You're right that the FTL has some durability concerns which, in addition to performance, is why it's typically cached in DRAM. Older DRAM-less SSDs were unreliable in the long-term but that's been improving with the adoption of HMB, which lets the SSD controller carve out some system RAM to store FTL data.
> A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller.
Does the controller read the partition table to decide that the space beyond logic partition is safe to use as scrap?
The SSD maintains a translation table for all the virtual addresses exposed by the drive, that maps to the underlying flash physical addresses. Any physical address not in that table, is unallocated and the drive can use freely.
With most SSDs, there's no special explicit step necessary to overprovision a device. Just trim/unmap/discard a range of logical block addresses, and then never touch them again. The drive won't have any live data to preserve for those LBAs after they've been wiped by the trip operation, and the total amount of live data it is tracking will stay well below the advertised capacity of the drive.
The easiest way to achieve this is to create a partition with no filesystem, and use blkdiscard or similar to trim the LBAs corresponding to that partition.
If I partition the entire drive, eventually all blocks will be used, depending on how the filesystem allocates, right? So to guarantee some free space it's better to over-provision by under-partitioning. Now how do I make sure that on a used drive?
You could use some sort of disk quota system, to make the filesystem artificially smaller than it actually is (trim after applying this change). Or simply insure that you don't exceed 80% - 90% used space.
It it is also worth noting that many SSD's are over-provisioned by the manufacture anyway, in those drives manual over-provisioning might achieve very little anyway.
This reminds me of a recent interview[0] by Digital Foundry with the Core Technology Director of Ratchet and Clank: Rift Apart.
Near the beginning they talk about how targeting the PlayStation 5, which has an SSD, drastically changed how they went about making the game.
In short, the quick data transfer meant they were CPU bound rather than disk bound and could afford to have a lot of uncompressed data streamed directly into memory with no extra processing before use.
A lot of talk about pages, but no mention about how big these pages are. From a quick look on Google, most SSDs have 4kB pages, with some reaching 8kB or even 16kB.
SSDs mostly tell the host system that they have 512-byte sectors or sometimes 4kB sectors, and the typical flash translation layer works in 4kB sectors because that's a good fit for the kind of workloads coming from a host system that usually prefers to do things (eg. virtual memory) in 4kB chunks. But the underlying NAND flash page size has been 16kB for years.
Emulating 4kB or 512B sectors when the underlying media has a 16kB native page size really doesn't add much more complexity on top of the stuff that was already required to handle the fact that erase blocks are multiple megabytes.
The complexity doesn't come from the emulation. It comes from trying to do the emulation efficiently based on assumptions about the behaviour of the other moving parts... which are also doing the same thing.
So, you've got firmware that is pretending you've got 512B/4kB chunks when really you have 16kB, and anticipating how the other layers might be doing things in order to maximize performance.
Then you have a filesystem/VFS layer, which tries to optimize its access patterns anticipating how the underlying solid state storage might be really doing things in 16kB sizes and how it might be optimizing 512KB & 4kB accesses to fit that.
Both those layers are dealing with filesystem journaling and how that might impact performance.
Then you might have a database, which is now trying to anticipate how the filesystem and the underlying firmware might be optimizing access patterns, and so it's trying to optimize to fit all that.
You also potentially have application logic that is trying to anticipate how the database might do things...
What you tend to end up with are many layers of redundant caching that are all working against each other in a very inefficient manner.
A number of high-level techniques help rationalize data management and transfer, but the mileage of practical implementations may vary a lot. Generally speaking, only a small number of applications really need to take care and add a further layer of abstraction, that because the best practices already codified into any widespread language do an acceptable job already.
How big is the write cache usually and how does it work? Typically I've seen the write caches be something like 32MB in size, but the "top speed" seems to be sustained for files much bigger than 32MB, which doesn't make sense to me if that top speed is supposedly from writing to the cache. How does that work?
Getting full throughput from the SSD is less about file size and more about how much work is in the SSD's queue at any given moment. If the host system only issues commands one at a time (as would often result from using synchronous IO APIs), then the SSD will experience some idle time between finishing one command and receiving the next from the host system. If the host ensures there are 2+ commands in the SSD's queue, it won't have that idle time.
Then there's the matter of how much data is in the queue, rather than how many commands are queued. Imagine a 4 TB SSD using 512Gbit TLC dies, and an 8-channel controller. That's 64 dies with 2 or 4 planes per die. A single page is 16kB for current NAND, so we need 2 or 4 MB of data to write if we want to light up the whole drive at once, and that much again waiting in the queue to ensure the drive can begin the next write as soon as the first batch completes. But you can often hit a bottleneck elsewhere (either the PCIe link, or the channels between the controller and NAND) before you have every plane of every die 100% busy.
If you're working with small files, then your filesystem will be producing several small IOs for each chunk of file contents you read or write from the application layer, and many of those small metadata/fs IOs will be in the critical path, blocking your data IOs. So even though you can absolutely hit speeds in excess of 3 GB/s by issuing 2MB write commands one at a time to a suitably high-end SSD, you may have more difficulty hitting 3 GB/s by writing 2MB files one at a time.
It varies quite a bit. There are two different types of caches: SLC and DRAM. Most drives use SLC caching, higher end drives often use both.
Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of flash.
SLC caching is using a portion of the flash in SLC mode, where it stores 1 bit per cell rather than the typical 2-4 (2 for MLC, 3 for TLC, 4 for QLC) in exchange for higher performance. SLC cache size varies wildly. Some SSDs allocate a fixed size cache, some allocate it dynamically based on how much free space is available. It can potentially be 10s of GBs on larger SSDs.
The 1 GB DRAM per 1 TB Flash is to store the Flash Translation Layer mapping from logical addresses of the host system to the physical address in Flash. The write cache is separate and much more limited in size.
The DRAM you're referring to is for the most part not a write cache for user data. Most of that DRAM is a read cache for the FTL's logical to physical address mapping table. When the FTL is working with the typical granularity of 4kB, you get a requirement of approximately 1GB of DRAM per 1TB of NAND.
Drives that include less than this amount of DRAM show reduced performance, usually in the form of lower random read performance because the physical address of the requested data cannot be quickly found by consulting a table in DRAM and must be located by first performing at least one slow NAND read.
If you leave un-partitioned space on the SSD, how the heck does the SSD know it is ok to erase it? Wouldn't it be safer to partition it as an extra drive letter, format it, and then leave that drive alone? That would allow the OS to trim all the "empty" blocks.
Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:
The actual physical address on the storage chip and the physical address from the operating system's perspective don't have much to do with another. For harddrives, "un-partitioned space" means that there is a physical "chunk of metal" that is unused.
However, that's not the case for SSDs. SSDs dynamically remap "OS-physical" block numbers to whatever they want. (Preferably addresses that have never been used before or that have been discarded/trimmed. If there aren't any available, perhaps to the address that was previously used for the same block number.)
>Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:
I'm replying to the whole of comments on this article. The write amplification problem goes up as the number of "free" sectors/blocks goes down. Many solutions have been presented that don't allocate X% of the hard drive... but I'm not sure than any of them let the hard drive's SSD controller know they aren't allocated.
For that to happen, the OS has to have TRIM support, AND the block in question has to be on a volume that the OS is managing.
My worry is that if you have a blank partition, it's not being actively managed by anything, and thus isn't going to be TRIMed, and thus the SSD doesn't know the blocks are free for use.
Thus, leaving an unpartitioned area isn't going to help.
How could the drive know? TRIM commands are the only way to "free" a sector/block for writing. The drive might have been tested during setup, so the block might not be empty any more.
If sequential and random reads are mostly the same on SSDs, does that make the distinction between columnar and row-based databases/data storage less important?
Nope, unless your columns are all several kB wide. If you force the hardware to perform a multi-kB read for each 64-bit value you need, you're still going to waste a lot of potential performance.
The claim about parallelism isn't true. Most benchmarks and my own experience show that sequential reads are still significantly faster than random reads on most NVME drives.
However, random read performance is only somewhere between a 3rd to half as fast as sequential compared to a magnetic disk where it's often 1/10th as fast.
What kind of queue depth do you test the read performance? The sequential can be made fast at low queue depth by the SSD controller doing prefetch reads internally. I've worked on such algorithms myself.
Show me a benchmark at any queue depth where random reads are as fast as the fastest sequential rate for that drive. It's simply not true.
I suspect it has something to do with prediction on the controller but I'm also not confidently spewing a bunch of bullshit about drive architecture unlike this article.
There's nothing whatsoever I should need to know about SSDs as a Javascript programmer and if there is then the programmers on the lower levels haven't done their jobs right and are wasting my time
So.. interesting topic. Last year I experimented with some C# + Samsung 970 Evo Plus Nvme + MessagePack (with compression) + Zfs .. to benchmark how fast I could dump objects from .net memory to disk.
The numbers involved was insane and I played with various scenarios, with/without compression (MessagePack feature), with/without typeless serializer (MessagePack feature), with/without async and then the difference between using sync vs async and forcing disk flushes. I also weighed the difference between writing 1 fat file (append only) or millions of small files. I also checked the difference between using .net streams versus using File.WriteAllBytes (C# feature, an all-in-memory operation, good for small writes, bad for bigger files or async serialization + writing). I also played with the amount of objects involved (100K, 1M, 10M, 50M).
I cannot remember all the numbers involved, but I still have the code for all of it somewhere, so maybe I can write a blogpost about it. But I do remember being utttterly stunned about how fast it actually was to freeze my application state to disk and to thaw it again (the class name was Freezer :p).
The whole reason was, I started using Zfs and read up a bit about how it works. I also have some idea about how ssd's work. I also have some idea how serialization works and writing to disk works (streams etc).. I also have a rough idea how mysql, postgres, sql server save their datafiles to disk and what kind of compromises they make. So one day I was just sitting being frustrated with my data access layers and it dawned on me to try and build my own storage engine for fun, so I started by generating millions of objects that sits in memory, which I then serialized with MessagePack using a Parallel.Foreach (C# feature) to a samsung 970 evo plus to see how fast it would be. It blew my mind and I still don't trust that code enough to use it in production but it does work. Another reason why I tried it out, was because at work we have some postgres tables with 60m+ rows that are getting slow and I'm convinced we have a bad data model + too many indexes and that 60m rows are not too much (since then we've partitioned the hell out of it in multiple ways but that is a nightmare on its own since I still think we sliced the data the wrong way, according to my intuition and where the data has natural boundaries, time will tell who was right).
So I do believe there is a space in the industry where SSD's, paired with certain file systems, using certain file sizes and chunking, will completely leave sql databases in the dust, purely by the mechanism on how each of those things work together. I haven't put my code out in public yet and only told one other dev about it, mostly because it is basically sacrilege to go against the grain in our community and to say "I'm going to write my own database engine" sounds nuts even to me.
I encourage anyone to go write their own little storage engine for fun. It will force you think about IO, Parallelization, Serialization, Streams, and backwards compatibility.
It is really fun (and not even that hard) and even if it works I still recommend against using it in production, but it will help take some of the magic away on how databases work and reveal the real challenge. The real difficult part for me comes from building a query language, parser and optimizer (like sql) and to handle concurrent writes properly. It is still difficult for me to comprehend how something like a sql query string gets converted into instructions that pull data out of a single file (say sqlite file), where that file's structure on disk can be messy and unknowable upfront when sqlite gets compiled. You essentially have a dynamic data structure and you are able to slice & order the data however you want, it is not known at compile time with hard-coded rules. So I think in that regard sql adds a ton of value. So sqlite is still my go to for most flat-file scenarios.
If you want to go fast & save NAND lifetime, use append-only log structures.
If you want to go even faster & save even more NAND lifetime, batch your writes in software (i.e. some ring buffer with natural back-pressure mechanism) and then serialize them with a single writer into an append-only log structure. Many newer devices have something like this at the hardware level, but your block size is still a constraint when working in hardware. If you batch in software, you can hypothetically write multiple logical business transactions per block I/O. When you physical block size is 4k and your logical transactions are averaging 512b of data, you would be leaving a lot of throughput on the table.
Going down 1 level of abstraction seems important if you want to extract the most performance from an SSD. Unsurprisingly, the above ideas also make ordinary magnetic disk drives more performant & potentially last longer.