Hacker News new | past | comments | ask | show | jobs | submit login
Is sequential IO dead in the era of the NVMe drive? (jack-vanlightly.com)
163 points by eatonphil on May 9, 2023 | hide | past | favorite | 174 comments



Moving from spinning rust to solid-state storage dramatically improved the performance of random reads, but random writes still carry a penalty compared to sequential writes. So sequential I/O is perhaps half-dead; there's no need to optimize layouts for sequential reads any more.


Yes. And the Flash Translation Layer inside modern drives are quite sophisticated. This means you could start out with a write pattern that's causing a lot of copying and reorganization of pages in the back ground, typically because it's doing a lot of unaligned random writes smaller than the erase block size, and initially the performance will be fine. But as the drive wears in, and the capacity fills up, the FTL has less extra space to work with, forcing this garbage collection activity to become increasingly common. So you get performance degradation and premature wearing of the drive in a way that's opaque.


Short stroking SSDs is a commonly used trick in the toolbag of wise sysadmins for dealing with this kind of workload.


Umm sorry, no. I'm a storage engineer with one of the leading enterprise vendors. SSDs are already "short-stroked." They have much more space available internally than you can see through the controller externally. The more "enterprisey" the drive, the more of that hidden space there is.

A wise sysadmin buys a drive with more of that hidden internal space, instead of getting a larger drive with less hidden space. The logic inside the drive is much better at zeroing/out and background trim on that internal hidden space, than host-addressable blocks that are currently unused. That's because the drive has no idea if you're about to use them, while the hidden space is guaranteed to be always free.

In fact - a fun fact. Flash has been used as a Write buffer on storage arrays for a couple of decades.


Most of this hides the problem. It still exists though.

I have a piece of logic in my indexing code that essentially transposes an ~100 Gb multi-value dictionary on disk. If you do this the naive way with random writes, the write amplification makes it a complete non-starter. All the caching layers and buffers fill up with completely disjointed 8 byte writes and it takes ages to write.

What I've ended up doing is to in an intermediate stage write the data to be written into a series of files, containing pairs of offsets and data (up to like 100Mb each); and then going over the files one by one and essentially evaluating them as assembly instructions.

Both passes have relatively good data locality, and despite essentially writing 2.5X as much shit to disk, it takes hours rather than weeks to do this operation.


Short stroke in a SSD era is more about trying to keep your data within a page of a SSD. Depending on the flash drive you are using the controller may or may not do some of it for you. One of the easiest perf gains I can sometimes get from a program that is write intensive is to put a small amount of write buffer into the mix. Depending on your OS that can help a lot too. Basically keep my code out of the kernel and off the bus and keep the write block to something that closely resembles what the drive considers a block. If you are doing a bunch of small 8 byte writes randomly in your file you probably are having a bad time as you may quickly exceed the amount of buffer the drive has for that sort of thing. It will start backing you off which bleeds up into the kernel space then very quickly into your program. Keeping them together can help, as you found out.

I see this sort of issue in a lot of programs. As it is a dead easy problem to make in your program. You need to write something out you just splat it out somewhere. With a dozens 1/4/8/16 byte writes instead of one big write. Basically not thinking about how that data is getting into your files. Most of the time that is just fine and not that big of a deal. But as your data set grows or you want better perf you have to worry about it. I usually use something like filemon and can see what is going on. You can see the pattern where there will be a large stack of I/O with hundreds of very small read and writes. While SSDs are an order of magnitude faster than the older drives. They still have their command structure and kernel context switching you have to deal with. You in some cases want to minimize that as it can become a large portion of writing and reading data. As with most optimizations (in this case the drive is faster) we just moved where the bottleneck is (to the kernel and bus typically).


Short stroking data doesn't writes within a page. It gives the FTL more time to shuffle data around in pages when garbage collecting and erasing pages. Erasing a page tends to take milliseconds of time vs microseconds of time for reads, tens to hundreds of microseconds for writes. NAND flash is typically written in smaller pages (512 bytes in older products vs 4KB-16KB ) but erased in larger pages (32KB+). It's only NOR flash that supports smaller byte granularity write sizes. NOR flash is not as dense as NAND, so it doesn't get used much beyond embedded applications.


exactly. Also My point was more about keeping yourself out of calling into the kernel and paying the price for the context switches. Which can add up quickly.


Context switches between processes are bad, but going into the kernel for an irq using things like aio and io_uring which can populate events into a result ring buffer doesn't add a ridiculous amount of overhead. Optane DIMMs avoided that, but the consequences were such that the hardware had a lot more complexity. Doing something like an FTL at DDR4 speeds and latency is very difficult. Plus you don't really want to expose hardware that requires wear leveling directly to applications, as a buggy app can wear out (damage) the hardware prematurely.


A wise sysadmin buys a drive with more of that hidden internal space, instead of getting a larger drive with less hidden space.

The pricing of enterprise SSDs is... not exactly fair for the performance you get.


that's like complaining a bentley is too expensive and not good at moving your piano.

when you get fined $1mil/min by the SEC for downtime, and data loss or corruption can cost you in the hundreds of millions, or someone dying at a hospital, enterprise gear is cheap. reliability is the key and for what you are paying. now yes, there may be a specific consumer drive more reliable than a specific enterprise drive. so which do you buy? well the enterprise array vendor tested the crap out of everything in every combination and workload and environment, and picked one for you, and it comes with full support and SLAs that you can use to meet gov regulations.

performance for an enterprise drive is not something you usually consider, at all. in fact, did you know that when I quote a storage array, I can't even specify the drive type of vendor, and who knows what will get shipped? Ionly specify drive size.

your performance comes from all your workloads clumped together, spread over a thousand of these drives connected with infiniband, with dedupe and compression on the backend, and hundreds of terabytes of RAM. the perf of an individual drive is not relevant.

but yes, sticking it into an AMD server you bought on newegg when they had a sale is not its purpose, and is a very bad deal.

now when we talk about wise sysadmins, we're talking about guys who know their stuff, and do "important big stuff." not a guy at a small business ordering from newegg. and that guy - he shouldn't be coming up with any storage policies because he lacks the needed large-scale experience.

I sold a 1PB usable-effective (after 3x dedupe/compression) space storage array last year. It was $2mil, after a 65% discount. If one time in its 5year lifecycle the "enterprisey" stuff on that array prevents about 10 seconds of downtime, it paid for itself.


Sure, and judging by System Z sales there's still a demand for earthquake-resistant machines with hot-swappable CPUs.

But for most of us, infrequent downtime is acceptable, and most single-machine downtimes are either automatically mitigated or have very limited impact. In that scenario, getting a prosumer SSD and over-provisioning it can be a sensible choice.


* Hetzner joined the chat


There are many other ways to get reliability at the software level, usually via redundancy. Even enterprise level SSDs doesn't guarantee you free of data losses and downtimes, you have to back them up regularly and preferably set up high availability for the database.


cool. so I have a few thousand VMs of various OSs and hypervisors, a few hundred home-built applications with an average of 10 components each, some mainframe, some ibm-i, a bunch of linux, solaris, aix, and HP-UX server, and I'm using mysql, sql server, postgres, oracle, and a couple of DB2. there may be an informix DB or two somewhere - I don't know, I manage storage. about ten different volume managers, about 30 different filesystems - and some raw disks of course. 10PB total space, it needs to be metro cluster replicated ten miles away for active/active, and async in a different state can be about a minute behind max.

So, write me some software that's going to make all that work and make the storage highly available. make sure it works for everything.

you're exposed to solutions that make that work, daily. when you swipe your credit card buying condoms - you have no idea how much happens so your transaction doesn't get lost, corrupted, or errored out.


> you have no idea how much happens so your transaction doesn't get lost, corrupted, or errored out.

Maybe he doesn't, maybe he does - you don't know nor do I.

I'm pretty sure this is how IBM salesmen used to respond when confronted with those newfangled Unix systems which were starting to appear here and there, nibbling first, then taking larger bytes out of their market share. Instead of the litany of diverse systems they'd have thrown LPARs, SYSPlexs and ESMs around but in the end it still came down to the same thing: this stuff is too complicated to be left to amateurs. They were right, in a way... until those amateurs grew their wisdom teeth and took a large part of their market away from them.

Yes, "enterprise" stuff is complicated - often overly so [1] - and it has its place. This does not make it the only viable solution to these problems, something will eventually come up to eat your lunch just like IBM saw its herd of dinosaurs being overtaken by those upstart critters from the undergrowth. Maybe some smart software system which "guarantees" data reliability and availability without the need for "enterprise" storage devices? It wouldn't be the first time after all.

[1] https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...


It depends on what kind of performance you're looking for and how you value that performance.

Under sustained long duration (ie: hours to days) of continuous full load performing small block writes, even the highest rated consumer SSD drives will begin to show significantly reduced throughput in comparison to pretty much any enterprise SSD drives.

Additionally, it's totally normal to see >=5 drive writes per day and 5 year warranties on enterprise drives. Consumer drives usually are rated significantly below 1 drive write per day and rarely for more than 3 years of warranty. If you're performing a lot of writes, you're going to need to replace worn out SSDs and that's not free in a business (remote hands, downtime, etc). So buying less durable storage has costs which don't show up on the original purchase invoice but do need to be factored in within a business setting. The warranty period is the best indication of how confident the drive vendor is about drive durability.


More total space for the same price changes that equation.

Spending 2+X per GB means rather than 5x the same X it’s 5 times 1/2 or less total space. And that’s before you consider write amplification issues with having dramatically less total SSD space.

Enterprise SSD’s have a few benefits, but they shouldn’t be your default choice for all servers.


> Consumer drives ... and rarely for more than 3 years of warranty.

Looking at my list of consumer SSDs and their warranties:

Samsung EVO 850: 2y

Samsung EVO 960: 3y

Samsung EVO 860: 5y

Samsung EVO 970: 5y

Samsung PRO 970: 5y

For comparison with spinning rust HDDs:

Seagate Ironwolf: 3y

WD Red (both EFAX & EFRX): 3y


Consumer drives usually exclude wear out in their warranties making the number of years pretty much irrelevant. What is does show is that vendors are more confidence in their FTLs and hardware nowadays compared to the first high volume SSDs on the market more than a decade ago.


But take a wider sample of consumer SSDs, a large majority are <=3 years. And look at enterprise SSDs.

Seagate Ironwolf and WD Red are not enterprise spinny hard disk drives, instead look at Seagate X18 or WD HC560.


Yes, I know, I was comparing warranty on consumer SSDs and consumer HDDs.


I always complained to our IT group about storage - until they showed me an invoice for the drives they are buying.


wait till you see your Cisco bill


I am well aware that flash is overprovisioned. One consumer SSDs the overprovisioning is quite small (maybe 5-10%), so short stroking the SSD will have a much greater effect than on enterprise SSDs. It also has the nice side effect of giving the drive more free space to work with which allows the garbage collection room to be more gradual and efficient, which can improve performance for more write intensive workloads.

Actually, many storage arrays use battery backed RAM, not flash as a write buffer. Flash does not have the endurance needed to serve as the write buffer for a large storage array. Some products that use battery backed RAM for this purpose will dump the contents of that RAM onto flash. I worked on a messaging appliance that used that approach, albeit with supercaps to provide the hardware time to dump 4GB of DRAM onto a compact flash card. Supercaps were easier to monitor and maintain. There are also plenty of hardware RAID cards that use batteries for their write buffer as well.

Edit: there are also persistent memories like MRAM that don't need power to retain their contents which are used in this space as well.


yes, big arrays (symmetrix, shark) have battery backed everything - enough to take all the cache and dump it to disk when power is lost. terabytes of RAM. they also often use use a flash cache behind the RAM - that was the primary use for Optane. The reason for this, is sustained writes instead of peaks. If you let more and more sit in cache, eventually you have write folding. That results in less write IO on the backend, and you're able to have a higher sustained peak.

smaller cheaper arrays (unity, pure, isilon) and HCI (nutanix, vxrail) that don't have terabytes of RAM - they have gigabytes, and pretty much always use flash as a write cache. In fact, I cannot name one that doesn't.

No one in enterprise storage cares about the indurance of flash. All that means is that twice a year, a vendor engineer comes out to replace a few flash drives under your support contract.

And flash has been used as a write cache behind a smaller RAM cache, for two decades, by all major storage vendors.


I wish that it was easier to trim unused unpartitioned space. I find myself fumbling on BSD and Linux trying to remember how to get the exact offsets for that. You could write zeroes but there isn't a guarantee that the FTL will interpret that as unused.


Just make another partition for the unused space, and run blkdiscard on it?


Right, and you can use fdisk and gdisk to see the partition table and determine the offsets


I might put together a script for that because of the danger of making a mistake on a live system.


I’m guessing that means pretending they’re smaller than they really are?


Yes. The term dates back to hard drives, where using only a fraction of their capacity would minimize the worst-case distance the heads needed to travel, reducing the portion of seek latency due to head movements (but not helping with rotational latency).

On most hard drives, the beginning of the logical block address space corresponds to the outer edges of the platters where the bits are going past heads with the highest linear velocity, so sequential throughput is higher than elsewhere on the disk.


Looks like that's the case based on a cursory google search. Seems analogous to only charging batteries up to 80%. I wish companies would factor these issues into their products and leave invisible buffers, but I guess making them cheaper is more important than being consistent and reliable.


Most consumers make purchasing decisions against these buffers.

If you have two identical electric cars for the same price, one with a range of 350 miles and one with a range of 280 miles, which one would you buy?

If you have two identical phones, one with a 4300 mAh rated battery and another with a 3400 mAh battery, which one would you buy?

If you have two equally priced SSDs, one with 500GB capacity and one with 480GB, which one would you buy?

To enterprise customers the sales rep will explain "look, this device has a bit less space, but the write endurance number quoted here is much better, over 5 years and 5000 drives this will lower your TCO by X". In the consumer space you build your entire brand around one price-reliability tradeoff and stick to it. Usually the cheaper one with the bigger headline stat sells the most, and everyone else has to make up the lost volume with margin.


They do leave invisible buffers, commonly referred to as over provisioning. The discrepancy between GB and GiB is ~7% hiding in plain sight, and then more expensive enterprise drives are commonly sold in (usable) capacities like 960GB and 7.68TB rather than 1TB and 8TB.


1 TB = 930 GiB

The value difference is not because of invisible buffers! Marketing material and usable OS space are measured in different units, but the usable amount of bits in each value is exactly the same.


No, there really are invisible buffers resulting from this discrepancy. The raw flash chips have nominal capacities that align with power of two sizes, but the drives use decimal-based units. So a drive advertised as 1TB will have a usable capacity of approximately 1TB, but is assembled from flash chips with a total capacity of at least 1 TiB.


Yes, there are invisible buffers, but the discrepancy doesn't come from this. Drive makers count the space in 1000 -multiple units (i.e. 1GB = 1000*1000*1000 bytes) whereas in computing units are usually count on 1024 -multiple (i.e. 1GiB = 1024*1024*1024 bytes)

So 1TB = 1000*1000*1000*1000 bytes/1024/1024/1024 = 931GiB (or 0.9TiB)


Did you notice the part where I explained how drive makers use the decimal units but the chips they source to build those drives are built in binary capacities?

I don't give a shit whether your operating system likes to show you disk usage in binary or decimal units, because I'm not talking about software at all. I'm explaining how the hardware is built. I especially don't need another person to try to explain what the binary and decimal units are, after I've repeatedly used both correctly.


Some flash based drives will even report this value in SMART, the amount of the over-provisioned space that has been used.

Modern QLC has very low write endurance so those drives need to have spare space to use when it starts to wear out.


I think there are two things you may be referring to. First, some drives support thin provisioning of namespaces, and can report the current amount of storage actually being used to back the namespace(s), relative to the maximum user-accessible capacity.

Second, in the SMART/drive health data, there's usually a counter tracking how many reserve/spare blocks remain usable as in not worn out and retired. That's a one-way counter; deleting data won't un-retire defective or worn-out blocks.

It's almost unheard-of for a drive to directly expose a realtime counter of how many blocks are currently unallocated, not being used to store data, and in the erased state making them available to accept new writes.


Sales and marketing will never leave anything on the table that might make their job easier.


Why not just buy SLC?


SLC still has a minimum block size that gets used during writes.


Does that even exist anymore?


Only in very small capacity raw flash chips, generally less than 1GB in size. This is still somewhat popular in embedded systems because it's often inexpensive. But it's basically impossible to purchase an SLC flash-based drive with a modern managed interface (NVMe, SATA, eMMC, SD, etc) today.


The discard operation mitigates this situation by marking blocks freed by the filesystem as empty so the GC doesn't have to copy them. Thus, free filesystem space will eventually be consolidated.


> So sequential I/O is perhaps half-dead; there's no need to optimize layouts for sequential reads any more.

Yes and no. Large contiguous IOs are still faster to read than a bunch of random small sectors. You generally want your frequently read files/blobs to be split into as few operations as possible.

https://dl.acm.org/doi/abs/10.1145/3477132.3483593


This bears a strong resemblance to RAM and caches where cache lines are relatively large so you want to optimize your algorithms to touch as few "cache lines" as possible. So even in RAM random IO is slower than sequential IO.


Yeah. Causing additional page accesses can also be expensive due to the TLB impact.


There's still a ton of performance benefit that can be gained by keeping your read pipeline full, and you can most easily keep your read pipeline full by making it guessable by doing readahead. It doesn't need to be strictly disk-layout sequential, file-layout sequential will usually do.


Sequential reads will still generally be faster at least for small ones, just that NVMe are fast enough that this rarely matters


To add to that, the penalty for random writes is paid not only with performance but also in durability of the drive.


Sort of. Random writes are fine as long as they are as large as the physical write block size (not the virtual 512b or 4k) and there's a queue depth high enough to keep the drive busy. Unless you are lucky enough to have Intel's (no longer made) Optanes that give nearly 100% of performance with low queue depths.

This avoids the ugly read/modify/write, as well as much of the behind the scenes magic for minimizing write amplification and shuffling blocks around.


Flash controllers provide an abstraction layer that should turn any write workload into a sequential one, freeing software developers from having to care about it.


That abstraction layer cannot always avoid creating write amplification when you send it writes that are too small.


It can but it involves additional complexity in the controller. Some definitely do a much better job than others.

Most controllers handle the block to page mapping problem well. They just mark the old block in the map as dead and write a new block on a new page. GC will later erase pages, hopefully prioritizing pages with the most dead blocks (subject to wear-leveling concerns).

But that same concept can be extended to partial block writes. It complicates the read process but there is no reason the controller can't coalesce multiple partial block writes into special partial update blocks and update the map accordingly. Basically write a record saying byte range (X,Y) was overwritten with just those bytes - no need to read the old block, merge the change, and write the whole block back out. Obviously you need to handle things like read chain thresholds (too many partially overlapping updates).

Let the GC handle merging the partial updates into the full block during idle periods. Again prioritize doing this for blocks on pages that are going to be erased - you had to rewrite those blocks anyway, turning what would be write amplification into a "free" write.

Like I said a lot of controller firmware doesn't bother but that doesn't mean it is impossible or unprofitable to do.


I didn't say it was impossible, I said it is not always possible. Even the most advanced SSD firmware will cause write amplification for some pathological workloads.


That's fair. For many systems you can come up with a pathological workload that defies any existing optimizations.


Which is why having a nice chunk of RAM on the disk itself is useful. 2x write queue size x block size is nice.


It was Nautilus file manager in Ubuntu that used to drag if you opened a folder with a few thousand files. It was it was essentially doing at least one request for each file when you opened a folder and this was limited by the spin rate of the HDD. So if you had 5,400 files and a 5400RPM drive, yep that would take a full minute to resolve.

Eventually the solution really just become - get an SSD because you could throw thousand of requests and the results was fast enough.


You'll still find that the gtk file managers choke hard when you have hundreds of thousands of files in one directory (such as screenshots), taking minutes to load or even hanging and crashing. Both pcmanfm-qt and dolphin handle them fine and load within 30 seconds. I last tested this a year or two ago and every gtk file manager was roughly equally bad. Made me switch from pcmanfm to pcmanfm-qt (I didn't like dolphin as much).


While I do not doubt that Nautilus did that, I don't think it's a given that a 5400 RPM drive will take exactly a minute to scan through 5400 files.


I should not have said that so literally but it was more an example of how the physical driver can limit these things.


a sensible conclusion


Far from it, sequential IO optimization isn't dead.

Sequential reads need to be optimized to produce optimal performance for selected workload. That usually means applying a lower level of inline compression.

In some cases deduplication works better before, in other after, compression. Sometimes post-process dedupe is more suitable than inline.

Then there's erasure coding and data protection methods that are still being optimized for NVMe with sequential workloads, including random workloads which are being sequentialized to work better with latest flash media.

I would even say developments in sequential IO are becoming more important than random IO.


I always worry that hardware "locks in" software. SSDs showed up after everyone wrote their databases to be optimal on spinning rust, so SSDs were built to look at the flow of requests, assume that the application was written for spinning rust, and optimize the data layout accordingly. Whether or not this is as good as brand-new software that managed the layout itself is debatable, and nobody will write that software anyway because they won't have users that have SSDs that let the database manage the disk in full. So the software has to tune it for one manufacturer's reorganization algorithm and hope for the best. The path to a "global maximum" remains unclear because of this.

CPUs are similar. People started writing programs in C, so CPUs started optimizing for the outputs of C compilers. If someone were to invent CPUs and compilers today, the list of optimizations would probably be different, and the performance characteristics of real-world software would probably be different. (It's not just C; people wanted virtual address spaces, and that was slow in software, so now there is a TLB, etc.)

Actually, it goes even deeper. Half the startups I see on HN related to software operations have a quickstart like "don't worry, you don't have to rewrite your code, we'll magically do everything for you". Why not just refactor the code? It would take about an hour, and then you don't need a crazy Rube Goldberg machine to achieve your desired results. (If you want specifics, think service meshes, and how they now detect what language your app is written in so they can rewrite your HTTP handling functions to pass around a trace ID. Back in the day, you just set outgoing.Headers["x-trace-id"] to incoming.Headers["x-trace-id"]. Now we have layers and layers of OS-level machinery that still don't do as good a job as spending half a day adjusting your codebase. Billions of dollars invested in saving half a developer day! Wow!)


Probably the biggest ground-up rethink of SSDs that I've seen is the Samsung SSDs with a key-value store mode.

Like, think about it, a filesystem is really a database, right? Copy-on-write even makes the "transactionality" explicit. And a lot of high-performance databases will go farther and skip the filesystem and treat the device as block storage... which it is. The filesystem is a leaky abstraction with filesystem blocks and flash pages and flash block erasure, etc.

Well, what if the SSD was just an object store? In that model you slice away a bunch of layers of abstraction and just let the SSD worry about all that. SSD says it's committed? Alright then, guess it is.

Obviously you are very much at the mercy of the SSD to implement ACID correctly however...


Optane was also a contender for the biggest fundamental shift - byte addressable persistent memory with insane endurance is simply in a different class to every other SSD technology out there. No need for a FTL. No need for a DRAM / SLC cache. Unfortunately no software really took advantage of it, OSes still provisioned it as 512/4096 byte sectors, etc. so in standard benchmarks it never really looked that compelling over a regular NVME drive. But if the access patterns align, nothing can come close. Truly a technology ahead of its time.


It always made me surprised that Intel didn't market it better, such as forking linux and writing some custom additions that could allow the Optane drives to demonstrate real world 10x improvements.


Intel already has Clear Linux actually.

I don't think there's really a ton you can do at an OS level. ZFS L2ARC, system swap, etc, but a lot of the benefits would come from tuning DB configs/etc to suit the new hardware. If it's a fast SSD... ok are you the DBA?

Similarly, PDIMM is really best utilized by a ground-up rethink of applications operating in a persistent context (think of like, javacard multicore 1TB or whatever). If you treat it as slow RAM it's gonna be slow RAM. The point would be building systems and applications that exploit the "writing the memory is now writing the disk" idiom. The ideal ecosytem for PDIMM isn't Linux at all, it's JVM, or LISP/Haskell. Or android I suppose lol.


Optane was either the fastest SSD you ever had or the slowest RAM. Incorporating it into the memory hierarchy might not be a win if you are replacing fast RAM with slow Optane.


I assume for marketing figures, Optane would replace the SSD at a much higher cost.


Key value FTLs have been around for about as long as FTLs. The logical block address -> NAND page forward map is just a key value store after all. It can potentially take one layer of indirection out of applications but you still have one there (the FTL). Or alternatively exposing the raw NAND and having the software manage it is another common way to go (indirection is in software now instead of hardware/firmware).

When I worked on FTLs it was always a thing executives and product managers liked to talk about but didn't seem to provide incredible results and people generally didn't like using them because it was just harder / different to manage. Has that changed?


> Key value FTLs have been around for about as long as FTLs.

Is this similar to what IBM had since the 60s already on their hard disks (which they call "DASD"), namely "CKD" (count-key-data) storage?


fwiw Seagate also experimented with this for hard drives, it was pretty neat--ethernet-connected KV-addressed hard drives. I don't think it made it to an actual shipped product, but it did make it to "partner field testing phase".


Typical SSDs all assume your IO will be in 4kB-aligned chunks even though the storage is denominated in 512-byte sectors for compatibility reasons. So we have managed to move past the legacy of early hard drives to some degree, and are now being held back by assumptions that derive more from x86 page sizes.


Deeper than that! Such fundamentals as 'the stack is accessible by code' goes back to Fortran. C doesn't need that. The call stack and the 'display' stack could be different things and no C programmer need care.

This is the root of attacks that rewrite the return address of a kernel call. If the return address were managed/protected outside normal data operation this would not exist.


Could you explain this in more detail, please? Fortran standards didn't support recursion until Fortran '90 and it wasn't easy to get stack allocation in F'77 compilers from vendors until the mid-80's or so, both long after Algol-like languages had them.


Fortran programmers abused the stack in astonishing ways. They used to calculate offsets on the stack to variables in the previous frame you already returned from. The details of stack layout were coded into important applications. Intel designers tried to do better, but the Fortran people pushed back, hard.


Wondering if they meant Forth rather than Fortran.


It's even less true for Forth, given that it has separate data and return address stacks, and neither is addressable. There is some dedicated Forth hardware that relies on this fact to separate the stacks physically in hardware and e.g. use SRAM for them.


> I always worry that hardware "locks in" software. SSDs showed up after everyone wrote their databases to be optimal on spinning rust, so SSDs were built to look at the flow of requests, assume that the application was written for spinning rust, and optimize the data layout accordingly.

It's not really that so much as contiguous or nearby access is faster on NAND, just like it's faster on hard disks, so software optimized layout and access patterns can look similar for both.

> CPUs are similar. People started writing programs in C, so CPUs started optimizing for the outputs of C compilers.

This didn't start at a one-way street. Actually CPUs were around first, so the first C compiler optimized for the CPU it generated code for. And from that point onward, C compilers optimize for the CPU.

> If someone were to invent CPUs and compilers today, the list of optimizations would probably be different, and the performance characteristics of real-world software would probably be different. (It's not just C; people wanted virtual address spaces, and that was slow in software, so now there is a TLB, etc.)

Unlikely. At the periphery yes, but there are fundamentally difficult things for CPUs and compilers to do which shape the solutions. Before somebody says "those fundamental difficulties might be different if we invented things differently" - the story of CPUs and of compiler optimization is about inventing ways to make those fundamental difficulties less difficult.

There wasn't one day somebody invented the CPU and that was that, millions of people around the world have worked countless hours inventing and improving parts of CPUs and compilers for the past 70 years or so. There is inertia, but new ideas that are good enough can catch on even if it means entirely new models and ecosystems have to be invented. That's how we got multiprocessors, clusters, SIMD, GPUs.


> It's not really that so much as contiguous or nearby access is faster on NAND, just like it's faster on hard disks, so software optimized layout and access patterns can look similar for both.

I worry about the pathological cases. Imagine you have an append-only log, and you write and fsync() one byte at a time. Each time you write a byte, the entire flash block (are they still 4KB these days?) has to be erased. So you end up chewing through 4000 durability cycles on your SSD, whereas if you had waited to write an exact 4KB block, then you'd use only 1.

A hard drive is fine with this access pattern, though you'd probably be doing a lot of seeking to update fs metadata and it would be so slow that you'd come up with some other way. So maybe this pathology only exists in my mind.


> I worry about the pathological cases. Imagine you have an append-only log, and you write and fsync() one byte at a time. Each time you write a byte, the entire flash block (are they still 4KB these days?) has to be erased. So you end up chewing through 4000 durability cycles on your SSD, whereas if you had waited to write an exact 4KB block, then you'd use only 1.

NAND pages are what, something around 16-64kB these days, and block sizes are 10s of MB. In SSDs the logical size of those things may end up being larger at the FTL level if they are ganged together, but that's about your minimum.

Those are program and erase units respectively. With NAND, you do not erase a block to write. You can program pages in a block incrementally, and then you have to erase the entire block before reprogramming any.

Flash translation layers have to make this look like a disk. To do that they will do something like gather writes into a page size chunk in a small cache that is non-volatile or can persist itself on power failure. Then that chunk is programmed out to a free page. A mapping structure records the new NAND location of the logical block addresses you wrote. And a garbage collector comes along behind and compacts and frees data in block size units, erases them, and puts them on the free list.

It's a log structured filesystem with one file (the block device), if you've read any of those papers.

If you write+fsync to sector 0 of your disk 500 times, that data will get stored at 500 different places on the NAND (ignoring larger persistent caches in front of the NAND that some devices have). Selecting what pages to use and what to garbage collect etc is all part of wear leveling that is intended to prolong live of the drive. That's why endurance ratings tend to be in total writes to the drive, not writes to any particular sector.

To handwave the numbers, if you have a 100GB SSD that might be implemented with 105GB of NAND. Then if you had a program/erase endurance of 1,000 cycles you will be able to write that block 25.6 billion times. Your drive can do about 20-30 thousand QD1 writes per second, so about 12 days of that write+fsync block workload. Again assuming no front end cache on it.

You definitely can wear out NAND drives if you write a lot to them (large streaming writes would be easier to do it with), but regardless of what you do in the software layer, the drive will (should) last for its rated endurance no matter what kind of write patterns your application does. A block is a block as far as the NAND sees.

For consumer stuff you generally have to be pretty extreme or have malfunctioning software to wear them out as far as I've seen.

> A hard drive is fine with this access pattern, though you'd probably be doing a lot of seeking to update fs metadata and it would be so slow that you'd come up with some other way. So maybe this pathology only exists in my mind.


Whether or not this is as good as brand-new software that managed the layout itself is debatable, and nobody will write that software anyway because they won't have users that have SSDs that let the database manage the disk in full.

Maybe it's not relevant to databases, but in my experience everything written with SSDs in mind has been horrifying. Compare two machines with nice NVMe drives running Windows 10/11 and Windows 7. Your user experience will be virtually identical. Swap the SSDs out for hard drives. The Windows 7 machine will get slower and have the many split-second delays we all remember from back in the day. The Windows 11 machine, however, will completely shit the bed. Massive multi-second delays crop up trying to do the most simple tasks.

So far developers, instead of taking advantage of SSDs, are just using them as a crutch.


Many companies refuse to test their software on low-end hardware, because effort. I've seen testers test software, get annoyed at how slow it was, and instead of letting that be a test result they just switched to a faster machine.


Didn't or do still some CPUs have instructions that can aid the JVM?


"aid" is very broad, so i'd say "yes"

but recently there was an article on hn that talked about an arm-instruction for some js stuff.


Jazelle is generally deprecated.


Only since 2005 ;)


Sadly I’ve never gotten my hands on Jazelle hardware :(


It's weird to me that there's so many promising reliefs that seem near at hand, but which have simply never been delivered. Zoned Storage has existed for a while, for Shingled Magnetic Recording & tape, and there was much hubbub in 2018-2020 about NVMe getting Zoned Namespaces, which happened. https://zonedstorage.io/

And there's now a bunch of fs implementations. f2fs and btfs both have some support.

But there's still no products actually available to buy. We could be getting so much better at NVMe, so systematically making things better. But we've kind of been stalled out for a while, after a bunch of ceremony figuring out what we wanted to do.


SMR and zoned storage offer compete and total utter crap performance in the real world. They have all the glass jaws of early flash firmware (garbage collection that took seconds to complete during which time no additional I/Os completed) while running on a medium that has many orders of magnitude higher latency than flash. Take 2008 era USB flash drives and that's about what you'll get out of an SMR HDD when using a write workload that isn't 100% large sequential writes.

SMR might have become more relevant if it had provided a useful increase in density, but that never materialized as was originally promised. Retail prices for SMR drives were never much better than CMR. The largest generally available HDDs are available in CMR flavours because that's what's needed in real world servers. There's no 10x density improvement (or even 2x). Meanwhile flash is marching relentlessly down the cost reduction path afford by Moore's law + 3D layer stacking. Really fast 1TB NVMe SSDs are under $100 now, and it doesn't look like flash cost reduction is going to slow down any time soon.


SMR and zoned storage offer compete and total utter crap performance in the real world

Yes, that's because we're in the drive-managed transition phase of zoned storage (not clear when -if ever- we'll leave that phase). The first zoned disks needed to present themselves as a dumb unit to the OS, because none of the operating systems had any support for it.

Now that at least some OS'es have native support for zoned storage, we might finally see host-managed zoned storage on the market, but as you say -- the technology has not delivered on the projected storage benefits, so the value-add of CMR disks is very much an open question.

And there's also the consumer confusion angle: how to market a device that can physically connect to their computer, but might logically not work at all? So I'd suspect to see the first host-managed drives in enterprise datacenters (think Amazon Glacier) -- but FAFAIK they're nowhere to be found.


the value-add of CMR disks

Argh. SMR, that is.


I wasn't really talking about SMR though, just some protocols they had that are now being re-used. I'm talking about Zoned Namespace SSDs (ZNS).

Let's go back in time almost a decade. A bunch of smart people had figured out that the cumbersome Flash Translation Layer (FTL) on SSDs made it really hard to get expectable & consistent performance. They were building possible specs to try to directly access each flash block, to be able to fill them up as they saw fit & clear them out as they saw fit. They wanted direct access, with far less work juggling complex data-mappings done on the SSD itself. Two examples of Open Channel SSD works: https://openchannelssd.readthedocs.io/en/latest/ http://lightnvm.io/

Note the huge banner on the second: "Zoned Namespace (ZNS) SSDs have replaced the work on OCSSDs, and is now a standardized interface." Everyone realized that the protocols we had for Zoned Storage were basically adequate to get us what we wanted. We'd just make a lot of 128kB or whatever sized zones on the SSD, and let people manage them themselves.

Currently this means opening blocks & then appending to them. There's been outstanding hope, in the future, that we might go further and allow more random write patterns, but still with the same contract of no over-writes (just clears).

That work seemed like it was ready to go in 2020 & 2021. SSDs were supposedly sampling/becoming available. https://blog.westerndigital.com/zns-ssd-ultrastar-dc-zn540-s... https://semiconductor.samsung.com/newsroom/news/samsung-intr... And that was reaffirmed again a year latter. https://www.tomshardware.com/news/samsung-and-western-digita...

But here we are in 2023 & there's still no Zoned Namespace SSDs (ZNS) one can purchase.


You can purchase them, just not from retailers. These drive break compatibility in annoying ways (eg. cannot boot a system off one), so it's totally reasonable that drive vendors would not be offering them through channels where unsuspecting customers could so easily buy something they cannot use.


They didn't need to completely remove compatibility modes.

And I'm not asking to get one off a shelf but not only can I not buy one on amazon or newegg I can't even find a price when I look for a specific model!

I think it's fair to say that to a first approximation I, as a person, can't get one.

Edit: I found one site with a price on a WD drive, but they only have refurbished drives and the price is 25% off of $2200 for 1TB, so I'm going to ignore that site.


I really wouldn't mind buying special purpose hardware. In fact, I think I prefer it.

The whole story here is that modern SSD controllers tend to embed a bunch of very fast very expensive data-processing cores to quickly do a ton of fancy mapping. We don't hear it quite as clearly these days, but for example for a while higher end Samsung SSD controllers were announced as penta-core ARM Cortex-R[1] systems, and I think those Cortex-Rs are what most folks do. (It was groundbreaking news that WD released an open source "swerv" RISC-V core which they'd experimented with using instead, https://blog.westerndigital.com/risc-v-swerv-core-open-sourc... .)

Ideally, the fantasy is, we can build much cheaper & faster controllers that do far less. Zoned Namespace drives should ideally have enterprise grade, fast, dual-port access, but in many ways, I feel like they should ideally be far cheaper than DRAM-less drives. They should leave even more up to the host. They should be incredibly dumb & simple drives. They should be lower power, by far. They should eskew DRAM. Drop all the legacy baggage & expose what you are, un-intermediated, so we can be fast & use it well. Which you, the drive, with all purpose tasks, could never do.

It's not worth it, to me, to make a drive that still keeps the ancillary old baggage of convention. If you want to make a good ZNS drive, make a good ZNS drive.

[1] https://en.wikipedia.org/wiki/ARM_Cortex-R


> I can't even find a price when I look for a specific model!

That's true of most enterprise/server components. At most you'll find a laughably inflated list price that hardly anyone actually pays.


Google, Facebook, Microsoft, Amazon and so forth buy most of the spinning rust, and have for years. Each of them have their own extremely proprietary storage layer that knows how to manage Shingled Magnetic Recording.

So when SMR got released into the open market, it was exclusively drive managed, and mainly on the low-end, as a cost reduction strategy. With all the performance downsides you'd expect.

It is a real shame that host managed SMR isn't available outside the mega-scale corps. It would be nice to have access to an intelligent storage stack mixing flash and SMR drives without having to go work for one of the big N companies.


The thing that still bothers me is that the read-write cycles and lifetime are sort of unpredictable to me personally.

For an analogy, I only buy LED lightbulbs, love that tech. When they first became widely-available, my house had a few incandescent bulbs on the porch that I probably just left on for 5 years.

The supposedly superior led bulbs in that adoption period often died surprisingly fast, compared to the claims.

To bring it back to storage. I use as much NVMe as I can, but I’m still a bit uncertain on how much a risk I’m taking with my filesystem choice and access patterns.

My answer: buy more SSDs when they go on sale! I don’t have intuition about lifetime even now.

I know, HDDs have equivalent classes of worse problems. I just learned to expect them to reliably fail, like a few days after a build ;)


  $ sudo smartctl -A /dev/nvme0n1

  Available Spare:                    100%
  Available Spare Threshold:          10%
  Percentage Used:                    0%
  Data Units Read:                    626,936 [320 GB]
  Data Units Written:                 22,262,139 [11.3 TB]
This is on a few months old PC which is why percentage used is still 0%. The number does slowly go up over time especially if you do heavy writing.

As with LED lightbulbs, the consumer industry is focused on costcutting to deliver shitty products at low prices. Any high quality SSD is more than adequate for consumer use. Main thing to look at is how many layers the flash has. SLC is pretty much not available in consumer products. Triple layer is very adequate. All the cheap products have four layer, which you don't want unless you know you're not going to be writing often.

Firmware bugs or bad controllers are still a potential issue. I have a 1.5 year old drive which is at something like 65% percentage used, it's a big drive with a high TBW rating and only a moderate amount of data written to it, but the excessive wear is known issue with that model. It'll get replaced under warranty but that's only kicking the can down the road since the replacement will have the same issue.


That was a very well-thought out comment, and I learned new things! Thanks!

The strange thing to me that gives me weird feels despite the science is that I’m never had an ssd die that I installed. And I’m no Linus of LTT here.

Plenty of apple machines and etc have needed a catastrophic replacement. That just throws off my ability to make rational decisions about it at times.

“Dude! That SSD was premium priced, what the heck? Thermal failure?”

(Seems likely, but ugh. Note: Opinion was formed a few years back, so it’s not something I’m actively diagnosing.)


The recent issues with Samsung's SSD's don't help with those feelings I'm sure. I have a WD-Black 1TB SSD in my system that I built in 2018 and I used it heavily all day long and I have not suffered any issues from the SSD. I also used a MacBook Pro from early 2013 and only recently sold it still going strong with it's SSD (proprietary). I read recently that some HDD were starting to have reliability problems after a few years.


I think append only type logs like Kafka are more important from a semantic point of view than from an IO point of view. The importance of having an immutable, append only data store is that it is immutable. That gives you some nice properties with consistency in especially distributed setups and synchronizing state between different nodes.

The IO write speed is a bit of a secondary concern. And since the data is immutable, you don't actually overwrite it all the time. And of course, SSDs have been around for quite long and spinning disk was already on the way out over a decade ago on high end servers. So, things like Kafka and event sourcing mostly became popular after that started happening; not before. This was never really about SSDs vs. spinning disks. Instead it is about the guarantees that come with having immutable data in a distributed setup. You see that in the Elasticsearch world as well (lucene uses append only data structures). Elasticsearch emerged around 2010. Almost from the beginning the consensus was that ssd was vastly preferable to spinning disks for scaling. Not because of the writes but for reads. Of course lots of people still used spinning disk around the time as well.

So, sequential reads and writes are two things. Elasticsearch writes sequentially; but reading is random access. Which is why SSDs are nice to have for Elasticsearch clusters even though it uses append only storage.


Not to harp too much on the title (I read the article and found it very interesting), but are there any industries storing their data on all-flash? My intuition is that most institutions still have a giant array of spinning rust somewhere underpinning everything they do.


Outside of backups I would expect many business to be all SSD. Maybe not the majority but for many use cases I see no good reason to go HDD for most small businesses.

I know I went pretty much all SSD ages ago. I do run backups to a few 14TB externals but nothing outside of that is spinning rust.


I ditched my final spinning rust (a WD Black 2.5” 7200RPM drive I had jammed into my home server) last night actually. It’s all SSDs in my server, and it’s all NVMe in my gaming desktop. My Macs and my partner iPad are all flash too obviously.


I do not think it is feasible to store all of YouTube videos on SSD. Google must be using spinning rust for the long tail of videos.


Gobs. This is Pure Storage's entire market.


Our datacenter at $dayjob retired our last spinning rust drive way back in 2017. Backups are done over the wire to dedupe-enabled targets 600km away; dedupe restore times are shit unless you use all-SSD there as well.


I see a fair number of all SSD clusters, but most have a mix of drive types. Interestingly, the ratio of drive types is aligned with optimal performance rather than cost.


Completely tangential, but Yangtze Memory/YMTC 232-layer NAND flash that was supposed to be used in iPhone 14 is on fire sale due to US import restrictions, to the point that some speculates that YMTC will be gone soon.

The point is, DRAM-less 2TB M.2 NVMe disks are at $80 right now in select non-US markets. This is mere ~5x HDD, and slow as 4Gbps on writes at worst conditions.


DRAM-less 2 TB m.2 NVMe drives are <$80 in the US market as well https://www.amazon.com/dp/B08CDM2HSS. There are a couple such drives in this price range, even from companies like Intel with their 670p.


Minor detail, Intel sold their SSD business to SK Hynix for $7b. SK Hynix created a new company Solidigm out of the acquisition. So it is now a Solidigm 670p. https://www.solidigm.com/products/client/d6/670p.html

Personally it really felt a bit rotten to me that Intel had created drives like the 670p that would simply stop working after a designated amount of writes. On the one hand, predictable life-span is kind of of a good thing, but sending drives to the landfill before they're actually done seems monstrous. Anyways though, yeah, DRAM-less drives are available for incredible prices.


I'm still of the opinion that even with SSD/NVMe, the random reads/writes are still very slow.


In my experience, depends on the SSD. Consumer grade ones appear really fast because they have a DRAM cache, and as soon as you start having a bunch of cache-misses you realize how slow they actually are.

We got a bunch of enterprise grade Samsung SSDs and there'a large difference in sustained I/O. It's not "instant" by any means, but it there are other things slowing I/O.


Only the very best drives hit >200 MB/s in 4k write (even with DRAM as cache) and for 4k mixed not even Optane drives get that high.


Did you forget to stipulate that you're only talking about workloads that issue IO requests serially with zero parallelism?


No, I just didn't pick QD32T8 to pump the numbers either. Even in such workloads the gap to sequential speed is a decent ratio, just not painfully so as in most other mid to singular workloads.


This obviously depends substantially on the implementation, but (especially for standalone devices) there tends to be metadata access associated with translating a given LBA to a physical page. This brings cache locality of reference into the picture, and any workload that is "random enough" such that the metadata working set exceeds the cache will suffer...


I don't understand why so much emphasis is put on optimizing for sequential reads and writes. Sequential IO is not something that happens in the real world except in rare circumstances. In most applications IO is random, even if it's a log file.


> In most applications IO is random, even if it's a log file.

I am all ears... how are log writes random?


I think that I see what thelastparadise is getting at. At base, most logfile output is a read-modify-write of usually no more than the final block of the file, whereas the sequential writes that are generally optimized for by filesystem and disc drivers are multiple-block writes sent as single output requests.

The twain are similar, but aren't exactly the same. Optimize for the cases where you are writing multiple blocks in a chunk, with contiguous free block allocation policies and write behind and whatnot, and you haven't in reality quite optimized for the case where what you're actually doing is overwriting the tail block in the file repeatedly, oftentimes changing as few as mere 10s of bytes in that block with each read-modify-write request.

(And yes, logging mechanisms often do like to ensure that logs are synchronously flushed to disc line by line, effectively and intentionally defeating write caching and write behind optimizations.)


I would add what it is very rare to have 100MiB+ writes to a log file and anything less would be concurrent with the other operations, which would make 'sequential' here a moot point.

> At base, most logfile output is a read-modify-write

It's worse, actually, as you never can tell if this would be RMW to the same block or to the other free block. In the former you waste the writes for the whole blocks. In the latter you can hit a write amplification despite never have been writing any meaningful amount of data.


It's worse for an even simpler reason, which I thought about mentioning but it probably wasn't what thelastparadise was talking about. The line by line synchronization and flushing also causes i-node updates.


When writing to a log file, isn't that sequential write?


With SSDs it's more like a discount on sequential/adjacent reads within a physical block that is of a size that varies with model and manufacturer.


Relative to what?


That's why all of my minecraft servers are on RAM drives.


Seems SSDs are reaching the end game with ~8TB on a 2.5" platter.

Now you just need to nail the peak of $/GB, I'm looking at both 870 EVO / QVO, they have 4TB versions for $300.

For "write once" or append only data these could work well, think user registry or sequential log.

But you need to offload the active data on SLC drives or take the risk of the wearing driver firmware in the exact devices you buy/patch.

The OS needs to be on SLC anyhow.


> There are different techniques for implementing OP (Over provisioning) yourself. You can simply leave a portion of the drive unpartitioned

This doesn't sound right. Does this mean that the drive decides to arbitrarily move blocks between your partition and unpartioned spaces to manage GC?


Unlike hard drives of yore where ordering the drive to write a bit at a given location results in the drive writing a bit at a given location, SSDs abstract it away as a matter of their design.

An SSD will still provide a virtual bitmap so file systems can still order it to write a bit at a given location like ye olde days, but the SSD controller ultimately decides where to actually write that bit which more than likely has no relation to the specified location.

Incidentally, yes: This also means defragmenting an SSD is not only harmful, it is outright misleading because the state of the file system is not related to the state of the bits on the drive.


Mechanical HDDs have same mechanisms available, its just used for defects https://www.dataclinic.co.uk/hard-drive-defects-table/


Indeed! HDDs remap sectors to reserve sectors if their controller detects the original sectors have become unwritable. The result is quite similar to what SSDs do more broadly; the file system remains as-is, but the workings behind the scene differ from it.


We had a use case at one of former jobs where we had to lookup entries in hundreds of megs of static key/value data (vendored with the app), but it didn't have to be super fast and we didn't want to waste any RAM on it. Since everything is on SSDs, I wrote a lib[0] that was just seeking sorted data in files with pread and binary search. Worked perfectly for our needs.

[0]: https://github.com/maxim/wordmap


Can't you just map the flash into the the host's PCIe memory space, at least for reading, then you can have random access to it? Though some speed penalty if accessing different rows from the flash rapidly in succession.

You can just treat it as memory, and be able to mmap() it from userspace?

The SSD controller will handle all the bad block remapping, ECC, etc. transparently in that case.

It should be possible to make a PCIe x16 card for this, so the CPU will not be tied up as much waiting for the flash.


I find it strange to try to make the application deal with this kind of issue. It's the OS that's responsible to deal with the physical aspects of IO, not the application.

Now, as for a write-optimized log, it's important from a durability perspective that data is persisted to media as soon as possible after the OS orders a flush. In the storage hierarchy, even the increasingly rare battery-backed DRAM drive has its place.


Think of it like this: the universe doesn't care what we want, or what we find to be clean abstraction. When our programs are run on real, physical hardware they will be faced with the physical limits of those systems.

If that means sequential writes are significantly faster than random writes then no amount of OS abstraction will make an application that is structured to require random writes as fast as one that uses sequential ones.

If you don't care about performance, then the abstraction layers we have let you ignore this level of detail already.


> If you don't care about performance, then the abstraction layers we have let you ignore this level of detail already.

I really care more about correctness, portability, and long-term maintainability. Optimizing for the idiosyncrasies of a given OS at a given point in time go against them all. Performance can often be increased by adding hardware (or waiting for the next Moore iteration).

There are a few cases where performance is critical, in the sense that, if you don't achieve it, the system is not fit for purpose, but they are vanishingly rare.


Even setting aside internal GC, there is read-side benefit to having fewer contiguous sections: your large read request needs to be split into fewer IOs. This matters if you're reading the output of this workload. https://dl.acm.org/doi/abs/10.1145/3477132.3483593 discusses it in a sciencey way.


Even on most overpriced and insane modern nvme consumer SSDs you'd see numbers like that: Seq Writes 7000 MiB/s; Rand 4k Writes: 44 MiB/s. And the last number haven't really improved over the years.

Yeah, it's better compared to (1 MiB/s writes;8 MiB/s reads) on HDDs, but it's not that good.


The interesting question —- I’d appreciate any links —- is how to change our programming patterns to get the most performance out of NVMe SSDs. I’ve seen before that it takes a lot of parallel requests to saturate a fast NVMe drive —- but how to change our application architectures to generate such a workload?


"Writes to the WAL are purely sequential and writes to long-term storage are purely sequential."

Is that simply a typo or am I missing something? Shouldn't it be mostly random for long-term storage?


No. In WAL based systems new writes are appended to the write ahead log, and the dirty database pages held in ram. Periodically, a checkpoint process writes out all those dirty pages to long term storage, then truncates the write ahead log. So both these writes can be sequential. This is one of the reasons databases have the performance they do vs just using lots of little files like a database naively.


> So both these writes can be sequential.

In the worst case though, the long-term storage can still have random writes though no (for btrees anyway)? I figure LSM tree writes can always be sequential, even during compaction. But btrees can only do solely sequential writes if the entire tree needs to be rewritten. And you don't want to always do that if only parts of the tree need to be updated?


I would take it as a simplification to really mean "sequential enough" when writing out the reorganized copy to the long-term storage.

With all the different layers in modern storage, I think you can get asymptotically close to the sequential rate once you start hitting some of these other multi-block granularities, i.e. around the block size for flash erasure, encryption, redundancy coding, etc. These don't have to be that big, i.e. several megabytes.


The dirty pages can be written out in physical address order no matter what indexing structure is used. What real databases do involves a bunch of complicated choices, but basically there's ways you can make checkpoints incremental such that it's nearly always large sequential writes.


Sure it can be written in order but it still can involve non-consecutive writes which I assumed is closer to random writes from the disk's perspective.


No, because flash memory (and so-called RAM) still do sequential reads and writes faster, but for sure I'll tune a default Postgres config to give less of a penalty to random reads.


Definitely not, random access on PCIe5 is not too different (from user experience) from PCIe4 and not that stellar compared to sequential throughput of various block sizes.


What a great and clear write-up. Thanks to the author.


if you were going to design a filesystem for solid-state storage (instead of linear or rotational media), could you implement wear-leveling by determining where to place the next segment of a file by hashing it, and using the hash as its address, updating the linked-list/tree/table/whathaveyounot that tracks where files exist?


Sequential IO is not dead for the simple reason that it is the simplest way of solving many problems. Often it will be good enough.


Amazing how much faster f2fs is than btrfs. Now I have some buyer's remorse over using btrfs...


It was always slow-ish compared to traditional ones, same with ZFS in some workloads


It's because of the COW I think right?


Yeah, and in general nice features cost IOPS. Not that it matters all that much on modern hardware, unless you're really trying to get all there is out of it


Not sure about ZFS, but going off a graph in the link - BTRFS is like ~3x~4x slower then the other FS's which IMO that's really substantial.

Of course it's workload specific but that type of a performance hit is going to effect a lot of workloads.


If you like f2fs, you might also be excited about ssdfs: https://news.ycombinator.com/item?id=34939248


Or bcachefs...


What are you citing or talking about? Neither f2fs nor btrfs show up in the article.


See the reproduced Figure 9


COW file systems are pretty amazing, especially on flash memory-based storage.


For some reason, reading this title made me remember Betteridge's Law, so I first answerd "no" and then clicked to read the article.

> Betteridge's law of headlines "Any headline that ends in a question mark can be answered by the word no."

https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...


you'd need something byte addressable for access pattern not to matter.


Tldr: No – “the benefits of sequential IO are alive and well, even in this new era of NAND flash.”

(Betteridge's law of headlines strikes again.)


Huh?

PCIE is serial.


That's true(ish) but it has nothing to do with sequential access patterns in LBA space.


NVMe is going thru PCIE, ok, there are some // lanes.


At least you could have said they are several lanes, a bit of //.


Everything is serial.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: