Now I wonder: Does increased page size have any negative impacts on I/O performa...

tadfisher · on Aug 23, 2024

Flash controllers expose blocks of 512B or 4096KB, but the actual NAND chips operate in terms of "erase blocks" which range from 1MB to 8MB (or really anything); in these blocks, an individual bit can be flipped from "0" to "1" once, and flipping any bit back to "0" requires erasing the entire block and flipping the desired bits back to "1" [0].

All of this is hidden from the host by the NAND controller, and SSDs employ many strategies (including DRAM caching, heterogeneous NAND dies, wear-leveling and garbage-collection algorithms) to avoid wearing the storage NAND. Effectively you must treat flash storage devices as block devices of their advertised block size because you have no idea where your data ends up physically on the device, so any host-side algorithm is fairly worthless.

[0]: https://spdk.io/doc/ssd_internals.html

lxgr · on Aug 23, 2024

Writes on NAND happen at the block, not the page level, though. I believe the ratio between the two is usually something like 1:8 or so.

Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

to11mtm · on Aug 23, 2024

> Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

Look at what SandForce was doing a decade+ ago. They had hardware compression to lower write amp and some sort of 'battery backup' to ensure operations completed. Various bits of this sort of tech is in most decent drives now.

> The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

The controller needs to know what blocks can get a clean write vs what needs an erase; that's part of the trim/gc process they do in background.

Assuming you have sufficient space, it works kinda like this:

- Writes are done to 'free-free' area, i.e. parts of the flash it can treat like SLC for faster access and less wear. If you have less than 25%-ish of drive free this becomes a problem. Controller is tracking all of this state.

- When it's got nothing better to do for a bit, controller will work to determine which old blocks to 'rewrite' with data from the SLC-treated flash into 'longer lived' but whatever-Level-cell storage. I'm guessing (hoping?) there's a lot of fanciness going on there, i.e. frequently touched files take longer to get a full rewrite.

TBH sounds like a fun thing to research more