Hacker News new | past | comments | ask | show | jobs | submit login
How does Linux handle writes? (cyberdemon.org)
150 points by dmazin on June 30, 2023 | hide | past | favorite | 83 comments

In ye olde unixe days, I learned a habit.

That habit was;

# sync

# sync

# sync

# halt

When powering down ye olde unixe maschine. It was so far back I just vaguely remember I'd read that somewhere; possibly in a ye olde unixe manual, but I can't remember exactly. I think back then it made sense, due to older spinning rust technology and there might have been some possibility of the OS not flushing any waiting cache to disk before powering down.

Even though it's probably not needed today, I still do throw a sync command in when powering down, say, my NAS with spinning rust drives, and my PCs/laptop with SSDs.

There is one thing where I 100% perform a couple of sync commands, and that's right before unmounting a USB memory stick - I usually witness linux Very Quickly Copying a file from PC to USB stick. "Aha! No, you're just caching that operation you sly dog you!" - the sync command on a CLI tells me when linux has finally actually copied the file onto the stick.

telling you to type in "sync" multiple times was the old-timers way of ensuring that the first sync completed (maybe about a second) before you turned the power off, even if you were a newbie SA.

It's a point in time write barrier no? Everything before the sync has to get written.

But writes are still getting added after the sync. Maybe I'm wrong but it seems like sync has to not try to sync those latecomers, because otherwise it might run forever, with new writes showing up each time it waits for current writes to sync.

Assuming the above is true, it makes sense to sync multiple times. You converge towards being fully synced. The first sync might take extra long (say 30s) because there could be tons of stuff unsynced. But the next sync should only have those 30s worth of writes to sync. Let's say that takes 5s. Now the third sync only has 5s worth of writes to sync, and that takes 1s...

This is all built on some assumptions about how sync works. But it feels like sync needs to be able to terminate & these assumptions are the only thing I can imagine that makes sure sync can ever finish, without never ending changes keeping it open.

Apparently it is not, or at least, it wasn't, once, if you believe this research:


The post says BSD tried to do what I said. And it failed to actually do it right, as basically a bug. And needed to be updated to do what I said. Which took a couple decades to find out & make happen but was a bit of an issue.

Still no confirm or deny on Linux.

I was at +2 and now 0 after this post rose to the top. Am I bitter? No...

> It's a point in time write barrier no

That's the only way it makes sense. The caller controls when they start the sync, so they can make sure that all writes they care about have been issued by then.

But it's impossible to define a temporal ordering between writes and the time the sync completes. That's not a well-defined point in time. There's always a non-zero delay between the time the sync proper completes and the caller gets back control.

You're supposed to close applications (at least ones that you cares about) then sync. Then by definition nothing important falls into the queue

Also I'm around 70% sure remounting system readonly does equivalent of sync & not allow more writes.

> The first sync might take extra long (say 30s) because there could be tons of stuff unsynced.

that would have to be an awful lot of data, or a very slow drive, perhaps a floppy?

your other assumptions are sensible, to an extent. but way back when, we typically had a script that sent a message to all ttys saying "system going down in 30 seconds, please log off" - in fact ALL the other systems i used back then (Dec10, VAX, IBM 4381) did this before the sync (or whatever they did).

I’ve seen extreme sync times when writing large files to an old slow USB thumb drive. Often you will see the first part of the copy run at full USB speed but then it slows to a crawl. Then the unmount (which does a sync) pays back that initial burst of speed. I’ve seen it take more than 5 minutes (a kernel warning about the slow sync appears in dmesg after 300 seconds IIRC).

In any production machine, I would strive very much to avoid this circumstance, yes. Better tune your system. Either allow less dirty data in the first place (dirty_background_ratio defaults to 15% of memory... even if you have 2TB of ram) or lower your dirty_writeback_centisecs so not as many seconds of dirty writes get buffered if this happening, pretty please.

My example used more illustrative than real numbers. Still, 10s seems not totally uncommon, which is only 3x off.

Don't you have any shitty ass USB sticks that only do ~6MB/s writes? 180MB worth of dirty pages is nothing.

Only 6MB/sec? That would be lightning fast when I first started to use USB flash drives. I have used USB 1.1 drives which are rated a mind boggling ~900kB/sec sustained transfer speeds.

Good Kingstons did 5MB/sec, top of the line ones did 11MB/sec.

I have become the old-timer ;)

it happens to us all!

It beats the alternative.

A coworker once asked me for shell tips. I gave them a few that came to mind, and also invited them to peruse my .bash_history, because I figured there might be other tricks hiding in there.

They later asked why I sync all the time, almost always in threes.

"Huh?" Looking through history, I see that I do. Apparently it's been ingrained into my subconscious, like a sort of stutter, and I type it sometimes out of habit when I'm deep in thought.

Old habits really do die hard.

The big one is if you dd an image to a USB stick, and then try to pull it out and boot from the stick. You can use conv=fsync, but maybe you forgot to do that when you ran dd.

I made it a habit to run `eject /dev/sda` (as root) before removing a USB stick. That flushes pending writes and makes it impossible for new writes to occur before the stick is removed and re-inserted.

Just `sync` also works but it's asynchronous, so you don't really know how long to wait until the writes are flushed to disk (especially for USB sticks which can be really slow when writing).

I still do this. There was another command which was used to park the heads, too.

Ah yes! And park the heads before opening up the server or moving it. Absolute must.

FXPARK.EXE! I'm getting flashbacks of Tandon DOS! Aaargh!

I've always been wary of USB sticks. I have some that write slowly, and the system says still writing. I have others that seem to however it works, return immediately like they're super fast, and do writes in the background.

Anymore, I just buy the sticks with light indicators. It's not done if it's not pulsing evenly...

umount will block until the writes are finished, though

I almost always run sync after unmount on USB sticks or SD cards. Old habits and superstitions, I guess.

Recently I saw a USB stick's activity LED stay on long after unmount. I'm not sure if it was writing or something else.


The little microcontroller that does the actual write to pages of flash might actually be busy moving things around and doing other housekeeping after the disk is unmounted but while it still has power.

I wrote the firmware for a flash based system, and used powered-yet-unmounted time to save a few statis to the uC's internal flash.

I have seen that too. It generally means the drive is doing write leveling and/or other tasks. If I have time, I wait for that, too.

However, LEDs become rarer and rarer these days.

Triple sync is at the end of my backup-to-usb disk script

Indeed it will but I just like to make /sure/ ;)

Not always.

Not always, as in it'll terminate earlier, or as in it'll continue blocking indefinitely after the writes are finished? Because I think I've experienced the second case a few times.

Not always as depending on flags you pass to the umount2 syscall.

If you were an old timer, I would love to hear, how significant was Evi Nemeth's UNIX administration handbook to you?

I myself learned a ton from it, but as I understand, it used to be the bible.

there's old timers and there's ye olde thymers. The big shift was the web/.com boom. In the period before the mid 90s when high end workstations were why groups of people ran unix, sysadmins were frequently more experienced systems programmers who knew a lot about the internals and why Chesterton's various fences had been built, and they did sysadmin because nobody else could be trusted.

With the interwebs there became a massive need for many many servers, and an attendant need for many many sysadmins, and their training/experience would be purely administration, with more need to glean more information from guides.

In that same time period, use of computers shifted to more personal computers, so the demand for datacenters was from a lot of Windows9x boxes (that needed administration and didn't get any). And then the shift to WindowsNT and then AMD64, so who was "coming up" and how much they knew of systems programming kept declining.

Now we've started to adminster administration, with a lot of files in /etc that say "automatically generated, don't edit this"

Evi Nemeth's books were feeding the internet growth, not the real old time sysadmin's.

Zero. I've never read it to this day. Eric Foxley's Unix for Superusers, on the other hand ...

Foxley's book was published in 1985. By the middle 1990s there were tonnes of books being produced, none of which was really "the bible", unless you counted ones that had "bible" in the title, like the Waite Group's "Unix System V Bible". (-:

> how significant was Evi Nemeth's UNIX administration handbook to you?

Never encountered it. I presume it was available in the UK, but my Unix courses in college didn't use it, and after college I started off as a CAD operator, using Tektronix SVR4 workstations with 4109 terminals running Teknicad, which is where I became quite intimate with SVR4 as I was a bit of a computer nerd ;)

This isn't specific to Linux

Raymond Chen has a tangentially amusing story that I just found while trying to google the win95 screen: https://devblogs.microsoft.com/oldnewthing/20160419-00/?p=93...

This article had an extremely positive tone, compared to what I expected from the URL. It's refreshing to hear the author's excitement about filesystems, and I learned a couple of things. Thanks.

It helps me to expand the concept of what is happening. Instead of "asking the computer to write a file," it can be seen as "asking the computer to coordinate some information from keyboard, to memory, to disk." You can then expand that to constituent parts. "From memory, to the disk controller, to disk." In this way "disk controller" can easily expand to S3 or some such. And you can begin to enumerate all of the things that have to happen for that to work. (And yes, you could even expand that some, as there is a memory controller and many other things you may want to think about.)

And buffers for efficiency aren't new to computers. Consider returning a book to a library. You drop it off in the return box. Eventually, a worker moves it from returns to processing, where they will then put it on a shelving queue. Where it will eventually be carted around and reshelved. Odds are high that you have some form of buffering before it gets to the return box, even.

> In this way "disk controller" can easily expand to S3 or some such.

Disk semantics are subtly different from S3-style blob semantics in important ways. Crucially the immutability of blobs makes life _much_ easier for the blob storage than the ability to write into the middle of a disk file.

Given the latency problem, I think it would be interesting and useful for there to be devices which expose a "blob" API to a bunch of Flash on a PCIe bus, and for there to be a suitable OS API for this.

Flash itself is only weakly mutable! You can only write in one direction (usually 1->0) to a Flash cell. To go 0->1 you have to do a block erase on a whole bunch of bits, which takes much longer.

> I think it would be interesting and useful for there to be devices which expose a "blob" API to a bunch of Flash on a PCIe bus, and for there to be a suitable OS API for this.

We're partly there. NVMe has standardized a key-value interface for SSDs, but as with any compatibility-breaking new storage interface it's pretty much only used by and available to large cloud computing companies that can afford to rewrite their entire storage stack to take advantage of it.

Oh, for sure! And rotational disk semantics are different from SSD ones.

It isn't hard to start to realize that there are some things you can ask the controller to do, that they just can't do. They can pretend to for you, though.

Until we start talking about SMR magnetic drives, then they go back to being similar to "need to read to rewrite sector and write it back", althought that's mostly hidden by disk's controller.

Only some drive-managed SMR drives hide the need to do writes in large blocks. Of course, they hide it the only way they know how: buffer writes as much as possible, but eventually do the slow thing and hope nobody notices the performance fell off a cliff.

As I understand it, host-managed SMR drives for enterprise require the host software to do writes in the correct sequence.

It's certainly good to understand that there is caching, journaling, etc. though I think that most people realize it by now. It's very easy to notice at least part of this by removing a USB stick without unmounting it on Linux :)

However, if you're one of today's lucky 10,000, welcome to the curse of knowledge. On that note, I can only imagine the war stories that PostgreSQL developers could tell about disk syncing.

> How slow? An SSD is 1,000 times slower than memory. A spinning hard drive is one MILLION times slower. Disks are multiple orders of magnitude slower than memory!

Latency? Throughput? I don't think SSDs are 1,000 times slower than RAM in either category, assuming fairly typical consumer machines, but I could be wrong.

> Latency? Throughput? I, don't think SSDs are 1,000 times slower than RAM in either category, assuming fairly typical consumer machines, but I could be wrong.

Everything in this business is like a great big onion of caching and buffering layers. Disks have them too. This is why you often see consumer SSDs have very fast writes only as long as you don't write continuously. Continuous writes causes their internal buffers to fill up and the real write performance to be revealed.

There's also factors like write amplification and overprovisioning that enter into this set of equations. But if there's any takeaway from this, it's that I/O benchmarking is hell.

I actually want to explore this further sometime. It's not like it's a secret that buffers are everywhere in computers, but I am still surprised at how many buffers one comes across.

Like, let's say you are doing some writing using the C standard library...

* the library itself will buffer your writes, to reduce the number of syscalls

* I think your bio objects might get buffered, via plugging, to reduce the number of block requests

* your block requests will get buffered, so that they can be canceled or merged (I'm actually not sure if this happens in the "software staging queue" part of blk-mq, or prior...)

* as you mentioned, the disk controller may very well buffer the actual write operations

John Carmack's lecture on systems engineering / reducing latency for VR mentions buffers repeatedly: https://youtu.be/lHLpKzUxjGk

> I actually want to explore this further sometime. It's not like it's a secret that buffers are everywhere in computers, but I am still surprised at how many buffers one comes across.

Well there's a reason it takes a click or key press to register about as long as it takes sending a packet halfway across the world (or a few times around the world, depending on what software you're running).

It must have been a real blast back in the 8-bit days, when you'd have some peripheral's port mapped to a specific address in the memory map, and setting a byte to that address actually slammed the electrons into that peripheral's register. No wasteful 15 layers of OS abstractions waiting for buffers to fill and copying bytes from buffer to buffer all the way down.

Oh yeah, that I can't deny. The peak throughput numbers used in marketing are definitely questionable at times. Kinda makes sense considering SSDs and HDDs are basically computers of their own that give you some abstraction over the actual raw storage media.

You could compare memcpy to sequential write speed or something, but it'd definitely not tell the full story.

It can be a pretty real problem. I encountered a problem where I needed to construct a file with out of order writes 64 bits long. This works fine for small files, but this file was dozens of gigabytes. Turns out if you do this the naive way on a consumer disk, it will take weeks and it will wear out the disk incredibly quickly, because all the buffers will fill up in the worst possible way and the disk ends up with worst case write amplification for every write.

I ended up creating a bunch of smaller files containing essentially assembly instructions, of the form "put X at offset Y", and grouping adjacent write instructions in ~100Mb chunks. This is just sequential writes, which the OS will gladly help buffer and the disk likes. Then I go over the files one by one and evaluate the instructions. Now since the "random" writes are concentrated to a relatively small area on disk, the buffering works well.

Despite writing 2.5x as much data (and reading the data an additional time), this makes the write operation take like an hour instead of several weeks.

  > ...but this file was dozens of gigabytes. Turns out if you do this the naive way on a consumer disk, it will take weeks and it will wear out the disk incredibly quickly

This is what C's "fallocate(2)" with "FALLOC_FL_ZERO_RANGE" or "FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE" flags are for, along with "pwrite()"

It's mostly a SSD problem. They really do not like sustained loads of random small IO, since their write cycles have much larger granularity.

Thank you for engaging.

Also, thank you for finding that imprecise language. An original draft had stated that I'm talking about I/O duration, but I lost that language. I will re-add it.

My numbers are for "main memory access", "SSD I/O" and "rotational disk I/O" from Gregg System Performance (chapter 2). In retrospect, with all respect to Gregg, that wording is itself pretty imprecise.

I think what really matters here is I/O duration rather than throughput. That is, the reason we use the cache so much is because a single I/O is really slow against disks...

Yeah, that's true. Raw speeds don't really mean a whole lot without more context, and in practice writing to disk is going to be a lot slower even if the peak throughputs are not quite that far off. The performance characteristics are quite different in practice. I assume at one point in history though, these numbers were closer to true even for raw throughput... PCIe and NVMe have probably helped close the gap a bit, I would guess.

> the war stories that PostgreSQL developers could tell about disk syncing.

The man page for glibc open(2) used to say, "Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace", but now it notes that "before Linux 2.6.33" it wasn't implemented correctly

I remember a rather eye-opening discussion on either the postgresql-hackers or lkml about the semantics of O_SYNC not actually being synchronous. In addition to the implementation details, which depended on the version of the kernel and the underlying filesystem type, it was also noted that underlying disk hardware also had caches which did not necessarily behave synchronously.

Some places I can find that cover these details include




RAID cards back in the day, and some enterprise SSDs today either had batteries or capacitors in order to finish the write, even if the power turned off.

They tell the OS that the write has synced even when it has not as it has confidence that it will.

Not only are they 1000x slower than memory, if you bump into thermal management, much slower than that. Worse, if you run into wear leveling write amplification, orders of magnitude slower.

Yes, you can use caching to hide this, in main memory and on the device (safely and unsafely, depending on the design and whether it is an enterprise drive with sufficient capacitors), but you can do the same with SRAM vs. DRAM and it doesn't make DRAM as fast as SRAM, or registers-vs-SRAM, likewise.

SSDs, even nvme, are indeed about 1000x more latency tham RAM

normal RAM only has like 10x the bandwidth of a high end SSD, though

As a corollary, it's not possible to read a single byte from a file (with regular usermode I/O). When you call e.g. `read(fd, buf, 1)` the Linux kernel will read an entire 4096 page into the page table cache, and then copy 1 byte from that into your buffer.

This might sound inefficient but in practice consecutive reads are so much more efficient than random reads that the difference between reading 1 byte and 4096 bytes is negligible when the data is stored consecutively.

Consequently, when designing on-disk data structures that are accessed randomly (like Btree pages in a database, for example), it usually doesn't make sense to make them smaller than 4 KB.

(It is possible to bypass the page table cache using O_DIRECT, at least for some filesystems, but that has its own size and alignment restrictions; typically you can only read 512 byte blocks.)

This is true even when talking about reading a byte from memory.

It's way faster to fill a cache line with 64bytes in one go than it is to make 64 1 byte requests to main memory.

Yes, good point!

I don’t remember the last time I actually lost any data when a machine loses power. Am I just lucky or is there more to the story for this comment:

> What happens if you unplug your computer before the operating system writes the data to disk? Well, you will lose the data. It’s as simple as that.

It absolutely happens. My guess is that your machines just don't actually lose power very often. Any normal desk- or laptop won't. When you hit the power button, it doesn't just cut the power; the system is told to shut down so that it can neatly terminate processes and unmount filesystems before halting. Takes a few seconds.

At work I'm writing software for industrial machines. It's essentially a Linux system in kiosk mode running on semi custom hardware. We actually do have a power button that simply shuts off power (much to my dismay), and we have data loss all the time. We just make sure that whenever there's actually important data that needs to be stored (very seldom) we make it blatantly obvious to the user that now is a very bad time to push the power button, and sync before those warnings are removed.

But we've had boxes that were rendered unbootable because someone decided to push the power button the moment an apt-get upgrade completed and a lot of updated files were still living in memory only.

There are several reasons why it's actually really unlikely to lose important data when you lose power on modern systems.

Firstly, modern filesystems are designed to be robust to random power loss, so it's unlikely that you will corrupt your filesystem in a way that cannot be recovered from. It does mean that some unflushed data would not be written, but it will only affect files that were opened for writing at or shortly before your system lost power. Some filesystems (including ext4) can offer quite strong consistency guarantees if you configure them correctly, though usually stronger consistency means lower performance in some scenarios.

Secondly, the Linux regularly flushes dirty pages to disk anyway (15 seconds on my system, see `cat /proc/sys/vm/dirty_expire_centisecs` for yours) so you cannot lose more than a few seconds of data in the worst case. This is particularly important for the case of "oh, I forgot to unmount the USB stick before pulling it out of my laptop".

Thirdly, most applications guard against data loss by using some form of write-ahead logging and explicit syncing. For example, if you edit a file in vim and then exit with `wq!` it will call fsync() before exiting, which causes the contents of the file to be flushed to disk immediately. In theory it's possibly that you lose power in the middle of the fsync() call, but it's extremely unlikely. (Of course it would be just as likely that you lose power right before issuing the write-and-quit command.)

The sort of data that is not explicitly synced is where it's not really important to keep the latest information. Think about log files that are appended to continuously, or for example downloads in browsers. Do you really care if you have to restart an incomplete download after a sudden power loss? Probably not.

We learnt that at school back in 1994. Don’t they teach kids about how computers work anymore?

Surprisingly, no.

These foundational things are pretty frequently skipped or glossed over. There's a lot of stuff above and beyond that needs to be covered.

For example, in 1994, you probably didn't spend a whole lot of time talking about concurrency or parallelism. Most software didn't need to think about those things.

But beyond that, kids these days have a very different relationship with computers than we did. They very VERY rarely interact with the filesystem (in my day, that was everything). Instead, it's a slick UX with apps to tell you everything to do.

Back in my day, even fairly novice computer users were familiar with the idea of mounting/unmounting stuff because they'd pop in a 3 1/4 floppy write stuff there, and pop it out. Formatting a floppy was a fairly common occurrence (for some reason).

Things are just different. On the one hand information about this stuff is WAY more available than it ever was. On the other, there's a lot more information and things are constantly changing. You can see that in old games like starcraft. There, the authors had things like their own custom linked list implementation. You'd pretty much never see something like that in modern software. You'd get funny looks with people saying "Why aren't you using a library for that? There's `ac`"

All this talk about why writing files is lazy. But why isn't booting lazy? Why do I have to wait for the network interface to boot even if I don't really intend to use it right now?

systemd does this with socket-activated units.

It doesn't work very well, though. Booting into a tty login prompt should be possible in well under 1 second. On a modern Linux system that uses systemd, however, it still takes on the order of a minute or more.

There's a lot of seemingly purely local infrastructure that nowadays depends from networking. login calls into a PAM module, which does database lookups, which goes through nsswitch and under the covers tries to consult nscd or systemd-resolved or some such, which either have network timeouts to tick down or aren't started until boot network interfaces are configured ....

Mind you, there was the same issue with NIS.

Yes, I guess the problem is that laziness shouldn't stop at service boundaries.

> This is called non-blocking IO

No, that's not quite right.

It happened not to block this time, but the write was not "non-blocking". It could have blocked.

This behavior is up to the file system or virtual file system implementation. Not all writes on Linux use a page cache.

It's misleading to tell people that there is always a page cache involved.

Originally, I was going to explain that I was discussing ext4 specifically but, honestly, can you name a common file system where the page cache isn't involved by default?


As interesting as the article is I am really distracted by the fact that the headers have no spacing before the next paragraph begins.

Thanks for pointing this out. I'll fix it!

And specifically when you tell the shell to redirect stdout to a file, the shell is opening a file descriptor to handle the request. A file descriptor also sits in the middle for piping stdout for one command to stdin of another. It seems like it's doing magic, but it's not.

Are we continuously told in our courses to flush() the file so that it gets written to disk?

If you're talking about fflush() from the C standard libray, fflush() doesn't write to disk. It pushes writes from the FILE's buffer to the kernel and drops any read data that has been cached. The C standard library doesn't really provide you with all the tools needed to commit writes to disk fully.

Hopefully not, because that murders performance. The caching exists for a reason, let it do its thing.

If you care about data integrity, you don’t trust the disk anyway, so flushing to the disk isn’t enough.

There's fsync. Unfortunately, if you want atomicity but not durability you have ~zero options on common filesystems.

(On ZFS and bcachefs you can get that by disabling syncs completely; they never reorder writes in a visible manner.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact