> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.
> No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
"The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.
Of course fsync ensures durable writes on systems like Linux with drives that honor FUA. The reliability of the device and stack in question is implied in this and anybody who talks about data integrity understands that. This is how you can calculate and manage error rates of your system.
> "The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.
I think most people understand that there is a huge difference between the sun exploding and a single hardware failure.
If you really don't understand that, I have no idea what to say.
> Of course fsync ensures durable writes on systems like Linux with drives that honor FUA
No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.
What is the difference in the context of your comment? The likelihood of the risk, and nothing else. So what is the exact magic amount of risk that makes one thing durable and another not, and who made you the arbiter of this?
> No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.
It does to anybody who actually understands these definitions. It is durable according to the design (i.e., UBER rates) of your system. That's what it means, that's always what it meant. If you really don't understand that, I have no idea what to say.
> The only thing that can help is writing it more than once.
This just shows a fundamental misunderstanding. You achieve a desired uncorrected error rate by looking at the risks and designing parts and redundancy and error correction appropriately. The reliability of one drive/system might be greater than two less reliable ones, so "writing it more than once" is not only not the only thing that can help, it doesn't necessarily achieve the required durability.
> What is the difference in the context of your comment? The likelihood of the risk, and nothing else. So what is the exact magic amount of risk that makes one thing durable and another not, and who made you the arbiter of this?
What's the difference between the sun exploding and a single machine failing?
I have no idea how to answer that. Maybe it's because many people have seen a single machine fail, but nobody has seen the sun explode? I guess I've never had a need to give it more thought than that.
> It does to anybody who actually understands these definitions. It is durable according to the design (i.e., UBER rates) of your system.
You are wrong about that: Nobody cares if something is "designed to be durable according to the definition in the design". That's just more weasel words. They care what are the risks, how you actually protect against them, and what it costs to do. That's it.
I was asking about the context of the conversation. And I answered it for you. It's the likelihood of the risk. Two computers in two different locations can and do fail.
> You are wrong about that: Nobody cares if something is "designed to be durable according to the definition in the design".
No I'm not, that's what the word means and that's how it's used. That's how it's defined in operating systems, that's how it's defined by disk manufacturers, that's how it's used by people who write databases.
> That's just more weasel words.
No it's not, its the only sane definition because all hardware and software is different, and so is everybody's appetite for risk and cost. And you don't know what any of those things are in any situation.
> They care what are the risks, how you actually protect against them, and what it costs to do. That's it.
You seem to be arguing against yourself here. Lots of people (e.g., personal users) store a lot of their data on a single device for significant periods of time, because that's reasonably durable for their use.
There is a point at which a redundant array of inexpensive and unreliable replicas is more durable than a single drive. Even N in-memory databases spread across the world is more durable than a single one with fsync.
Unfortunately few databases besides maybe blockchains have been engineered with that in mind.
> There is a point at which a redundant array of inexpensive and unreliable replicas is more durable than a single drive. Even N in-memory databases spread across the world is more durable than a single one with fsync.
Unless a failure mode you are concerned about include being cut off from the internet, or your system isn't network connected in the first place, in which case maybe not eh?
Anyway surely the point is clear. "Durable" doesn't mean "durable according to the whims of some anonymous denizen of the other side of the internet who is imagining a scenario which is completely irrelevant to what I'm actually doing with my data".
It means that the data is flushed to what your system considers to be durable storage.
Also hardware failures and software bugs can exist. You can talk about durable storage without being some kind of cosmic-ray-denier or anti-backup cultist.
Say you have mirrored devices. Or RAID-5, whatever. Say the devices don't lie about flushing caches. And you fsync(), and then power fails, and on the way back up you find data loss or worse, data corruption. The devices didn't fail. The OS did.
One need not even assume no device failure, since that's the point of RAID: to make up for some not-insignificant device failure rate. We need only assume that not too many devices fail at the same time. A pretty reasonable assumption. One relied upon all over the world, across many data centers.
"but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability"
I believe drives that do have capacitors are aware of it and return immediately from fsync() without writing to flash. Thats the point of this API
Since neither Macs nor any other laptops have SSDs with capacitors, this point is kind of moot.
I have at various points replaced or upgraded 15 NVME SSD's in desktops and laptops, and I have not seen a single one - could you please let me know where I can find a non-server SSD with capacitors that are large enough for it to flush data in case of a sudden power loss?
Laptop batteries are irrelevant - battery failure, freezin or cutting power to the curcuitbord by holding the off buttons are the failrue modes you have to protect against.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.