
A way to do atomic writes - Tomte
https://lwn.net/SubscriberLink/789600/101b40d06e0dfb80/
======
ww520
Appending log based file system has a lot of appeals. A lot of hard problems,
like atomic write, become trivial. A log based FS shares similar properties as
the transaction log in RDBMS, making consistency and recovery easy. Write
performance is fantastic. With enough cache memory, read performance should be
good, too. The only down-size is the performance hit when doing garbage
collection. More research in this area would be helpful.

~~~
Scriptor
Haven't log-based file systems been a thing for a while now?
[https://en.wikipedia.org/wiki/Log-
structured_file_system](https://en.wikipedia.org/wiki/Log-
structured_file_system)

~~~
ww520
Was hoping there are more research attempt at the Log Structured FS. For
example,

\- Borrow some ideas from generational garbage collection. Young generation in
SSD (or mirrored in RAM) with copying GC to get rid of old versions of fast
changing data blocks.

\- Utilize some deduplication techniques with content based signature.

~~~
kjeetgill
I think elements of that generationality are the foundations of the Log
Structured Merge Trees used by KV Stores like LevelDB and RocksDB. Atleast I
think that's the same concept, I'm not well read on filesystems.

~~~
ww520
Yes, LSMT is a good example of pushing the idea of a hybrid append log and in
memory data structure further.

However, LSMT is for relatively smaller data set, i.e. ordered key-value. It
has worse write amplification than a simple append log. The level-0 memtable
flushed to the write-ahead-log counts as one write. Writing to the level-1
sorted files counts as 2nd write. Merging the sorted files counts as 3rd
write. There're 2~3 writes per change.

Also it doesn't offer help to address the frequent update block problem. All
versions of a data change are written to disk. A merge is needed to get rid of
the old versions.

But it has a number of good implementation ideas that can be borrowed.

------
kabdib
Note that the SQL Server team at Microsoft had the Windows folks add write-
through and cache control to NTFS so that the database could have stability
guarantees.

It's mind boggling that this is still an issue, given the number of _decades_
we've had applications (databases, and other things) that need to efficiently
guarantee bits are on stable storage.

~~~
zzzcpan
It shouldn't be mind boggling, given that this guarantee is impossible to
achieve. A little bit better or worse is not that big of a deal, still very
far from 100%.

~~~
kabdib
Er, no. I'm using a storage system _right now_ that makes this guarantee; when
the store responds to a block write, it's going to be stable. The
implementation in question uses battery-backed RAM for performance, but that's
just an implementation detail.

This is definitely not impossible. It's not even that difficult (and yes, I've
written a number of transactional storage systems over the past 30 years).

~~~
zzzcpan
> when the store responds to a block write, it's going to be stable.

That's the thing, you are communicating with your storage controller and these
are just promises from your controller, not guarantees. Once you try to read
the data back sometimes in the future there is no guarantee that you will
succeed. A lot can happen between your controller reporting successful write
and you retrieving the data: software bugs and false promises, hardware
problems, operational problems, disasters, etc. There is a limit of how high
of a probability of data retention a typical single server in a typical server
room can achieve.

~~~
theamk
There are always implied assumptions, like assuming that there are no bugs in
the kernel drivers themselves, and assuming that memory is not damaged.

Saying "we cannot achieve anything because there might be firmware bugs" is
technically true, but completely counter-productive. Adding "A little bit
better or worse is not that big of a deal" is just bad engineering -- can you
imagine doctor saying, "you might get hit by a car at any time, so I decided
it is not worth it to heal you"?

~~~
zzzcpan
We can achieve a lot on top of unreliable components, buggy kernels, buggy
firmware. Make unbelievably reliable systems. But can't, if we assume we can
rely on unreliable components. This is what bad engineering is.

------
jchrisa
One way modern distributed databases handle this is by assuming they can't
trust the disks. When you are writing to multiple datacenters it makes
recovery from disk errors just another kind of chaos resilience. The Calvin
protocol takes this to its logical conclusion by moving the write-ahead-log to
a Raft-like distributed consensus algorithm, where commits to the log span the
cluster: [https://fauna.com/blog/consistency-without-clocks-faunadb-
tr...](https://fauna.com/blog/consistency-without-clocks-faunadb-transaction-
protocol)

I will add that the different flavors of fsync, and what they do (and don't
do), is always a source of entertainment in database engineering chat rooms.

------
cafxx
I agree with some of the comments on the article: there are good reasons to
keep durability and atomicity separated, and to allow apps to specify what
they need.

Also, a while back there was another paper that had a much better API as it
allowed atomicity across files: [https://github.com/ut-
osa/txfs](https://github.com/ut-osa/txfs) (better, but obviously not perfect:
[https://twitter.com/CAFxX/status/1017204557003173888](https://twitter.com/CAFxX/status/1017204557003173888))

------
tarruda
When in full control of the file format, it is hard to find a cross platform
user-space solution that beats SQLite.

------
chris_wot
_In that system, users can write as much data as they want to a file, but
nothing will be visible until they do an explicit commit operation. Once that
commit is done, all of the changes become active._

Isn't this Multiversion concurrency control[1]?

1\.
[https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)

------
caf
_Hellwig said that originally he was using an open() flag, but got reminded
again that unused flags are not checked by open() so using a flag for data
integrity is not really a good idea._

This problem will be solved if, as looks likely, the openat2() syscall is
merged for the path resolution restriction patchset. As a new syscall, it can
do what open() should have always done and return EINVAL on flags it doesn't
understand.

------
ape4
Dumb question: is close() the same as fsync() for getting data written?

~~~
nwallin
No. (and it's only a dumb question if you don't ask)

close() flushes the application buffer into the vfs buffer, but it does _not_
guarantee the vfs buffer is flushed to the disk buffer, nor does it guarantee
the disk buffer flushes to disk. A successful fsync() call guarantees these
things.\\*

\\* except if the disk drive controller lies about flushing its buffer to
disk, but there's nothing you can do about that besides buying a better disk.

~~~
zAy0LfpBZLC8mAC
> close() flushes the application buffer into the vfs buffer

No, close() only closes the file descriptor. fclose() potentially also flushes
a userspace buffer.

------
13415
When I got fed up with the way the OS handled writes 15 years ago in one of my
(ordinary) applications, I wrote to a temp file in the same directory, closed
the file, did the atomic filename swap, and then verified the integrity of the
new file under new name by opening it again and comparing it byte for byte to
the content to be stored. Only after that did the operation succeed. On
failure, the file names were swapped again and an error was signaled to the
user. The temp file remained on disk to avoid further data loss.

I suppose that's not fast enough for a DB.

~~~
jjnoakes
Are you guaranteed that the data you read back didn't come from kernel caches
of the file system blocks? How do you know the data actually came back from
physical storage?

~~~
alain94040
Yes, but why can't we add an API, all the way down to the lowest levels of the
hardware, that says "once I say this write is done, it will be readable again
even after power-cycle".

Achieving that guarantee is not impossible, but it needs to be explicit. It
can't be inferred from other API calls.

~~~
charleslmunger
The behavior of fsync is a battle of OS developers and OEMs versus application
and database developers. Gaming the implementation of fsync to do less than
fully fsync makes benchmarks look amazing, improves user perceived latency,
and reduces flash wear. On the other hand, it corrupts data - but that's rare,
"the hardware is failing anyway", etc.

That's how you end up with stuff like OS X's "no really, fsync" param [1], or
Motorola shipping nobarrier on their phones. [2]

[1]
[https://github.com/google/leveldb/issues/203#issuecomment-55...](https://github.com/google/leveldb/issues/203#issuecomment-55007291)

[2] [http://taras.glek.net/post/Followup-on-Pixel-vs-
Moto:fsync/](http://taras.glek.net/post/Followup-on-Pixel-vs-Moto:fsync/)

~~~
layoutIfNeeded
My CPU can't isolate processes due to Intel cutting corners with speculative
execution.

My memory is vulnerable to row hammer due to vendors cutting corners while
pushing for increased DRAM density.

And my - supposedly - non-volatile storage is broken due to vendors gaming
benchmarks with fsync.

Is there any component in a modern consumer computer that isn't fundamentally
broken in one way or another?

~~~
syn0byte
What you call "broken" others call "performance choices". Except SSD
controllers, they are big fat phonies.

Look at how much slower CPUs are without those speculative execution tricks. I
can buy gigabytes of RAM with a 20 I found in an old jacket. You can explore
entire worlds consisting of gigabytes of high res textures and mesh data in
near real-time, _while_ downloading 4 new albums off the internet.

Broken?

~~~
zAy0LfpBZLC8mAC
> Broken?

Yes, it is breaking the promise of what it is supposed to do. If fsync() was
defined as "will ensure your data is on disk, unless that's kinda slow, then
who knows", then the behaviour would not be broken, just potentially useless
for many applications. But if you promise to ensure something is stored on
disk, and then don't, that's the definition of being broken.

~~~
layoutIfNeeded
This. It's like those counterfeit memory cards you could get on eBay, that had
a 128 Mb chip but reported themselves as 8 Gigs or so. "Works perfectly if you
stay below 128 Mb!"

------
wyldfire
> This feature would work for mmap() regions as well, not just traditional
> write() calls.

I always wondered about writable disk-backed pages -- back when I last looked
it up munmap() manpage made no suggestions about whether it would flush the
data or whether you need an explicit msync() to make it happen. Since pages
could get written back without an explicit msync(), it's always tricky to
speculate how the behavior should be based on observation.

~~~
namibj
Read the code. Only way to know for sure. Then use something maybe based on
clang-query to check if the path that leads to flushing base on msync() and/or
munmap() changed. Like, if any code between your syscall and the kernel-
internal write-page-to-disk-now-i'm-waiting call changed. You should be able
to convince clang-query to spit out a normalized AST of the code, and still be
able to easily map it back to the source file location so you can check the
source-file diff for the specifics once you demmed the AST diff ass non-
trivial (some things should be easier to ignore in an AST diff).

In case you end up creating such a tool, please share it with us.

