So MIT/BSD licensed projects are more likely to be created by professional developers. Some of them very high skilled and passionate, since the run of the mill enterprise rarely open sources software ;-)
people who reviewed the leaked MS source code remarked that it was unexpectedly clean with almost no profanity or inside jokes. I think this was mostly true for leaks from software companies.
Modeling code from academia, whether it has to do with weather prediction or covid prediction, however, appears to be (to put it lightly) subpar -- the covid 19 model that prompted the original lockdown was apparently using uninitialized variables and was non-deterministic. (And ... independently, was also completely wrong because of wrong assumptions - e.g., on 1-apr-2020 it predicted 95,000 dead in Sweden by 1-july-2020 if super strict lockdown is not enforced. The death toll on 1-july-2020 was 4,800)
This seems like a common problem in general.
Abstraction to a common interface is fine so if you don't care what the abstracted thing is you don't need to. Great. Read a basic file off spinning disk, ssd, cdrom, nfs - all the same, all good.
But if you do care you can't know what it really is because they've abstracted the underlying away with whatever layer(s) sit on top.
Should it be a question we need to consider when writing libraries, or any other abstraction, how does one penetrate your abstraction if they need to? Are you really, completely 100% sure they don't and shouldn't?
There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them. Maybe that's right sometimes, i don't know?
This sqlite paper casts some pretty strong doubt on that as an approach, imho.
What do you think? Have we all been missing something like this in api design?
They are one of the _first_ abstractions, and they are quite fundamental. That's why they all still exist in the way they are.
Just look at the name: "file". And "directory". Where do these terms come from? Right, from organizing things, mainly offices. And offices are organized like that because this is what is practical under the constraints of humans working with paper in a three dimensional space.
But looking at it semantically, almost no one really wants to only use files or directories. We mostly want to use semantical tags. And one kind of tag can be a hierarchical "directory/file" tag - fair enough. But that's only a small part of the picture, because we also want to tag our objects/entities by date, creator, department, security clearance level, how long they have to be archived and so on. Using just a filesystem does not provide this, so we have to build on top of that.
And you can see that this is changing. Look at phones nowadays. They don't use this anymore. Pictures are stored in collections/groups, can be tagged and sorted in various ways. They can be seen on a geographical map. And much more. No one has a directory/file structure in their head.
So the question is: does the OS still need to support that? Actually, maybe not. Maybe SSDs today can work much more efficient if we don't assume everything must be a file.
Just wait, we will get there.
The file/directory abstraction survived for a long time but nowadays it's often a limitation more than a convenience.
Usually it's right, and at least you're putting thought into it. API design is not easy, hence why it gets fucked up so often.
I just went through explaining to someone why they shouldn't go pawing through /proc to find details on a running process. How there were commands like 'strace' and 'lsof', or even running a debugger that could better achieve their goals. Often people just aren't aware.
The option to bypass layers of abstraction should be there, but the option should be off by default, and come with plenty of warnings akin to "if you're removing the cylinder head, we assume you know what you're doing, and you assume all responsibility."
Other databases already bypass the kernel routinely; I'm reminded of Oracle which could use a partition or even whole disk directly. And yes, I know SQLite is not like other databases; that's why it gets tricky.
Honestly, the stuff they lay out is the kind of customer requests I dream of getting. I would gladly implement these, where they don't break backwards compatibility, and where they are off by default but can be turned on, in a fashion where it doesn't upset/surprise/break currently majority usage.
Yes that's a leading question.
Are you looking for experience?
[ see Wikipedia for contextual meaning of law here, https://en.wikipedia.org/wiki/Law_(principle) ]
This pops up a lot when libraries are designed with functions that are non-public. I only want to access the underlying function, not use your abstraction on top of it. You can't anticipate every possible use case of your library!
On the flipside, I consider a successful software tool is one that is used to do something undreamt of by its author.
One should not try to imagine every possible use to which their software will be put, especially when designing abstractions or interfaces. But these abstractions should be flexible enough that they can be adapted for any use.
There might be very good reasons that the underlying function is not public (I can think of private methods in OO where end users absolutely should not be calling them). To me, this seems like the public interface is lacking or perhaps too rigid, or maybe not well-understood (either by designer or user).
In any case APIs are hard.
On the other hand, the OS filesystem APIs are detailed enough for most people (and even for most of the OS internals too). I’ve never wanted to know sector sizes or power loss characteristics for any reason other than curiosity. The needs of SQLite aren’t very common. Most developers writing code to use the filesystem don’t need (or even want) to know if a file was renamed or what happened to it before last reboot or whether it can be mmap()ed with other processes. I can barely manage to use rsync, and if I was burdened with these things whenever I need to open or write a file, I’d use SQLite as my API instead of fopen(), which puts me right back at using a high level API (https://www.sqlite.org/whentouse.html).
All APIs present an interface over the resources they manage, so an abstraction is inherent, even if the API is low level. But usually part of the goal of API design is to make the interface easier to use than the resources it manages. If we expose an API that is too low level, then we compromise on usability and safety for everyone in favor of performance and detailed control for a very small minority of users.
So personally, I think I agree with your general concern in the sense that API designers should be aware of the needs of who they’re designing for, but I also think SQLite and the Linux filesystem aren’t a good example.
For example, I’m pretty sure I would like every app to automatically handle cases where the underlying file gets renamed or removed. (AFAIK apps that use the Cocoa API automatically does these.) However there really isn’t any POSIX API that allows handling these kind of situations — and to do these now you need to do all of these nitty-gritty things that SQLite does.
In fact, I think that the reason why SQLite needs all of these nitty-gritty details that does everything is because of an insufficient abstraction. If the file APIs were better and didn’t require everyone to consider whether you have to call fsync or not, SQLite wouldn’t have needed any of these. But well POSIX have flied and here we are.
I don’t think the POSIX API is perfect by any stretch, but there is a real downside to going lower level, in terms of productivity, in terms of mis-use and accidents, in terms of security, and in terms of ability to support old hardware, heterogeneous storage devices, networked storage devices, etc. If there was a huge and unanimous demand for the low level features SQLite uses, the API would already expose those features.
APIs sometimes are too high level, and sometimes are too low level. Getting it wrong in either direction has negative consequences. And getting it exactly right is hard. In any case, I’m unconvinced that the Linux filesystem API is abstracted at the wrong level or represents a good example of an API that is too high level.
But to be honest, persistence & durability of data are the goal of Storage and thus what the filesystem API needs to be designed for.
The API needs to clearly be designed for atomicity, durability & to allow coordination across multi-file structures.
This can be achieved by either 1) indicating all required synchronizations to callers in metadata, or 2) fully encapsulating durability by making the API atomically commit every operation.
I don't agree these factors should be hidden, since a losing/ corrupting data are fundamentally at odds with the goals of a Storage API.
It's not just SQLite, any application needing to maintain a datastore needs reliable storage.
No. It's hard to design, maybe harder to design an API and perhaps a misunderstanding when APIs are designed for others or an API just for an API.
As others have already commented about the leaking abstractions (The Law of Leaky Abstractions) which is a good description IMHO to make the non-boundary clear even when it was (non-)intended.
This can be found as well in Hyrum's Law, which IMHO is from a different point of view but reminds as well that the concept to just API for API is conceptually broken.
> Sqlite is all like, "yeah but we need to know what you're abstracting or everything is terrible."
I'm wondering where you pick the terrible from, I have not found that in the article. And it's absolutely not "all like". The paper describes common problems with file-systems on what Sqlite does which is standard programming - at least for Sqlite authors.
You seem not to be much aware of these kind of problems and with all due respect, this creates the impression that you have not experienced this much yet. This aligns well with the localization by yours as something "missing in the design" where I'd like to offer a different view: It's missing practice. To keep the colloquial tone:
Despite what has been missed in API design, we all haven been missing practising writing software.
So when someone who creates an Open-Source library creates an API that you don't like, they are arrogant? Interesting take.
A typical Android phone has hundreds of SQLite databases and does more than 5 gigabytes of database I/O per day....
F2FS is a log-structured filesystem for Linux that has (limited) atomic write capabilities. SQLite is able to use the atomic write capabilities of F2FS to bypass the rollback journal. Anecdotal reports are that an Android phone that is reformatted to use F2FS is noticeably faster.
F2FS make an old tired phone feel like a perky new phone.
Has the situation changed?
As a PostgreSQL hacker I have the same question, and I know the MySQL people do too, and I wrote some more specific questions and backgound, over here: https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO
I don't agree these factors should be hidden, unless the API guarantees every single operation is atomic & durable.
Full details as to atomicity & durability need be available to callers (not just SQLite, but any application that wants to provide a reliable datastore) so that they can write information durably including across coordinated multi-file data structures.
Then they could at least say "we're doing everything we can; next time you buy a storage device look for feature X".
And I barf.
>Pages are sorted so that they are written in increasing order. → Is this helpful on Linux? Or can we just as well skip the sort and write the pages in any arbitrary order?
Does anyone know if there are responses to these questions available anywhere?
Edit: The recorded talk is available here: https://www.youtube.com/watch?v=-oP2BOsMpdo&t=5541 but it's a lightning talk and the discussion was very brief.
For a database, this is not the question you want to be asking. It's a rabbit hole. The right answer (which is not always the answer we want, so we keep asking): no one knows.
It's better design to err on the side of assuming disk sector writes are never atomic. This would be taking the security stance, e.g. assume an attacker is actively trying to break your protocol, and use cryptographic guarantees to prevent or at least detect them.
Kernel developers also can't answer these kinds of hardware questions. There's too much hardware out there.
Disks may give atomicity guarantees and then you discover edge cases where these degrade, or disks may emulate Advanced Format 4096 byte logical sectors but have smaller 512 byte physical sectors underneath. Disks are too complicated (and firmware too buggy) to trust any kind of guarantee. There are too many layers of abstraction.
Instead of trying to figure out the safety of the layer beneath you, assume the worst and bring the design back to using end-to-end guarantees.
If you assume and plan for the worst, then you don't need to ask the question, and you can handle the worst of hardware, without surprises when those guarantees are broken.
In fact, with file systems in general, the situation is even worse than for disks, because of disk corruption (3% per 36 month period per disk) and misdirected writes, which most file systems unfortunately don't handle. You should never trust any metadata provided by the file system about your journal file (e.g. file size) and use that as part of your recovery protocol, because the file system itself is storing this kind of metadata on the same disk you're writing too. Again, something as important as the metadata size of your journal file should be protected by your own end-to-end protocol.
Sometimes it's possible for the filesystem API to guarantee that "sector"-size writes are atomic, even though the underlying storage doesn't do that.
For example a filesystem which uses copy-on-write when overwriting data, by appending each written block to a journal and checksumming or checkpointing the journal, is likely to offer that property.
Sometimes a block device knows that sector writes are atomic too. For example some battery-backed RAID controllers can reasonably guarantee this. Of course it fails if the battery runs out, but the abstraction is intended to assume you never allow that to happen.
It can go the other way as well. When underlying storage provides guaranteed atomic sector writes, the overlaid storage might not. RAID-5 does not provide atomic sector writes even when the underlying storage units do.
It would be useful for them to report when they have that property, because it would allow SQLite to use fewer writes. In effect, if the filesystem already uses a mechanism to ensure atomicity at some performance cost, there's no need for SQLite to use a second mechanism on top at a second performance cost.
It's actually worse than non-atomic. Writing to a sector on RAID-5 (without battery backup) corrupts other sectors too, during the time window until they are all made consistent. I call this the "radius of destruction", and I'm pretty sure SQLite and other software ignores this problem because there's no API for finding out about it.
So filesystems could report:
- "Don't know" for when writes go direct to the underlying block device. Or better, report whatever the block device reports for this query, which should be "don't know" in most cases.
- "Yes it's atomic" for when the filesystem layer (or block translation layer, flash translation layer or whatever there is) knows that it provides a reliable atomic-block-write abstraction on top of storage which doesn't provide that.
- "Watch out, we don't even guarantee your sector write won't corrupt related sectors in a geometry group..."
If they are going to ask kernel filesystem authors to provide a better API, I have long felt there is one obvious API type which is missing from Linux and would be better for filesystem databases than other approaches:
That would allow the WAL or journal to be efficiently part of the database file itself. No need for extra files, file creation and deletion, directory operations or syncing directories. That's a bunch of durable metadata writes that can be removed, and filesystem implementation details that can be avoided.
It would also allow atomic and durable writes without the need for writing twice in some cases, depending how the data is structured. (First write to the WAL or rollback journal, then to the database main file). That's a bunch of data writes avoided.
In cases where it can avoid two writes, that would also remove the intermediate barrier sync, doubling the commit-to-sync throughput.
Ideally like other I/O (or at least reads) that should be available in a sensible async manner too, with completion when the covered data is committed durably, and only that data. I'm not sure what system API would provide good async I/O for something like SQLite though, if it's not able to use AIO or ui_uring due to kernel API being poorly suited to a self-contained library.
Finally, a couple of variant (probably via flags).
"No rush" flag, to let the caller wait until those ranges are committed durably, but not force them to be written faster than normal. That would allow ordinary fast buffered-write speed, while at the same time providing the usual ACID guarantees that a DB returns success from COMMIT when the data is durably committed. For some workloads that would be fine.
"Barrier" flag, make the call return immediately but delay all subsequent writes on the same filedescriptor until the ranges are synced durably. This is similar to what's mentioned in the article, but more versatile by attaching to ranges. It's also not strictly necessary if you have the "no rush" flag.
My argument is if there were to be some work on potential enhancements that require kernel filesystem changes and syscall changes, fdatasync_rangev would be a more useful choice of enhancement, and probably easier to implement than fbarrier.
Better for performance, user convenience (no more separate files), and more kinds of applications (e.g. VMs and other kinds of in-file data store).
It's not as big as it sounds. fsync_range is already in FreeBSD, NetBSD and AIX, and Linux nearly got it too (there was a patch). (Linux got sync_file_range instead, which is not useful for this unfortunately despite the similar name.)
With a network share, you already configured authentication. SQLite would be easier and maybe even more secure since it does not increase the attack surface?
however, SQLite is not really meant to be used concurrently efficiently.
I haven't played with shares for a long time. But if I remember only one node can write to a file at once. That's also to say if file read/write performance will be any good for this use case.
NFS or Samba servers have their own issues with attack surface.
Please rethink your idea please.
Like, using a front program to control traffic flow for queries to read and write.
SQLite basically runs on everything and doesn't need a server to work with it.
If you use MySQL or Postgres, you need a lot more dependencies.
- local speed = 2 seconds
- NFS to another VM = 31 seconds
- NFS via localhost = 17 seconds
NFS is just about impossible (except in narrowly restricted situations)
I've always been suspicious that fdatasync() does not promise to durably commit the size metadata (and perhaps indirection blocks) on some OSes/versions, just the data blocks itself. Such that it's only safe to use fdatasync() on already allocated regions of the file. That was my understanding of fdatasync() when I first read about it, and I vaguely recall some DB handled this by rarely appending to files, instead doing it occasionally by pausing writes (to that file), appending a large block of zeros to the file then fsync() to commit, then resuming writes.
It's not just me thinking this. From a quick Google:
> "The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two."
fdatasync() is similar to fsync(), but does not flush modified metadata *unless that meta‐
data is needed* in order to allow a subsequent data retrieval to be correctly handled.
For example, changes to st_atime or st_mtime (respectively, time of last access and time
of last modification; see inode(7)) do not require flushing because they are not neces‐
sary for a subsequent data read to be handled correctly. On the other hand, a change to
the file size (*st_size, as made by say ftruncate(2)*), *would require a metadata flush*.
FreeBSD says the opposite at https://www.freebsd.org/cgi/man.cgi?query=fdatasync&sektion=...:
The fdatasync() system call causes all modified data of fd to be moved to
a permanent storage device. Unlike fsync(), the system call does not
guarantee that file attributes or metadata necessary to access the file
are committed to the permanent storage.
Thus "given SQLite is written to be so portable and cautious and has been around a long time" I'm surprised.
I'm not sure if we should believe the FreeBSD documentation, as current OpenBSD, NetBSD and Solaris have statements that work out similar to current Linux. However, if the documentation is to be believed, SQLite's durability using fdatasync() is broken on FreeBSD after a file append.
My point really is that if I recall correctly, what FreeBSD's man page says isn't an accident, it's indicative of what used to be the understood guarantee provided by most if not all implementations. Actual guarantees were not exactly specified, and I think there may have been a vague transition in the meaning of fdatasync() from "commits the data blocks" to "commits data blocks and the metadata necessary to access them" over time because the latter is obviously much more useful.
But that transition over time is why it seems a risk to assume it in highly portable software.
That said, back in the 2.6 days Linux even fsync() wasn't durable on consumer HDDs, it was on SCSI HDDs though. Then for a while O_DSYNC (which is like fdatasync() after every write) was unreliable for data retrieval (see the Persona link in GP post) but O_SYNC worked (which is like fsync() after every write).
** We do not trust systems to provide a working fdatasync(). Some do.
** Others do no. To be safe, we will stick with the (slightly slower)
** fsync(). If you know that your system does support fdatasync() correctly,
** then simply compile with -Dfdatasync=fdatasync or -DHAVE_FDATASYNC
#if !defined(fdatasync) && !HAVE_FDATASYNC
# define fdatasync fsync
AC_CHECK_FUNCS([fdatasync usleep fullfsync localtime_r gmtime_r])
SQLite's os_unix.c actually confirms my suspicions about fdatasync:
/* fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly
That's to commit the directory change, which updates directory entries pointing to the file inode. File creation, renaming and deletion are directory changes, and are unconnected with the file size.
As the article says: "open and fsync() the containing directory in or to ensure that the file will still be there following reboot from a power loss".
Without the fsync() on the directory, after reboot a newly created file name may not exist at all in the target directory. That's separate from whether the inode size is updated. If you just create the file at its target location, fsync() and then abruptly reboot, it may not be found, yet fsck may find it and put it into /lost+found with the data and size intact, because data and size were committed by the file fsync() while the directory entry was not.
Another way to show this is by creating the file elsewhere and using link() or rename(). Then you can see the committed file data at the old name, while the target of link() or rename() may be missing after abrupt reboot. This is why some atomic+durable file write patterns: Open a file under a temporary alternate name in the same directory, write(), fsync() on the file to commit data and metadata to the inode, then rename() to atomically replace the target filename, then fsync() on the directory after rename() to commit the directory change. (Some particularly careful programs go further than that because rename() is not atomic on some network filesystems.)
(In the article, there are what appear to be typos where it says fdatasync() on the containing directory in some places and fsync() in others. It should always be fsync() on a directory.)
Can SQLite databases still attack the host that opens them?
To improve performance, many OSes also provided kernel extensions to provide a DB-specific alternative block-device API for faster "raw" I/O and something like "async" I/O, rather than going through the regular unix block layer. These extensions were probably only ever used by databases.
Why? Well, for benchmarking for example, or if the application you are using has a bad multi-user implementation and you don't want to migrate to another database...
$ ssh dbserver -c sqlite3 /path/to/database.db
"Bedrock [..] is a networking and distributed transaction layer built atop SQLite"
"... That query can be any SQLite-compatible query – including schema changes, foreign key constraints, partial indexes, native JSON expressions, or any of the tremendous amount of functionality SQLite offers."
write+sync is a speed limiter in general (even on SSDs), because of waiting for the syncs which are needed for commit durability. On HDD, serial transactions can't go faster than roughly twice the rotation time of the HDD per write, plus seeks, and on network filesystems they can't go faster than four times the network latency (two syncs per commit) plus whatever the backing store needs.
The trick to getting high commit throughput is to allow the durability part of multiple independent transactions to be committed concurrently. On a filesystem-backed DB this means writing multiple transactions per fdatasync. It would be tricky for SQLite to do this because of its architecture as a library and synchronous API, but not impossible.
It is a Redis module that embed SQLite.
The idea is to run medium complexity project on top of a single runtime dependencies, Redis.
You will find Redis in most applications anyway, and for a simple crud application sqlite is more than enough.
In this way you get rid of a lot of systems complexity. And I works both in memory (crazy fast) or with standard files.
Don't know if this is possible and just how horribly it would perform, but it could be legit useful if it can solve e.g. the remote fs conondrum mentioned in other comments.