Hacker News new | past | comments | ask | show | jobs | submit login
SQLite briefing for Linux kernel hackers (2019) (sqlite.org)
301 points by symisc_devel 11 days ago | hide | past | favorite | 88 comments

The simple, detailed and extensive documentation is one of the reason why I love SQLite and probably a major reason for its success as an open source project. There are so many open source projects that don't even provide developers with the architecture of the project or even a simple compile guide with the list of required dependencies. (In general, I've observed that BSD-licensed open source projects tend to be better documented. Wonder why?)

Besides the historical aspect of old BSD projects coming out of academia, I'm quite sure that the vast majority of Open Source projects created or backed by companies are either MIT or BSD licensed.

So MIT/BSD licensed projects are more likely to be created by professional developers. Some of them very high skilled and passionate, since the run of the mill enterprise rarely open sources software ;-)

I have yet to see in-house code made by "professional developers" that is not a flaming pile of dung when compared to the average open-source code. Most of the horrors stories I have are from proprietary codebase by a large margin. So your statement is very confusing to me :)

You’re taking two words in isolation and picking at them, without considering what the comment means beyond those two words. I think you need to modify what you think the commented means by those two words in the context of the whole comment. After all developers at “run of the mill enterprise” are still professionals, right, but this isn’t the only statement that relevant to what the commenter thinks is different about these efforts. The comment could have been clearer, sure, but I think it’s making a decent point.

I have not looked myself, but ...

people who reviewed the leaked MS source code remarked that it was unexpectedly clean with almost no profanity or inside jokes. I think this was mostly true for leaks from software companies.

Modeling code from academia, whether it has to do with weather prediction or covid prediction, however, appears to be (to put it lightly) subpar -- the covid 19 model that prompted the original lockdown was apparently using uninitialized variables and was non-deterministic. (And ... independently, was also completely wrong because of wrong assumptions - e.g., on 1-apr-2020 it predicted 95,000 dead in Sweden by 1-july-2020 if super strict lockdown is not enforced. The death toll on 1-july-2020 was 4,800)

Microsoft actually runs a program (policheck) that scans source for controversial comments. It automatically files bugs against the introduction of such comments.


You're comparing code written by software engineers to code written by scientists. There's a very well known "gap" in the quality of code produced by those two professions.

Can you please link to the code you're describing?

Most open source software today is written by professional developers. You've completely misunderstood my comment.

So the os is providing an abstraction so you don't need to know the underlying hardware, filesystems etc. Sqlite is all like, "yeah but we need to know what you're abstracting or everything is terrible."

This seems like a common problem in general.

Abstraction to a common interface is fine so if you don't care what the abstracted thing is you don't need to. Great. Read a basic file off spinning disk, ssd, cdrom, nfs - all the same, all good.

But if you do care you can't know what it really is because they've abstracted the underlying away with whatever layer(s) sit on top.

Should it be a question we need to consider when writing libraries, or any other abstraction, how does one penetrate your abstraction if they need to? Are you really, completely 100% sure they don't and shouldn't?

There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them. Maybe that's right sometimes, i don't know?

This sqlite paper casts some pretty strong doubt on that as an approach, imho.

What do you think? Have we all been missing something like this in api design?

The filesystem, and the file itself, are perhaps the best abstractions ever designed in computing. A database sometimes uses the file API when actually wants to access the disk in a lower-level fashion, for which APIs are fragmented, non stable, non portable. So it's not a fault of the abstraction, but the fact that a different low-level abstraction to access the storage in a way databases need to do is missing from POSIX.

Not really...

They are one of the _first_ abstractions, and they are quite fundamental. That's why they all still exist in the way they are.

Just look at the name: "file". And "directory". Where do these terms come from? Right, from organizing things, mainly offices. And offices are organized like that because this is what is practical under the constraints of humans working with paper in a three dimensional space.

But looking at it semantically, almost no one really wants to only use files or directories. We mostly want to use semantical tags. And one kind of tag can be a hierarchical "directory/file" tag - fair enough. But that's only a small part of the picture, because we also want to tag our objects/entities by date, creator, department, security clearance level, how long they have to be archived and so on. Using just a filesystem does not provide this, so we have to build on top of that.

And you can see that this is changing. Look at phones nowadays. They don't use this anymore. Pictures are stored in collections/groups, can be tagged and sorted in various ways. They can be seen on a geographical map. And much more. No one has a directory/file structure in their head.

So the question is: does the OS still need to support that? Actually, maybe not. Maybe SSDs today can work much more efficient if we don't assume everything must be a file.

Just wait, we will get there.

Completely agreed. Every serious project I've been in has been striving to invent storage that's as close to their problem as possible. And I do the same in my hobby projects.

The file/directory abstraction survived for a long time but nowadays it's often a limitation more than a convenience.

I think Ceph has the right model. The only abstraction is physical disk to addressable binary/object storage. Everything else is implemented on top of that so you can do S3, filesystems, databases all close to the metal.

> There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them. Maybe that's right sometimes, i don't know?

Usually it's right, and at least you're putting thought into it. API design is not easy, hence why it gets fucked up so often.

I just went through explaining to someone why they shouldn't go pawing through /proc to find details on a running process. How there were commands like 'strace' and 'lsof', or even running a debugger that could better achieve their goals. Often people just aren't aware.

The option to bypass layers of abstraction should be there, but the option should be off by default, and come with plenty of warnings akin to "if you're removing the cylinder head, we assume you know what you're doing, and you assume all responsibility."

Other databases already bypass the kernel routinely; I'm reminded of Oracle which could use a partition or even whole disk directly. And yes, I know SQLite is not like other databases; that's why it gets tricky.

Honestly, the stuff they lay out is the kind of customer requests I dream of getting. I would gladly implement these, where they don't break backwards compatibility, and where they are off by default but can be turned on, in a fashion where it doesn't upset/surprise/break currently majority usage.

Any abstraction leaks. Good abstractions do not leak for common use cases and provide necessary access to the internals for uncommon use cases.

Which os is a good abstraction by this metric?

Yes that's a leading question.

I consider the libraries of the languages that allow the programmers to do everything (e.g. Python or JavaScript) are good at this: The library authors signal that these functions should not be called and these can be, but ultimately if you’re monkeypatching or something you can really do this. Obj-C is also a good language in doing this too…

Which metric? Leaky abstractions is a "Law", not a metric.

Are you looking for experience?

[ see Wikipedia for contextual meaning of law here, https://en.wikipedia.org/wiki/Law_(principle) ]

I think by "metric" he means your definition of a good abstraction, not the fact that all abstractions leak.

>There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them

This pops up a lot when libraries are designed with functions that are non-public. I only want to access the underlying function, not use your abstraction on top of it. You can't anticipate every possible use case of your library!

> You can't anticipate every possible use case of your library!

On the flipside, I consider a successful software tool is one that is used to do something undreamt of by its author.

One should not try to imagine every possible use to which their software will be put, especially when designing abstractions or interfaces. But these abstractions should be flexible enough that they can be adapted for any use.

There might be very good reasons that the underlying function is not public (I can think of private methods in OO where end users absolutely should not be calling them). To me, this seems like the public interface is lacking or perhaps too rigid, or maybe not well-understood (either by designer or user).

In any case APIs are hard.

This is a great point, that APIs often abstract over things that some people need finer grained control over. I work on an API that was too high level, and was recently overhauled in a backward incompatible way due to the limits inherent in the abstractions, limits and abstractions that most of the time are a good idea, but were getting in the way of professionals. So I’m inclined to agree on one hand.

On the other hand, the OS filesystem APIs are detailed enough for most people (and even for most of the OS internals too). I’ve never wanted to know sector sizes or power loss characteristics for any reason other than curiosity. The needs of SQLite aren’t very common. Most developers writing code to use the filesystem don’t need (or even want) to know if a file was renamed or what happened to it before last reboot or whether it can be mmap()ed with other processes. I can barely manage to use rsync, and if I was burdened with these things whenever I need to open or write a file, I’d use SQLite as my API instead of fopen(), which puts me right back at using a high level API (https://www.sqlite.org/whentouse.html).

All APIs present an interface over the resources they manage, so an abstraction is inherent, even if the API is low level. But usually part of the goal of API design is to make the interface easier to use than the resources it manages. If we expose an API that is too low level, then we compromise on usability and safety for everyone in favor of performance and detailed control for a very small minority of users.

So personally, I think I agree with your general concern in the sense that API designers should be aware of the needs of who they’re designing for, but I also think SQLite and the Linux filesystem aren’t a good example.

I’m not sure if I would call SQLite’s needs incommon — in a perfect world, every app should have SQLite’s properties like preserving user data on power loss, etc… It’s just that other apps just don’t care enough about these issues as much as SQLite.

For example, I’m pretty sure I would like every app to automatically handle cases where the underlying file gets renamed or removed. (AFAIK apps that use the Cocoa API automatically does these.) However there really isn’t any POSIX API that allows handling these kind of situations — and to do these now you need to do all of these nitty-gritty things that SQLite does.

In fact, I think that the reason why SQLite needs all of these nitty-gritty details that does everything is because of an insufficient abstraction. If the file APIs were better and didn’t require everyone to consider whether you have to call fsync or not, SQLite wouldn’t have needed any of these. But well POSIX have flied and here we are.

The API doesn’t need to expose whether data is preserved on power loss. Normally these days, it is, and it’s implemented underneath the API.

I don’t think the POSIX API is perfect by any stretch, but there is a real downside to going lower level, in terms of productivity, in terms of mis-use and accidents, in terms of security, and in terms of ability to support old hardware, heterogeneous storage devices, networked storage devices, etc. If there was a huge and unanimous demand for the low level features SQLite uses, the API would already expose those features.

APIs sometimes are too high level, and sometimes are too low level. Getting it wrong in either direction has negative consequences. And getting it exactly right is hard. In any case, I’m unconvinced that the Linux filesystem API is abstracted at the wrong level or represents a good example of an API that is too high level.

Files are a good abstraction to start from.

But to be honest, persistence & durability of data are the goal of Storage and thus what the filesystem API needs to be designed for.

The API needs to clearly be designed for atomicity, durability & to allow coordination across multi-file structures.

This can be achieved by either 1) indicating all required synchronizations to callers in metadata, or 2) fully encapsulating durability by making the API atomically commit every operation.

I don't agree these factors should be hidden, since a losing/ corrupting data are fundamentally at odds with the goals of a Storage API.

It's not just SQLite, any application needing to maintain a datastore needs reliable storage.

> What do you think? Have we all been missing something like this in api design?

No. It's hard to design, maybe harder to design an API and perhaps a misunderstanding when APIs are designed for others or an API just for an API.

As others have already commented about the leaking abstractions (The Law of Leaky Abstractions) which is a good description IMHO to make the non-boundary clear even when it was (non-)intended.

This can be found as well in Hyrum's Law, which IMHO is from a different point of view but reminds as well that the concept to just API for API is conceptually broken.

> Sqlite is all like, "yeah but we need to know what you're abstracting or everything is terrible."

I'm wondering where you pick the terrible from, I have not found that in the article. And it's absolutely not "all like". The paper describes common problems with file-systems on what Sqlite does which is standard programming - at least for Sqlite authors.

You seem not to be much aware of these kind of problems and with all due respect, this creates the impression that you have not experienced this much yet. This aligns well with the localization by yours as something "missing in the design" where I'd like to offer a different view: It's missing practice. To keep the colloquial tone:

Despite what has been missed in API design, we all haven been missing practising writing software.

https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a... https://www.hyrumslaw.com/

> There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them. Maybe that's right sometimes, i don't know?

So when someone who creates an Open-Source library creates an API that you don't like, they are arrogant? Interesting take.

Today I learned:

A typical Android phone has hundreds of SQLite databases and does more than 5 gigabytes of database I/O per day....

F2FS is a log-structured filesystem for Linux that has (limited) atomic write capabilities. SQLite is able to use the atomic write capabilities of F2FS to bypass the rollback journal. Anecdotal reports are that an Android phone that is reformatted to use F2FS is noticeably faster.

F2FS make an old tired phone feel like a perky new phone.

Yet 3 years ago Samsung (who invented F2FS) wasn't using it https://www.reddit.com/r/Android/comments/6txvqx/why_arent_a...

Has the situation changed?

interesting, thank you …

> Are sector writes atomic? And if so, for what size sectors?

As a PostgreSQL hacker I have the same question, and I know the MySQL people do too, and I wrote some more specific questions and backgound, over here: https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

I think the answer to both questions is "sometimes", and most kernels go out of their way to avoid ever answering this question. They just don't want responsibility for it.

Because kernel devs are not able to answer that. It is beyond kernel control as SSDs are backed by nested operating systems under the hood with own bugs and non-determinism.

But to be honest, persistence & durability of data are the primary goal of a Storage API.

I don't agree these factors should be hidden, unless the API guarantees every single operation is atomic & durable.

Full details as to atomicity & durability need be available to callers (not just SQLite, but any application that wants to provide a reliable datastore) so that they can write information durably including across coordinated multi-file data structures.

To be fair, the solution here would be for the kernel devs to ask the "SCSI council" (or whoever picks the ATA opcodes these days) and the NVMe sig to create a mechanism by which the device can communicate its durability guarantees, if any. If they decline to create one then the kernel devs can just declare one (even if nobody implements it). Failure to implement the API is, of course, equivalent to reporting "no guarantee" for all queries.

Then they could at least say "we're doing everything we can; next time you buy a storage device look for feature X".

I’m really impressed with how... humble this briefing is. It has no prelude, no author introduction, no ego whatsoever—-just short descriptions of SQLite behavior and a proposal to make it faster. I’m feel like this format would be amazing if combined with hyperlinks to places I could find detailed info or code.

Every time you give a talk whose first slide is a "bio slide", Baby Jesus cries.

And I barf.

Because context is always meaningless, right?

To Baby Jesus?

This document is from 2019 and contains a number of questions posed to the kernel hackers/filesystem developers, for example:

>Pages are sorted so that they are written in increasing order. → Is this helpful on Linux? Or can we just as well skip the sort and write the pages in any arbitrary order?

Does anyone know if there are responses to these questions available anywhere?

Edit: The recorded talk is available here: https://www.youtube.com/watch?v=-oP2BOsMpdo&t=5541 but it's a lightning talk and the discussion was very brief.

They must have gotten answers because to many of the questions the only good answer is it depends, e.g. "can I tell the os that a section of the file is unused" sure fcntl(F_FREESP) and you have a sparse file, but it is not obvious it works nor help with performance.

fcntl(F_FREESP) only works on old UNIXen. The Linux API equivalent is fallocate(FALLOC_FL_PUNCH_HOLE)

"Are sector writes atomic? And if so, for what size sectors?"

For a database, this is not the question you want to be asking. It's a rabbit hole. The right answer (which is not always the answer we want, so we keep asking): no one knows.

It's better design to err on the side of assuming disk sector writes are never atomic. This would be taking the security stance, e.g. assume an attacker is actively trying to break your protocol, and use cryptographic guarantees to prevent or at least detect them.

Kernel developers also can't answer these kinds of hardware questions. There's too much hardware out there.

Disks may give atomicity guarantees and then you discover edge cases where these degrade, or disks may emulate Advanced Format 4096 byte logical sectors but have smaller 512 byte physical sectors underneath. Disks are too complicated (and firmware too buggy) to trust any kind of guarantee. There are too many layers of abstraction.

Instead of trying to figure out the safety of the layer beneath you, assume the worst and bring the design back to using end-to-end guarantees.

If you assume and plan for the worst, then you don't need to ask the question, and you can handle the worst of hardware, without surprises when those guarantees are broken.

In fact, with file systems in general, the situation is even worse than for disks, because of disk corruption (3% per 36 month period per disk) and misdirected writes, which most file systems unfortunately don't handle. You should never trust any metadata provided by the file system about your journal file (e.g. file size) and use that as part of your recovery protocol, because the file system itself is storing this kind of metadata on the same disk you're writing too. Again, something as important as the metadata size of your journal file should be protected by your own end-to-end protocol.

Filesystem writes don't always correspond directly to storage sector writes.

Sometimes it's possible for the filesystem API to guarantee that "sector"-size writes are atomic, even though the underlying storage doesn't do that.

For example a filesystem which uses copy-on-write when overwriting data, by appending each written block to a journal and checksumming or checkpointing the journal, is likely to offer that property.

Sometimes a block device knows that sector writes are atomic too. For example some battery-backed RAID controllers can reasonably guarantee this. Of course it fails if the battery runs out, but the abstraction is intended to assume you never allow that to happen.

It can go the other way as well. When underlying storage provides guaranteed atomic sector writes, the overlaid storage might not. RAID-5 does not provide atomic sector writes even when the underlying storage units do.

It would be useful for them to report when they have that property, because it would allow SQLite to use fewer writes. In effect, if the filesystem already uses a mechanism to ensure atomicity at some performance cost, there's no need for SQLite to use a second mechanism on top at a second performance cost.

It's actually worse than non-atomic. Writing to a sector on RAID-5 (without battery backup) corrupts other sectors too, during the time window until they are all made consistent. I call this the "radius of destruction", and I'm pretty sure SQLite and other software ignores this problem because there's no API for finding out about it.

So filesystems could report:

- "Don't know" for when writes go direct to the underlying block device. Or better, report whatever the block device reports for this query, which should be "don't know" in most cases.

- "Yes it's atomic" for when the filesystem layer (or block translation layer, flash translation layer or whatever there is) knows that it provides a reliable atomic-block-write abstraction on top of storage which doesn't provide that.

- "Watch out, we don't even guarantee your sector write won't corrupt related sectors in a geometry group..."

This is a lovely document, typical of SQLite authors and why it's such a great piece of software.

If they are going to ask kernel filesystem authors to provide a better API, I have long felt there is one obvious API type which is missing from Linux and would be better for filesystem databases than other approaches:

That is, like fdatasync() (including committing size & filesystem structure blocks needed to retrieve committed data later), but only committing the data within a set of byte ranges, returning when that's done. So that it doesn't have to wait for all the rest of the data to be written.

That would allow the WAL or journal to be efficiently part of the database file itself. No need for extra files, file creation and deletion, directory operations or syncing directories. That's a bunch of durable metadata writes that can be removed, and filesystem implementation details that can be avoided.

It would also allow atomic and durable writes without the need for writing twice in some cases, depending how the data is structured. (First write to the WAL or rollback journal, then to the database main file). That's a bunch of data writes avoided.

In cases where it can avoid two writes, that would also remove the intermediate barrier sync, doubling the commit-to-sync throughput.

Ideally like other I/O (or at least reads) that should be available in a sensible async manner too, with completion when the covered data is committed durably, and only that data. I'm not sure what system API would provide good async I/O for something like SQLite though, if it's not able to use AIO or ui_uring due to kernel API being poorly suited to a self-contained library.

Finally, a couple of variant (probably via flags).

"No rush" flag, to let the caller wait until those ranges are committed durably, but not force them to be written faster than normal. That would allow ordinary fast buffered-write speed, while at the same time providing the usual ACID guarantees that a DB returns success from COMMIT when the data is durably committed. For some workloads that would be fine.

"Barrier" flag, make the call return immediately but delay all subsequent writes on the same filedescriptor until the ranges are synced durably. This is similar to what's mentioned in the article, but more versatile by attaching to ranges. It's also not strictly necessary if you have the "no rush" flag.

The api needs existing filesystems to have fdatasync_rangev functionality implemented. It is likely it would not avoid any complexity, just move it elsewhere.

I agree with you, but the article puts in a polite request for filesystem changes already under "potential enhancements".

My argument is if there were to be some work on potential enhancements that require kernel filesystem changes and syscall changes, fdatasync_rangev would be a more useful choice of enhancement, and probably easier to implement than fbarrier.

Better for performance, user convenience (no more separate files), and more kinds of applications (e.g. VMs and other kinds of in-file data store).

It's not as big as it sounds. fsync_range is already in FreeBSD, NetBSD and AIX, and Linux nearly got it too (there was a patch). (Linux got sync_file_range instead, which is not useful for this unfortunately despite the similar name.)

I’d really love SQLite to work on the nfs and cifs. Is there anything that can be done to make that safe at a kernel level?

Wouldn't it just be easier to use MySQL or Postgres in that situation? I mean if you are connecting to a network share, why not use a database over the network instead?

because sqlite is way simpler to deploy. MySQL and Postgres require being set up, allow outside connection but you need to be careful not to open to much, you need to create a database, a user, set a password, grant privileges to the user on a database, configure the clients to use this user and this database.

With a network share, you already configured authentication. SQLite would be easier and maybe even more secure since it does not increase the attack surface?

however, SQLite is not really meant to be used concurrently efficiently.

To be honest, you are replacing one set of problems with another. The rationale about configuration does not make a lot of sense either, better you learn how to do things right than what seems easier.

I haven't played with shares for a long time. But if I remember only one node can write to a file at once. That's also to say if file read/write performance will be any good for this use case.

NFS or Samba servers have their own issues with attack surface.

Please rethink your idea please.

Can you simulate concurrency with SQLite somehow?

Like, using a front program to control traffic flow for queries to read and write.

You could add a queue and execute the queries one at a time. Probably SQLite or at least popular wrappers for it already do this? But it will be slower than a DBMS that allows you to make hundreds of queries in parallel.

Didn’t SQLite say that their lock/unlock process is really fast? Such that it doesn’t really pose a significant drawback.

Depends how portable you need it to be.

SQLite basically runs on everything and doesn't need a server to work with it.

If you use MySQL or Postgres, you need a lot more dependencies.

Sqlite is in loads of things, Firefox, VSCode etc., if you have NFS home directories you can have issues. KDE’s akonadi uses MySQL via localhost, though it can’t handle you being logged in more than one place. My view is that a JSON or XML file would work a lot better.

You could also look into Firebird , there is an embedded server which provides you with more oomph , is easy to use and has a small footprint.

The SQLite docs seem to caution against some broken NFS implementations and mentions Apple explicitly, but otherwise imply that NFS should work.

It’s very slow at writing over NFS. I wrote a simple python script writing 1000 entries, committing after each one, in WAL mode. In Fedora 22 VMs, NFS4.2 default settings, no security, SSD on host machine.

  - local speed = 2 seconds
  - NFS to another VM = 31 seconds
  - NFS via localhost = 17 seconds

Interesting. 17-31 ms per write is about what a write and sync used to cost on spinning disks. Could still be ok for some use cases.

I tried the same test on a different system with spinning drives, it took 1 minute 43 seconds. The resulting file is 12kb (all inserts, no deletes or updates). It’s sync writes that are slow.

CIFS is do-able, but you would need the database to have a special mode for it, and for some extra API to be exposed by the kernel (byte-level locking)

NFS is just about impossible (except in narrowly restricted situations)

I'm very surprised that SQLite trusts fdatasync() when appending to a file, rather than fsync(), given SQLite is written to be so portable and cautious and has been around a long time.

I've always been suspicious that fdatasync() does not promise to durably commit the size metadata (and perhaps indirection blocks) on some OSes/versions, just the data blocks itself. Such that it's only safe to use fdatasync() on already allocated regions of the file. That was my understanding of fdatasync() when I first read about it, and I vaguely recall some DB handled this by rarely appending to files, instead doing it occasionally by pausing writes (to that file), appending a large block of zeros to the file then fsync() to commit, then resuming writes.

It's not just me thinking this. From a quick Google:

> https://www.percona.com/blog/2018/02/08/fsync-performance-st...

> "The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two."

For what it's worth, `man fdatasync` on linux says (emphasis mine)

  fdatasync() is similar to fsync(), but does not flush modified metadata *unless that meta‐
  data  is  needed*  in  order to allow a subsequent data retrieval to be correctly handled.
  For example, changes to st_atime or st_mtime (respectively, time of last access and  time
  of  last  modification; see inode(7)) do not require flushing because they are not neces‐
  sary for a subsequent data read to be handled correctly.  On the other hand, a change  to
  the file size (*st_size, as made by say ftruncate(2)*), *would require a metadata flush*.

It does say that for Linux now, but it didn't used to, and it doesn't on other OSes.

FreeBSD says the opposite at https://www.freebsd.org/cgi/man.cgi?query=fdatasync&sektion=...:

  The fdatasync() system call causes all modified data of fd to be moved to
  a permanent storage device.  Unlike fsync(), the system call does not
  guarantee that file attributes or metadata necessary to access the file
  are committed to the permanent storage.
I.e. "does not guarantee that ... metadata necessary to access the file are committed".

Thus "given SQLite is written to be so portable and cautious and has been around a long time" I'm surprised.

I'm not sure if we should believe the FreeBSD documentation, as current OpenBSD, NetBSD and Solaris have statements that work out similar to current Linux. However, if the documentation is to be believed, SQLite's durability using fdatasync() is broken on FreeBSD after a file append.

My point really is that if I recall correctly, what FreeBSD's man page says isn't an accident, it's indicative of what used to be the understood guarantee provided by most if not all implementations. Actual guarantees were not exactly specified, and I think there may have been a vague transition in the meaning of fdatasync() from "commits the data blocks" to "commits data blocks and the metadata necessary to access them" over time because the latter is obviously much more useful.

But that transition over time is why it seems a risk to assume it in highly portable software.

That said, back in the 2.6 days Linux even fsync() wasn't durable on consumer HDDs, it was on SCSI HDDs though. Then for a while O_DSYNC (which is like fdatasync() after every write) was unreliable for data retrieval (see the Persona link in GP post) but O_SYNC worked (which is like fsync() after every write).

From SQLite's unix.c:

    ** We do not trust systems to provide a working fdatasync().  Some do.
    ** Others do no.  To be safe, we will stick with the (slightly slower)
    ** fsync(). If you know that your system does support fdatasync() correctly,
    ** then simply compile with -Dfdatasync=fdatasync or -DHAVE_FDATASYNC
    #if !defined(fdatasync) && !HAVE_FDATASYNC
    # define fdatasync fsync

From SQLite's autoconf/configure.ac:

    AC_CHECK_FUNCS([fdatasync usleep fullfsync localtime_r gmtime_r])
In other words, it automatically uses fdatasync if that's available on any target OS.

SQLite's os_unix.c actually confirms my suspicions about fdatasync:

    /* fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly
Given its portability, and the folklore around fdatasync not flushing the file size on other OSes at some time (and the current FreeBSD man page), I'm still surprised the code as written assumes the size issue only exists on OSX.

Note that they talk about opening and calling fdatasync on the parent folder? That is to handle that change in size

No it isn't.

That's to commit the directory change, which updates directory entries pointing to the file inode. File creation, renaming and deletion are directory changes, and are unconnected with the file size.

As the article says: "open and fsync() the containing directory in or to ensure that the file will still be there following reboot from a power loss".

Without the fsync() on the directory, after reboot a newly created file name may not exist at all in the target directory. That's separate from whether the inode size is updated. If you just create the file at its target location, fsync() and then abruptly reboot, it may not be found, yet fsck may find it and put it into /lost+found with the data and size intact, because data and size were committed by the file fsync() while the directory entry was not.

Another way to show this is by creating the file elsewhere and using link() or rename(). Then you can see the committed file data at the old name, while the target of link() or rename() may be missing after abrupt reboot. This is why some atomic+durable file write patterns: Open a file under a temporary alternate name in the same directory, write(), fsync() on the file to commit data and metadata to the inode, then rename() to atomically replace the target filename, then fsync() on the directory after rename() to commit the directory change. (Some particularly careful programs go further than that because rename() is not atomic on some network filesystems.)

(In the article, there are what appear to be typos where it says fdatasync() on the containing directory in some places and fsync() in others. It should always be fsync() on a directory.)

> SQLite databases are commonly used as storage containers for sending structured content across the internet.

Can SQLite databases still attack the host that opens them?


Is there any database that just defines their own partition type? It always seemed strange to me that databases are so optimized, but still operate on top of the filesystem abstraction.

Quite a few databases run on raw partitions. Often the "big, old" ones. Oracle, DB2, Sybase come to mind.

To improve performance, many OSes also provided kernel extensions to provide a DB-specific alternative block-device API for faster "raw" I/O and something like "async" I/O, rather than going through the regular unix block layer. These extensions were probably only ever used by databases.

now virtual machines systems are happy to find them.

Some DBs use direct disk access

Does someone know if there is something like an SQLite server, so that it could be used like a 'normal' database server?

Why? Well, for benchmarking for example, or if the application you are using has a bad multi-user implementation and you don't want to migrate to another database...

My Datasette project is leaning in that direction: it's a web application that provides a JSON API (or GraphQL with the datasette-graphql plugin) to execute read queries against your database, and I've been adding write functionality to it as well.


Does someone know if there is something like an SQLite server, so that it could be used like a 'normal' database server?

    $ ssh dbserver -c sqlite3 /path/to/database.db
Is as close as you’re going to get. You can have multiple users doing this but all transactions are serialised so don’t expect concurrent writes to be performant - concurrent SELECTs will be fine.

Possibly Bedrock fits your needs: https://bedrockdb.com

"Bedrock [..] is a networking and distributed transaction layer built atop SQLite"

"... That query can be any SQLite-compatible query – including schema changes, foreign key constraints, partial indexes, native JSON expressions, or any of the tremendous amount of functionality SQLite offers."

Perhaps dqlite? https://dqlite.io/

AFAIK SQLite can do only one transaction commit at the same time even if the commits do not clash. Thus, it is not useful for multiple writing users use cases.

But those commits are FAST - tens of thousands of writes a second, easily. So if you want to handle multiple writing users at once, put the writes in an in-memory queue and send them to SQLite as fast as it will accept them.

If you mean tens of thousands of commits a second, it can only go that fast on SSD or battery-backed HDD.

write+sync is a speed limiter in general (even on SSDs), because of waiting for the syncs which are needed for commit durability. On HDD, serial transactions can't go faster than roughly twice the rotation time of the HDD per write, plus seeks, and on network filesystems they can't go faster than four times the network latency (two syncs per commit) plus whatever the backing store needs.

The trick to getting high commit throughput is to allow the durability part of multiple independent transactions to be committed concurrently. On a filesystem-backed DB this means writing multiple transactions per fdatasync. It would be tricky for SQLite to do this because of its architecture as a library and synchronous API, but not impossible.

I guess with the introduction of BEGIN CONCURRENT resolves this.


I am the one behind https://RediSQL.com

It is a Redis module that embed SQLite.

The idea is to run medium complexity project on top of a single runtime dependencies, Redis.

You will find Redis in most applications anyway, and for a simple crud application sqlite is more than enough.

In this way you get rid of a lot of systems complexity. And I works both in memory (crazy fast) or with standard files.

What could be really useful would be that can be exposed as a unix socket in a way that is transparent to sqlite, ie it still believes it's working with a file.

Don't know if this is possible and just how horribly it would perform, but it could be legit useful if it can solve e.g. the remote fs conondrum mentioned in other comments.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact