For people interested in this topic I can recommend "Can Applications Recover from fsync Failures?" by A. Rebello et al[1]. The paper analyzes how file systems and PostgreSQL, LMDB, LevelDB, SQLite, and Redis react to fsync failures. It shows that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.
A nice outcome here for distributed SQL databases that store multiple copies of the data by default is that they can just failover over on an fsync failure and not try to "handle it". If fsync starts failing with EIO there is a good chance the disk is going to die soon anyways.
That depends. Replicated state machines need to distinguish between a crash and corruption. Most systems don't do that. It can be disastrous to truncate the journal when encountering a checksum mismatch for instance.
See "Protocol-Aware Recovery for Consensus-Based Storage" for more research on that topic. [1][2]
I was so glad to see you cite not only the Rebello paper but also Protocol-Aware Recovery for Consensus-Based Storage. When I read your first comment, I was about to reply to mention PAR, and then saw you had saved me the trouble!
UW-Madison are truly the vanguard where consensus hits the disk.
We implemented Protocol-Aware Recovery for TigerBeetle [1], and I did a talk recently at the Recurse Center diving into PAR, talking about the intersection of global consensus protocol and local storage engine. It's called Let's Remix Distributed Database Design! [2] and owes the big ideas to UW-Madison.
An fsync() failing doesn't necessarily mean there is a disk corruption. I agree all logging and recovery protocols should have different handling for a corruption vs a torn tail of the log for example, but I view that as mostly orthogonal.
I'm talking about exiting the process if fsync() fails and letting the distributed databases normal failover processing do its thing. This is a normal scenario for a failover (i.e, its the same as process crashing or getting OOM killed by linux, etc).
You're probably aware of this, but for the sake of others reading:
Crashing after an fsync failure isn't sufficient if you're using buffered IO either. Dirty pages in the page cache could cause your consensus implementation to e.g. allow voting for two different leaders for the same term if at some future point that machine crashes and the dirty page contained the log record for that vote. Your process would restart after the machine restarts and no longer have access to the dirty page and potentially vote again if asked.
Edit: Thought I'd add the series of steps for this to happen to make it clearer:
1. Process X receives an RPC from Process Y to vote for Y as leader in term 1.
2. Process X writes log record containing vote information to page cache and calls fsync.
3. Process X receives EIO in response to fsync (if it is lucky...) and crashes.
4. Process X restarts and receives a retry of the same RPC from Process Y for term 1 and this time it responds affirmatively because it can see it already voted in term 1 for Process Y, which _should_ be safe.
5. Process X crashes because the machine hosting X experiences a temporary hardware failure.
6. Process X restarts after the hardware failure and receives an RPC from Process Z and responds affirmatively because the dirty page that was available back at T4 never actually made it to disk.
Consensus is now (potentially) broken from electing two leaders for the same term.
And yet many non-Byzantine consensus protocols are equipped to handle the network fault model, which could be seen as equally Byzantine under this definition.
The problem is really that many formal proofs of consensus have focused only on the network fault model, and neglected the storage fault model.
Both network/storage fault models require practical engineering efforts to get right. I think a better term for this is “near-Byzantine” fault tolerance. It's what non-Byzantine fault tolerance looks like when implemented correctly in the real world—the GP comment is a great example of how to approach and think about this from an engineering perspective.
"near-Byzantine" is not a very clear term you can reason about. A system is either Byzantine-fault-tolerant, in which case it handles all Bizantine faults, or it is not. A system that is tolerant to some faults (that you may want to call "Byzantine") is not BFT.
You don't call plaintext SMS "tamper-resistent" because it resists to some simple tampering. You don't call your house "FBI resistant" because you managed to convince them once to turn around.
A Byzantine fault is clearly defined as a case where a specific node may be doing anything, including not know it has failed, including malicious behavior. It is important that people know what class of faults their system is designed to resist; for Raft/Paxos, it is NOT Byzantine faults. Those systems are pretty great, but trying to pretend they aim at BFT is dangerous misinformation...
What then would you specify as the clearly defined storage fault model for non-Byzantine protocols such as Paxos/RAFT that rely on stable storage for correctness?
Anything is possible with Byzantine faults, on the specific failed node. It will not remember voting, it will not remember to vote, it will not remember its identity, etc. PAXOS/Raft are not tolerant to a minority of nodes exhibiting those kinds of faults, only to a minority of nodes being unreachable or partitioned.
Remember that the Byzantine generals had traitors among them, not merely communication issues.
To be clear if fsync() on linux had well defined behavior on errors (no matter the file system used) I wouldn't suggest failing over as a reasonable solution for a distributed database. Its mentioned in the parent video, but fsyncgate[1] was a recent reminder of this.
Don't get me wrong, I run SQLite in production with millions of records, but it seems people have discovered it just very recently.
All of the sudden, there's always a post about it with hundreds of comments.
Author here. I agree there's been a resurgence lately. My working theory is that the performance of the underlying hardware has increased so much in the last decade that the simplicity of SQLite has become a more important tradeoff than the extra performance boost you may see with Postgres/MySQL. Also, network overhead tends to dominate small queries so SQLite can actually end up a lot faster since it's in-process.
“the performance of the underlying hardware has increased so much in the last decade that the simplicity of SQLite has become a more important tradeoff”
Thanks for articulating this so clearly! It's going to be interesting also to see how much simplicity is realized from new thread-per-core designs leveraging io_uring to get rid of multithreading, locks and memory barriers scattered throughout the control plane. The performance dynamics are all changing and it's easier than ever to be writing simple, fast systems, that hit the 90% sweet spot.
You've probably also seen McSherry et. al's “Scalability! But at what COST?” [1]
It seems to come in waves, there was another round of SQLite posts hitting the front page around 6 months ago as I recall.
I'm not sure why, SQLite is the most widely deployed database engine on the planet, so it's not as if the technology is really "gaining" in popularity ... it's already there.
I was always curious if this is due to the huge number of new developers, many front-end, who have joined IT over the past decade or so, now venturing out and beginning to learn about relational databases and their benefits.
That's true as far as it goes, but also Ben works here,exclusively on the open-source Litestream project, so his blog posts end up on our blog sort of naturally.
The new edge computing fad. These companies need a way to bring the database to the edge too, so they are heavily promoting sqlite as the best thing since slice bread to sell their services as the one solution to all your problems :)
Thanks! The WAL mode is really interesting because it just changes the design just a little bit but you get a ton of additional benefits. I have that post in the works right now.
Whenever I read articles or releases by Fly.io, it makes me want to work with them. As a newbie who's interested in databases and networking, I reckon it would be an amazing experience in the team.
Unfortunately, they do not hire interns or likewise (from their job page) and I assume, from the amazing work they do, they aren't looking for beginners. =)
EDIT: Thanks to people encouraging me to apply. I'll definitely try!
We do hire entry level people, in fact we're wrapping up our first cohort of interns right now. Exclusively hiring experienced dudes from the tech bubble is a failure mode in the long run. We're invested in hiring folks with diverse backgrounds and experience levels, and we need to talk about that more for sure.
Our jobs page is sparse right now because we’re focusing on hiring EMs to help grow a healthy eng org. We’ll have more openings before long, and you should absolutely apply when something sounds right. Feel free to hunt me down online in the mean time!
Not only just apply, but definitely try to network with people on the team. Works well if they're active on twitter. I got my first internship by building a rapport with someone on twitter and then they worked with their org to find an internship opening for me (still had to interview and stuff) that wasn't on their careers site.
> After that, 510 bytes are used for the SHARED lock. A byte range is used here to accommodate older Windows versions with mandatory locks.
I was curious how old, and... wow, that code is for Windows versions that predate the NT kernel (Win95/98/ME). I'm surprised that it's still around, but the comment does a great job of explaining it.
Question for people who actually use SQLite in production: when do you start seeing performance degrade to the extent that you have to migrate to a "proper" DB? Can a transactional table scale to say 10 million+ records with a couple of indexes before it becomes a problem?
A company I work for runs MySQL on an IoT base station product (Beaglebone-like hardware). Tables seems to corrupt every so often, which typically are repairable with a "mysqlcheck --auto-repair" which they have as part of the boot sequence, but not always. These corruptions seem to be due to bad batteries or the field team holding down the power button too long.
I run about 100 computers that regularly receive power cuts and have not have databases go corrupt. However we also write to the database every minute or so, all data is packed into compressed chunks that contain a minute of data.
problem is in application layer and mysql settings.
you can try to switch from myisam engine to innodb engine, tune innodb settings, write inserts in batches, rather than continuously, and wrap insert into transaction
I've done research on verification of distributed programs and worked for IBM as an architect of distributed file-systems. It was all so long ago that I'm not up to date on the literature so I really enjoy the discussion here. This is the best thing about HN, it has such an interesting and knowledgeable community. Thanks HN.
Maybe I'm misinterpreting your comment but SQLite operates with serializable isolation[1] so I don't think write skew is possible. For both the rollback journal and the WAL, SQLite only allows a single write transaction at a time so you can't have two transactions updating with different snapshots.
Also, from that Wikipedia article:
> Were the system serializable, such an anomaly would be impossible
For someone reading the article fresh, without other articles or outside knowledge, the quoted text pretty much states that snapshot isolation is sufficient for isolation in general, and it's a common enough incorrect belief that it's worth pointing out as untrue.
I'm glad to hear that SQLite has chosen serializable isolation as the one true way, even with the newer BEGIN CONCURRENT functionality.
The blog post isn't about Fly.io, it's about SQLite. Ben works here, but exclusively on Litestream (we sort of "acquired" Litestream some months ago). He's not writing about our product; he's writing about his project, and about SQLite.
We have Postgres as part of the platform (`fly postgres`), and not SQLite. But SQLite is super interesting to us, especially with Litestream, because it plays nicely with apps that are deployed close to their users. A good starting point: https://fly.io/blog/globally-distributed-postgres/ --- now, substitute an incarnation of SQLite with good read replicas, and in-process performance.
In this case fly.io (and others) are trying to change that by doing interesting things with storage of the database and the transaction logs. It works by hooking into various SQLite features (like virtual file systems and post transaction hooks) so your application doesn’t need to care about that layer and you just write you app with standard SQLite code.
The theory is you get best of both worlds - it could be pretty cool if it works out well.
- [1] https://www.usenix.org/conference/atc20/presentation/rebello