We put a distributed database in the browser and made a game of it

petrosagg · on July 11, 2023

> Sure, we’re not yet injecting storage faults, but then formal proofs for protocols like Raft and Paxos assume that disks are perfect, and depend on this for correctness? After all, you can always run your database over RAID, right? Right? > If your distributed database was designed before 2018, you probably couldn’t have done much. The research didn’t exist.

I'm trying to understand this part but something seems off. It seems to imply that those proofs do not apply to the real world because disks are not perfect, but neither is RAM. The hardware for both RAM and disks has an inherent error rate which can be brought down to arbitrary levels by using error correction codes (e.g ECC RAM).

I'm assuming that the TigerBeetle correctness proofs are predicated on perfect RAM even though in reality there is a small probability of errors. This tells me that there is an error rate which they consider negligible. If that's the case what is the difference between:

* TigerBeetle's storage approach

* Paxos or Raft with enough error correction on disk writes that the probability of errors equals that of ECC RAM which is considered negligible

I've probably made a logical error in my reasoning but I can't see it. Can someone enlighten me?

chubot · on July 11, 2023

Not an expert in this area, but I think disks have correlated failure modes whereas CPUs and memory generally don't.

Especially spinning platter disks, not sure about SSDs.

The difference in failure rates could be orders of magnitude ... Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)

But disks will have non-random manufacturing issues. I'd be interested in more info too, but my impression is that the data on these issues is pretty thin. Foundation DB mentioned it ~10 years ago and Google has published data >10 years ago, but hardware has changed a lot since then

Software redundancy will take care of non-correlated failures, but it fails precisely when there are correlated ones

planckscnst · on July 11, 2023

I work on a large distributed database system. SSDs absolutely have correlated failures. Also CMOS batteries. Also CPU and memory (think a manufacturing defect or a storage climate issue on specific batches that made it through QA). Pretty much nothing is 100% guaranteed to have no correlated failures. It comes down to probabilities. You can add flexibility, variation, vendor/sourcing diversity to reduce risks.

vidarh · on July 12, 2023

Went through a rather large batch of OCZ SSDs that all failed within a two week window years ago. Thankfully the IBM Death Star had long before made me allergic to putting devices of the same model in the same RAID array if I can help it, so it was a nuisance rather than a disaster.

imtringued · on July 11, 2023

SSDs tend to have highly correlated failure modes because you either run into a bug in the firmware which is the same on every SSD or you have the same wear on every SSD, which locks both into read-only mode within a short period of time. You might argue that read-only is not a failure, but read-only means downtime and replacing hardware.

yencabulator · on July 12, 2023

> Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)

Cosmic rays cause random bit flips, sure.

However, DIMMs going bad results in the same areas corrupting over and over. If you've run memtest or such with bad DIMMs, you'll see it telling you exactly which DIMM is bad, etc.

Now, dynamic memory management and virtual memory mapped onto physical memory complicate that picture... but you could easily end up with a single buffer used for e.g. TCP receive that lives in the same physical RAM region for the lifetime of the process.

Similarly, firmware bugs have resulted in very deterministic corruptions.

jorangreef · on July 11, 2023

Thanks for the question! Joran from TigerBeetle here.

The research in question is the 2018 paper from UW-Madison, “Protocol-Aware Recovery for Consensus-Based Storage” (PAR) [0] by Ram Alagappan, Aishwarya Ganesan, as well as Remzi and Andrea Arpaci-Dusseau (who you may recognize as authors of OSTEP).

PAR won best paper at FAST '18 for showing that a single disk sector fault, in the write-ahead log (WAL) of a single replica, could propagate through the distributed RAFT or MultiPaxos consensus protocol, to cause global cluster data loss.

This was counter-intuitive at the time, because PAR showed that the redundancy of these consensus and replication protocols did not in fact always imply fault-tolerance, as had previously been assumed.

The reason is, and we cover this in depth in our recent QCon London talk [1], but it was assumed that checksums alone would be sufficient to detect and recover from storage faults.

However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot.

What you tend to find is that the WALs for many of these protocols will truncate the WAL at the first sign of a checksum mismatch, conflating the mismatch with power loss when it might be bitort, and thereby truncating committed transactions, and undermining quorum votes in the Raft or MultiPaxos implementations.

RAID solutions don't always help here, either. See "Parity Lost and Parity Regained" [2] for more details. ZRAID is better here, and ZFS is a huge inspiration, but with local redundancy under ZFS you're still not leveraging the global redundancy of the consensus protocol as well as you could be.

To summarize PAR:

There are fundamental design changes to both the global consensus protocol and the local storage engine that would need to be made, if the storage fault model of PAR (and TigerBeetle) is to be solved correctly.

Furthermore, few simulators even test for these kinds of storage faults. For example, misdirected I/O, where the disk writes or reads to or from the wrong location of disk, which may yet have a valid checksum.

However, this is important, because disks fail in the real world. A single disk has on the order of a 0.5-1% chance of corruption in a 2 year period [3]. For example, a 5 node cluster has a 2.5-5% chance of a single disk sector fault, which again in terms of PAR can lead to global cluster data loss.

On the other hand, memory (or even CPU) faults, assuming ECC are not in the same order of magnitude probability, and therefore TigerBeetle's memory fault model is to require ECC memory.

But, again, to be crystal clear, checksums alone are not sufficient to solve the consensus corruption issue. The fix requires protocol changes at the design level, for the consensus protocol to be made storage fault-aware.

Thanks for the question and happy to answer more!

[0] “Protocol-Aware Recovery for Consensus-Based Storage” https://www.usenix.org/conference/fast18/presentation/alagap...

[1] “A New Era for Database Design” (we also dive into the research surrounding Fsyncgate, looking into the latent correctness issues that remain) https://www.youtube.com/watch?v=_jfOk4L7CiY

[2] “Parity Lost and Parity Regained” https://www.usenix.org/conference/fast-08/parity-lost-and-pa...

[3] “An Analysis of Data Corruption in the Storage Stack” https://www.cs.toronto.edu/~bianca/papers/fast08.pdf

petrosagg · on July 11, 2023

Thank you for the detailed response!

> However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot.

The PAR paper states that "although Crash preserves safety, it suffers from severe unavailability". I assume that when TigerBeetle loads state from RAM into a CPU cache/register it operates under the NoDetection consistency model or the Crash consistency model if ECC RAM automatically resets the CPU on read errors. At the same time it doesn't suffer from severe unavailability so what gives?

The answer is probably that ECC RAM is just reliable enough that the NoDetection/Crash models are fine in practice.

I can believe that off-the-shelf checksum and redundancy options offered by filesystems like ext4 and ZFS or systems like RAID don't hit the required error probabilities but why does the argument stop there? Couldn't a distributed database generate error correcting data on every write in the application layer so that the probability becomes low enough such that NoDetection/Crash become a non-issue for storage, just like RAM? Is there some other fundamental difference between reading and write data from RAM versus a disk?

jorangreef · on July 11, 2023

Huge pleasure, thanks again for the question!

The crux of the problem: How do you solve misdirected read/write I/O? Where the firmware writes/reads to/from the wrong disk sector (but with a valid checksum)?

PAR shows how both global consensus protocol and local storage engine need to be modified for this, with foundational design changes at the protocol-level, if a distributed system is to not only preserve correctness, but also optimize for high availability.

Bear in mind that PAR is not only actually correct, but it's also more efficient than simply dialing up local redundancy, because it lets you recover from the global redundancy that you have via replication in the consensus protocol.

The paper is great, but will especially reward a few passes of reading. The examples they give take time, but are great to work through slowly to gain a deeper understanding.

And/or, you can read the Zig code of PAR in TB! :)

Here's a great place to start, one of our favorite pieces of code in TigerBeetle: https://github.com/tigerbeetle/tigerbeetle/blob/4aca8a22627b...

yencabulator · on July 12, 2023

> The crux of the problem: How do you solve misdirected read/write I/O? Where the firmware writes/reads to/from the wrong disk sector (but with a valid checksum)?

Can't you make the expected location of the data part of the checksum?

Concretely,

- switch from checksums to hashes

- use something like Blake3 as keyed hash with the WAL offset as key.

Now, you can't accidentally read WAL block #5 instead of #7, as it's recorded hash won't match H(data, key=7).

Similar more old school technique: storing the expected role & id of a block inside the block can make storage more robust.

jorangreef · on July 18, 2023

> Can't you make the expected location of the data part of the checksum?

Yes, and in fact we do this already in TigerBeetle (specifically towards solving misdirected I/O, along with hash chaining). Coincidentally, we used to use Blake3 but have since moved to AEGIS for hardware acceleration.

However, and this begins to hint at the problem, but redundancy alone is not sufficient. For misdirected I/O, we are already encoding more into the checksum...

And, PAR goes beyond this. For example, how do you disentangle corruption in the middle of the committed write ahead log, from a torn write at the end of the WAL due to power loss? For this, to solve this correctly (to decide whether to repair a committed operation or truncate an uncommitted operation respectively, for correctness and high availability), you really do need two WALs... and integration with (or awareness of) the invariants of the global consensus protocol—as the paper motivates.

This is a foundational design change.

convolvatron · on July 11, 2023

you just bury these and play games with p. just like distributed consensus, there is no perfect storage medium. 1 bit flip a week too much for you? add secded. secded on memory. interleaving to spread out correlated errors. etc.

at some point you run in the probability of the earth being coincident with the sun and you call it good.

none of viewstamp replication, paxos, or raft deal with storage errors.

where it does get interesting is that in the low p spectrum can you can subvert the correctness of these protocols by fuzzing them. and then you get into byzantine protocols.

jorangreef · on July 11, 2023

We tried to emphasize “Protocol-Aware Recovery for Consensus-Based Storage” in the blog post, because it's how TigerBeetle solves the storage fault model, and because PAR shows how you can actually solve this using the redundancy you already have in the global consensus protocol.

https://www.usenix.org/conference/fast18/presentation/alagap...

pohl · on July 11, 2023

This is a sweet idea and a nice, game-esque implementation.

It could definitely use some onboarding.

There's nothing to give the "player" a hint as to what they should do. What is my goal? Am I trying to defeat the communication of the nodes, or help them? If it's the former, why did I seem to win the first level after doing nothing? I started by trying to mash some keys. Eventually I saw that there were tools in the upper-right, but it wasn't clear what to do with them. It's a bit frustrating that they disappear after you grab them if your click is too far off the target. After successfully applying one of them to one of the targets, it's not clear what one should do next. Was it a good move? Who knows, because there's no feedback.

jorangreef · on July 11, 2023

Thanks for the feedback! This is the first version of our educational SimTigerBeetle frontend.

The goal is to see and interact with a distributed database running visually under Easy/Medium/Hard levels of fault injection, giving you the opportunity to explore how the consensus handles failures, poke the otherwise deterministically simulated world, or inspect progress of the replicas.

So, it's a game in the sim genre, and we've got more to come, but the significance is hopefully what we tried to explain in the post:

You're seeing a new design of distributed database survive cosmic levels of fault injection (8% storage corruption on the read path across all replicas in the Radioactive level!), in your browser... and you can smash things with a hammer. ;)

You might also appreciate the video [0] at the end of our post, where we dive into the three insights behind deterministic simulation testing, and how DST moves beyond chaos engineering.

[0] SimTigerBeetle (Director's Cut!) https://www.youtube.com/watch?v=Vch4BWUVzMM

kristoff_it · on July 11, 2023

Well the point is that the replication protocol is going to fix all issues no matter what, so you could say that the only winning move is to enjoy watching TigerBeetle move forward impervious of scripted nor human-inputted issues.

In other words, you're going to 'win' no matter what :^)

pohl · on July 11, 2023

That makes sense: it's not a game at all, but a simulation pretending to be one.

kristoff_it · on July 11, 2023

never played "walking simulators" then? plenty of games don't feature traditional win-lose mechanics.

pohl · on July 11, 2023

No, I hadn't even heard of that genre. They sound potentially fun, though, since at least they have an environment to explore.

TeMPOraL · on July 12, 2023

> environment to explore

Only if the setting is something other than present-day reality - after all, the genre is called "walking simulator", not "trespassing simulator".

schmichael · on July 11, 2023

This is incredible! I am curious about this paragraph though:

> You’re going to see view changes when the primary crashes or is partitioned, and VSR’s telltale round robin rotation of the new primary among replicas, until a new primary is established. This is in contrast to Raft, which elects a primary at random, but then suffers from the risk of (or increased latency to mitigate) dueling leaders.

It seems like regardless of primary/leader selection mechanism used (deterministic/roundrobin vs voting), you still need a quorum of nodes to agree (and for the minority side to know they can't proceed). Surely in a round robin selection mechanism some sort of vote or liveness check must be performed before it is safe for the #2 node to promote itself to #1/primary? Otherwise if the link between #2 and #1/primary is partitioned, #2 could unilaterally assume it was the primary/leader, even if the rest of the nodes could still communicate with #1 (the original primary). I don't understand how round robin solves the agreement aspect that leader election does.

The simulation seems to only partition nodes and not links so I'm not sure it exercises asymmetric connectivity between members. Although it does mention flaky links, so perhaps they do cover this case.

Edit: I have been informed there is still a vote. :)

jorangreef · on July 11, 2023

Thanks, great to hear you enjoyed it!

The round robin "view change" in VSR is still consensus, and uses quorums to do fault isolation of the old primary, and to preserve the intersection property, to ensure that the committed log survives into the new view.

What's cool about VSR's consensus though, is that the dice is also preloaded, ahead of time, so that there's more information baked into the protocol than with Raft or MultiPaxos, which means that you can neatly sidestep their dueling leader problem, or the latency padding that is often added to mitigate it.

You can read more about VSR's intuitive view change consensus here: http://pmg.csail.mit.edu/papers/vr-revisited.pdf

The round robin view chance can also make VSR (slightly) more resilient to weird network faults, like you describe. For example, variants of VSR using stable storage as hinted at by the '12 paper, are in fact able to survive most of the OmniPaxos liveness scenarios (we submitted some errata to the paper's authors for this).

We haven't yet explored visualizing asymmetric connectivity in SimTigerBeetle (we tried to start with the big things!), however we recently wrote about how we test for this in our VOPR simulator here: https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...

schmichael · on July 11, 2023

Thanks for the details. A coworker pointed me at Heidi Howard (of Flexible Paxos, etc) and Diego Ongaro (Raft) discussing this very question! https://groups.google.com/g/raft-dev/c/cBNLTZT2q8o?pli=1

jorangreef · on July 12, 2023

Yes, I think Heidi Howard gets it spot on there, in that the main difference is the view change.

(TB implements Flexible Paxos and we're fans of Heidi Howard's work!)

eatonphil · on July 11, 2023

Direct link to the game: https://sim.tigerbeetle.com/.

While TigerBeetle is open-source, the game isn't yet. But that may change. :)

Digikid13 · on July 11, 2023

It's very confusing that y'all are calling this a "game". There's nothing to play, it's just a simulation to watch.

cstrahan · on July 11, 2023

"Simulation" doesn't necessarily entail stimulating interactivity and a pleasing aesthetic. As of writing, I don't know if there's a better term than "game" for what this is. (Maybe we can coin a new term, like "game-lite" -- somewhat akin to what "rogue-lite" is to the "rogue" genre).

See "walking simulators":

https://en.wikipedia.org/wiki/What_Remains_of_Edith_Finch

https://en.wikipedia.org/wiki/The_Stanley_Parable

https://en.wikipedia.org/wiki/Thirty_Flights_of_Loving

Plenty argue that these aren't games either (usual complaints involve lack of problem(s) to resolve, and no win-lose dynamic); but, then, what are these? The closest category I can think of would be "computer-animated film", but... these are interactive, and you can navigate and look in any direction you want, which yields a very different experience than watching a film like "Toy Story".

Apes · on July 11, 2023

I think "demo" or "toy" or "exhibit" would be a better term than "game", since there's not really any story, or rules, or objectives, or anything resembling a normal gameplay loop. The level of interactivity is well below what you would typically expect from a game, and there's effectively no agency in affecting the outcome of what happens. Even walking simulators have at least some of those things.

shard · on July 11, 2023

I remember reading an article which talks about the difference between games and these interactive demos: games have objectives and goals, and for the demos without such, they're better to be called as "toys".

zerocrates · on July 11, 2023

"Walking simulator" is a little tricky to extrapolate from since it was originally a perjorative.

captainhorst · on July 11, 2023

There are tools in the upper right corner. You can pick one and use it on a beetle. ;) This introduces various faults in the simulation that the database has to recover from.

Minor spoiler: there's another tiny game hidden at the end of the third level.

reidjs · on July 11, 2023

That adds some interactivity, but I still wouldn't call this a game yet. Games generally have some sort of challenge you must overcome. This is more like a highly polished interactive simulation of a distributed db. Still really cool and well done.

VectorLock · on July 12, 2023

Out of curiosity what did you use to write the graphical part of this game?

captainhorst · on July 12, 2023

I used my Zig port of NanoVG: https://github.com/fabioarnold/nanovg-zig which ultimately uses WebGL for rendering in the browser.

cdchn · on July 11, 2023

This is pretty cool from a database perspective but I'm really interested in what they used to write this 'game.'

eatonphil · on July 11, 2023

It's covered a little bit in the Evolution section (add #evolution to the URL and hit enter).

If that doesn't answer everything (it's not a long section, granted) Fabio (captainhorst) is on HN answering questions in this thread already so feel free to ask!

cdchn · on July 12, 2023

I went to https://sim.tigerbeetle.com/#evolution but nothing happened.

eatonphil · on July 12, 2023

Of the blog post, sorry, not sim.tigerbeetle.com.

muhaaa · on July 11, 2023

I thought tigerbeetle distributes the db on web clients? I skimmed through the docs and cannot find how to design a web app with tigerbeetle like the game https://sim.tigerbeetle.com/

eatonphil · on July 11, 2023

Hey! I'm not completely sure I follow the question. But the game is built a little differently from how you'd typically run TigerBeetle in production. In production you'd run a cluster of replicas (see https://docs.tigerbeetle.com/deploy/hardware#cluster-of-repl...) and then you'd have your application connect to the cluster using one of our client libraries.

The game is more a reflection of how we do simulation testing (as the post mentioned). You can find the code for that here: https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/s....

muhaaa · on July 11, 2023

I have a very interactive app with limited data use <10MB. Right now its a lots of different API calls to sync state between server and multiple clients. But if I have a distributed sync protocol (that works), I can replicate the state on each cient and server by tunneling the sync protocol to the server and back. I have reduced a lots of api complexity. My react UI just subscribes to the data changes in the synced DB. Similar concept impelmented pouchdb & couchdb but its a bit outdated.

cdchn · on July 11, 2023

What did you use to make the graphics? Thats very cool.

eatonphil · on July 11, 2023

They're handmade by Joy Machs! We credit him in the post. :) https://twitter.com/joymachs

cdchn · on July 12, 2023

Oh the sprites are great, don't get me wrong, but I mean how the graphical game system was made.

captainhorst · on July 12, 2023

We use NanoVG: https://github.com/fabioarnold/nanovg-zig

cdchn · on July 12, 2023

Nice! Looks pretty rad. I've never used Zig but it sure looks like an interesting language. Did NanoVG help with the animation stuff like the little text that pops out of the beetles and the lines that zip between them?

captainhorst · on July 17, 2023

Sorry, totally missed your reply! At first we had normal text and then we switched them for hand-drawn sprites. The text animation consists of 2D transformations (translation, rotation, scale) interpolated over time. It's all animated in code. NanoVG helps by having a transformation stack like OpenGL. In Zig I can open a block like this:

{ vg.save(); defer vg.restore(); // undo all transformations at the end of the block

  vg.translate(...);
  vg.rotate(...);
  vg.scale(...);
  // draw sprite

} // continue to draw the rest of the scene

moralestapia · on July 11, 2023

Nice work @eatonphil and TB team. Thanks for sharing.

I'm eagerly awaiting the production ready (or an RC or something) of TB, I've tried it and it truly excels in its domain.

I have a few clients that could make good use of it but I have to hold my horses a little bit.

jorangreef · on July 11, 2023

Thanks, Alex! Awesome to hear that. SimTigerBeetle has been a skunkworks project, on the side, the past 12 months. We're looking forward to sharing the production release of TigerBeetle with you soon when it's ready.

brimstedt · on July 11, 2023

Is this something similar to couch/pouchdb?

eatonphil · on July 11, 2023

TigerBeetle is more domain-specific, i.e. focused on financial transactions and high-performance, high-availability. There are just two entity types in the database: accounts and transfers between accounts. More details here https://docs.tigerbeetle.com/design/data-modeling if you're interested!

jazzyjackson · on July 11, 2023

In comparison to {C,P}ouchDB, I think the question is around offline-first availability. Can a mobile client interact with a local database without a degraded experience and expect changes to be synchronized once a connection is re-established. I would suppose in the financial transaction space this is not A Thing.

eatonphil · on July 11, 2023

> In comparison to {C,P}ouchDB, I think the question is around offline-first availability.

Got it, thanks! Yeah that is indeed not how TigerBeetle works. If you ever cannot connect to the cluster, you keep retry messages (idempotently) until you connect and the message succeeds.

jlokier · on July 11, 2023

That sounds similar to the account state storage used in blockchains. In some of those, high performance and high space consumption are technical challenges, and the tendancy to have a lot of random keys (hashes) adds another.

I wonder if TigerBeetle would be suited to those storage performance challenges, and conversely if the low-level storage optimisations in certain implementations of blockchain account history would be more broadly applicable to the financial applications targeted by TigerBeetle.

eatonphil · on July 11, 2023

Yes it is similar, but simpler! We do chat with a few companies using blockchain, looking at TigerBeetle for better throughput.

> if the low-level storage optimisations in certain implementations of blockchain account history would be more broadly applicable to the financial applications targeted by TigerBeetle.

Maybe! Any examples you're thinking of?

jlokier · on July 12, 2023

> Any examples you're thinking of?

The three I know about are all Ethereum account/state history stores.

Besu does something interesting they call Bonsai Trees. It reportedly uses a lot less storage (about 1/10, 1.1TB instead of 12TB) than what came before in certain modes that store all the account history ("archive node"), and a bit less storage in other modes. I haven't studied how Bonsai trees work, so I don't know if the method is useful for non-blockchain styles of account history.

Erigon is well known as using perhaps the smallest storage when storing all the account history, and this is what it's optimised for. Erigon uses a fairly classic B-tree, for performance reasons (lower IOPS than LSM-trees for relevant operations). Their tricks are in how the data is organised in the B-tree, being better than the classic Ethereum methods. Erigon 2/3/4 (not sure, they merged plans) uses a new storage format that is supposed to do the same thing better. I don't think Erigon's tricks would be applicable to TigerBeetle, because TB doesn't have the storage problem of keeping everythng in Merkle trees in the first place. But it's a place to look as Ethereum's leader for archive node storage.

Last but not least, I have a work in progress storage format that is more efficient than either of the above, in that it uses much less storage for all the account history (due to efficient queryable and updatable adaptive compression), and fewer IOPS to look up individual account states (because random access is important). It uses a novel storage tree structure that is neither a B-tree nor an LSM-tree, but has some aspects of both, in order to reach for lower IOPS in various scenarios that each of those trees ar better or worse at, and it is also compresses the data in a novel way, which is important to storing histories and patterned data efficiently. It was originally motivated by Ethereum state history storage (since there's a clear benchmark), but I'm thinking it could be useful for time series data.

zelda-mazzy · on July 11, 2023

This is so cool!

nlavezzo · on July 11, 2023

This is awesome!

jorangreef · on July 12, 2023

Thanks, Nick! All credit to FoundationDB.

charbuff · on July 11, 2023

This is a really cool sales prop

jedisct1 · on July 11, 2023

This is insanely cool!

jorangreef · on July 12, 2023

Thanks, Frank!