We actually measured latency and throughput to find the efficient frontier, the number of logical transactions per physical DBMS query, that optimizes both.
What you find is that the relation between latency and throughput looks more like a U-shaped curve.
As you process only 1 debit/credit at a time, you get worse throughput but also worse latency, because things like networking or fsync have a fixed cost component, so your system can't process incoming work fast enough and queues start to build up, impacting latency.
Whereas, as you process more debit/credits per batch, you get better throughput but also better latency, because for the same fixed costs your system is able to do more work, and so keep queueing times short.
At some point, which for TigerBeetle tends to be around 8k debit/credits per batch, you get the best of both, and thereafter latency starts increasing.
You can think of this like the Eiffel Tower. If you only let 1 person in the elevator at a time, you're not prioritizing latency, because queues are going to build up. What you want to do rather is find the sweet spot of the lift capacity, and then let that many people in at a time (or let 1 person in immediately and send them up if there's no queue, then let the queue build and batch when the lift comes back!).
We had been following (and sponsoring) Zig since 2018, for 2 years before I chose Zig for TigerBeetle in 2020.
At TigerBeetle, we have a technical value called "Edge", which refers to "the ability to make quality technical decisions from first principles without regard to popularity". We don't adopt technology on the basis of popularity or quantity (that's an anti-pattern for us, in fact, as a logical fallacy), but rather on the basis of quality (does the technology merit the choice?).
In other words, we see choosing a technology a lot like surfing. If technology moves in waves, then the time to paddle out is not when you see thousands of surfers on a wave, but rather before, you need to spot quality swell before it breaks. If you can do that, then you tend to meet surfers at the backline who can also spot good swell, and counter-intuitively, that has the 2nd order effect of also being good for hiring.
I think there's an assumption (that I used to have) that open source is somehow inversely related to business model.
However, before TigerBeetle was founded, I came around to the understanding that, in fact, "open source and business model are orthogonal".
What I mean by this is that startups might say that open source is too expensive (who will manage it for them?) and that enterprise might say that open source is too cheap (who will scale and support it and connect it to all their other systems for them?).
So the business model is what solves both these problems: providing a valuable product around the open source that serves a real need, honorably at a profit, that customers want to pay for, because the alternative is either too expensive or too cheap.
TLDR: Stored procedures (or an extension, or an in-process embedded DBMS) would help with the external row locks but wouldn't solve the internal locks (that couple I/O and CPU), and wouldn't approach the same degree of group commit, or network request amortization that TigerBeetle does.
i.e. You're still using the anatomy of a general-purpose DBMS design, whereas TigerBeetle has a completely different anatomy in the storage engine and consensus protocol—all specialized for transaction processing. Again, the anatomy is different (think duck for OLAP, elephant for OLGP, or beetle for OLTP).
> Is there any chance you could add a mode with a regular database so I can get the enjoyment of learning how to break it? I can't appreciate how hard it is to crash TigerBeetle since I don't have a frame of reference.
This would be awesome to do.
We're running real TigerBeetle code in each of the "beetles" that you see. While we couldn't bring in real code from a regular database (most were not designed for Deterministic Simulation Testing), we could definitely make TigerBeetle emulate one.
For example, we could disable TB's corruption detection/recovery code paths that provide storage fault-tolerance. And then you could zap "a regular database" with a cosmic ray and see what happens with a regular consensus protocol like Raft that doesn't implement UW-Madison's “Protocol-Aware Recovery for Consensus-Based Storage”, how the fault propagates through the cluster.
> More in-game explanations on what's happening would be useful as well.
Yes, for sure. What would you like us specifically to explain more concretely?
> It looks cool but is confusing as it's impossible for me to actually "win".
(There is "a game within a game" that you can play after Radioactive that you can indeed win! Also an OLAP easter egg in there somewhere if you can find it!)
> For example, we could disable TB's corruption detection/recovery code paths that provide storage fault-tolerance. And then you could zap "a regular database" with a cosmic ray and see what happens with a regular consensus protocol like Raft that doesn't implement UW-Madison's “Protocol-Aware Recovery for Consensus-Based Storage”, how the fault propagates through the cluster.
This would be amazing! I'd suggest that TigerBeetle should be hard mode. I can understand common failures by breaking database clusters with them, then get crushed when TigerBeetle defeats the strategy I just learned. Could be useful for the non-technical people you want to sell to.
> Yes, for sure. What would you like us specifically to explain more concretely?
More or less the explanation you gave above. In Scenario 3, I see you're zapping random beetles causing the network to automatically recover. Perhaps the sim should freeze right before the first zap, and step through the following events with annotations:
1. You're about to corrupt the storage
2. TigerBeetle will detect the data corruption
3. It won't propagate through the cluster
This would let me see what animations correspond to what events, so when it gets faster, I still understand what's going on.
There's also a lot of metrics on the screen. The simulator should highlight important metrics for the scenario I'm watching, like read/write corruption rate in Scenario 3.
Also, it'd be cool to see metrics that evaluate the system. e.g. How long does a user have to wait before they get a response? Does TigerBeetle degrade better under load than competitors?
Right now, it's hard for me to judge the performance of TigerBeetle in your scenarios beyond "it works!", even though your blog post claims large gains in speed.
Those metrics would also be an entertaining way of keeping score (maybe add a leaderboard?) Making it extremely difficult (but still possible) to score against TigerBeetle gives me an intuitive knowledge of why your database is so good.
> (There is "a game within a game" that you can play after Radioactive that you can indeed win! Also an OLAP easter egg in there somewhere if you can find it!)
And so TB exposes a set of primitives designed around debit/credit (not only a record of debit/credits).
For example, you can express something like:
- Do a transfer between accounts, but only on condition that the resulting debits would not exceed credits (or vice versa, or some threshold amount compared to this net balance), and while taking other authorized (but not yet captured) inflight transfers (i.e. to other third party systems using two-phase commit protocol) into account also in that net balance calculation.
- Then, if this transfer succeeds, immediately also do another 7000 transfers, all linked atomically together. But if any of these would fail, then the whole chain, including the net debit cap transfer, should not be executed.
- But only do all the above, if both volume and velocity limits (expressed not in a currency but in a quantity) in an isolated risk ledger would also be satisfied.
- And finally, if all this does succeed, then do another set of transfers in a separate USD reporting ledger.
You can do the above, atomically, in a single round-trip to TigerBeetle. It would be a nightmare in SQL.
But the other advantage to having the transaction DBMS speak debit/credit accounting primitives is that your multiple products can then build around a transaction processing core with strict controls, while keeping the accounting policy in the product layer above—because while every business/product may have different accounting policy, the primitives are the same. This keeps the core of your transaction DBMS simple and in control, while allowing flexibility for product evolution.
It was pretty surreal to sit next to someone at a dinner in NYC two months ago, be introduced, and realize that they're someone you had an HN exchange with 5 years ago.
To be clear, TB moves the code to the data, rather than the data to the code, and precisely so that you don't have "race conditions outside the ledger".
Instead, all kinds of complicated debit/credit contracts (up to 8k financial transactions at a time, linked together atomically) can be expressed in a single request to the database, composed in terms of a rich set of debit/credit primitives (e.g. two-phase debit/credit with rollback after a timeout), to enforce financial consistency directly in the database.
On the other hand, moving the data to the code, to make decisions outside the OLTP database was exactly the anti-pattern we were wanting to fix in the central bank switch, as it tried to implement debit/credit primitives but over general-purpose DBMS. It's really hard to get these things right on top of Postgres.
And even if you get the primitives right, the performance is fundamentally limited by row locks interacting with RTTs and contention. Again, these row locks are not only external, but also internal (i.e. how I/O interacts with CPU inside the DBMS), and why stored procedures or extensions aren't enough to fix the performance.
Can you expand on why sproc isn't a good solution (e.g. send set of requests, process those that are still in valid state, error those that aren't, return responses)?
Maybe knowing the volumes you are dealing would help also.
> I honestly don't understand why this isn't a Postgres extension.
We considered a Postgres extension at the time (as well as stored procedures or even an embedded in-process DBMS).
However, this wouldn’t have moved the needle to where we needed it to be. Our internal design requirements (TB started as an internal project at Coil, contracting on a central bank switch) were literally a three order of magnitude increase in performance—to keep up with where transaction workloads were going.
While an extension or stored procedures would reduce external locking, the general-purpose DBMS design implementing them still tends to do far too much internal locking, interleaving disk I/O with CPU and coupling resources. In contrast, TigerBeetle explicitly decouples disk I/O and CPU to amortize internal locking and so “pipeline in bulk” for mechanical sympathy. Think SIMD vectorization but applied to state machine execution.
For example, before TB’s state machine executes 1 request of 8k transactions, all data dependencies are prefetched in advance (typically from L1/2/3 cache) so that the CPU becomes like a sprinter running the 100 meters. This suits extreme OLTP workloads where a few million debit/credit transactions need to be pushed through less than 10 accounts/rows (e.g. for a small central bank switch with 10 banks around the table). This is pathological for a general-purpose DBMS design, but easy for TB because hot accounts are hot in cache, and all locking (whether external or internal) is amortized across 8k transactions.
But the big problem with extensions or stored procedures is that they still tend to have a “one transaction at a time” mindset at the network layer. In other words, they don’t typically amortize network requests beyond a 1:1 ratio of logical transaction to physical SQL transaction; they’re not ergonomic if you want to pack a few thousand logical transactions in one physical query.
On the other hand, TB’s design is like “stored procedures meets group commit on steroids”, packing up to 8k logical transactions in 1 physical query, and amortizing the costs not only of state machine execution (as described above) but also syscalls, networking and fsync (it’s something roughly like 4 syscalls, 4 memcopies and 4 network messages to execute 8k transactions—really hard for Postgres to match that).
Postgres is also nearly 30 years old. It's an awesome database but hardware, software and research into how you would design a transaction processing database today has advanced significantly since then. For example, we wanted more safety around things like Fsyncgate by having an explicit storage fault model. We also wanted deterministic simulation testing and static memory allocation, and to follow NASA's Power of Ten Rules for Safety-Critical code.
A Postgres extension would have been a showstopper for these things, but these were the technical contributions that needed to be made.
I also think that some of the most interesting performance innovations (static memory allocation, zero-deserialization, zero-context switches, zero-syscalls etc.) are coming out of HFT these days. For example, Martin Thompson’s Evolution of Financial Exchange Architectures: https://www.youtube.com/watch?v=qDhTjE0XmkE
HFT is a great precursor to see where OLTP is going, because the major contention problem of OLTP is mostly solved by HFT architectures, and because the arbitrage and volume of HFT is now moving into other sectors—as the world becomes more transactional.
> In what case is it better to have two databases?
Finally, regarding two databases, this was something we wanted explicit in the architecture. Not to "mix cash and customer records" in one general-purpose mutable filing cabinet, but rather to have "separation of concerns", the variable-length customer records in the general-purpose DBMS (or filing cabinet) in the control plane, and the cash in the immutable financial transactions database (or bank vault) in the data plane.
It's the same reason you would want Postgres + S3, or Postgres + Redpanda. Postgres is perfect as a general-purpose or OLGP database, but it's not specialized for OLAP like DuckDB, or specialized for OLTP like TigerBeetle.
Again, appreciate the question and happy to answer more!
Thanks for taking the time for the explanation and the rundown on the architecture. Sounds a bit like an LMAX disruptor for DB, which honestly is quite a natural implementation of performance. Kudos for the Zig implementation as well, I've never seen a project as serious in it.
Personally, I still see challenges in developing on top of a system with data in two places unless there's a nice way to sync between them, and I would have seen the mutable/immutable classification as more of unlogged vs changes fully logged in DB, but I'm just doing armchair analysis here.
Exactly, the Martin Thompson talk I linked above is about the LMAX architecture. He gave this at QCon London I think in May 2020 and we were designing TigerBeetle in July 2020, pretty much lapping this up (I'd been a fan of Thompson's Mechanical Sympathy blog already for a few years by this point).
I think the way to see this is not as "two places for the same type of data" but rather as "separation of concerns for radically different types of data" with different compliance/retention/mutability/access/performance/scale characteristics.
It's also a natural architecture, and nothing new. How you would probably want to architect the "core" of a core banking system. We literally lifted the design for TigerBeetle directly out of the central bank switch's internal core, so that it would be dead simple to "heart transplant" back in later.
The surprising thing though, was when small fintech startups, energy and gaming companies started reaching out. The primitives are easy to build with and unlock significantly more scale. Again, like using object storage in addition to Postgres is probably a good idea.
- Abstract time (all timeouts etc.) in the DBMS, so that time can be accelerated (roughly by 700x) by ticking time in a while true loop.
- Abstract storage/network/process and do fault injection across all the storage/network/process fault models. You can read about these fault models here: https://docs.tigerbeetle.com/about/safety.
- Verify linearizability, but immediately as state machines advance state (not after the fact by checking for valid histories, which is more expensive), by comparing each state transition against the set of inflight client requests (the simulator controls the world so it can do this).
- But not only check correctness, also test liveness, that durability is not wasted, and that availability is maximized, given the durability at hand. In other words, given the amount of storage/network faults (or f) injected into the cluster, and according to the specification of the protocols (the simulator is protocol-aware), is the cluster as available as it should be? Or has it lost availability prematurely? See: https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...
- Then also do a myriad of things like verify that replicas are cache-coherent at all times with their simulated disk, that the page cache does not get out of sync (like what happened with Linux's page cache in Fsyncgate) etc.
- And while this is running, there are 6000+ assertions in all critical functions checking all pre/post-conditions at function (or block) scope.
And please come and join us live every Thursday at 10am PT / 1pm ET / 5pm UTC for matklad's IronBeetle on Twitch where we do code walk throughs and live Q&A: https://www.twitch.tv/tigerbeetle
Joran from TigerBeetle here! Really stoked to see a bit of Jim Gray history on the front page and happy to dive into how TB's consensus and storage engine implements these ideas (starting from main! https://github.com/tigerbeetle/tigerbeetle/blob/main/src/tig...).
> focusing on the raw TPS number is a bit like “counting the number of bills in your wallet but ignoring that some are singles, some are twenties, and some are hundreds.”
https://chainspect.app/dashboard describes each of their metrics:
Real-Time TPS (tx/s),
Max Recorded TPS (tx/s),
Max Theoretical TPS (tx/s),
Block Time (s),
Finality (s)
USD/day probably doesn't predict TPS; because the Average transaction value is higher on networks with low TPS.
Other metrics: FLOPS, FLOPS/WHr, TOPS, TOPS/WHr, $/OPS/WHr
And then there's Uptime; or losses due to downtime (given SLA prorated costs)
> USD/day probably doesn't predict TPS; because the Average transaction value is higher on networks with low TPS.
Exactly. If we only look at USD/day we might not see the trend in transaction volume.
What's happened is like a TV set going from B&W to full color 4K. If you look at the dimensions of the TV, it's pretty much the same. But the number of pixels (txns) is increasing, with their size (value) decreasing, for significantly higher resolution across sectors. And this is directly valuable because higher resolution (i.e. transactionality) enables arbitrage/efficiency.
I don't know what the frontier is on TPS and value in financial systems. Does HFT or market-making really add that much value compared to direct capital investment and the trading frequency of an LTSE Long Term Stock Exchange?
High tx fees (which are simply burnt) disincentive HFT, which may also be a waste of electricity like PoW, in terms of allocation.
Low tx fees increase the likelihood of arbitrage that reduces the spread between prices (on multiple exchanges) and then what effect on volatility and real value?
But those are market dynamics, not general high-TPS system dynamics.
You can see the trend better in the world of instant payments, if you look to India's UPI or Brazil's PIX, how they've taken over from cash and are disrupting credit card transactions, with this starting to spread to Europe and the US: https://www.npci.org.in/what-we-do/upi/product-statistics
However, it's not limited to fintech.
For example, in some countries, energy used to be transacted once a month. Someone would come to your house or business, read your meter, and send you a bill for your energy usage. With the world moving away from coal to clean energy and solar, this is changing because the price of energy now follows the sun. So energy providers are starting to price energy, not once a month, but every 30 minutes. In other words, this 1440x increase in transactions volume enables cleaner more efficient energy pricing.
You see the same trend also in content and streaming. You used to buy an album, then a song, and now you stream on Spotify and Apple. Same for movies and Netflix.
The cloud is no different. Moving from colocation, to dedicated, to per-hour and then per-minute instance pricing, and now serverless.
And then Uber and Airbnb have done much the same for car rentals or house rentals: take an existing business model and make it more transactional.
Thanks, I hadn't. Looking into it, I think Toon is asking the question:
Should the DBMS expose the raw storage operations (insert, update, delete) for the application to take the burden of correctness for their physical composition into logical functionality? Or should the DBMS export that logical functionality?
With TigerBeetle, we didn't want people to have to keep cobbling together their transaction processing system, to take that burden of correctness, but to solve this in open source, once and for all, for everyone to build around.
We actually measured latency and throughput to find the efficient frontier, the number of logical transactions per physical DBMS query, that optimizes both.
What you find is that the relation between latency and throughput looks more like a U-shaped curve.
As you process only 1 debit/credit at a time, you get worse throughput but also worse latency, because things like networking or fsync have a fixed cost component, so your system can't process incoming work fast enough and queues start to build up, impacting latency.
Whereas, as you process more debit/credits per batch, you get better throughput but also better latency, because for the same fixed costs your system is able to do more work, and so keep queueing times short.
At some point, which for TigerBeetle tends to be around 8k debit/credits per batch, you get the best of both, and thereafter latency starts increasing.
You can think of this like the Eiffel Tower. If you only let 1 person in the elevator at a time, you're not prioritizing latency, because queues are going to build up. What you want to do rather is find the sweet spot of the lift capacity, and then let that many people in at a time (or let 1 person in immediately and send them up if there's no queue, then let the queue build and batch when the lift comes back!).