We model a timestamp which functionally is "when does this activity go on the books"? We compute clearing both based on this "effective_at" and a "system time" (there are some business functions with known periodicity and it makes for a useful debugging tool).
Long-clearing activity makes for some interesting business cases and long-standing float obviously can lead to some detection noise. For example, Brazil has uniquely long settlement times.
I work with payments in a bank in Brazil and as I was reading the question I thought "interesting question, here we have a great 'delay' in clearances and very old and very complex government-embedded systems to deal with it, funny to see that you know about it too.
Hello! I am the author. Thank you for the comment.
While Stripe does love to write, I think our core premise here was that being able to reason mathematically about independent distributed systems was a unique application of a ledger. Likewise, we think the data quality offering is a unique extension since it gives us tools to make improvements to our systems over time.
I appreciate the reply. I think I would have appreciated a more clear description of how you are reasoning mathematically about your distributed systems. Simply claiming that you can and do is fine, but as an engineer I would have loved more practical examples. The only part that I found which touched on this was when you described how to find clearing errors by simply searching for clearing accounts with a non zero balance. Are there other examples of mathematically reasoning that you could share? I don't need/expect a reply here, this question is rhetorical to demonstrate how I think this article could have been stronger for me.
Not the author, but I worked on the Ledger team at Stripe for 5 years.
Clearing is definitely a key part and a large amount of signal can be derived simply through the zero-balance assertion. To give you some more detailed examples of that:
1. System A bookkeeps +$11 and system B bookkeeps -$10 against the same account, where this is representative of A handing off responsibilities to B during some multi-step process and something was off. In practice, this might look like a Charge being handed off from a product team to the team integrating with card networks for submission. There's plenty of reasons this could have gone awry internally from either team like incorrect fee handling, incorrect FX handling, etc.
2. Pipelines reconciling data may bookkeep $11 and -$10 against the same account, where the two events come from different data sources. This could be a difference in Stripe data vs partner reporting or even a difference between two reports from the same partner (e.g. a transaction-level report against a aggregate we've estimated attributions for). Again, there's plenty of reasons this goes wrong like an error on our side, unexpected partner behaviour, actual partner error, or our incorrect interpretation of a complex partner behaviour.
This approach is general in that we don't need to be concerned about what the actual data models or system interactions are to effectively apply monitoring to everything. Another element is to not create a lot of noise on for non-zero balances while they're in an interim state. Some amount of modelling happens to establish an expected time to clear for accounts, and with the right granularity, like "Charges from product teams to integrating teams usually take 10 minutes", "partner report A and B arrive and are processed 1 day apart", or "we receive money from partner A about 2 days after submitting".
In the end, we're trying to boil down every other Stripe system and partner integrations into a Stripe-wide set of discrete states, transitions, and with dollar value they're for. Teams may have their own system diagrams but they'll also have a parallel Ledger event diagram if they handle money which takes consideration to get right (e.g. failure modes, modelling everything well). I suspect that's what the author is getting at with reasoning more mathematically about distributed systems, as we add a twist on a typical system design here.
I don't work at Stripe, but I've talked to a few ledger nerds there.
One thing I'd point out that makes Stripe's ledger different than most others you might have seen or encountered is the granularity of what they consider an account. If you think about a bank ledger, you logically have a single account per customer. Stripe uses multiple accounts for each payment, so processing this number of accounts is a hard scaling problem, especially when most are ephemeral. The saving grace is that this can be done offline and balances can be eventually consistent or materialised on read.
When you have this many accounts, organising them also is a big challenge, since you want to be able to roll them up to something meaningful. Conceptually asset and liability accounts in a ledger represent a financial relationship with an economic actor. If that's the top level account (e.g. how much does bank X owe us right now) you'll want to have a hierarchy of accounts down to each state of each payment. So a lot of effort went into building modelling tools for product teams to design how their financial activity gets represented in the ledger.
Disclaimer: I'm CTO at Fragment (http://fragment.dev) and Stripe led our last round.
To me, this reply is more valuable than the original post simply because it's written without such heavy dependence on jargon. It also does a better job of "showing, not telling" examples of the behavior that need to be handled.
I had to read it a few times too, I'm still not entirely convinced that "being able to reason mathematically about independent distributed systems was a unique application of a ledger" actually parses into anything.
Long-clearing activity makes for some interesting business cases and long-standing float obviously can lead to some detection noise. For example, Brazil has uniquely long settlement times.