This is similar to what I deal with at work, where we have a large number of devices (dare I say "IoT") that all send data to a cloud system. There's the time that each record was recorded, the time it was sent, the time it was actually picked up from the queue for processing, and so on.
It also arises in ETL systems like Apache Airflow, where the "logical date" of a task run might or might not correspond to the date it was actually run.
The common element here is that the time relationships are not monotonic, and you occasionally have to make business decisions around that non-monotonicity.
There's an old saying "two is an impossible number". In programming, we either deal with one of a thing, or a collection of things. The cases where there are exactly two things are uncommon (cue HN coming up with a bunch of exceptions). In the case of points in time, even bi-temporal leaves out a third, implicit point: now.
By human choice, though. We could be using N different thresholds of electrical charge or current, but instead we define ranges as "0-enough" and "1-enough".
> If you're new to Datomic, you probably have the same misconceptions as I did regarding the use of Datomic's historical features.
> bad news: you've probably over-estimated the usefulness of these features for implementing your own specific time travel. Unless you really know what you're doing, I recommend you don't use db.asOf(), db.history(), and :db/txInstant in your business logic code.
> good news: you've probably under-estimated the usefulness of these features for managing your entire system as a programmer.
>I believe the key to getting past this confusion is the distinction between event time (when things happened) and recording time (when your system learns they happened).
Another approach to this kind of thing would be storing most of your data as a log of events ("event sourcing") and having "Retroactive Address Change" as its own business-process, one which can be linked to the other steps like sending a refund check or crediting an account.
In that architecture, the database-row will often remain plain and boring with just one "current address." That "projection" (a result of events-so-far) would simply switch whenever the "Retroactive Address Change" event comes down the pipeline, which would be good enough for, say, the system that is sending out monthly statements.
Anybody needing more, like the exact history of changes and realizations, would be running their own processor and projection, scanning the event-stream in a way suitable to their specific use-case.
I have never seen a productive application of event sourcing at scale. The idea is super elegant but there’s so much extra overhead. Your data size is easily 10x bigger, unless things nearly never change anyway (and then why do it, right?). Even simple things such as fixing a typo require a Command to be stored as an Event etc etc. Fixing fuckups by manual database edits / script runs gets harder because it means you’ll create a mismatch between the events and the main db, so you ought to inject events instead, but you’re under time pressure so what to do, and the list goes on. I wish these things weren’t issues because in a very real way event sourcing feels like the “obviously best way” to deal with pretty much any kind of structured data storage. I want it to succeed.
Does anyone know of a system that successfully did/does event sourcing at scale? How did you deal with these downsides?
> Even simple things such as fixing a typo require a Command to be stored as an Event etc etc. Fixing fuckups ...
Event sourcing is a completely different way of thinking about business information. Of course you are going to run into things that look like "downsides". Consider the desired semantics of event sourcing and how your concerns run directly in the face of these semantics.
If you aren't comfortable with what qualitatively seem like "trivial" updates, then I don't think a large enough leap of faith has been made. We like to empathize with the machine but sometimes we take it a bit too far.
Event sourcing absolutely works and it works at the most massive scales the best. If you can dedicate an entire wing of your 15th floor office to "enterprise messaging and events", this stuff 5000% works. It seems to be at the small scale that it fails most frequently. At small scale, it's usually just one developer who has enough motivation to see it through, and the moment they shift their focus it falls apart.
I saw it work in manufacturing in ways I cannot do justice in HN comment size limits.
Yes, the event model is the only thing that makes sense. Just wanted to say, I don't think Kent's idea is in conflict with the Event Sourcing pattern, as long as these "Effective/Posting" events are stored append-only. Once you start updating the "Effective" date, you're losing data (a customer might call back and ask what the old Effective date was, to understand a paper bill, and you've lost it).
Git doesn't really. It does have a (more or less) immutable, append-only model, but the things stored in git are state snapshots, not events. The UI does tend to present it as if it were events but that's just for convenience.
I've implemented a system that uses event sourcing at "scale" (YMMV at what is "scale").
1. Data storage is not "10x bigger", it depends on how long you maintain the event stream in the past. Think about a bank account. The "monthly opening balance" is effectively a snapshot in time of all of the previous events for that account. So you can archive events to secondary storage as needed.
2. You don't "fix fuckups by manual database edits/scripts". You make sure that for every business level event, you have a reversal event defined as well. A fuckup is "reverse the affected events, then replay them with corrections". It requires the design at the start to understand that this needs to be dealt with.
The benefits:
* Each entity has its own state and events, it's self contained and easily mockable and testable on its own. The events are idempotent.
* Dependencies between entities is entirely driven by the events, so they are fully decoupled and can be deployed, blue/green, sharded, scaled etc independently.
* The events are all that is needed to recreate a scenario or failure, so can be saved and replayed as needed.
* Deployment of new entities and/or blue/green and/or scaling becomes trivial.
Cons:
* If there's an error in production, you have to apply reversal events and then replay the previous events to "fix" a problem. There isn't a "main DB" as such, so there's nothing to change there, but anything that takes the stream of events and turns it into RDBMS (for reporting etc) needs to also understand the reversals and replacement events. This can lead to confusion when a report generated on day X is reversed/redone and now it is "different" to what the report showed at the time. However, this can be covered by actually dealing with the reversals as first-level events in their own right.
* When you reverse/replay events for an entity, you need to be aware of side effects on other entities. For example, if you reverse/replay an order, you need to ensure that you don't trigger another round of payment processing. That means adding logic to the payment entity to ensure that it "knows" what order it got payment for so that it doesn't repeat.
* Event sourcing is not about "structured data storage", they're entirely about business logic. The events are business events. Programmers seem to have trouble dealing with the fact that the low level storage and processing are independent of the business events.
* You need to have versioning of event schemas so that you can add/remove attributes of the event type over time. Your processing has to accomodate the versioning (semver is your friend).
* If you have your event processing separate from your query processing, you cannot rely on the query when processing events for the entity itself. If you have multi-threaded and or scaled processing that deals with the entity at the same time, you need to have some form of internal caching/locking/vector-clocks to ensure that you don't have two processes updating the entity from two different events simultaneously.
So it has distinct advantages when in production and in the CI/CD deployment. It has additional complexity at the architecture and high level design phase and it needs some standardized libraries and infrastructure to support the necessary features (primarily the internal synchronization of entity/event processing between threads/processes to ensure monotonic updates of the state of the entity).
In case anyone is wondering, if this isn't dealt with by the system designer, then what ends up happening far too often is the data team is tasked with inferring the sequence of events based on a series of clues across this and other business systems to come up with some kind of pre-eventual-consistency guess.
I built a similar system for one of my old insurance clients. Every 3 months we would send an updated list of what we wanted the insurance company to cover. This batching worked great because changes came in daily and without it we would have to amend the policy for each individual change. We tracked the date the client requested a change (posting date in the article) alongside an effective date (the date we would base cost calculations off of).
Often the client would forget to report changes to us, so the reporting delay worked in our favor. For example, we would find out out on Friday that a client needed coverage starting the previous Monday. We would make sure there were no known losses and backdate coverage in our database. The insurance carriers would receive the report in a few months and be none the wiser. Carriers will backdate coverage like this, but it can be messy with the paperwork. Tracking posting dates, effective dates, and reporting dates really cut down on the complexity of managing the day-to-day insurance needs of my client.
> Part of the reason it hasn’t taken off is because of the additional complexity it imposes on programmers.
My take is that it's specifically the additional complexity it imposes on _database_ programmers - the impacts across storage design, foreign keys and schema evolution when all data is bitemporal are profound. SQL:2011 introduced "temporal tables" but the design was arguably incomplete and implementing it ubiquitously requires a real re-think of database internals, so we're now over a decade on and these capabilities are still far from commonplace.
> However, I think part of the reason it hasn’t become more popular, given the benefits it brings, is just the name.
No disagreement from me there!
Disclosure: I work on xtdb.com, building a new database engine where bitemporality is the default. I also spoke with Kent about this last week and am running a webinar about bitemporality more generally on Thursday: https://attendee.gotowebinar.com/register/296060701290006793...
I don't think much has really changed since, and I'm not sure Postgres is any closer to addressing this natively (although there have been extensions, e.g. https://github.com/scalegenius/pg_bitemporal).
Very cool article. It's a problem I knew kind of existed but never thought about formulating. I'm glad it's been at least called something since the 90s - this means maybe there's good stuff to read!
Bit annoying as I'm writing an event store in rust and now I want to make it bitemporal. But annoying in a good way.
Except making n-dimensional indexes both fast and scalable is hard. Often its safest to build your way up, one index at a time, until you fully understand the workload requirements.
N-dimensional event correction logic is also likely to be a lot more complicated than two, which you might be able to handle exhaustively with just a bit of care (and without breaking out the logical proof systems)
The more I learn about the relational model from a first order logic perspective, the more I wonder what we would have ended up with if more people thought of data/databases in this way (as opposed to the more record/object school of thought).
Event sourcing and CRDT like distributed data using something like the Event Calculus?
heh. when i read the title it suggested completely different tangent.. - that pure technical stuff is always eventualy consistent with business needs/goals. Sometimes very close (success). Sometimes very perpendicular (failure).
Nevermind, bi(tri)-temporality is amazing concept (together with "relations Are Objects too" someone mentioned long time ago... ; Back in the days ~2006-7 i wrote a library [1] to allow for it over plain SQLalchemy.. in a bigger project that did not see light. Heh, recently i made a wrapper for xtdb for python [2].. for another project which may or may not happen too).
But it is very difficult to grasp for common programmers, let alone common people. (Actualy, similar to time jitter, and other dimensional inconsistencies. wow-and-flutter). People like to think proceduraly i.e. in sequence..
Now, 20y later, temporality of things should be default way of handling data - not a quirky write-your-own-wrongly or use whatever weird extension..
It's tempting to go "wow bitemporality is a jolly great idea! Let's make a bitemporal database where every field is bitemporal" but this elides the significant jump in complexity/overhead of such systems.
Bitemporality should be explicitly enabled on data where required, part of the database schema rather than default behavior. This could be a big chunk of your data, but unlikely to be 100%.
I mostly agree, but it's the same as the argument for immutability in systems more generally: if you can afford the RAM / GC / storage costs then great. Eventually though costs will decrease and new systems will routinely capitalize on the productivity benefits in spite of reduced peak efficiency. HTAP database systems illustrate this trend.
But it's not clear to me that existing, popular database engines will ever be able to evolve to make bitemporality an efficient or easy default choice. And in the meantime countless person-years of effort are wasted rediscovering and solving the same time-related problems over and over.
A world where bitemporality was (somehow) a realistic default choice would be a simpler world, I reckon.
> A world where bitemporality was (somehow) a realistic default choice would be a simpler world, I reckon.
I think MongoDB sorta proves this theory false: "let's avoid RDBMS schema/migration hell by making everything a flexible JSON document".
As it turns out, JSON columns are a jolly good idea, but making that the dominant default i.e. MongoDB led to a bunch of other (perhaps unanticipated) issues. I believe it would be a similar situation with bitemporality.
So yeah, it would be cool if e.g. Postgres had fast native support for bitemporal fields in a way that makes them accessible without changing the whole paradigm.
I'm not convinced about that comparison. MongoDB entirely disregarded the relational model whereas bitemporality—as envisioned by Snodgrass et al. and standardised in SQL:2011—is merely an extension/evolution of it.
> it would be cool if e.g. Postgres had fast native support for bitemporal fields in a way that makes them accessible without changing the whole paradigm
100% agreed with you there. And there has been various work on patches & extensions in this area previously. Unfortunately I'm not sure Postgres' internals will ever be able to flex hard enough to deliver a decent "default choice" experience. At least not until some other system has proven the UX is really worth the effort (which I guess MongoDB did similarly achieve with native JSON handling).
As it turns out, other mature options offer this temporality out of the box. In Fauna (disclaimer: I work there) offers temporality out of the box so you can do a search on a record at a specific point in time in the past. It combines this with native JSON documents with flexible schema (so you can add additional notes as an audit trail) without disregarding relational modeling (it supports joins, foreign keys, normalization, etc.), and with low-latency distributed ACID writes nd transactions. A couple of other databases were mentioned here (Datomic and xtdb) that support this, and I believe it is just a matter of time where this will be more widely used. Two additional points to make: a) to the comment of "let's avoid RDBMS schema/migration hell by making everything a flexible JSON document" - that's a specific side effect of Mongo's design decision - since our db handles schema migrations quite elegantly, so I wouldn't be quick to assume that because Mongo did it that way that most systems will exhibit those side effects. b) The main reason I'm convinced this will be used a lot more (call it event-driven systems, temporal systems, etc) is because it will result in better-trained ML behavioral and prediction models too. So, might as well get ahead of this.
We just call that "point in time" (PiT) data. There can be multiple dimensions of timestamping, but you often end up with:
- published date: when was this information publicly available
- received date: when was this information actually received in our systems
- effective date: when does this information starts applying