Hacker News new | past | comments | ask | show | jobs | submit login
Turning the database inside-out (2015) (kleppmann.com)
158 points by andection 15 hours ago | hide | past | favorite | 72 comments





We did this style on top of plain MSSQL. Each event would have a SQL table which is the primary storage. Then we have workers that listens to new data in tables and updates projections we needed. (Sometimes DB triggers but mostly async workers.)

The main issue is "listening to new data in a SQL table". I wrote this code to achieve it in MSSQL (perhaps it is somehow built into postgres?): https://github.com/vippsas/mssql-changefeed

In my experience this approach is beautiful; as Martins says, our backend code is mostly stateless and functional these days, why have mutable objects in the DB? And the approach is extremely useful for dry-running business logic etc

But we didn't like the prospect of adopting Kafka wholesale. Having all the data in a SQL DB is extremely convenient for debugging, and since we already used SQL it was a smaller change that was done first where it made most sense and then spread out.

It would be great with more DB features targeting this style. Explicit partition event tables (kafka-in-SQL), and writing a projection simply as a SQL query which is inverted into an async trigger by the DB would be awesome. (MSSQL has indexed views, but it cannot be done online...)

Materialize is the DB I know about in this territory.


I use the architecture you described as the go-to architecture for most of so called "business" applications (with PostgreSQL as dbms though).

The missing pieces are:

- incremental materialized view maintenance

- (bi)temporal primary and foreign keys (upcoming in Pg18)

Lately I found out about DBSP and Feldera which looks very promising as it is based on sound theory: https://github.com/feldera/feldera?tab=readme-ov-file#-theor...


Great link, thanks!

What do you do postgres for consuming old events and smoothly transitioning to consuming new events? Anything like an event ID allocated at commit time that is usable? As I talk about in sibling comment..


> The main issue is "listening to new data in a SQL table".

You may want to take a look at Service Broker[0]. It's the idiomatic messaging and queuing bit of SQL Server. It's a bit of an obscure feature and has a bit of a steep learning curve. If I were trying to implement what you're doing it would be the tool I'd reach for.

[0] https://learn.microsoft.com/en-us/sql/database-engine/config...


“ The main issue is "listening to new data in a SQL table". I wrote this code to achieve it in MSSQL (perhaps it is somehow built into postgres?): https://github.com/vippsas”

Postgres uses “publications” for this purpose. Clients can subscribe to a publication that gets updates to a given table.


I always thought that the most flexible approach was:

- good old mutable relational tables

- a separate db to store immutable events (could be the same kind of db you use for business transactions or something fancy like big query)

I feel like mixing both into one has more disadvantages


This is of course the more common approach, I was aware that our approach is not usual which is why I posted.

You don't list the disadvantages so cannot respond to that. But I really like the code resulting from flipping it around. It just fits how I think and I now feel "unsafe" messing around with code that mutates state directly.

The programming style of those mutable relational tables is a bit like mutable objects in Java -- eventually people moved to more stateless code and immutable objects. The same shift doesn't have to happen in the storage layer, but it is what the OP (and I) argue for.

I really enjoy having the tools around to replay any object from history as part of the online backend. For instance, consider if there was a bug so that customer input was buggy in some timeframe. Instead of writing a job that looks through the history of each customer and tries to figure out if you should mutate the data, then roll out that ad hoc mutation while holding your breath -- you can add some rules when fetching the relevant events on lookup-time and change how they reflect the state; i.e. a change that is only read only and only changes deployed code and not data, and roll back by only rolling back the service not by reverting a data change.


> perhaps it is somehow built into postgres?

Postgres has a built-in listen/notify mechanism. The problem with that is, that it doesn't guarantee delivery and if no process is listening, notifications will be lost.

Most solutions that need something like that use "logical decoding" these days. That's the built-in change data capture exposed as a public API as part of the logical replication.


Yes, listen/notify is something very different. We would often write new projections that consumes events from years back and until today.

You want sequence numbers that indicate the event's position in a partitioned log.

Something like "int identity" except that the int is assigned during commit, so that you have guarantee that if you see IDs 5 and 7, then 6 will never show up, so that each consumer can store a cursor of its progress of consuming the table which is safe against inserts.

I was hoping to do it using CDC, but Microsoft SQL has a minimum 1 minute delay on CDC which destroys any live data usecase. Perhaps postgres allows listening to the replication log with lower latency?


> Perhaps postgres allows listening to the replication log with lower latency?

Yes, I think that's what the "logical decoding" referred to. Postgres can emit a "logical" version of the WAL (something with a stable spec writtten down so that other services can stream and decode it). My understanding was that "logical replication" was designed for low latency situations like creating read replicas.

I haven't heard of the logical log being preserved for "years back" but that's an interesting case...


That is OK, guess I would write a job to listen to the logical WAL and use it to do an update that writes an event sequence number.

> Something like "int identity" except that the int is assigned during commit, so that you have guarantee that if you see IDs 5 and 7, then 6 will never show up

I don't think that's possible, nor is it something you should actually need.

If two transactions tx1 and tx2 are concurrent (let's say tx2 begins after tx1 and also finishes after tx1), then tx2 has done some work without access to tx1's data (as tx1 hadn't committed yet when tx2 began). So either:

- tx1's data is relevant to tx2, so tx2 needs to abort and retry. In which case the sequence number doesn't _need_ to be assigned at commit time, it can be assigned at any time and will be increasing monotonically between related transactions. - tx1's data is irrelevant to tx2, in which case the ordering is irrelevant and you don't need to assign the sequence number as late as commit-time.

The "relevance" is what partition keys encode: if tx1 and tx2 are potentially conflicting, they should use the same partition key. It doesn't enforce that sequence numbers increase monotonically within a physical partition, but it enforces that they do for a given _partition key_ (which is what should matter, the key->partition assignment is arbitrary).

> Perhaps postgres allows listening to the replication log with lower latency?

Pretty sure it does, you can listen to the WAL which is as instant as it gets. We were doing that in a previous company: a process (debezium) would listen to the WAL for a specific "events" table and write to Kafka. The main downside is that the table isn't an outbox, it keeps growing despite the events having been pushed to Kafka.


You explain why I don't need them for a very different usecase than what I refer to.

My point is I want a new primitive -- a pub/sub sequence number -- to avoid having Kafka around at all.

What Kafka does is "only" to generate and store such a sequence number after all (it orders events on a partition, but the sequence number I talk about is the same thing just different storage format). So also you do need it in the setup you describe, you just let Kafka generate it instead of having it in SQL.

Assuming your workload is fine with a single DB, the only thing Kafka gives you is in fact assigning such a post-commit row sequence number (+API/libraries building on it).

This is the mechanism used to implement pub/sub: Every consumer tracks what sequence number it has read to (and Kafka guarantees that the sequence number is increasing).

That is what mssql-changefeed linked above is about: Assigning those event log sequence numbers in the DB instead. And not use any event brokers (or outboxes) at all.

For postgres I would likely then consume the WAL and write sequence numbers to another table based on those...

It may seem clunky but IMO installing and operating Kafka just to get those pub/sub sequence numbers assigned is even clunkier.


It sounds like the Log Sequence Number in Postgres is what you are looking for. If you subscribe to a Postgres publication via logical replication, each commit will be emitted with a monotonically increasing LSN.

I would love to know if other people in the industry (beside hickey/datomic) use the immutable log/stream + integrators. From my small experience in enterprise app: auditability and time travelling are always bolted on good old sql tables/snapshots after the fact and the pain is already baked in.

Depends which industry. If you look at a lot of non-tech industry then they'll use a commercial DB with all those features in place already, rather than hacking up their own data layer. A few years ago I spent some time in the enterprise finance space, and learned some unfashionable tech you don't see talked about on Hacker News much. It left me with a new appreciation for what goes on there. A staggering amount of time spent in tech startups is spent on solving and resolving problems that you can buy off the shelf solutions for and have been able to for a long time.

After all, this talk is now 10 years old but appears to be describing features that have been around for much longer. Take your average bank - it will have a bunch of Oracle databases in it. Those already have every feature discussed in this thread and in the talk:

• Incremental materialized view maintenance (with automatic query rewrite to use it, so users don't have to know it exists).

• Exposing logical commit logs as an API, with tooling (e.g. GoldenGate, LogMiner, query change notifications).

• Time travelling SELECT (... AS OF).

• Lots of audit features.

• Integrated transactional and scalable MQ (no need for Kafka).

My experience was that faced with a data processing problem, enterprise devs will tend to just read the user guide for their corporation's database, or ask for advice from a graybeard who already did so. They go write some SQL or an Excel plugin or something old school, ship it, close the ticket, go home. Then a few years later you look at HN and find there's a whole startup trying to sell the same feature.


Who besides Oracle offers this stuff though?

Yeah Oracle has a bunch of nice features, it also costs a gajillion dollars that no one besides a large enterprise can afford.


time based snapshots are in datomic and also possible in other dbms (maybe via extensions)

for the rest i don't know


XTDB (inspired by datomic) also has bitemporal queries.

Wouldn't surprise me a bit. Thanks for the detailed comment.

In large scale business intergration platforms/apps, you have operational systems like SAP and and Oracle Service Cloud generate/stream raw or business events which are published to message brokers in topics ( orders, incidents, suppliers, logistics, etc). There the data is published , validated, transformed (filtered, routed, formatted, enriched, aggregated, etc) into other downstream topics which can be used to egress to other apps or enterprise data stores/data lakes. Data governance apps control who has access. Elastic search or Splunk for data lineage and debugging. you also have sbservability systems sandwiched in there as well.

thoughts from 1992 (Gray+Reuter): https://news.ycombinator.com/item?id=42829878

very interesting, every generation foresees the same solutions somehow


I have implemented something similar (updating projection from background and serving the projection automatically via REST) and wrote a high-Level article about it https://www.fabianzeindl.com/posts/the-api-database-architec...

The thing I always get stuck on with these techniques is, how do you handle transactions which perform validations/enforce invariants on data when you’re just writing writes to a log and computing materialized views down the line? How can you do essentially, an “add item to shopping cart” if for example, users can only have max 10 items and so you need to validate that there aren’t already 10 items in the cart?

This all sounds to me very close to the event-sourcing/CQRS/DDD area of thinking. In which case you look at it in two parts:

- Event firing: Here is where you fire an event saying that the thing has happened (i.e. item_added_to_cart, not add_item_to_cart). Crucially, this event states the thing has happened. This isn't a request, it is a past-tense statement of fact, which is oddly important. It is therefore at this point where you must do the validation.

- Event handing: Here you receive information about an event that has already happened. You don't get to argue about it, it has happened. So you either have to handle it, or accept you have an incomplete view of reality. So perhaps you have to accept that the cart can have more than 10 items in some circumstances, in which case you prompt the use to correct the problem before checking out.

In fact, this is typically how it goes with this kind of eventual-consistency. First fire the event that is as valid as possible. Then when handing an 'invalid' event accept that its just got to happen (cart has 11 items now), then prompt the user fix it (if there is one).

Not sure how helpful this is here, but thought it a useful perspective.


You're not "just" writing to a log. You always need a view of the state of the world to take decisions that enforce invariants (in this case, the "world" is the cart, the "decision" is whether an item can be added).

What you'd do is, when you receive the "addToCart" command, construct the current state of the cart by reading the log stream (`reduce` it into an in-memory object), which has enough data to decide what to do with the command (eg throw some sort of validation exception). Plus some concurrency control to make sure you don't add multiple items concurrently.

For reading data, you could just read the log stream to construct the projection (which doesn't need to be the same structure as the object you use for writes) in-memory, it's a completely reasonable thing to do.

So at the core, the only thing you persist is a log stream and every write/read model only exists in-memory. Anything else is an optimization.

DDD calls this "view of the world" an "aggregate". Reading the log stream and constructing your aggregate from it is usually fast (log streams shouldn't be very long), if it's not fast enough there's caching techniques (aka snapshots).

Similarly, if reducing the log stream into read models is too slow, you can cache these read models (updated asynchronously as new events are written), this is just an optimization. This comes at the cost of eventual consistency though.


You don't use it for that sort of thing.

But if you did you'd need an aggregatable (commutative) rule.

Like you can't aggregate P99 metrics. (To see why, it is similar to why you can't aggregate P50. You can't because a median of a bunch of medians is not the total median)

So you measure number of requests of latency < 100ms and number of requests. Both of these aggregate nicely. Divide one by the other. Now you get Pxx for 100ms. So if your P99 target was 100ms you set your 100ms target to 99%.

Anyway you'd need something like this for your shopping cart. It is probably doable as a top 10 (and anything else gets abandoned). Top 10 is aggregatable. You just need an order. Could be added to cart time or price.


This kind of thing only works if your whole universe belongs to your database.

Transactions and conditional-updates work smoothly if it's your customer browsing your shop in your database - up to a point.

But I usually end up with partner integrations where those techniques don't work. For instance, partners will just tell you true facts about what happened - a customer quit, or a product was removed from the catalogue. Your system can't reject these just because your Db can't or won't accept that state change.


All systems I have worked on like this has some concept of a version number for each entity / aggregate.

So you get that the account_balance was 100$ on version 10 of the account, and write an event that deducts 10$ on version 11.

If another writer did the same at exact same time, they would write an event to deduct 100$ at version 11. There will be a conflict and only one version will win.

This is exactly like any optimistic concurrency control also without event as the primary storage.

Didn't check if the system linked to supports this, I guess it might not? But this primitive seems quite crucial to me.


I was thinking about this as well as these streaming things become more popular. Would you write add-to-cart events and those trigger add-cart events; the latter containing an valid field which will become false after the 10th add-cart. So after that you remove-from-cart which triggers add-cart which then becomes valid again < 11 items? And transactions similarly roll back by running the inverse of what happened after the transaction started. I'm just thinking out loud. I understand you probably wouldn't use this for that, but let's have some fun shall we?

I assume that shopping cart limit is a made-up example, but I'm curious what preconditions are you actually enforcing in the real world via DB transaction rollback?

TLDR; You evaluate your preconditions and invariants before the event is published, using the current state of the aggregate.

Here's how that looks like in a DDD world:

* An aggregate is responsible for encapsulating business rules and emitting events

* An aggregate is responsible for maintaining the validity of its own state (ensuring invariants are valid)

* When a command/request is received, the aggregate first rehydrates its current state by replaying all previous events

* The aggregate then validates the command against its business rules using the current state

* Only if validation passes does the aggregate emit the new event

* If validation fails, the command is rejected (e.g., throws CartMaxLimitReached error)

Example flow: Command "AddItemToCart" arrives

>> System loads CartAggregate by replaying all its events

>> CartAggregate checks its invariants (current items count < 10)

>> If valid: emits "ItemAddedToCart" event. If invalid: throws CartMaxLimitReached error


You write the 'add item' event regardless, and when building the 'cart' view you handle the limit.

Alternatively "invalid cart" could itself become an event.

Well, assume the non-overdraftable bank account example instead then, what do you do then?

Sounds like an easy way to run out of storage space

This is also known as event sourcing [0] and is a common pattern used inside of databases, in git, in lots of popular software.

I don't generally recommend it for every application as the tooling is not as well integrated as it is in an RDBMS and the data model doesn't fit every use-case.

However, if you have a system that needs to know "when" something happened in an on-going process, it can be a very handy architecture... although with data-retention laws it can get tricky quickly (among other reasons).

[0] https://martinfowler.com/eaaDev/EventSourcing.html


> Databases are global, shared, mutable state. [...] However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

Because the world itself is a global, shared, mutable state (which, incidentally, is also a single source of truth) and databases were invented to mirror it (well, relevant parts of it) 1-to-1, or close to it. This style of "we use the database as a proxy of the physical world itself" is still pretty common, see e.g. the example with the shopping cart somewhere else in these comments.


Seeing the world as mutable is a matter of perspective, if you explicitly model time as a dimension it can instead be seen as a sequence of transitions from immutable state to immutable state, an accumulation of events over time, which fits the log abstraction perfectly.

> if you explicitly model time as a dimension it can instead be seen as a sequence of transitions from immutable state to immutable state, an accumulation of events over time, which fits the log abstraction perfectly

I worked with a feature that used this approach once. It even made sense for the feature (an immutable history log of patient chart data). It was absolute hell to work with. Querying current state, which was 99% of the usecases, was cumbersome and extremely slow.

Turns out doctors rarely care about any of that immutable history. They just wanna know what’s up right now. In their ideal world, you’d re-answer all the same questions 30 seconds before walking in the door.

Turns out a combination mutable table of current state + derived immutable log/snapshot/audit table works much better for most things.


> I worked with a feature that used this approach once.

You work with many features that use this approach.

  https://www.dbi-services.com/blog/oracle-log-writer-and-write-ahead-logging/
  https://git-scm.com/docs/git-reflog
  https://dev.mysql.com/blog-archive/mysql-8-0-new-lock-free-scalable-wal-design/
  https://www.postgresql.org/docs/current/wal-intro.html
  https://www.sqlite.org/wal.html
  https://www.amazon.science/blog/lessons-learned-from-10-years-of-dynamodb

You can use snapshotting to increase performance of the sequence of transitions approach.

Except lots of applications need to be able to forget things, similar to how things can be destroyed in the real world

Forgetting what you had for lunch last Friday is not necessarily the same as changing what you had for lunch last Friday.

You could, for example, throw away the encryption key for that fact. What you had for lunch is now inaccessible, but the fact remains unchanged.


> perfectly

That's a joke, right? If you tried to live in this world, as a human being, not in some abstract database-building sense, you'd be completely lost in the first second of your existence, and would probably die in minutes because your body parts would "forget" how to function properly.

We make sense of the world because we have "object permanence", which requires the concept of larger things made up of components, with multiple possible configurations. This object identity is what allows to understand that you can do things like "type on a keyboard" (a keypress that changes the physical configuration of the keyboard doesn't destroy it, it's still a keyboard, just slightly different, but functionally equivalent to the one you had a moment ago).

If all you can do is transitions, you don't know upon transitioning if you still have a keyboard you are typing on or not. This also means that if you want to be able to function somehow in this world, then after each transition, you'd have to reassess all properties of all interesting aspects of the world just to make sure they are still there.

Only in very restricted number of cases can you use append-only log as a useful model. And you'd be bending over backwards with this approach when modeling trivial things, like a realtor business or an online book store.


> If all you can do is transitions, you don't know upon transitioning if you still have a keyboard you are typing on or not. This also means that if you want to be able to function somehow in this world, then after each transition, you'd have to reassess all properties of all interesting aspects of the world just to make sure they are still there.

You're assuming that at every timestep all transitions are equally valid or equally likely. That's not a given, you can and of course should carefully model which transitions your system allows or not. In real life this model is Schrödinger's equation, as best as we know it. In your information system it can be whatever you design it to be.


If you want to build a stateless system with transitions, you are limited to regular expressions (finite automata).

I don't think you'd like to write real-life applications using a regular language. It might be interesting as an experiment, but life without recursion or potentially infinite loops sounds bleak... Just imagine the struggle of creating some generic containers like trees / hash-tables.

I don't think it's impossible, and may be even interesting to see how far one can take such a language to make it useful, but enjoyable this is not...


The "world" can just as easily be conceived of as being composed of discrete immutable facts rather than global mutable state. Either way, I kind of don't think that unfalsifiable ontological premises should be an important factor in system design.

While the world can tentatively be conceived to be like that, it is not nearly "just as easily". Lots of those "immutable facts" can't be realistically discovered: e.g. good luck recovering "this blackboard had a drawing of a cat and a bird until five minutes ago when it was cleared with a wet sponge" if you weren't there in time. The approach with mutable state fosters this destructible nature of many things upon you and makes you cope with it, somehow.

Ok, but why should I even care about what was on this hypothetical blackboard? Do you have any real world examples with business or technical significance where this "cope with it, somehow" approach to mutability is a clear win?

https://materialize.com/ provides another approach, based on "timely dataflow" (https://timelydataflow.github.io/timely-dataflow/) - originated at MS.

Does any one have some resources where a real, practical example is implemented? Because I can only find fairly theoretical resources but not real world examples.

Say, like how some simple CMS would work with a datastore like this. What does the event to update the headline of an article look like? How are integrity constraints enforced, e.g. an article can't reference an author that doesn't exist? Things like that.


(2015)



This is the technique that developers used to build Aurora, not Aurora the end-product.

Customers writing code against Aurora are still doing plain ol' destructive CRUD mutations "now".

Event-sourcing is write-ahead-logging is CQRS is journaling-file-systems is Git-reflog is persistent-data-structures is copy-on-write. It's all good stuff and is decades old.


> This is the technique that developers used to build Aurora, not Aurora the end-product.

It's what I meant. But I am behaving like an LLM and economizing on tokens... :-)


Isn't this essentially how a modern transactional database works anyway? All mutations end up in the Write Ahead Log (WAL) and you can replicate or back up that to be able to recover the state at a point in time?

Thats how the database works internally, thats not how the interface to the database works.

Apps dont know about the WAL log, and the database has internal logic for processing the WAL. Things like transactions, rollback etc...


Two points:

* Technically the data is probably there, but I really don't think you want back-up ops invoked by your REST call to /getUserHistory/. Is it even possible to mix old data and new data within the same SQL expression?

* The DB is still a god object at the centre of your system. It doesn't give you consistency across partner systems and end users. If a partner sends the event CustomerBanned(2025-02-04, 1234) and you try to translate it into CRUD with 'UPDATE Customer SET Banned=True WHERE id=1234' it could fail (or worse - be rejected by an invariant for "data integrity" reasons) and then it's gone. If you just blindly write the event, then you always know that fact about customer 1234 in any future query.


One difficult to replicate thing is visibility rules and rollbacks, with postgres you can abort and your changes are hidden, no such luxuries worth this architecture unless you make it very complex with partial states, drafts or something similar.

Isn't this what you get with Datomic?

> So why do we tolerate databases as they are?

Because they reflect the way we understand the world? We understand that things are made of smaller things, and that sometimes the smaller things making up larger things may change, while the larger thing stays the same? The idea that as soon as one component changes the whole thing needs to be discarded and rebuilt from ground up is insane and creates a lot of problems. It's absolutely not worth it to try to redefine the way we deal with the world to get the benefits of stateless code.

Making database stateless is making it worthless. The world has a state, and if you want a useful program, it needs to accept this "unfortunate" aspect of the world. The alternative is the world where as soon as you finish drinking your coffee, your cup, your table, your kitchen, your credit card history, your grandparents and all planets in the solar system disappear, and have to be built fresh. But you wouldn't know about it, because your memory of how the world used to be would disappear too.


Datomic

My first thought. And then I realized that the talk is from 2015, so I wonder if there was cross pollination between people at the time (datomic was released in 2012 according to wikipedia, so it's plausible it predates Martin's ideas but I don't know)

https://news.ycombinator.com/item?id=20937215


The concept is even older than that.

Yeah, like 15th century old.

Awesome,very useful

Stopped reading at the word "Kafka".

fair



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: