Hacker News new | comments | ask | show | jobs | submit login
Event Sourcing is Hard (chriskiehl.com)
485 points by goostavos 12 days ago | hide | past | web | favorite | 157 comments





I agree with the author completely. I worked on a fairly large system using event sourcing, it was a never-ending nightmare. Maybe with better tooling someday it will be usable, but not now.

Events are pretty much a database commit log. This is extremely space innefficient to keep around. And not nearly use useful as you might think.

Re-runs need to happen pretty often as you change how events are handled. Even in our local CI environment, it eventually took DAYS to re-run the events. It was clear that the system would never survive production use for this reason alone.

De-centralizing your data storage is a bad idea. We ended up not only with a stupidly huge event log, but multiple copies of data floating around at each service. Not fun to deal with changes to common objects like User. Sometimes you would have to update the "projection" in 5-10 different projects.

In practice ES amounted to building our own (terrible) database on top of a commit log that's actually a message queue. Worst technology I've worked with in years, eventually the whole project collapsed under it's own weight.

Some of these problems are fixable in theory. Perhaps a framework to manage projection updates, something to prune old events by taking snapshots, a DB migration style tool to "fixup" mistakes that are otherwise immortalized in the event stream. But right now, seriously stay away :)


I agree that it's hard, however doable and pays benefits if you know what you're doing. I worked on 3 successful implementations for finance sector and we could replay a few million messages per second. Have a look at how we achieved that in LMAX: https://martinfowler.com/articles/lmax.html

Sorry to say it, but clearly you must have been doing something wrong or employing event sourcing where it does not belong.


> LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.

I'm sorry, but this is saying "catch your bugs before they reach production" which just isn't feasible on non-critical software development (i.e., most software development). The important part that is left out here is: what happens when one such errors slips in? How do you deal with it after the fact?

That being said, your system is impressive and I loved being able to read about it. Please keep up the good work and specially sharing your findings! :)


I honestly can't imagine bug free software, even in critical software development. Luckily for me I've worked on important apps, but if there's an issue there is time to trace and repair the data...not a system that has 6 million orders per second.

Not quite! I don't think that's what they're getting at.

The idea is this: Say you have a record A with fields f1, f2, f3. When an even comes in you run a function F with steps s1, s2, s3 each of which may modify a field of record A.

Here's the issue, if s3 fails (due to "invalid input"), the modifications to A from s1 and s2 are incorrect and A is now corrupt.

There are a bunch of ways to handle this but the one described here is to avoid touching data that persists between requests until you're at a stage where nothing can fail anymore.


> until you're at a stage where nothing can fail anymore.

... and then there's a NullPointerException because you forgot to check something that could indeed fail (i.e.: you have a bug).

In other words: they advise you to not have bugs in that part of the codebase, which was precisely my objection.


Absolutely. But doing things in this style protect you from large classes of the especially hard to reproduce bugs. Nothings perfect but it helps a lot!

I'd never heard it articulated before but I personally discovered this style over the years as well.


Hmm. You are both right?

If LMAX fits your problem, and you're fine with the distribution aspects, and you can adopt a homogeneous architecture etc, then it works well.

But the way Event Sourcing is normally sold and implemented is you emit events in some components, written in some mix of languages and half of them probably already legacy, into some 'event bus' thing that you adopt and is probably written internally in some other paradigm, and then you consume these events in several different programs all implemented in other languages by other teams at other times, and you probably do all this across the atlantic!

I live in a gazillion-events-per-second world, perhaps not as hectic as finance, but sadly bigger events and sadly globally distributed, and sadly running on amazon (which will be orders of magnitude slower overall than if you have dedicated hardware once you're at this scale because of, aha, aws's complete lack of 'mechanical sympathy' ;) ). It sucks. Oh I wish I had an LMAX-shaped problem.


That's EDA (Event Driven Architecture) not ES.

That article is masterwork, thank you. It seems like having a "hub and spokes" architecture is key to getting event sourcing right. LMAX's Business Logic Processor is the only thing sourcing outputs, so you don't hit dependency hell. Also, your business process is literally a process, rather than living on the ledger.

It reminds me of well-designed Paxos control systems like Borg, where the journal serves as the ledger and all business logic lives in one scheduler thread.


>Re-runs need to happen pretty often as you change how events are handled.

Aren't you supposed to take Snapshots every now and then to solve that problem?


Migrating snapshots just moves the pain of changes : instead of letting the new projection logic replay all past events, you write code to migrate snapshots and hope that it will give the same result as the replay.

In those cases, I would rather do the replay, but do it offline before the release.


You are, but if you change how events are handled, or create a new event handler you probably need to replay all events.

Most of the time what people actually want is the audit log abilities of event sourcing. Seeing how data was in the past and what (or who) made changes to it. There are dozens of ways to accomplish this at different layers but unfortunately they go all in on event sourcing instead.

You make a convincing case that it's tricky to implement.

There's also an (somewhat obvious) case to be made that it isn't for every use case.

Doesn't mean it can't be useful sometimes, or maybe, god forbid, often.


The replays are done on the processor right ? Isn't it supposed to be fast ?

Event Sourcing = everything that happens is an event. Save all the events and you can always get to the latest state, as well as what things look like at any time in the past. What an "event" is depends on your business domain and the granularity of processing. It's very common in enterprise apps with complex workflows (like payment processing or manufacturing). Good fit for functional programming techniques and makes business logic easy to reason about since everything is in reaction to some event, and usually emits another event.

With global ordering, you can have a single transaction ID that points to a snapshot of the entire system. The actual communication is usually handled by "service bus" style messaging like RabbitMQ that can support ordering, routing, acknowledgements, and retries. Kafka or a RDBMS can also be used but requires frameworks on top.

This concept is used pretty much everywhere. Redux is event sourcing front-end state for React. Most database replication is event sourcing by using a write-ahead log as the stream of mutations to apply. All that being said, I completely with this article that it's the wrong solution for most cases and creates far more problems and limitations than it solves.


What about when event structures change? Now you’re having to push versions into your events and keeping every version of your serialisation format.

Redux often does not have to keep track of versions, because the event stream is consistent for that session.


The most common approach is to add versions to the events. The good thing is that with event sourcing, the exact cutover and lifetimes of these schema versions can be known (and even recorded as events themselves).

Downstream apps and consumers that don't need to be compatible with the entire timeline can then migrate code over time and only deal with the latest version. You have to deal with schemas anytime you have distributed communications anyway, but event sourcing provides a framework for explicit communication and this is one area where it can make things easier.


For me an issue, not usually made explicit, is that those benefits seem to require a pretty stable business and a stable application landscape.

Why? Because a changing business requires changes to the Events. Since a change in an Event requires that all consumers to be updated, immediately or at the schema version deprecation, the cost of change seems to increase faster in comparison to an application landscape without event sourcing. At the same time, the cost to know who needs to be changed also grows since "any" application can consume any Event.

An stable application landscape also seems to be required because if the number of an Event consumer grows quickly the ability to update, and deprecated schemas, seems to be related with the number of Event consumers (which require update).


If your org is anything like mine, mostly things (data) are "additive" onto the existing structure. When you want to deprecate something, you can notify all the consumers like a thirst party would, if you were going to change something enough. But this later happens much rarer for us, though tends to leave traces of technical debt....

> What about when event structures change? Now you’re having to push versions into your events and keeping every version of your serialisation format.

Sure. Events are like filled out forms, and there is a reason forms that are significant, modified over time, and where determining meaning from historical forms is a expected business need tend be versioned (often by form publication date). If you ever need to reconstruct/audit history (whatever your data model), you need a log of what happened when that faithfully matches what was recorded at the time, and you need definitions of the semantics as to how the aggregation of all that should map to a final state. Event sourcing is a pretty direct reification of that need, and, yes, versioning is part of what is needed to make that work.


> Now you’re having to push versions into your events

Most distributed systems already do this, especially if they use Thrift/Protobuf etc. It's par for the course.


The protocol version only handles the version of the structure. It does not change when you change the meaning of the field. For example "this is still foo-index, but from now on it's an index in system X, not system Y". (Yeah, bad example, you could change the field name here, but sometimes it's not that clear)

Typically you write an in-place migration for every version, yes. Or you snapshot your working-copy database and archive the event stream up to that point, and you play it back using the appropriate version of the processing code.

It kinda sucks and there isn't a great answer. But there are a lot of use cases where there are real benefits to having that log go back to the start.


There is no easy way (in any platform as far as I'm aware), of discovering and updating all clients that depend on some data structure if you plan to change the data structure in an incompatible way.

Better to "grow" your data structures than "change" them.


I have actually written a book on this.

Its not that hard to deal with. Dealing with it through ad-hoc "we will come up with something as we go along" can be a bit of a pain though. It is in fact fairly trivial to handle if some thought is put into it.


Any reference to the book you wrote?


> Save all the events and you can always get to the latest state

This isn't actually true and good event sourcing guides will point out why this isn't true. Event based systems are naturally racey, on a rerun of a stream of events the order in which the events is processed may change and you might therefore get a different result than the first time.

For example, if you have 1 item remaining in inventory but two people attempt to purchase it at the same time, that is a natural race and only 1 of them can succeed. If you are operating at the sort of scale where you require this kind of horizontal scaling, you may be in the territory where these sorts of conflicts become common.


Event systems are not racey. An event is historical and immutable. The event source is a list of things that categorically happened. Commands can race each other, but the outcome (OrderPlaced vs OrderFailedBecauseItemOutOfStock) is definitive.

A common confusion is between commands (which can fail) and events (which have already happened). Replaying events should always result in the same state because they only reflect things that actually happened.

You can't replay commands because they have side effects.


They kind of go hand in hand so I find it a bit amusing to make the distinction and say that therefore event sourcing isn't racey. Fine, event sourcing by itself isn't, but issuing the events is racey if anything listening to the events issues commands.

I am quite confident every implementation of event sourcing is going to couple commands to the event stream unless it is merely making an audit log. In which case I guess it isn't really an architecture so much as an add on.

You will replay commands as part of testing and debugging and that is where the fun comes in with race conditions.


Preventing race conditions in message-queue based architectures is a neat challenge. Depending on your problem domain, there are some approaches that work really well. It is likely that these don't work for all cases though.

The approach I've seen work well is to divide your messages into categories, such that events that belong to the same category MUST be handles in order, while events in different categories can be handled in any order.

Say you have ordered updates to USERs. Say you have 10 event processors. Each processors handles events for 1/10th of the users in your system. You come up with a way of partitioning your users: user_id % 10 for example. For any given user instance, only ONE processor would process ALL the events for that user. They would be processed in order based on some ordering key (i.e. time stamp, event ID).

What is tricky is that this partitioning requires first-order attention. If it gets messed up, everything crumbles and you are in for long nights and working weekends. When it works, you have a hugely efficient event processing machine that can handle anything the pipe (message queue) can throw at it.


We dealt with a similar problem where I work.

At first we had ideas to do exactly what you described. We're using RabbitMQ, so partitioning is not built-in like with Kafka. We did some experiments with RabbitMQ's consistent hash exchange. Although the plugin does properly partition, we'd need to build our own stuff for rebalancing the partitions over workers as workers join and leave the pool, which happens often in our case because we sometimes process massive spikes and automatically scale up.

In the end we went with something more stupid. Each event gets a timestamp and we acquire a lock in Redis when we start processing the event. If the lock is already acquired, the event is rejected and put back into the queue with a delay. Before processing an event (and after acquiring the lock) we make sure that the timestamp of the event is newer than what we already processed. If it's older, we drop the event.

This approach trades some efficiency for correctness. Depending on your problem domain, this can work very well.


There should be 1 point in the system that said "user x has ordered item a". That then becomes an event. The command from user x and y to order item a might be a race condition to see which command gets processed first, but once it has been turned into an event it's immutable and without race conditions. You should never replay commands.

Event systems do not require deduplication and/or strong ordering on the ingress. You can also solve the problem on the consumption side and there are often reasons to do this.

Plus you should be more forgiving of examples. Let us strengthen the example by saying that the inventory system lied and told us there were 2 items when there was only one. So you issue 2 commands, but when the events arrive at the fully automated warehouse, one randomly receives a null pointer exception and auto-heals into a refund, but which one?


Here, you should have 2 orders issued events, one order_succeded, and one order_failed (hopefully not with an actual NPE, that would be pretty bad). So much like an SSTable-based system, there are race conditions at the level of deciding what to write to the logs/event stream, but should not be race conditions in replaying the event stream (the same customer would get a failure every time).

Never done event sourcing architecture, but i guess this kind of subtelty is what makes a system a nightmare or a breeze to work with.

Which makes me wonder if there are some kind of exercise book to practice « good architectural design » , just like with algorithms.


Life has race conditions. You'll need these transactions somewhere in any kind of application to handle these situations.

Event-sourcing just provides more control and explicit ways to handle it since it's built around a stream of events. You can leave it to the messaging layer which can timestamp the messages, or use explicit partitions with sequences and strict ordering, or have consumers process messages with external atomic transactions in the database. What works depends on what you need.


There is only a request, no "customer X buys the last item" event unless the system proves that the item is actually available (and other conditions). An easy way is putting requests in a queue, to ensure one of the buying attempts comes after the other and sees the item unavailable.

First we need to agree on what "Event Sourcing" means. In my view you don't need to implement every single Event Sourcing pattern to have an "Event Sourced" system. Say you have a TODO List app (yes, pretty cliche but that's ok). In that TODO list you have a <form> in which you post the state of the TODO list to the server. That state is stored in the database in the form of an event "TODO_LIST_SAVED". When you want to "replay" back, just list all the events from the DB in chronological order while filtering the ones the user has access and pick the last, then rebuild the HTML using that one event converted into the view model.

Kaboom, you have an event sourced system that doesn't even use a queue.

The problem with trashing the idea is that people have a bad experience with it, either by over-engineering and trying to apply all the solutions with tools or hand-made implementations instead of using a Lean approach, or storing events in a type of business model that doesn't even require a database.

"Event Sourcing is Hard" is a statement as true as "web development" is hard, or "distributed systems" is hard, or "API" is hard, "eventual consistency" is hard... yet, we build those things every day doing the best we can. In fact, anything, any practice, any technique, any architecture can be hard because software is hard. Even harder is to not over-engineer something that can be very simple.

Simplicity is hard.


Kafka-oriented streaming folks talk about stream-table duality; the idea that one form can be expressed as the other. There is usually a little lip service paid to this idea before some heavy hints that actually, the stream is the true reality are dropped.

My own view is that there are dimensions for any data of interest, expressing some ability to show an evolution of it. Frequently that dimension is time, or can be mapped onto time.

But neither the stream nor the table is the truest representation. The truest representation is whatever representation makes sense for the problem. Sometimes, I want to clone a git repo. Sometimes I want to see a diff. Sometimes I want to query a table. Sometimes I want a change data capture stream. Sometimes I want to upload a file. Sometimes I want a websocket sending clicks. Sometimes you need a freight truck. Sometimes you need a conveyor belt. Sometimes a photo. Sometimes a movie.

Sometimes I talk about space vs time using the equations for acceleration, or for velocity, or for distance. These are all reachable from each other via the "duality" of calculus, but none of them is the One Truest Formula.

And so it is for data. The representation that makes the most sense for that domain under those constraints is the one that is "truest".


I believe this is where CQRS plays nice with “event sourcing”. You write all your events in one model, but can read them in multiple ones, that is if you can tolerate some read latency... and most systems are usually ok with that.

CQRS also alleviates a lot of the pain people experience around event sourcing and distributed systems with event evolution and manipulation. There are many cases where it's inappropriate for consuming services to be aware of the internal representation of events. Giving them a 'snapshot-centric' view of the data can be a simplification in both directions.

I agree with this a lot. I posted a comment elsewhere in this thread about our use of events, and it boils down to selectively picking the entities we need to be able to reason about past states of, and storing the new states of those, and then deriving views from that state. For many uses we don't even ever need to then explicitly apply state transformations on that to derive a materalized form of the present state - a suitable view is often sufficient. For some we do need to apply transformations into new tables, but we can do that selectively. We still always have the database as a single source of truth, as we're lucky not to need to scale writes beyond that, which simplifies things a lot.

What it gives us is ability to rerun and regression test all reporting at point in time for those data sources we model as events, and ability to re-test all the code that does transformations on that inbound data, because we don't throw it away.

"Our" form of event sourcing is very different from the "cool" form: We don't re-model most internal data changes as events. We only selectively apply it to certain critical changes. A user changing profile data is not critical to us. A partner giving us data we can't recreate without going back to them and telling them a bug messed up our data is. For data that is critical like that, being able to go back and re-create any transformations from the original canonical event is fantastic.

And as long as there is an immutable key for the entity, rather than just for the entity at time t(n), we can reference from non-evented parts of the system to either entity at time t(n) or entity at time t(now()) trivially, depending on need.


The great thing about HN is that it consistently shoves into my face how many, seemingly common, dev tools or frameworks etc... that I've never heard of.

Event sourcing isn't anything I've ever heard of, let alone something for which broad marketing promises need debunking.

How common is this framework/archecture/product? Google isn't helping me determine how widely used it is.


It's become a fairly widely known concept in data engineering circles, expounded upon in Martin Kleppman's Designing Data Intensive Applications book. (buy this book if you want to get up to speed on modern ideas around distributed systems and data architecture)

This became popular as people were trying to figure how to use Kafka as a persisted log store that could be "replayed" into various other databases. This meant that you could potentially stream all the deltas (well, more accurately the operations to create the delta, e.g insert, update, delete) in your data -- through a mechanism called Change-Data-Capture (CDC) [1] -- into a single platform (Kafka) and consistently replicate that data into SQL databases, NoSQL databases, object stores, etc. Because these are deltas, this lets you reconstruct your data at any point in history on any kind of back end database or storage (it’s database agnostic).

Event sourcing to my understanding is a term used among DDD practitioners and Martin Fowler disciples but with a different nuance. This article explains what it is:

http://cqrs.wikidot.com/doc:event-sourcing

[1] Debezium is an open-source CDC tool for common open-source databases. Side note: A valid (but potentially expensive) way of implementing CDC is by defining database triggers in your SQL database.


EventSourcing is not a Framework, but a concept.

The idea is to store not the current state of your app, but the transitions (events) that derive into the current state.

Think about how git stores your source code as a series of commits.

In theory it is a beautiful idea; in the real world, it is hard to implement.


In fact, git stores a full snapshot of your entire repo with every commit. It does not store diffs from the previous commit. When you do "git show <COMMIT_SHA>" it generates a diff from the parent commit on the fly.

There's a huge optimization though: it uses a content-addressed blob store, where everything is referenced by the sha1 of its contents. So if a file's contents is exactly the same between two commits, it ends up using the same blob. They don't have to be sequential commits, or it could even be two different file paths in the same commit. Git doesn't care - it's a "dumb content tracker". If one character of a file is different, git stores a whole separate copy of the file for it. But every once in a while it packs all blobs into a single file, and compresses the whole thing at once, and the compression can take advantage of blobs which are very similar.


Gits bi-modal nature is a wonderful representation of a sanely architected Event Sourced system. When needed it can create a delta-by-delta view of the world for processing, but for most things most of the time it shows a file-centric view of the world.

IMO a well-factored event sourced system isn't going to feel 'event sourced' for most integrated components, APIs, and services because it's working predominately with snapshots and materialized views. For compex process logic, or post-facto analysis, the event core keeps all the changes available for processing.

Done right it should feel like a massive win/win. Done wrong and it's going to feel hard in all the wrong places :)


Yeah... My comment was an oversimplification, but you get the point. :)

With event sourcing there's also the concept of snapshotting, btw.


Also directory structures are content-addressed, so if a commit changes nothing in a given directory, the commit data won't duplicate the listing of that directory.

Then I've just found out I've worked on a 28 years old event source'd code base. In Clipper. An old loan management software, which had to control how installments changed along their lives.

It worked well.


Basic event sourcing is quite simple to implement. All the bells and whistles people sell alongside event sourcing are hard - whether you do event sourcing or not.

Your typical application presents a user interface based on data in a set of database tables (or equivalent), the user takes some action, the database tables get updated.

The equivalent event-sourced application presents a user interface based on data in a set of database tables (or equivalent), the user takes some action, the outcome of that action is written to one table, the other database tables get updated.

For git, where "or equivalent" is the working copy. You could easily imagine a source code management system similar to git, but without storing history - every commit and pull is a merge resulting in only the working copy, every push replaces the remote working copy with your working copy.

But man, wouldn't it suck to be limited to only understanding the most recent state of your source code…


You need to know beforehand how exactly you’re going to pull out the data, how it’s going to be indexed, updated, etc. You can get nice benefits out of it especially if you’re doing transaction states, but your query times are going to suffer unless you’re caching the end-state. This can make “simple” things take a lot of effort. It’s a really significant departure in terms of effort to deal with your data.

Here's a talk that largely introduced the concept to me, Turning the Database Inside Out: https://www.confluent.io/blog/turning-the-database-inside-ou...

Can second this - if asked to recommend one link/article I'd choose this one too, an excellent intro

It seems to be one of those things that you probably don't need. And when you do need it, it becomes obvious that you need it. Specifically, in my research, you really shouldn't use it until it becomes painfully necessary to horizontally scale writes.

Parts of it can be useful, but you don't need to split out an event bus to get auditing for example. As you say, you can avoid that until/unless you need to scale writes.

In the meantime, you can look for inbound data that naturally correspond to immutable events and apply some of the ideas to that. E.g. that form a user submits? It's reasonably an immutable event. Many of them won't matter to you, because you'll never care to audit it. But some might.

E.g. we have projections of financials being submitted by third parties. Being able to go back and audit how original form submissions relate to changes in other system state is useful, or just being able to re-run old reports after fixing bugs and confirming that the reports show what they should before/after certain events. So instead of just storing the end state, we're increasingly looking to store the original external signals that triggered those changes, and build transformations as views over that event log, and then where we need it only drive transformations to tables we don't event the same way, often with a suitable reference to the source event(s).

It avoids the problems in the article for the most part (some, such as changes in the structure of the events will always be an issue), but gets enough of the benefits to be worth it, because it's only applied to data we have that it genuinely fits (where we have clear, natural event sources, often but not always external submissions of data) where we have a need (whether for complexity reasons or because of external auditing requirements) to be able to get past views of data.


I think you also need it when scaling reads becomes painful.

Reads that have different patterns, specifically, the kinds of patterns that can't be indexed easily because they need denormalization to generate all the indexed expressions. Or you need to read a time series, a snapshot at a point in time, or the latest version of the data, all from different places under different loads - analytic, machine learning, transactional.

One user needs to read across all the data over all time; another user wants super-fast scrollable access to user-customized sorts of a subset of the latest data. The user-configurability of the sort is what defeats the kinds of indexing you get in a traditional RDBMS. The obvious way to get this is a lambda architecture: have an immutable append-only system of record which contains all the data, and build the other views out of it. It's a small step from there to event sourcing.


Things like Redux or Bitcoin are more or less event sourced systems. It's basically the idea of deriving your application state from a series of business events and storing those as a single source of truth. It's very appealing in theory but as the article explains, it's a bit more complicated in practice (e.g. dealing with consistency).

So somewhat akin to the Command pattern?

Somewhat but with an important difference: commands may fail. Events may not, since they are facts.

Although in 2019 one has to wonder.


It is more widespread in the enterprise software development space and particularly in the Microsoft/.NET ecosystem.

Greg Young was wondering in one of his talks about Event Sourcing why people try to connect CQRS/Event Sourcing with .NET

Eventsourcing is really a lose set of concepts, and each application of it will look very different. That's also why there are almost no useful frameworks here.

There are some good talks on Youtube about it.

The concept is used all over though: Redux (js) is basically a lightweight form of eventsourcing ( with redux-saga being the "process managers" mentioned in the post).


It is used in finance systems a lot.

Event-based systems that run on message queues are the backbone of money. These often co-exist with batch systems that process huge files (which also can be thought of as a journal of events to be processed in a batch).


Events are the cornerstone of product analytics. If you want to understand what your users are doing on your platform, and to look for opportunities to improve the user experience, events are a big part of that.

Events in that context are totally different. Sure there's overlap, but that's mostly a coincidence.

I've been working with user event funnels for years, with tools like mixpanel and others but it seems like this is a build state tool for development workflow.

That’s a good point! I’ve typically piggy-backed off these types of systems to do that kind of analysis. But you’re right it is a distinct use case.

The basic idea of event sourcing is to store the actions rather than the end state of the action. This allows actions to interleave and systems calculate the endstate from all of the actions.

Think of ATMs. They don't update the balance of your bank account directly. They just record a debit against it and then the sum of all of your credits and debits is your balance. This avoids it having to have some kind of lock on your account during the transition and even allows significant delays from various transaction sources.


In my view most of the problems with event sourcing that people run into boils down to overdoing it and modelling large parts of their systems as sets of events instead of maintaining an event log just for those things that falls out naturally as discrete events from the rest of your design.

Having certain critical events, and especially complex ones, logged as events that can be replayed and reconstructed and model transformations as operations over the state of those logs can be invaluable. Building the system around modelling all changes as events is amazingly painful.


The first thing ATMs (at least over here) do is check if you have enough balance for your transaction. If you just stored events, you'd need to collapse all the previous delta operations at that point, which would be arbitrarily slow if you never persisted data.

Storing event records vs materializing changes is sort of an old hat trick in databases. People did that in the early 90s for better TPC-C scores. It has its uses, and in some contexts storing deltas can have huge advantages (e.g. bigtable with a log structured file system), but it's no silver bullet.


Databases that try to only reflect the entire truth as-is are like everyone working from a shared whiteboard.

Databases setup to event source, ie reflect the truth as it happened at each step, are like everyone working from a shared spreadsheet with change tracking.

It's an entire extra dimension that makes reconciliating problems of conflicting actions in disparate systems possible.


The past few years there's probably one front-page article a week on Event Sourcing / CQRS.

For a fun introduction to this concept and the motivations behind it, you may watch Greg Young's talk https://youtu.be/JHGkaShoyNs

Stay away from it.

Datomic is, at its core, an event sourced datastore, and it works really well. I don’t think event sourcing is something that should be solved in application space — you wouldn’t write a database from scratch for your products yet implementing an in-house event sourcing system is for some reason more acceptable.

All of these stories of failure are stories of teams bandaiding event sourcing on top of other databases not optimized for the usecase.


I was thinking about Datomic when I read this, and yet came to exactly the opposite conclusion :)

Datomic may be something like 'event sourcing' internally, but it works so well because it abstracts and hides the naked overreach that the article pounds against.

Datomic users think of it as a DB, not a stream of events. So I don't think that Datomic users are following the 'event sourcing' model.


Fair point... but, events in the style of "User address changed" are very repetitive, wasteful, fragile etc.

So stopping seeing user-address-change as (primarily) an event is a win. One always can fall back later to perceiving user-address-change as an event.

For "real" events (e.g. "Transaction fraud detected") possibly I wouldn't use Datomic at all even if it was my primary DB. There are more suitable pub-sub systems.

And that results in a neat separation of concerns: data changes in one store, actual business-valuable events in the other.


Datomic: Event Sourcing without the hassle https://vvvvalvalval.github.io/posts/2018-11-12-datomic-even...

I'm using Datomic in a production environment and the ergonomics of development so far have been extremely nice. Would definitely recommend.

Good summary of the drawbacks of ES.

I think one thing can not be repeated often enough:

Eventsourcing can be an incredibly valuable approach, but almost always only for a very SMALL SUBSET of your system.

Most of the problems with ES materialize from trying to build your whole architecture around it.

I think many developers fall into this trap because theoretically the concept sounds so appealing and elegant (immutable, reproducible, modular, ...)


Event sourcing is REALLY hard to figure out how to do “right.” A lot of getting it right is modeling knowledge/experience, understanding your domain.

That said, you can succeed at building your entire arch around it and once you do, it’s glorious. Kafka Streams makes the technical aspects easy once you figure out how to model correctly.


Out of curiosity, how do you deal with consistency guarantees across aggregates? (which is much more relevant when your whole architecture is ES)

I realize this is highly domain dependent. Some will be much less affected than others. But it's another drawback not mentioned, because now you start to need sagas/managers that coordinate across services with commit/rollback patterns , conflict resolution., etc.


If you took event sourcing out of the picture, how would you solve it?

This is a problem to do with distributed transactions, not event sourcing. If you don't need distributed transactions, don't use them. If you do need distributed transactions, redesign your system so you don't need distributed transactions. If you _still_ need distributed transactions, use one of the available mechanisms for making them occasionally work, like three-phase commit, or clocking, or blind luck, all of which work just as well with events as without.

If you'd be perfectly happy using a single database server doing a more typical ORM solution with tons of database locking without event sourcing, you can use a single database server with tons of database locking with event sourcing: just serialise all (or maybe most) events.

My domain has a lot of unpredictable coupling between events, such that there's not really a sensible aggregate model to divide up the information model (remember: an aggregate is the unit level of consistency in your system). But it's also low update volume, so we quite happily just serialise all events - which is what we did before event sourcing, using table locks.


As zenpsycho said, all you get is eventual consistency across aggregates, if you’re talking about projection aggregates. As you say, for domain aggregates you can wire up transactions by writing your own 2PC on top of Kafka exactly once semantics.

I would recommend only to people who are really committed to the idea or know what they are doing and modeling. It took me a long time because I went from zero knowledge of Java/DDD. First I had to learn Java, then Kafka, then KS, but still I was lost. Learning the most basic DDD was enough for me and did the trick though. Writing a 2PC to coordinate across aggregates wasn’t pleasant but also wasn’t hard with KS. The hard part was learning all the other stuff and the modeling.

I think a well done ES framework based on Kafka streams would maybe be the first ES framework that would have a chance. The primitives in KS seem just right.


DDD in this context = Domain Driven Design?

si

The architecture obligates you to the "eventually consistent" model. If that's not good enough, you should probably look at other approaches.

Disclaimer: I'm just an avid reader, not an experienced implementor.


I think calling it an architecture is a bit strong. Isn't it more like a pattern that you can use for some of the connective lines in an architecture? Reading through this, it seems a lot of the pitfalls come from neglecting the architecture whilst implementing event sourcing.

Event sourcing is an "architecture" because it can possibly be used for a part of a system, but only all-in (otherwise, you lose events you should preserve and all integrity benefits).

I just finished the last 2.5 years replacing a somewhat complex legacy system. It's a lot of Spring Integration, JMS Queues in between (so technically, 'events') and a traditional relational DB. The system was deployed 6 months after conception and piece meal migration and feature addition to support a full decommission of the legacy system.

It runs very well and the business is happy. However, I feel I need to convert it to Event Sourcing/CQRS not because for any technology constraint, or business requirement...but I feel if I don't have buzz words on my resume, I won't be marketable.

Ideally, systems are created to meet current and future business needs within time and budget constraint. In reality, I need to also maintain marketability. JMS queues and 2 phase commits....old school!


> It's a lot of Spring Integration, JMS Queues in between (so technically, 'events') and a traditional relational DB.

Ah, i wonder if you're working on the project i worked on at my previous job ...

> It runs very well and the business is happy.

Apparently not.


Good one. Buzzword bingo is a real thing that afflicts our industry.

Interesting- Sounds to me like you were reinventing a database - PostgreSQL has taken over 30 years, has 400 contributers and 1.1 Million LOC, no surprise to me, it was a bit tricky! PG has a rock solid ACID enabling transaction log - the Event Source - you can access this easily via logical decoding functions. You can easily replicate this data into tables, if you need to keep it, you can also add system and applicable timestamps to get a bi-temporal db, to enable queries to go back in time. You mentioned impedence mismatch with GUI, relational databases are famous for ths same problem, fantatic tools like PostGrest and recently GraphQL integration have already been built to address this.

just wanted to say that I don't like your comment. too shallow, wrong direction - something along those lines

Event sourcing is still the best paradigm for exposing valuable service data to an undefined number of services downstream, especially when there’s enough data to make querying painful for some use cases. They are like a NoSQL buffer between services and databases. Not always necessary but sometimes useful.

Event sourcing brings its own complexities (eg Kafka clients) but it’s still better than having one huge shared database or even RabbitMQ fanout exchanges. In the best cases, developer experience of Kafka can be very good and comfortable. You can write short python scrips that ingest data and dump to other databases and services, it’s very nice. In some cases you want services writing directly to databases but sometimes you don’t.


This has nothing to do with event sourcing. What you're describing is event-driven integration. Rarely the events used to source the domain model should be the same as the ones used for the messaging integration. Unfortunately, due to the repetition of the term "event" the purpose gets confused most of the time. The "Data on the Outside versus Data on the Inside"[0] paper makes a good distinction even though it doesn't use the same terms.

[0]: http://cidrdb.org/cidr2005/papers/P12.pdf


This is a good point (about how messaging integration requests differ). I think the choice comes down to whether at least once or exactly once makes more sense for a specific unit of shared data / functionality. For exactly once, APIs and RabbitMQ shine for connecting services.

I’m still wrapping my head around when to do what in SOA. Thanks for the link btw.


Btw why is it bad to use same events for messaging / data integration?

Never heard of event sourcing before, but I've used this pattern myself (as another comment mentioned, when you need it, it's a fairly obvious way to do it). Looking it up, I am guessing it's from Martin Fowler's book about patterns and architecture design? And on a related note, would you recommend the book to the average developer?

This is question more than a comment, as I have only casual knowledge of event sourcing...

"""

You wouldn't let two separate services reach directly into each other's data storage when not event sourcing – you'd pump them through a layer of abstraction to avoid breaking every consumer of your service when it needs to change its data

"""

Isn't the event itself precisely that layer of abstraction? That is, you're not publishing the details of your data store. You're publishing an event which is a thin slice or crafted combination of details that ultimately reside in that store, but which you are hiding...

Am I misunderstanding the quote?


> Am I misunderstanding the quote?

I don't think you are misunderstanding the quote, I think you are misunderstanding the nature of the problem.

If you tip your head sideways, you may notice that the persisted representation of your model is "just" a message, from the past to the future. It might describe a sequence of patches, or it might be a snapshot of rows/columns/relations. But it is still a message.

The trick that makes managing changes to this message schema easy is that you own the schema, the sender, and the receiver. So coordinating changes are "easy" -- you just need to migrate all of the information that you have into its new representation.

If the schema is stable, the risk of coupling additional consumers to the schema is relatively small. Think HTTP -- we've been pushing out new clients and servers for years, but they are still interoperable, because the schema has only changed in quiet safe ways.

But if the schema _isn't_ stable, then all bets are off.

Because of concerns of scale/speed, we normally can't lock all of our information at once. Instead, we carve up little islands of information that can be locked individually. The schema that we use are often implicitly coupled to our arrangement of these islands, which means that if we need to change the boundaries later, we often need to change schema, and that ripples.

And all of this is happening in an environment where business expect to change, and there is competitive advantage in being able to change quickly. So it turns out to be really important that we can easily understand how many modules are going to need to be modified to respond to the needs of the business, and to ensure as often as possible that the sizes of the changes to be made are commiserate with the benefits we hope to accrue.


Stupid non-web guy here. I don't understand why you can't just start publishing events, and if you need to change the event contents just create a new version of the event and support the legacy clients until you can get around to rewriting them. If you can't control the clients and need to support the legacy events indefinitely... well, you probably would have had that same problem no matter what you did, right?

This article seems like another instance of criticizing an oversimplified example of something.


> You're publishing an event which is a thin slice or crafted combination of details that ultimately reside in that store

The event stream is the canonical store.


I was wondering this too. Seems like services publish to an event stream which other services then read from.

I've worked in a couple of ES based trading system (matching engines and algo trading both) and it was definitely the way to go, especially on the equity side were you can restart or snapshot the system every day. In my startup, I've designed the core transactional system and it's serving us well for 2 years now. It comes with its own challenges and you need more senior devs than your average crud system hence you use it only for parts which are better specified and less subject to change. As everything it's a trade off.

I think this is the kind of pragmatism that is needed when taking about event sourcing. It is a _very_ advanced architectural pattern. There is no easy 'Event Sourcing made simple' way to use it.

I think one of the reasons for this is that the systems people often describe when they talk event sourcing are actually _three_ different, but interrelated, architectural patterns:

- event sourcing (build your models based on immutable facts) - event driven (side affects triggered by messages, often delivered by queues) - workflow / state machine

It takes a long time to get these concepts straight. In our case once it was untangled, our framework can quite clearly demonstrate how they relate: messages update the state in a workflow engine, this triggers side effects, the results are captured as facts and used to build the model, repeat.

This worked for our use case, in which our transactions have a strong workflow, it may not work in other cases.

Finally the one point I'd certainly reinforce from the article is: _don't reach directly into the event stream_. This causes huge amounts of coupling. Instead, we ended up using bounded contexts to define our systems, and then treat key events as our API. It sounds counter to some of the ideals of event sourcing, but it is absolutely needed once you grow past that toy phase.


Very good points. Cleary defined bounded contexts are the way to go and also having a clear distinction between internally and externally visible domain events. Other tricky points to get right how to distribute events and the relationship order (as in stream ordered, totally ordered, entity ordered) based on the scalability and "correctness" requirement. For instance in algo-trading is not uncommon to have totally ordered events throughout the system.

At Qbix, early on we made the decision to do replication via logs of events, which we call “messages”. This was at a time when Parse and Firebase were all te rage and collaboration was done by things like Operational Transformations and various diffing.

We figured that a social activity would be best represented as a linearly ordered set of messages, interpreted by various actors (and front end components) depending on the type of activity.

The thing is, you usually don’t want global ordering of events across all activities. That introduces problems that Google solves with Spanner and CockroachDB works hard to solve. Global Byzantine Fault Tolerant Consensus is even harder to achieve with any scalability, as can be plainly seen from Bitcoin and Ethereum. You can do it if you trust most of the nodes (like Ripple) but otherwise it’s infeasible.

But it’s also overkill. You need vector clocks locally for activities only. Like a chess game or a chat. You don’t NEED to know which message came first across chats, or unrelated transactions across countries. It’s a bit like quantum entanglement — only if actors activities and transactions start being entangled do you need to start caring about conflicts. And it’s a bit like relativity — if events happen far enough apart then it doesn’t matter which happened “first”, it will depend on the location of the observer.

So anyway... the primitive for us is the Group Activity, and they can reference each other, forming Merkle Trees and DAGs if needed, just like in Git and so on.

For more info see qbix.com/platform/guide/streams


I’ve been using event sourcing as an exclusive data store for a variety of toy modules, which are slowly becoming something large... I am using all bespoke infrastructure, both for managing the logs and for building UI.

I don’t think what the author did sounds like event sourcing, as I think of it. His setup sounds more like pubsub. In all honesty I’m probably the one doing it wrong though.

My event stream isn’t typically consumed or listened to by arbitrary listeners. (Although it can be, in some rare cases) Each event is namespaced to a specific module. When you load a log, you provide singletons for each module. Each message is only received by one singleton. If you want to broadcast that message to another module you have to do that from the parent module.

Everything is totally imperative so there’s no ambiguity about what causes what.

If I need data on a different machine, or in the browser, I just send the log down as a procedure and run it with fresh singletons on the client.

I also don’t have one mega stream, I only mix modules that actually interact. If I have two pieces of data that aren’t connected, they each have their own log.

I don’t know, maybe I’m the one who’s not really doing event sourcing.

Like I said, I’m still somewhat at the toy stage, so maybe I will get to where OP is eventually. I do plan on doing log rewriting for compression. That might lead to some pain.

I also haven’t yet had to deal with sharding. I suspect there’s pain there.

My plan is to use the one-singleton/one-module per message namespace rule to keep things from getting too complex. I figure if the same module that consumes messages is responsible for rewriting them, maybe it won’t be too weird. We’ll see. I have a lot more prototyping to do.

If anyone is interested, the core module is “a-wild-universe-appeared” on NPM. It’s still in the 0.x.0 series so it could break. But the basic API is pretty simple and hasn’t changed much in a few months even though I’ve been using it regularly.

https://www.npmjs.com/package/a-wild-universe-appeared

I agree with the OP that it’s perhaps too soon to throw this kind of thing at production problems. We need more research on how stores like this integrate with other layers.


Event sourcing is often described in the context of a rich domain model where every event has slightly different semantics with respect to the domain. As a result, there is often a need to adapt these event types over time as the domain changes. This could involve revising an existing event type (which is fine for adding new attributes for capturing more information) or creating a new event type to model some that has changed in the domain. If you simply incorrectly modeled the events for the domain, then you will have the same challenges of migrating a poorly designed database schema.

The arguments that Git and Datomic are both event sourced systems are good examples of successful application of this pattern. However, these are poor examples when it comes to event sourcing where the events are "domain events". In both Git and Datomic, the data model of the event are pre-defined. With Git you have a changeset and with Datomic you have datoms, both of which are composable by design and where every changeset and datom is equivalent (from a structural standpoint).

Applying event sourcing to an arbitrary domain model means every event is both semantically and possibly structurally different. That is, how event of type X affects the state compared to an event of type Y is different, where as applying two different datoms or changesets in Datomic or Git, respectively, change the state in the same way.

So I think there are two fundamental challenges faced when using event sourcing with a domain model. First is that the type and structure of events need to adapt to the domain (business, organization, etc). Datoms and changesets never change.. they are fixed and therefore don't have that challenge. Second, and related, is that as new events are introduced or existing ones are adapted, etc. the code that processes those events are unique as well (not just the model of the event itself). This challenge is exacerbated if there are multiple downstream consumers of this event stream where now there is "coupling" on the event/data side of things and likely semantic coupling as well.

Again, this is not something you run into with Datomic or Git simply because the data model of the events in those systems (the fundamental unit being left-folded) has a fixed data model and semantics.


Some contrarian opinion: https://vvvvalvalval.github.io/posts/2018-11-12-datomic-even...

Basic idea is to make the event log dumb. Instead of capturing business logic, make them as dumb as SQL commit logs: this row was inserted at this time, that row was deleted at that time. Sort of like a glorified WAL file. Benefits of this scheme? Querying old database states and auditing.


I've never read or heard that event sourcing was supposed to be all sunshine and rainbows.

Microsoft wrote about their adventures going down this path [0]. It's well worth the read if you're considering it for your project.

For a few features in my company's current platform we use event sourcing. We collect plenty of small data points over time that is aggregated into rows to provide users summary information of reams of operational data. We tried aggregating the data in queries and despite our best efforts to optimize our indices, tables, and queries there wasn't any way to compute it in any reasonable amount of time.

The pain points for us:

Our UI team wasn't in sync with how data flows in an event-based system. There's a lot of friction there. We're slowly updating the team on task-oriented user experiences and breaking down our UI components to transmit their commands directly. For now a lot of our control plane has to break up the huge form data we receive into commands and send back partial responses to the client. As we move forward and find better UI patterns this has improved.

The control plane controls some data models that are not event-sourced and are mutable. Our users tend to expect to be able to rename and delete objects in the system that our event-sourced models refer to. It caused some confusion when certain views in the application that are built from our projections wouldn't see updated name of the object they had just renamed. And so we ended up doing the event-sourcing no-no of emitting the CRUD events so that our projects could appear in the manner our users expected. This is partly because of the aforementioned problems with the UI team but is also a problem with event-sourced models referring to data that can mutate over time.

However it hasn't been a hellish experience either. I took the liberty of developing some models of our event-sourced infrastructure and features in TLA+. This has been helpful to ensure that certain properties of the system under development would hold: consistency, availability, etc. You may not be willing to go down the path of learning TLA+ but the key take-away there was that a little planning goes a long way with a project like this: simple unit tests and whiteboard diagrams are not going to cover all of the things that can go wrong in an event-sourced system. If anything it might convince you to keep it simple, limited, and constrained as we did.

edit: forgot link

[0] https://www.microsoft.com/en-ca/download/details.aspx?id=347...


It is worrying that a central figure to Event Sourcing & CQRS like Greg Young reduces the "framework" to a function, a pattern match & a left fold.

Linked in the article https://youtu.be/LDW0QWie21s?t=1926


This statement crystallizes the article, and indeed the discussion in the rest of the thread. Young's language is drawn from functional programming. All of the complaints about complexity seem to boil down to, "event sourcing is hard in a world of mutable objects. The further you are from living in a world of pure functions and immutable data, the harder event sourcing is going to be.

Having implemented event sourcing in an existing desktop GUI application, I found there was a tremendous amount of complexity at the beginning, and very few benefits. Application state had to live in two different systems for a while. It was very hard, but we had been backed into a corner by inconsistent private mutable state and we had to do something. Once the project picked up steam, we saw a huge improvement in consistency and were able to realize some features that had been out of reach under the old system. Having lived the transition, I wouldn't recommend starting a project with event sourcing. Unless you already have a very clear idea of what constitutes application state and what constitutes an event, you're going to have a muddle. I would, however, recommend making the transition once you've figured out what the application is that you're writing.


Why do you find that worrying?

I interpreted that section as "the core bits are easy, and the frameworks people have built don't really help you with the non-easy problems that come later, so they provide little value", which (assuming the statement about the frameworks is true) seems like a reasonable position?


It is not how I interpreted it. What I heard was salesman pitching an expensive product as a bargain.

Even though frameworks have lots of drawbacks, I think it solves one problem really well, it gives you, the team, a direction.

Doing Event souring & CQRS correctly takes years of experience, this can be concluded by reading articles like this one or watching any of video by Greg Young.

In a sense the origin of this article stems from the notion that no framework is needed. I think that is a setup for disaster by selling developers the idea that this is easy, when it isn't.

In my experience frameworks have often learned me how not do things. Frameworks are condensed experiences that you don't need to learn yourself, someone else has already done the mistakes for you. This is a huge time saver & gives you, the developer, experience at a lower cost.

With that experience going frameworkless can then be achieved if necessary.

However my interpretation of it can be exaggerated due to the fact it was a short statement without much context.


It is hard to separate DDD, Event Sourcing/CQRS - they seem to all be joined up concepts promoted by a small circle of people.

I worked on a blockbuster project financed by a local billionaire in a Gulf State where the entire shebang was “mandated” by the CTO and a well-known Scala consultancy. Event sourcing, CQRS, with DDD to define architecture.

Let me describe one simple issue that was almost intractably complex - user sign up.

We had one service which was authentication, and another which was user preferences.

The front end sends a create new user command (set up a new user in the auth system with a password, email, etc), we also needs now to send some kind of message to initialise the user info records (e.g. home address, telephone, language preference, etc.)

We need to implement a saga for this, e.g. implement a process coordinator. Or maybe the authorisation system should on first request where the data is not populated, fill in a blank, default record?

We genuinely achieved organisational paralysis over this, as there was no clear emergent right way to do this, and a bunch of ways that really smelled.

Additionally, regarding the a code organisation perspective, we had a repo that had the central commands and events defined as code, which every service linked to. So we had a huge central dependency which had a high velocity of change.

What was meant to be a distributed, loosely connected system was in fact the most coupled system I’ve ever worked on.

Architecture is about clear communication, and the developers were simply perplexed. We were bringing in 10 developers a month, and communicating the architecture was impossible. Lofty ideals were espoused.

The first basic rule of delivery in a software engineering project, KISS, was universally ignored, in favour of using “sexy” architecture.

Fitfully, the CEO was first sacked after 18 months at the helm, a few months later the CTO and 95% of the dev staff. The year long coding effort was turfed. Estimates are $50m-$100m was burnt.

The winner of course was the DDD consultant that mandated everything, whose word was seen as law, who didn’t actually seem to have very much pragmatic, practical experience - he was getting paid $2k/day, and found it difficult to listen to any idea which watered down his architecture in any way - I think he would eeked out enough cash from this shitshow to purchase a small dwelling.

The biggest losers were the hundreds of staff that had relocated to the region sometimes with families, that had made genuine plans to be in the region for years, that would have all had their visas cancelled.

I think the entire thing is a con - my experience was in ideologues and purists pushing it and making themselves niche consultancy/speaking/writing careers in it.

Now this is just one apocryphal story - but it really is an architectural style that can totally wreck a development or even a company - people are ideological about it - it is not low-risk, and the entire Saga/Process Coordinator stuff is just a tack-on to try to (unsuccessfully in my opinion) answer things that simply don’t work properly. The message I mean to communicate is that the async-everywhere nature of CQRS/ES is super complex where coordination is required.

If you are going for loosely connected services, I far more prefer the microservices/RPC architecture - such as Netflix - as the synchronous model with distributed load balancing is a lot more sound.


I might be misinterpreting your example, but wouldn't you have both the auth service and the preferences service listening for a 'new user' event (which would contain all the details needed by both systems), and both if them acting on it? Of course it would get more complex to handle error conditions, e.g what do you do if there is validation on the preferences that fails...

There are commands and events in CQRS/ES - where you need commands and success reported from multiple subsystems from front end interaction, now you need a transaction coordinator for your “saga”

Typically the front end needed to get a success message to say user created, or user creation rejected (e.g. backend checking on valid postcode, valid username, etc - rejections which would come from two separate services.)


I am completely confused by your description ...

I have also not used a 2pc transaction coordinator in well over a decade.

Also you use the term "saga" which is correct in that what people are often doing here is not in fact a saga (which actually has a pattern description) it is usually a process manager which is a different pattern.


If you’re confused, welcome to how everyone felt. The nuances and abstract definitions made for confusion for the staff. I’ve forgotten what I knew about it, you may well be right in your definitions, however what was clear was that we needed to implement asynchronous pub/sub comms on the Kafka queue to get this work. What might have been simple with an RPC (e.g. a POST/PUT) was turned into a system which needed to track the state of its asynchronous RPCs, and listen to the queue for responses.

All of this was for no real purpose, and as I said the proof was in the pudding - tens of millions of dollars wasted, hundreds of staff hired and then fired. The domain itself was not complex (e-commerce) but the implementation was ludicrous.

There were basic delivery problems in the project as there was a complete lack of keeping things simple, and massive overengineering.


I think Event Sourcing is neatly defined as an algebraic group and provides a nice abstraction to reason with. Perhaps the problem is not recognizing this notion and not proving the correctness of a system before implementing it?

Although terminology differs, storing the canonical source of truth in Kafka has worked great for many of my clients. If that is Event Sourcing, then it can be made to work easily. I do get asked many, many questions about this, often from inexperienced teams. I took their questions, and my answers, and posted them here:

http://www.smashcompany.com/technology/one-write-point-one-r...


If people are indeed directly accessing a stream, they are violating a fundamental service oriented principle - Teams must communicate with each other through their service interfaces. It is worth understanding this concept with clarity. Assuming we go with Kafka, even if we don't need any additional functionality other than what Kafka provides out of the box, it would still need to be wrapped in a service and treated as an actual service. Otherwise, it becomes a shared resource.

This is one of the core issues that make many ES systems complex. A service owns an event in the same sense that it owns internal state. It's internal to that service and only expose what it finds appropriate, in the way it finds appropriate. That is almost never directly as an event on a public eventbus.

I agree that event sourcing (and also CQRS) are not so simple, in practice. The coupling the author mentions is something I definitely experienced, and I think the answer is to separate your internal event representation from both the input (e.g. command) and output representations. I got seduced by the simplicity of having them all be the same, but I definitely found I wanted to be able to vary these things independently.

In the first project where I applied event sourcing, I treated my commands (system inputs) as events, but ended up regretting it. The problem was that many commands cause multiple effects and as the number of commands increased, the complexity of the logic for deriving the state started to accelerate.

If I could do it again, I would have strictly separated the command and event ontologies, and adopted the concept of a command processor. The command processor takes a command and an event list and returns feedback on the validation of the command along with a new event list that is at least as long as the input. Rejected commands would result in the same event list. Accepted commands that map 1-to-1 with events would result in an event list one item longer. Complex/composite commands would result in more events. I probably would have logged the commands to retain the command-event relationship (for things like undo), but largely that would be a separate thing.

If commands and events are separated, CRUD doesn't require the event sourcing paradigm to bleed through to the UI. Each operation is just a command. Valdiation comes directly out of the command processor and can easily be mapped back to user-visible feedback.

The other mistake I made was concentrating too much logic in my state calculator. My calculator produced all derivable facts from the event log, as well as housed the validation logic (the command processor). In retrospect, I should have figured out what the most fundamental derived facts were and then moved higher level facts into their own calculation logic. I think this would have made maintenance and testability far easier.

While event sourcing does come with its baggage, I find that in projects that aren't using it, the cost is a bunch of ad hoc solutions to the problems of "first-class change", which you get for free. It's an extremely helpful essential technique when modeling stateful workflows.


> If I could do it again, I would have strictly separated the command and event ontologies,

Your use of "I" instead of "we" is interesting here.

I find that when there is a separation of two two things that naively look like they can be collapsed into one things (e.g. because all the simple cases have 1-1 mappings) then someone will either collapse the two -- or (more likely) force you to collapse them by building in an assumption about the 1-1 mapping.

Your beautiful separation then becomes useless. Do you have experience about how to set up teams that don't do this?


I say "I" for a couple reasons. First, I'm no longer at the same company, so that's just my personal reflection. Second, I was the architect of the project and definitely the person who sold the bill of goods on the benefits of event sourcing :). It was largely successful, but with those lessons learned.

I'm not opposed to collapsing things that have a 1:1 mapping. It's often a reversible decision, when/if you find the simplification is no longer actually simplifying things. The problem is that as these representations cross boundaries between modules and systems, reversing the decision becomes far more difficult. This isn't limited to event sourcing at all, though. It's the fundamental concept of encapsulation and coupling in system design.

I have found it difficult to socialize the benefits of encapsulation in a team, because the upfront cost is easy to see, but the downstream benefits are not. Sometimes, I've made the judgment to just step back and let people learn from their own mistakes. I've learned the hard way that it's actually not the worst thing in the world.


RISC vs CISC? :)

Yeah! I guess you could say it's kind of like how modern CISC has RISC microcode under the hood.

There's sort of a middle ground between event sourcing and ordinary mutable entities: versioned entities.

http://higherlogics.blogspot.com/2015/10/versioning-domain-e...

The particular schema described there isn't suitable for highly concurrent entities, but a more suitable schema could be employed that achieves the same goals.


We did something similar to this for a CRUD app that needed to become append only, have a full change log, have approve/deny events, and the ability to be rolled back. We still had an event table, but instead of having event data it just had a reference to a 'shadowed' (versioned) entity in the entity table. Once an event is approved, you project the shadowed entity on to the real one. That way the ID of the real entity never changes. This worked really well for our very specific use case (simple CRUD events, monolithic app.)

I don't think one can do proper, robust Event Sourcing without having first a few years experience in hardcore functional programming (particularly of the variant that strictly segregates side-effects), something most of us lack.

OO and loose FP are fine for a (huge) variety of problems, but the hardest problems need the next level of correctness and elegance.

Else you end up authoring yet another Rube Goldberg machine.


I'm of the mind that the usefulness of event sourcing as an architecture is directly correlated with how easy it is to determine what an event is.

I'm also of the mind that most challenges with event sourcing are ultimately tooling problems, and that we will "get there" eventually, for some fairly pleasurable definition of "there."


Related to the topic, does anyone have experience with Axon Server? https://axoniq.io/product-overview/axon-server It claims to solve a number of the concerns over adopting an event sourcing model.

From what I've seen in demos, it addresses low-level plumbing elegantly but I don't see how a relatively unobtrusive framework (I mean it as a major compliment) can help against making plain old design mistakes with wrong events, commands etc. that don't fit together and are unable to satisfy requirements.

Event Sourcing is not hard (as compared to not event sourcing).

But event sourcing is not for every application. Event sourcing solves some, otherwise hard, problems at the cost of added complexity. You need to judge whether it pays off.

Event sourcing assumes particular size of an application. Too small application will pay a lot in complexity with no added benefits because the problem would otherwise be easily solvable without having to use event sourcing. Too large an application will pay a lot in complexity because your write path will be complex due throughput requirement.

Event sourcing requires dedication. You can't go half-way, mix event sourcing here with direct inserts there, for example. This is going to be hell of a complex environment to live in with worst of the two worlds. Event sourcing only solves problems if you use it 100%.


But I suppose there is no other alternative for high traffic systems? Concept of "Event sourcing for everything" belief might be the issue itself, not the selective use.

If you want to do event sourcing. Look at persistent actors from akka (jvm and clr). It solve a lot of feaures as discussed here with replay, snapshotting, event upgrades etc.

Everything in the world works off event sourcing, from your database commit logs to your react. Hard or not, we are stuck with it

One of the problems I think I see with event sourcing is its inability to scale. You have to guarantee the order of events, right? How do you do that in a large scale distributed system with eventual consistency, without incurring an insane synchronization time penalty? I'd genuinely love to hear if you have a good solution for this, because if you do I have a use case I need it for, so this ain't a troll comment!

You generally only have to guarantee the ordering of _some_ events--not all of them. Ideally you want a way to partition your data such that ordering is only important per partition. For example, when I worked in derivatives trading, we only needed to guarantee ordering by product class, but events could be generally unordered given multiple product classes, as long as each individual one was ordered. Similarly, the telemetry systems I work on now guarantee ordering per sensor (duh) and per sensor grouping, but not across _all_ sensors on the vehicle. Partitioning makes for easy scaling (just make a new partition yay).

This depends on the nature of your scaling issue. Are you having trouble with consuming events? Or with adding all of your events whilst trying to ensure that they are ordered?

For the former, you might use something like [ZooKeeper](https://www.elastic.co/blog/found-zookeeper-king-of-coordina...) in your consumers to control parallelisation.

For the latter, you order by time of insertion. I think you might have some contraints you haven't mentioned that may be the real problem you are trying to solve. What is distributed?



that basically says Vector Clock, which is a good solution for where it's usually applied, but I think this is different. Having everything bottlenecks through your vector clock isn't going to be feasible.

Exact order of events should be guaranteed for one DDD aggregate, it is doable with simple optimistic locking. So unless your system consists of one aggregate - you should not have scalability issues.

I guess, whatever a Kafka cluster does, that's how you do it.

Here's my take on Event Sourcing: it's not particularly well defined what an "event" is. Did the event happen yet? Did it succeed? Which part? I didn't like having an event ledger say "comment created," where my application is then meant to consume this, handle validation, potentially fail on the db operation, etc.

So here is what I do: I basically combine Event Sourcing architecture with CQRS. Whenever a client makes a write request, that "command" gets written to a log. Then when things happen as a consequence of that command, for example, a successful DB write happens, I write another log that references the command.

So I'll have an event that says:

    id: '1234'
    type: 'command'
    data: {type: 'createComment', text: 'Ayy', userId: '1234', postId: '6543'}
And then that can get picked up and processed by the application, which will create another event like:

    id: '2345'
    type: 'dbWrite'
    data: {model: 'comments', data: {text: 'Ayy'}}
    commandId: '1234'
There's a lot of redundancy, and you can instead rely on something like your DB WAL for some of the consequences of the commands, but separating out the commands from the results like this has made Event Sourcing work quite well for me.

> Did the event happen yet? Did it succeed?

Events have already happened when you see them in the event log. It's happened, it succeeded.

> I didn't like having an event ledger say "comment created," where my application is then meant to consume this, handle validation, potentially fail on the db operation, etc.

You don't validate the past, you just accept it. Failing on the dB operation means your application is in error - if it's a transient error, retry with whatever backoff strategy you have. If it's a permanent error, you have a serious mistake somewhere in your code.

In your example, I believe you've broken one event into two parts. The second part essentially only says "the command was accepted." The first part says why there was a change made.

The first event is essentially "a user submitted a request to create a comment." The second event is essentially "a user's request resulted in a database change." The second event is not semantically interesting without the request's information, and the first event is of little interest outside of processing the request. Your application receives the _command_ to post a comment, validates the command, and records the outcome as an event.

A complete system would record errors as events as well as successes. This might seem very noisy, but down the line you might need to answer questions like "was this user doing lots of failed comments leading up to their successful hack of our system?" Or "how many comments are failing because the spam filter rejected them?" And you can't answer those if you haven't recorded the error outcome.

(And there's no harm in logging commands separately, too, but I'd log them explicitly as _commands_, not events. Commands are imperative tense, they're an order or request for the system to take some action. Events are past tense, they're what the system actually did.)


Having it clear not just in your own head, but your entire team's head what an event is, quite important, especially to keep in mind what happens when you "play back" a log. Do transactional emails get sent out again? Do upstream services record the playback as duplicate events? How are transactions handled? What external state are you unknowingly depending on for that playback to produce the same result?

> Do transactional emails get sent out again? Do upstream services record the playback as duplicate events? How are transactions handled?

The narrow problem here is that Event Sourcing is temporal, but not bitemporal. Or rather, the different kinds of temporality are frequently muddled.

Event streaming in the Akidau/Google style is bitemporal, but mostly accidentally, as a side effect of distinguishing "event time" (fact time) and "processing time" (transaction time / belief time).

(Snodgrass later proposed tritemporal models, including a timeline for when a fact-belief was viewed, which I find both brilliant and slightly terrifying).

The problems of evolving data models, which is the third-order problem, is often hardest when you put a log or stream at the centre of your design. Databases using SQL, for example, struggle with first-order (current state) and second-order (historical state) evolution. But they do much better on third-order evolution, since DDL is built into every relational database.


Ya, I think that is part of the importance of separating out the effects that resulted from the command. If I want to benefit from the "state rewind" capabilities, to bring my DB back to a certain point in time, for example, I can simply play back the DB writes that occurred and perform those updates directly.

Otherwise, I would have to replay the commands, ensure that my application state for processing the commands is exactly the same as it was when it last processed those particular commands, ensure that all undesirable side effects are disabled, etc.


My separate comment was partially a reply to you: https://news.ycombinator.com/item?id=19073810



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: