I have worked on, or cleaned up, 4 different CQRS/ES projects. They have all failed. Each time the people leading the project and championing the architecture were smart, capable, technically adept folks, but they couldn't make it work.
There's more than one flavor of this particular arch, but Event Sourcing in general is simply not very useful for most projects. I'm sure there are use cases where it shines, but I have a hard time thinking of any. Versioning events, projection, reporting, maintenance, administration, dealing with failures, debugging, etc etc are all more challenging than with a traditional approach.
Two of the projects I worked on used Event Store. That was one of the least production ready data stores I've encountered (the other being Datomic).
I see a lot of excitement about CQRS/ES every two years or so (since 2010) and I strongly believe it is the wrong choice for just about every application.
I'm the author of this article. It sounds like you have some valuable, real-world experience with CQRS/ES.
I'd love to read more about the difficulties you've faced, and overcome.
For migration of immutable events, there's a good research paper[1] that outlines five strategies available: multiple versions; upcasting; lazy transformation; in-place transformation; copy and transformation. The last approach even allows you to rewrite events into an entirely new store.
ES, however, is more controversial. If speaking about "pure" ES (i.e. not having any mutable state, and reconstructing current state from input events all the time) - versioning and potential synchronization failures (and synchronized access is a prereq for event sourcing) will kill it very quickly (and I didn't even start speaking about performance, which is going to be a very serious challenge).
OTOH, if understanding ES just as an ADDITION to classical mutable-state processing - it can be made very useful. Not only ES will serve as a perfect audit, the duplication of information (once in mutable state and once in input events) will allow such things as regression testing, and fixing data problems caused by bugs, within the DB. BTW - with this model, the latter can be done in a post-factum-fix manner and this, unlike "pure-ES" fixes, is not confusing to the readers who already got and stored previous state of the DB (with "pure" ES, after the fix, all the history can change, invalidating all the data which might have been stored by the third parties, and this is really crazy - imagine if your bank statements would change overnight; with a "ES+mutable" model, bugs still can be identified, and effects of the bugs can be found too - and then a separate correcting transaction can be issued against the DB, which is a much better match to a vast majority of existing business processes).
Hope it makes sense :-) (it is admittedly very sketchy, but forum is not a good place to elaborate further)
I've been following your projects on Github for awhile, good work—I don't necessarily agree with all of the design choices but we've built on the eventstore at work and I'm going to be using it on another project in the near future.
Thanks for the link to the research paper, it's a good read. Are you aware of any event sourcing frameworks/datastores that use one (or more) of these strategies to tackle the problem of schema evolution?
Wow, this succinctly sums up our experience. Fun to develop against, absolute nightmare to support in production.
The only place I'd recommend it these days are where the business views their state as an event stream, maybe finance/stocks. Not developer-forced-events like "customer address updated" or "user email changed".
Even workflow systems I've dealt with, the business doesn't view their state as an event stream. The state is where it is, how it got there is an interesting footnote.
No shit... if you listen to the actual talks, that is EXACTLY what they recommend. Use the patterns where appropriate, don't use it where it's not appropriate. Also consider _where_ the advice came from - finance, gambling, healthcare and so forth.
As for "not production ready" with regards to Event Store specifically, we have had zero reported incidents of data loss which were catastrophic failure of the hardware (or, usually the cloud) it was running on.
One of the projects I worked on was in finance. The fact that we were going to get "a free audit log" from the architecture made everyone so excited.
After the CQRS/ES project failed (I was on cleanup duty) we used a more traditional arch. To handle the audit log we just had a separate table. ("customer" table had "customer_audit" table. Both were written to in transaction. Solved.)
Not production ready - the system, not the database. We never had a problem with the event data stores. The bespoke systems our CQRS/ES rescuees built around it, absolutely.
Even where there are smart, capable, technically-adept folks in-place, it in no way means that they have full command of the implications and the ways and means of event sourcing.
In my experience, event sourcing is useful in most projects - even, contrary to an earlier comment, for the user and user profile concerns.
I've worked with EventStore. I've wanted to smash EventStore out of existence on a number of occasions. On some of those occasions, I was at fault. And on others, I found the user experience (as an implementer) misleading, ambiguous, and ultimately costly. I'm quite frank with James and Greg about what I think EventStore should be achieving, and have been quite public about it in the past, so I won't rehash that here.
I guess I haven't seen the ebb and flow of excitement, though (since 2007-ish when I started crossing paths with Greg and Udi).
I do see a steady rise in awareness of it since then, and I see an increasing number of successful projects. There's also more people helping other people and shops with implementation guidance and safety.
And there's bound to be more failures - just as a matter purely of numbers - just as there are with any platform or tool where the learning was underestimated, or the grasp of the whole was overestimated.
Yep, you can fail with event sourcing. You can fail with Rails. It will take you longer to fail with Rails, though. And for an organization that isn't into the learning as a matter of course, failing over a longer term can be an important and empowering strategy.
I have two sets of customers: those whom I help slow the looming failure of their Rails projects, and those whom I get started on event sourcing. It's not always a good cultural fit. But I have yet to see a domain that's worth the expenditure of a software development team that is somehow naturally inappropriate for event sourcing.
> I strongly believe it is the wrong choice for just about every application.
None of these are pure as-originally-outlined-by-Fowler CQRS/ES, but I'm willing to suggest they are paradigm equivalent, real world successful examples:
1. Basically all double-entry accounting/book-keeping/core banking systems.
2. Many RDBMSes. Specifically the replayable log structure of transactions.
3. I sell a service that includes two event-sourced data structures mutated by domain-specific commands. They are used for collaborative decision making in sports management. This aspect works very well indeed: the resulting characteristics are intrinsic to ES structures, are (AFAIK) unique in our market, and represent one of our most customer-retaining capabilities.
What do you think about the use of technologies like Kafka which enable what is effectively an event sourced architecture without all of the buzzwords? Not everyone uses it that way, obviously, but there is plenty of discussion about it.
One of the projects did use Kafka as the "event log". There were stability issues with the version of Zookeeper that was used. From a writer/reader perspective Kafka was sufficiently performant when it was up. (The Zookeeper issue was eventually fixed as I recall, but by then the damage was done in terms of political capital spent and lost.)
The big issues didn't really have that much to do with the persistent store of events. The bigger issue was the fact that as new features get added to your application, your event payloads change. Well, in order for projection (particularly re-projection in the case of an issue) to work, your code needs to know how to read and process all versions of every event. Of course, there are techniques like snapshotting to give you point in time "good states" so you can eventually deprecate some of these events, but thinking about it is challenging.
Additionally, most folks argue that the event store should be immutable. This is great until some kind of bad event gets in there. Now your code needs to know how read, and discard, this bad event forever (or until a snapshotted point in time).
Finally, projection is not the panacea the evangelists will have you believe. Inevitably there will syncing issues between the event store and the projected database/elasticsearch cluster/mongo instance whatever. And what do you do then? Re-project! But that is not easy :)
We use a flavor of event sourcing in production (Elixir+Postgres) and I have found many "rules" can be broken while still maintaining the core benefits of CQRS/ES
- We use Postgres as the event store and all of our other projections are stored in the same database.
- Our app is not distributed. We use a single database.
- Our event store is not immutable. Instead we will run migrations to rewrite the events, delete events, etc. You either have to either deal with the complexity of maintaining two versions of code or the complexity of migrating. I've found the later is a fixed cost (do it once and move on) vs. a variable cost (continue to deal with two versions of code).
- Our commands aren't async. They are executed inline.
- We don't do any snapshotting.
Granted, we don't have a lot of users on our app (~50 active users) but overall this has been a positive experience.
We use the strategy of versioned events. Event data is still immutable, but older events are upgraded to the latest structure at read time. This works reasonably well but it is not ideal.
Storing data as immutable events implies that all data ever generated by your application becomes available to future versions of your application. Writing an application that can handle all the forms of your data across time is obviously more complex but it's a necessary consideration if you decide to go with event sourcing. Unfortunately, you cannot have all the benefits of available and useful unaltered historical data without also putting in the engineering effort to support it.
For migration of immutable events, there's a good research paper[1] that outlines five strategies available: multiple versions; upcasting; lazy transformation; in-place transformation; copy and transformation. The last approach even allows you to rewrite events into an entirely new store.
To address the concern with changing "immutable" events. I've just built a migrator [1] tool for my PostgreSQL-based event store. It implements the copy & transform migration strategy. The source event store database is untouched. Transformed events are written out to a separate database.
It uses Elixir streams to provide composable transforms to rewrite, aggregate, remove, and alter serialization format of events.
I work for a finance startup that uses Akka Persistence to implement an event sourced architecture. It has worked pretty well for us, much better than the original CRUD system. Akka can support a variety of different databases via journal plugins which has allowed us to maintain using a SQL database instead of being forced to adopt an unproven database.
The one area we've struggled with in the architecture is the query/projection side. Recently we have decided to completely separate the projection logic into a different application that is deployed independently to mitigate some of the issues we have been having.
We're currently in the midst of implementing the independent application so we have not definitely solved our projection issues yet.
The main issue we have been running into is a performance bottle neck after restarting the application. Currently, when the application restarts (i.e. after a deployment) the projections may not update until several minutes. Our hypothesis is that if we can run projections independent of the command side of the application they can run indefinitely (i.e. as Spark jobs) since in theory a projection should not be updated. If we need to make changes to a projection then we will deploy another version to run side-by-side with the existing projection until the consumers migrate to the new version.
Do you mind detailing your experience with Dataomic? We're looking at ways to store 6-ary tuples (quadstore + temporality of existence and temporality of observation) of facts and build indices on top of them. I'm hesitant to move to (relatively) obscure data store without a really good idea of where that puts us.
We didn't get terribly far down the road with Datomic. Management/administration of the DB was not for the faint of heart. (At the time the docs were simply terrible, maybe they are better now?) We would see data corruption/loss as well. We weren't doing anything terribly complicated with it, and data loss doing the simplest things was unacceptable. (It's entirely likely we were the cause of the data loss somehow, but we followed the guides and did nothing crazy or weird). So anyways, we just threw it out almost immediately.
Conceptually it wasn't really a fit with the team or our application either. The query model and direct reading from the storage medium seemed silly - why not just use underlying storage medium directly for everything? (Besides, now you have to administer both Datomic and Cassandra/Oracle whatever) Not worth it for the obtuse query approach and limited value of "never forgetting" imo :)
Direct use of the storage medium for everything would negate the advantages Datomic offer. The beautiful part of Datomic is that you can perform an expensive query without having any effect on other peers. Since Datomic also stores datoms in blocks, the n+1 problem is reduced as well. Another cool thing is that for unit-testing, you can just disconnect Datomic from the storage medium, and run everything in memory. Datomic requires writes to be synchronised, which is why you can't go directly through the underlying storage medium for write cases.
While I personally would love to use Datomic for just about everything. The fact that I need at least three machines going for the simplest app (1 actual database, the peer and the transactor) and that those machines can't be the cheapest machines (you need enough ram to store datoms in a cache on the client, or things will be slow), Datomic is something I can rarely afford in practice.
Absolutely. I use CQRS fairly regularly, but I've never used ES on production software and I can't think of a case where it would have been appropriate. It's a serious solution that is not trivial to implement.
CQRS is a highly understandable, almost always reasonable technique to follow when you strip away all the other things that people tend to associate it with.
I like thinking about applications in terms of read-side and write-side. I'm not super enamored with async CQRS, but I am a huge fan of APIs and data models where the developers realized that how I consume the data is wildly different than how I update it.
I detailed some of the problems I saw in another thread, but one other thing that gets challenging is the CQRS side (if you choose to keep everything async).
For example, say you fire an "address change" event. That gets sent to the event store for eventual projection in to a medium you can actually query from (realtime querying of the event store itself is the road to very bad places, I promise). So now your event is sent along, but how do you know when it has been processed/projected? What if there was a conflict with another address change event from somewhere else? How does that bubble back up?
The typical solution is to pass back some kind of "receipt token" so you can poll to see when your event is processed, then do your read from the projected database, or whatever. Of course, this can be made to work, but once you start talking edge cases and the need to support standard UX paradigms, polling for every update and handling error scenarios in this way becomes painful.
> Of course, this can be made to work, but once you start talking edge cases and the need to support standard UX paradigms, polling for every update and handling error scenarios in this way becomes painful.
This really makes me think you—and the originators of these projects you're lambasting—are working from an incomplete understanding of how to apply CQRS and ES. If you're applying CQRS in a fully async fashion, polling itself is an antipattern. That receipt token? That's what everyone else calls an ID. You know when its been processed because the subscription you should be watching tells you its been processed.
You mentioned changing event payloads in another thread. That's another big code smell to me. In a stable, well-understood domain, your payloads don't change much. If you're applying ES to a domain that ISN'T well-understood, you need to do a LOT of discovery ahead of time, or be prepared to iterate on your data until you do. It sounds like the projects you were on failed on those accounts.
Yeah, ES is hard when you keep trying to treat it like CRUD. Its overkill when you apply it to an easy domain, and its an antipattern when you write CRUD events. So don't do those things.
I was waiting for the "you're doing it wrong" guy to show up. You win!
Yes, some of the systems used subscriptions too, which had their own set of issues.
Additionally, domains are almost never completely understood. Even if they're well understood today, things will change tomorrow. CQRS/ES in your own words is not good when requirements change. Well guess what? That's every system I've ever worked on.
If you've had success building non-trivial CQRS/ES applications I'd love to hear more specifics about how you solved all the other issues I've presented.
You didn't succeed at building a CQRS/ES system despite several attempts. Why aren't you asking "what am I doing wrong?" instead of presuming that your personal experiences are sufficient to render informed judgement?
> Additionally, domains are almost never completely understood. Even if they're well understood today, things will change tomorrow. CQRS/ES in your own words is not good when requirements change. Well guess what? That's every system I've ever worked on.
It's still possible to design individual components that don't require constant churn of their application state. Most software teams are incapable of this, and event sourcing is not for them, even in domains where it shines (like finance).
In my experience, when teams have solid leadership, you can get your software pretty close to the target the first time you build it. Minor course corrections are straightforward, if sometimes tedious. When the business experiences big pivots, much of what you've built can be reused providing it's modular and does not make assumptions about the overall system. The rest can be discarded.
That's a big departure from the topic at hand, but my point is that if your software isn't modular, event sourcing in particular will amplify the pain you feel.
I recommend that anyone thinking about doing CQRS/ES find someone who is an expert to help guide them or their team.
Maybe they've been asking since 2010 and, not having received a satisfactory solution from the experts in the field, have stripped all the projects of CQRS/ES and gone back to what works well. There comes a point in time where you stop asking and move on, and expecting them to re-ask on HN is a poor presumption on your part, leading to an uninformed judgment.
Well, there's a lot of people who have successfully deployed ES systems at both the large and small ends of scale. So one might ask, after having looked for and received some answers from people who have done this successfully, where did I misapply or misunderstand the advice?
> I was waiting for the "you're doing it wrong" guy to show up. You win!
Well, when you base an argument on a set of known antipatterns, you shouldn't feign surprise when someone points out that you're basing your argument on known antipatterns.
>Additionally, domains are almost never completely understood. Even if they're well understood today, things will change tomorrow. CQRS/ES in your own words is not good when requirements change. Well guess what? That's every system I've ever worked on.
The first point is flat out untrue. There are domains of expertise with literal centuries of knowledge and practice in them. There are many many more with decades. And many manys more with years. Startups measure knowledge in weeks and months. This is not a suitable playground for ES.
Secondly, I didn't say CQRS/ES was unsuitable when requirements change. I said it required a lot more work when the domain was not well understood—and that the work was primarily in understanding the domain.
I've used some combination of these patterns on nearly every system I've worked on for the last 7 years. That spans medical billing, ticketing, public health, the wedding industry, and for the really esoteric, voting software for college life organizations. Here are the rules I've found:
* Keep it simple. Do not try to apply ES to all areas of your software, if you apply it at all. Use it within small bounded contexts, and guard the data from other BC's. The minute you poke a hole in the BC's data store, you've guaranteed yourself headaches down the road. This means don't try to make your user model something that's ES-based unless you're building an LDAP server or similar.
* CQRS does not require ES. ES does not require CQRS.
* on-demand projections are fine for a lot of purposes, learn to tell when you're going to need a static projection. Key indicators are reporting, background use, and expense of the projection. This is not a complete list of indications.
* a projection is part of a BC. Don't go querying other BC's at runtime for their data. If its important to the projection, establish a public contract on the events from the other BC, listen to them, and store the data independently. Yes, its duplicated, that's fine. YMMV.
* do not try to back ES into an existing application, unless you're a) rebuiding an entire feature silo from scratch; b) building an entirely new feature from scratch; c) there is no C. Its tempting, I've tried it, but your best value for time is to refactor into something more modular, which is the 80/20 value of it.
* If you're going to go async, go async. build that expectation into your UI. the pain of dealing with async commands comes from figuring out how to get feedback on them. Its a command; there is no feedback. Once it validates, its done as far as the sender is concerned. A failure to fulfill the contract is itself an event, like any other that comes over your event bus. If you build in the facilities to treat it as such from the beginning, your life is much easier.
* Use uuid's for PK's, and originate them with the client whenever possible. This allows for optimistic concurrency and additional commands to be sent before receiving the results of the original command. Also, track command ids/causation ids as part of the metadata for events. Its not always useful to have, but when it is, its very useful to have.
I'm sure there's more to say, but a lot of these lessons are basically common knowledge if you're well-read on the subject. A few of them are just things I've learned the hard way—I've broken damn near every one of them at some point, with regrets. That said, you do this enough and you learn which rules can be broken and when to break them, as with any other kind of expertise.
But ES has saved my bacon more than once. I've used it to back out of a poorly designed CRUD model, report on BI questions for years past, even restore data once when a network partition created a gap of several hours with high-frequency writes. (Chalk that up as a good reason to keep your event store independent of your transactional data store.) Yes, there are headaches to it—to pretend like CRUD doesn't have different versions of those headaches is disingenuous, or simply inexperience talking.
Based on his article it looks like he's using subscriptions internally rather than polling. That's a fairly natural thing to do across an Elixir application/cluster.
In terms of conflict resolution, it seems like you'd have to clearly define a scenario where a conflict was possible. Based on the write-up, the state of an address would be based on the aggregate of the events that wrote to it. That seems like it would always lead to the last change winning.
From the write up of the system, I actually can't imagine trying to do this in anything other that Elixir/Erlang. The set of requirements and challenges to pull it off would be really complicated on just about any other platform.
Pushing read model updates back to the client using a two way communication channel is one technical solution. I want to experiment with using Phoenix channels[1] to solve this. I think that has potential for easing the UI/UX concerns. You post a command from the web front-end and subscribe to receive updates for the read model you're looking at. Domain events can drive the client notification.
If you can write your read model into Mongo, then you can use Meteor to build a real time interface extremely quickly; it tails the database log and dispatches updated records to subscribers practically instantly over websockets, no need for the event processing code to know about how to map to frontend queries. We use this for our production CRM, albeit for internal users. No doubt Phoenix would be more performant and support more databases, but it's nontrivial to build the record-to-subscriber reverse mapping that Meteor brings out of the box. RethinkDB was going to be the Chosen One for this use case, alas...
The address thing is normally solved by the fact that you organise your commands by things that should only logically change togther. So a conflict messages won't revert irrelevant fields back to their old ones.
So your commands should not be
UpdateCustomer
They should be
UpdateCustomerAddress
UpdateCustomerEmail
Etc
For the address, just take the last one. All the business logic I can think makes this ok.
If your events contain the words "Create", "Update", or "Delete", or any synonym thereof, you're modeling CRUD with events and life is always going to be more complicated than it has to be for you. The names of events are data too—make them representative of the domain.
CustomerMoved(fromAddress, toAddress) is a domain event.
Dino Esposito describes an "historical" crud System in a series in msdn magazine https://msdn.microsoft.com/magazine/mt703431
This is basically ES with crud. Not saying ES with crud is the best example, but for data which requires Audit Trail logic it actually works fairly well.
Haven't read that article, but will check it out, thanks for the link.
My issue with audit logs in crud systems is that they're almost always at the row level, which is almost useless when you're trying to make sense of the audit log. An audit log of "operations"—i.e. command log—is far more useful, and trivial to implement when CQRS is used. I'm guessing that's what this article details...
I would be interested to know why you consider ES "not production ready".
There have been zero reported incidents of data loss in stable released versions which were not as a result of catastrophic failure of the hardware on which it was running (though most people's poor understanding of Azure has caused more problems than everything else put together). Furthermore, this should be no surprise given the testing process[1] continually run.
We had no data loss incidents. My concern was around tooling and maintenance of the database. Once you had to start clearing out bad events in dev environments or doing common administrative chores there were many holes. I never did find out how to remove specific problem events. We ended up just purging the store every time something went wrong in lower environments.
We had no solution for this problem in prod. How would you fix an event that should not have gotten in to the store? (Wrong contract, bad data etc) I understand it shouldn't happen, but in the real world all kinds of things go wrong.
I can't imagine running a massive system on something like event store. Fortunately the projects always failed well before we got in to any kind of production with real users.
I can see why your projects failed. That EventStore does not support you editing an event is not something that makes it "not production ready" it is in fact a feature that keeps you from doing stupid things.
If you go and edit an event, how do subscribers receive that edit? Let's imagine I have a projection updating a sql db and you now edit an event,how will this projection receive the edit?
"We had no solution for this problem in prod. How would you fix an event that should not have gotten in to the store? (Wrong contract, bad data etc) I understand it shouldn't happen, but in the real world all kinds of things go wrong."
You should do some more research into eventsourced systems as there are patterns for handling these exact scenarios. http://files.movereem.nl/2017saner-eventsourcing.pdf discusses some. In your scenario the most common is read the problem stream out, write it to a new stream (with any changes that you want) then either delete the old stream or leave a last event in the old stream saying it has been migrated to the new stream.
Not saying this is great, and I don't have experience with an ES system in production, but... Could you do one of:
1. Manually remove the specific problem event from the store, then re-run all events starting from the previous accumulated state snapshot stored before that event? Then you get the new state snapshot just by re-processing all the events. This of course assumes you have state snapshots, and may not be feasible if this is a common occurrence and there are too many events to process.
Or:
2. Create a "reverse" event? This will cancel out the bad change and give you a new, valid state to work from and continue. This is nice since the historical state of the system is still represented, but that may be a disadvantage in some situations.
CQRS/ES requires a different way of thinking and a different set of best practices. This could make it easy to shoot yourself in the foot.
That being said:
- Dealing with failures can be better than traditional systems if done properly. For example, we have services where, if they fail, won't bring down the entire system. However, this does require you to be more explicit on how you handle errors.
- I have found debugging to be easier. When an error occurs, we can trace it back to the exact command and the events it generated. This allows us to see 1. the exact state of the system at the time the error-producing command was generated and 2. the exact command that was executed. From this we can easily reproduce the error.
I have covered "versioning events" in my other comment. Please be more specific about "projection, reporting, maintenance, administration". What exactly were the challenges there?
I understand that ES is not a silver bullet but I would like others to have a clear understanding of the tradeoffs to traditional systems.
There's more than one flavor of this particular arch, but Event Sourcing in general is simply not very useful for most projects. I'm sure there are use cases where it shines, but I have a hard time thinking of any. Versioning events, projection, reporting, maintenance, administration, dealing with failures, debugging, etc etc are all more challenging than with a traditional approach.
Two of the projects I worked on used Event Store. That was one of the least production ready data stores I've encountered (the other being Datomic).
I see a lot of excitement about CQRS/ES every two years or so (since 2010) and I strongly believe it is the wrong choice for just about every application.