http://www.hagander.net/talks/tardis_orm.pdf (pdf)
Are the only downsides the overhead of writing to the extra schema and the space required? Anyone know how bad the hit would be for the writes?
With this setup it seems like you don't lose much by trying it. No interruption to the current db, you have the extra data if you want it, you can always cull to free up space and you can remove the whole thing at any point.
Or am I missing something?
One big pain point would be if the schema changes over time. What if you add a new NOT NULL column? What if you remove a column? What if you change the type of a column? What if you remove a table? What if you add a new CHECK constraint? Some of these things get mentioned in the video linked by 'tvon, but they are mostly passed over.
The changing schema is not so easily solved. Again, you could be a little looser with it. Don't remove columns from the history table (just mark as nullable). Adding columns would be a similar thing.
I guess I'm thinking of the use case of wanting to get a bit of extra history of free vs the "needs to be auditable, 100% correct, user queriable" scenario.
I did a college project with this where you pick the data and the system will show you how the database looked at that date/time. The past was read only, but the future was editable. The SQL Selects are quite annoying and the data grows very fast.
That's not even getting in to user requirements. Users almost always eventually want to know what the past data looked like. They always say at the outset of the project that they won't need it, and they always change their mind a year in. And they never understand why you can't "just get it back. You said we had backups."
It's kind of the Wild West out there.
I feel your pain. They don't want to pay for it until they need it, but when they do need it, they need it bad. They want you to implement it retroactively so all the changes made even before you implemented it are there.
When I was in the position, the solution I came up with was to do a very quick and dirty version that logged all changes for every table to a text file. Then I charged separately to parse the text file into the auditable database. I was able to get away with this because the database was tiny.
Once you start developing systems w/ temporally-oriented schemas I find that it's difficult to go back. Queries can be painful (and arguably the functionality should be "baked-in" deeper down in the RDBMs stack), but the functional result is very nice.
Did you go to the University of Arizona? I know Prof. Snodgrass.
In particular, it has been used in banking applications for quite a long time.
Also, the transaction log files that RDBMS are built on, are append-only, and have this same Event Sourcing semantics.
In the past I used the RDMBS trigger approach he describes, but have now moved to doing this more in the application layer (typically with ORM signals/events) because you can more easily write unit tests for it and make your code more portable. Neither of those are always requirements, though.
Would be interesting to see an ORM that could express triggers and generate trigger code in a declarative manner. Haven't seen anything like that, but could provide for some performance advantages in certain cases. Probably more complexity than it's worth, though.
However I see no complexity issue with how we implement CAS semantics. Its 100% equivalent to any other implementation with all the implications -- reduced concurrency being the primary issue.
We avoid this at all costs for most objects and, like you suggest, resolve conflicts by de-duping events or combining state changes to the final aggregate.
I'm not saying that it's impossible but how does people manage complexity at that scale and also the performance overhead.
That said, you don't necessarily want/need to apply ES to every single entity in your model. Typically, you want to apply it to those entities that have complex lifecycles or that are highly dynamic and might involve a lot of complexity.
ES actually often makes it much easier to reason about complex lifecycles because you can isolate the transition logic into discrete units of code. So, it's actually tool for managing complexity.
In terms of performance overhead, ES usually results in significant write-side performance improvements because you can either remove or dramatically simplify your transaction coordination overhead (increase concurrency, reduced latency).
To address read-side performance, people generally use a cached or materialized view that is lazily updated or implemented with the events as the "write-behind" log for the cached view. This read-side detail is why ES is most often associated with CQRS or lambda-style architectures.
As an example, in banking scenarios, you would determine "current balance" by adding up all of the past transactions on an account. Rather than doing this computation every single time you want the current balance, you typically settle the account in a nightly job and store the "nightly balance" and then compute the "current balance" by only considering new transactions for the day.
What I gather from Kafka style architecture is the storing/persistent works through the abstraction the processing itself you have to do separately for each type of entity and if you have many of them and your query requirement are complex not sure how to abstract that. I might be missing something.
Again the idea of caching, lazy updating, horizontal scaling those resonates with me but I don't see how to write good abstractions that allows to apply that. For most complex entities it seems I have to do this case by case basis and I can't figure out ways to contain/manage that complexity. And while ES adds for write scalability transactional integrity seems to become much more a complicated questions and hard to abstract about, doesn't it?
Graphs are simple which made this exercise even easier:
entity(vertex) -> edge($RELATIONSHP) -> entity(vertex)
user = get_user('emehrkay')
reply = get_reply($THIS_CONTENT)
user_reply = connect(user, reply, 'posted')
save = save(user_reply, source=user)
user('emehrkay') -> posted -> reply($THIS_CONTENT)
reply($THIS_CONTENT) -> event -> event("reply's changes")
event("reply's changes") -> source -> user('emehrkay')
Very fun/cool stuff. I haven't had the need to query against the history yet, so I don't know how simple it would be or if it were possible to set a time and query the whole graph against that time (like in the example posted in another comment)
The game backend I've been leading uses event sourcing for almost every "object" in our system. It's proven flexible enough to model both normal CRUD style access patterns and other activity feed-like features that may or may not have an aggregate state.
Some huge advantages we've identified:
* Modeling our objects as state changes has resulted in less code and improved readability. The effects of a state change are much more obvious than a series of CRUD-style updates. This is anecdotal of course.
* Audit trail, we have every mutation for every object in our system. We can go back in time and identify bugs and fix them by just overlaying new change events. We can also feed this data into any number analytics systems and create materialized views without having to know in advance what kind of data we wish to collect.
* Optimal performance characteristics for our chosen DBs (Redis+DynamoDB). We use Redis for storing incoming events, and DynamoDB to store our periodic snapshots that compress the aggregate state for an object (and keep the event history). This allows us to keep our Redis space usage bounded, as compressions flush the queue for each object. We optimize for DynamoDB pricing/performance by only needing eventually consistent reads for obtaining a snapshot. We keep our write capacity low because the only writes are when we write a new snapshot -- this happens much less frequently and is allowed to fail since a failure only means we continue using the previous snapshot.
* Consistent point in time snapshots. To obtain a backup of the objects, we simply need to snapshot our Redis databases. If we restore, all objects are consistent because the per-object snapshots we store in DynamoDB are immutable and will always exist with any given state of the Redis data. Another consequence of this architecture is that we can take an old snapshot and "go back in time" to the real state of the data.
* Highly concurrent without need for optimistic or pessimistic locking. Redis is obviously super performant and all we use it for is to push event items to the back of lists. For concurrency and conflicts, we can resolve the state of an object by de-duping events that may be invalid or duplicate from the point-in-time view of the aggregate object when the event was pushed. Obviously this won't work for every type of object state, but in our experience it has been extremely resilient.
A few disadvantages:
* Views on the objects must be eventually consistent and must be materialized externally to the storage of the objects.
* More CPU required to create the aggregate state when getting the object. Hasn't been an issue for us and can be mitigated by the frequency of compaction into snapshots.
* Space usage for snapshots. Eventually long lived objects will have a significant cumulative historical size. It's trivial though to traverse the snapshot history and trim the space usage by simply discarding snapshots after a certain point in time. The object will then be constructed from the last snapshot's aggregate state and events that happen after the purge.
I've been tinkering with a simple game back-end for building up turn-based, multi-player games (an engine for ripping off board games) mostly as a demo/proving ground for ideas and not as a serious project since this is a pretty active space with established projects.
Triggers make things happen sort of "by magic" and it's easy to be confused by their behavior and how they impact transactions. You also need to remember to disable them if you ever want to do some kind of bulk update/correction that you don't want recorded in the history.
My approach is to use stored procedures for the updates, and make the history tracking explicit in that code. Overall this is easier to manage, with fewer gotchas.
Perhaps if you're using some sort of abstraction layer then you can make sure that only developers who've read the docs and are aware of semantic links between tables which aren't enforced by the database then this can be mitigated?
Even if it doesn't bite you, it will bite someone else.
There have been many times where having the version information has helped us to debug and fix problems. We've had customers call us, freaking out because they accidentally deleted something, or confused because somebody else in the company changed something. We can tell them exactly when, what and by whom the data was changed and easily restore it.
This article is just reinventing Kimball's Slowly Changing Dimensions Type 4 .
The article describes two columns, dateEffective and dateEnd. If dateEnd is null, the row is still effective. When the row has been updated or deleted, the dateEnd field is set.
This was implemented in a regulated industry where we had to show the state of the system at given points in time. It worked well for this purpose.
The approach was very similar. The current versions of objects were stored in a set of tables, while old versions were stored in separate tables. In order to reduce bloat, I employed a few fairly simple compression techniques. They were reasonably good. On desktop-class machines with a Pentium processor, 256 MB of RAM and spinning hard disks, searching a million objects completed in single digit seconds.
Later, I added bit-mapped indices with dynamically self-adjusting bucket sizes that were determined based on locally-weighted densities along their respective dimensions. They reduced search time on a million objects to tens of milliseconds, with full history searched.
The obvious downside was that there was no SQL for ad hoc queries! So, a command line shell was provided with Ruby as the language, talking to a C API behind the scenes.
All in all, the system worked and scaled very well.
If I think I'm updating version 2 of an entity, then only update that version, otherwise throw an error.
I've updated the post to address this scenario.
NEW.version := OLD.version + 1;
Is there a different concurrency issue I'm not seeing caused by updating the version in the trigger?
You still have the issue of someone setting the version explicitly in an update statement without a WHERE clause which checks for the current version. But I guess as long as you enforce not doing that it's fine.
And I call a proc to add the stuff to the table to create the audit tables and triggers.
Works well. I also capture information like the host and session references in an audit batch record that points to the audit detail.
I also employ the insert only table for some very special datasets where looking up the history in a quick manner is important.
Is it common to name timestamps _on, like version_on? That doesn't seem very clear to me. What's the rationale?
as soon as the app needs to generate a list of f.e. active users, this can get nasty pretty quick. sure, one could use an aggregate like max(version), but once there are joins or lots of records involved, this could become a performance hit...
EDIT: the records are kept in a separate table, so my arguments are moot.
EDIT I hadn't finished reading the article, so I didn't see that they were using triggers and the history table. Using the approach I outlined you could keep everything in the same table (if you were that way inclined).
the disadvantage is bloat, as in our case the history data was rarely used at all and just there as a safeguard.
The one thing I would add is that using an RDBMS does not preclude you from storing events along with your relational data. So you have a representation of current state, but also all of the events that took place to get you to that stage. And that will give you the same type of history.