
Taking the 'D' Out of CRUD with Soft Updates - smerchek
http://scottsmerchek.com/2015/08/04/taking-the-d-out-of-crud-with-soft-updates
======
pjungwir
There was a very interesting presentation this year at PGConf New York with
something similar [1], but where the history table has the _same name_ as the
normal table, but lives in a separate history schema. Also there is a
timetravel schema whose tables again have the same name, but are views of the
history schema, with an extra WHERE filter to only show rows that were active
at a given time. So by changing your schema_search_path and setting the
"current_time" setting you can easily query the whole database as if it were
some other time! I haven't tried this out yet, but it looks to be very
compatible with an ORM like ActiveRecord. This is something I've been looking
for for a long time.

[1]
[http://www.hagander.net/talks/tardis_orm.pdf](http://www.hagander.net/talks/tardis_orm.pdf)
(pdf)

~~~
aidos
That seems like a really clean way of building the abstraction.

Are the only downsides the overhead of writing to the extra schema and the
space required? Anyone know how bad the hit would be for the writes?

With this setup it seems like you don't lose much by trying it. No
interruption to the current db, you have the extra data if you want it, you
can always cull to free up space and you can remove the whole thing at any
point.

Or am I missing something?

~~~
pjungwir
I agree it seems pretty great.

One big pain point would be if the schema changes over time. What if you add a
new NOT NULL column? What if you remove a column? What if you change the type
of a column? What if you remove a table? What if you add a new CHECK
constraint? Some of these things get mentioned in the video linked by 'tvon,
but they are mostly passed over.

~~~
aidos
That's a good point. I guess in terms of the constraints you wouldn't mirror
those to your history schema anyway?

The changing schema is not so easily solved. Again, you could be a little
looser with it. Don't remove columns from the history table (just mark as
nullable). Adding columns would be a similar thing.

I guess I'm thinking of the use case of wanting to get a bit of extra history
of free vs the "needs to be auditable, 100% correct, user queriable" scenario.

------
brixon
He is talking about Effective Dating except in his version he can never use
it. If you don't need to reference past for future dates then his idea is nice
since it keeps the primary tables smaller and the SQL easier to handle.

Effective Dating: [https://talentedmonkeys.wordpress.com/2010/05/15/temporal-
da...](https://talentedmonkeys.wordpress.com/2010/05/15/temporal-data-in-a-
relational-database/)
[http://www.cs.arizona.edu/~rts/tdbbook.pdf](http://www.cs.arizona.edu/~rts/tdbbook.pdf)

I did a college project with this where you pick the data and the system will
show you how the database looked at that date/time. The past was read only,
but the future was editable. The SQL Selects are quite annoying and the data
grows very fast.

~~~
moron4hire
I personally believe that the majority of CRUD projects actually need Temporal
databases. I've come to realize that--in my 12 year career--every project I've
worked on was in some way covered by at least one form of government
regulation (whether Sarbanes-Oxley or HIPAA) that required the ability to
audit change. The problem is, most organizations (at least the ones I've
worked for) don't know that they are covered by these sorts of regulations,
because most places are sub-100 employee consultoware shops that are just
reimplementing ERP for small B2B clients who also don't have much visibility
on their regulatory coverage.

That's not even getting in to user requirements. Users almost always
eventually want to know what the past data looked like. They always say at the
outset of the project that they won't need it, and they always change their
mind a year in. And they never understand why you can't "just get it back. You
said we had backups."

It's kind of the Wild West out there.

~~~
EvanAnderson
Commenting to voice my agreement. Modeling and querying temporal schema in a
traditional RDBMS is difficult so most developers don't even try to do it
(and, when they do, they do it badly).

~~~
moron4hire
Yep, that's exactly right. So many shops live and die off of new college grads
who--at best--only know relational algebra and nothing about optimization.
I've seen a rare few people coming out of college who could design a schema
that didn't completely destroy their data, but it's been rarer still to see
anyone of any experience level in that environment who knows anything about
making it work in a performant manner.

------
numbsafari
Another approach that is often used for these things is now generally referred
to as Event Sourcing. You'll see it mostly associated with Domain Driven
Design (DDD) and Command Query Responsibility Separation (CQRS), however it is
a technique that has been widely used for decades prior to either of those
terms being made popular.

In particular, it has been used in banking applications for quite a long time.

~~~
kodablah
While I use event sourcing w/ Akka persistence, I find it is not a panacea for
the problems that the blog post may reference. I find that oftentimes people
store the resulting state in memory which may become unwieldy. And if you're
storing it on DB/disk (not to be confused w/ storing your events/snapshots)
you're basically back to the way traditional CRUD works except you have a
stream of events if you need a new data structure from scratch. Also, many
times event sourcing offers eventual consistency which, while it is obviously
preferred to develop your app that way for scalability reasons, not all apps
can do that and need atomic state.

~~~
hesdeadjim
For atomic updates we implement this via CAS-like semantics. A caller who
obtains a point-in-time aggregate view of an object (equivalent to the 'R' in
CRUD) is also given the sequence number of the latest event. They can then
conditionally push new events into the object's queue and request a rejection
if the current sequence id is not equal to the one they have. Since we use
Redis for the queue, the event pushes have a consistent view of the queue at
the time of the attempted push.

~~~
kodablah
While scalable, this basically turns it to bidirectional by adding a side-
channel for conflict resolution which can add system complexity. This is why
many event sourcing approaches recommend making the event represent the
mutation (e.g. count + 1) instead of the result (e.g. count = 15)...when it
does have to be atomic, I just give up and use a centralized place like you
have with Redis. Such is life with atomic, distributed systems...they can only
be so distributed and still reject conflicts in real time.

~~~
hesdeadjim
Most of our events do in fact represent the state change as the mutation.

However I see no complexity issue with how we implement CAS semantics. Its
100% equivalent to any other implementation with all the implications --
reduced concurrency being the primary issue.

We avoid this at all costs for most objects and, like you suggest, resolve
conflicts by de-duping events or combining state changes to the final
aggregate.

------
ams6110
I've used this approach, but although there are pros and cons I come down
against the idea of using triggers to record the history records.

Triggers make things happen sort of "by magic" and it's easy to be confused by
their behavior and how they impact transactions. You also need to remember to
disable them if you ever want to do some kind of bulk update/correction that
you _don 't_ want recorded in the history.

My approach is to use stored procedures for the updates, and make the history
tracking explicit in that code. Overall this is easier to manage, with fewer
gotchas.

~~~
sleepychu
How do you deal with future developers acting on the table? Triggers ensure
that later on I don't develop some new code and forget to update the history
(and does all your code deal with inconsistent/damaged history).

Perhaps if you're using some sort of abstraction layer then you can make sure
that only developers who've read the docs and are aware of semantic links
between tables which aren't enforced by the database then this can be
mitigated?

------
jakejake
We do the same thing except we don't keep the version number in the main table
since we have no reason to ever display that unless we're looking at the
history anyway. We have a history table with two JSON columns representing the
before and after state, so we don't need to keep columns in sync. (Technically
the before state might be redundant, but it makes comparisons simpler). I
really like this idea of using triggers, though. We handle it at the ORM event
level so the DB isn't entirely self-sufficient.

There have been many times where having the version information has helped us
to debug and fix problems. We've had customers call us, freaking out because
they accidentally deleted something, or confused because somebody else in the
company changed something. We can tell them exactly when, what and by whom the
data was changed and easily restore it.

------
meritt
It'd be nice if we could see a Postgres patch that works exactly like Oracle's
Flashback [1]. You can simply do: SELECT * FROM TABLE AS OF '2015-07-01
08:05:04' and it works as expected.

This article is just reinventing Kimball's Slowly Changing Dimensions Type 4
[2].

[1]
[http://docs.oracle.com/cd/B28359_01/appdev.111/b28424/adfns_...](http://docs.oracle.com/cd/B28359_01/appdev.111/b28424/adfns_flashback.htm)

[2]
[https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type...](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_4)

~~~
vmind
Postgres used to have timetravel built in, but there's still a basic
timetravel implementation in the contrib:
[http://www.postgresql.org/docs/9.4/static/contrib-
spi.html](http://www.postgresql.org/docs/9.4/static/contrib-spi.html)

------
bink-lynch
We implemented a "point-in-time" database architecture to accomplish similar
behavior based on this document:

[https://www.simple-talk.com/sql/database-
administration/data...](https://www.simple-talk.com/sql/database-
administration/database-design-a-point-in-time-architecture)

The article describes two columns, dateEffective and dateEnd. If dateEnd is
null, the row is still effective. When the row has been updated or deleted,
the dateEnd field is set.

This was implemented in a regulated industry where we had to show the state of
the system at given points in time. It worked well for this purpose.

------
sigmaml
During 2004-5, I designed and implemented a K-V object store on the top of
BerkeleyDB. It served as the persistence layer for a number of applications my
then company developed for regulated industries.

The approach was very similar. The current versions of objects were stored in
a set of tables, while old versions were stored in separate tables. In order
to reduce bloat, I employed a few fairly simple compression techniques. They
were reasonably good. On desktop-class machines with a Pentium processor, 256
MB of RAM and spinning hard disks, searching a million objects completed in
single digit seconds.

Later, I added bit-mapped indices with dynamically self-adjusting bucket sizes
that were determined based on locally-weighted densities along their
respective dimensions. They reduced search time on a million objects to tens
of milliseconds, with full history searched.

The obvious downside was that there was no SQL for ad hoc queries! So, a
command line shell was provided with Ruby as the language, talking to a C API
behind the scenes.

All in all, the system worked and scaled very well.

------
mapleoin
I don't understand what the 'version' column is for. You can get the sequence
by just looking at the 'version_on' attribute. Deletes are already restricted
on that table and it wouldn't solve concurrency issues since it's generated by
the trigger.

~~~
smerchek
Concurrency issues can be solved by restricting the UPDATE to the version
passed in by the application.

If I think I'm updating version 2 of an entity, then only update that version,
otherwise throw an error.

I've updated the post to address this scenario.

~~~
mapleoin
I don't see the update in the post. If you pass the version in an UPDATE
statement it will get overridden by the trigger.

 _NEW.version := OLD.version + 1;_

~~~
smerchek
The version is part of the WHERE clause in the UPDATE statement. Therefore, it
is only possible to update the version of the entity that you were updating.
If the entity is updated before your statement, it would not succeed in
updating the row.

Is there a different concurrency issue I'm not seeing caused by updating the
version in the trigger?

~~~
mapleoin
Yes, you're right. I didn't notice the WHERE in the UPDATE statement. That
makes a lot more sense now, thanks!

You still have the issue of someone setting the version explicitly in an
update statement without a WHERE clause which checks for the current version.
But I guess as long as you enforce not doing that it's fine.

------
buckbova
I rarely do a D in prod. I've built a generic trigger function for postgres
and proc for sql server that will do the auditing in K-V format on every
table.

And I call a proc to add the stuff to the table to create the audit tables and
triggers.

Works well. I also capture information like the host and session references in
an audit batch record that points to the audit detail.

I also employ the insert only table for some very special datasets where
looking up the history in a quick manner is important.

------
Confiks
Completely offtopic, but the site loads 1.1 MiBs worth of a grinning man at a
resolution of 2446x2446, which is then rendered as an icon at 91x91 pixels.

~~~
smerchek
Totally noticed this in google analytics, earlier. Whoops. It's now fixed!

------
jasonjei
I do something similar because we have drafts (first drafts and drafts of
existing records) in our system. My question is whether there is any downside
to maintaining drafts and historic revisions in the same table opposed to
maintaining it in a separate table as the OP does. Thoughts? Bloat and table
optimization come to mind as potential problems, but it does make it easier to
query for records.

------
1wd
Could you get rid of the is_removed column again (by copying the delected row
to the history table) so the main table again has only the currently active
rows and simplify queries? (Moot if both tables are merged and views are used
as suggested by others.)

Is it common to name timestamps _on, like version_on? That doesn't seem very
clear to me. What's the rationale?

------
ahachete
Oh, please, my eyes hurt so much. Please, please, please change that "WHERE
boolean_column IS FALSE" to "WHERE NOT boolean_column". Thank you ;P

------
alternize
the problem seems only half solved - the author does not show how they select
(only) the most recent record.

as soon as the app needs to generate a list of f.e. active users, this can get
nasty pretty quick. sure, one could use an aggregate like max(version), but
once there are joins or lots of records involved, this could become a
performance hit...

EDIT: the records are kept in a separate table, so my arguments are moot.

~~~
dwrowe
Why? They're modifying the main table, which isn't growing with each update.
The changes are simply tracked in the history table.

~~~
alternize
oh right. small but important detail... thanks for pointing it out to me.

~~~
smerchek
Valid question though, I've updated the post to be more explicit and added a
couple more query examples.

