

The trouble with soft delete - concretecode
http://richarddingwall.name/2009/11/20/the-trouble-with-soft-delete/

======
patio11
A million inserts into a table a year is causing performance problems?
Assuming 200 work days and all accesses being in a four hour period that is
about one row inserted every three seconds. DBs are not my bag, baby, but that
presumably should not be killing you.

At the day job, when we do a soft delete as defined here, we tend to create a
view of the active rows in the table. Accessing through the view rather than
the table prevents many of the "Whoopsie, missed a WHERE clause, now I'm
summing over deleted records" errors. I'm told it also improves performance
but take anything I say about DBs with a grain of salt.

~~~
bmj
We use soft deletes in our system (primarily because we _have_ to maintain all
data), and we use views to retrieve results. The views also flatten the data a
bit, joining appropriate tables, which again simplifies queries.

------
tumult
_For example, in our job board application, we might store open, expired and
filled listings using a table-per-class strategy: [image of some class
hierarchy]_

That ain't no _man's_ soft delete, boy. You want to know what soft delete is?
Well that's too bad, cuz there ain't no delete. Move from one table to
another? What in tarnation?

Boy, you need yourself the sequence of actions that produce some kind of
result. You don't store the result itself, ya'll just go an' fuck it up or
something later. "Posted a job offer for Tractor Operator at
2009-11-21.12:44:26." See, that there's an action. "Accepted job offer for
Tractor Operator." There's another action. Now what's the status of that job?
You get your list of actions, then figure it out. Yes, it's taken, you would
have just injured yourself on that tractor anyway.

------
russell
Dingwall lumps a number of separate issues under "soft delete": undo, audit
trails, soft create, and performance in the presence of historical data. He
presents several solutions, not all of which I would buy.

The is_deleted column is a pretty simple solution that we all use and there
are a number of solutions to the problem of retrieving only the active
columns, such as views.

Audit trails and performance are more interesting. A while back I worked for a
web analytics company and we had the problem of the storage and performance
costs of historical data. Only the 5% of the data was of any real interest,
but the 95% historical data made writes slow because of the large number of
indexes. They adopted the solution of historical tables with fewer indexes on
cheaper drives.

I like the solution of serializing historical, deleted, and audit data and
storing them in a NonSQL database of your choice. Then you can bring them back
as individual undos, or into a data mining database for scenario playing.

I dont particularly like his suggestion of creating separate tables for each
state of an element. I think that's needless complexity.

~~~
jbert
> but the 95% historical data made writes slow because of the large number of
> indexes.

Why does updating an index take more time if there are many table rows?

i.e. if you have 5 indices on a table and 1 write/second, what's the
difference in the work done by the db whether there are 1000 rows or 100k
rows?

~~~
nixme
Simple explanation: Most database indexes are forms of B-trees. Insertion into
a binary tree isn't constant time O(1), it's usually O(log n).

~~~
jbert
Thanks, I was thinking in terms of hashes (which doesn't really make sense for
searching/ordering).

------
snowbird122
Separating one table into multiple tables based on status has caused me a lot
of pain over the years. Now I have multiple tables I need to join together
more often than I would hope. Is this really the best way?

------
truebosko
I read this article and the entire time I was thinking how easy this problem
is solved with a good ORM, like in Django.

All I do is setup two managers, one the default so in the admin I can browse
my records without any filtering and then a second one which I use exclusively
in my views that simply adds .filter(hidden=False) to the get_query_set
method. So simple, and have never ran into any problems.

------
rmason
Always record the date of the soft delete. That way you can have a daily
process to move records x days old to a historical table.

If it's something like a rapidly changing customers list then delete any
customers where the soft delete date is greater than three years old.

------
earl
So Richard basically claims that one of the big problems with soft delete is
that, from now on, all queries against the table have to add a where clause
excluded soft deleted rows; he claims this is some sort of tax.

Is he that ignorant of what a reasonable ORM (or at least, Ruby on Rails) will
do for you? eg named_scope, etc...

~~~
jrockway
Yes, he is. I could implement the soft-delete feature in the database, write
one class (that sets up the queries chained off of it to be aware of
deletions), and then substitute that class for the original via the dependency
injection system. Total effort? 15 minutes. Total changes to the application
logic and the code that actually queries the database? 0.

If you are writing PHP, where you hard-code a query and then print out HTML as
you iterate over the result set, then sure, you're fucked. But not because of
soft deletes.

~~~
prodigal_erik
That only works if everyone you work with religiously uses a single ORM and
dynamic programming language, because it makes calls to that ORM the only
(cumbersome) query language that gives correct results. It doesn't fix the ad
hoc SQL query your marketing guys are pasting out of email from a developer
who went to another startup last year (yes, I have seen this happen).

~~~
earl
Yeah, but any sql query can go stale...

