
Building Scalable Databases: Perspectives on the War on Soft Deletes - Anon84
http://www.25hoursaday.com/weblog/2009/11/23/BuildingScalableDatabasesPerspectivesOnTheWarOnSoftDeletes.aspx
======
edw519
_One of the annoyances that we have to deal when building enterprise
applications is the requirement that no data shall be lost._

Since when is a business requirement an "annoyance"? We keep data for lots of
good reasons. Just a few off the top of my head:

    
    
      - IRS requirements
      - SEC requirements
      - SOX requirements
      - data warehousing
      - data auditing
      - delayed undo
      - business intelligence
      - trend reporting & analysis
      - research & development
      - cooperative databases
      - industry databases & statistics
    

_The usual response to that is to introduce a WasDeleted or an IsActive column
in the database and implement deletes as an update that would set that flag._

Wrong. The usual response is archiving to disk using separate tables. This
solves all of the requirements above while streamlining active database
tables. It's not unusual for 98% of all the enterprise data on disk to be in
the archived state.

 _This sort of cleanup now has to moved up into the application layer._

So what?

One of the biggest mistakes I consistently see is trying to use the facilities
of a database management system to replace application logic. Indexes,
triggers, and stored procedures are all critical tools in the developer's
arsenal, but they do not replace application analysis, design, and
programming. Like security and scaling, archiving is a _architectural
consideration_ , not a bolt on. Including proper data archiving in the
application's design renders the hard vs. soft delete debate pointless.

~~~
sophacles
I agree that the "hard vs. soft delete" debate is pointless. It seems the
article sets up a false dilemma. The interesting thing is how to deal with
archive data. There are a few patterns, but the one presented many times in
this article is the "log it" pattern (if it has an official name, please let
me know). Basically, this transforms all the data into a series of
transactions, each of which is stored. It can occur at the database level (log
each changing query, e.g. insert, updated, delete). At the data level (columns
for is-filled, is-backordered, and so on, the soft delete in the article). At
the application level (each transaction is in a different row, what it means
is away from the data itself). Really it comes down to how often the data will
need to be accessed, and how dynamically it needs to be queried.

In the modern age of hadoop clusters everywhere, and clouds being cheap, it
may actually be best to keep the database as a maintainer of current state,
and the transaction log as the historian (alternative 1 above), shipping off
the logs for data-mining on the cluster at the analyst's leisure. The part I
find amusing is: depending on which angle you come from, this could be a hard
or soft delete :).

------
richcollins
We get around having to "clean up" hard deleted in VertexDB using garbage
collection.

<http://github.com/stevedekorte/vertexdb>

For archival, we append "events" to a node.

