Hacker News new | past | comments | ask | show | jobs | submit login
Soft Deletion Probably Isn't Worth It (2022) (brandur.org)
84 points by thunderbong 10 months ago | hide | past | favorite | 62 comments



> And forgetting that extra predicate on deleted_at can have dangerous consequences as it accidentally returns data that’s no longer meant to be seen.

For postgres you could consider using a view instead. See https://evilmartians.com/chronicles/soft-deletion-with-postg...


This sort of logic is a good candidate for views. Of course, someone could choose not to use the view, but you can solve it with DB permissions, etc.


You can do the same in other dbms. Just use an instead of delete trigger instead of the rule.


Soft-deletion ("recycle bin") is an important feature in many user-facing applications.

But if it's not surfaced in UI then I agree it doesn't make much sense.


Soft delete is not a recycle bin but the process of marking a record as deleted and revoke access instead of actually deleting it.


One of the most common reasons to “mark a record as deleted” rather than deleting it is so you can undo deletes without having to fish data out of a backup. I.e. soft delete is effectively a recycle bin. It might not be a user-facing recycle bin, but it’s a recycle bin nonetheless.


That makes absolute sense.

The way it was overlooked by me is because a ‘bin’ is something you clear regularly, but deleted records will likely persist forever.

That being said; I once helped out a friend who’s computer became slow to a crawl, only to find out they’d never emptied their bin so storage had only a few GB of free space left.


> The way it was overlooked by me is because a ‘bin’ is something you clear regularly, but deleted records will likely persist forever.

Clear regularly, but usually also has the option to undo. The article sort of suggests something similar to clearing regularly.

> So you may eventually find yourself writing a hard deletion process which looks at soft deleted records beyond a certain horizon and permanently deletes them from the database.

Another other reason to do soft-deletes up front, then hard delete later is because in some DBs (eg: OLAP), deletes can be extremely expensive. It's sometimes better to soft-delete (or invalidate references), and handle the hard-deletes in a batch-oriented manner on a regular basis.


>deleted records will likely persist forever.

If space is an issue, you can have a scheduled task to really delete all "deleted" items that were last updated e.g. more than 3 months ago.


Large tables use the delete table. Small tables use the deleted_at column.

Surprised you never got a request to undelete something. It’s such a common customer request in b2b saas.


Always. And sometimes they are a big customer who screwed up something big and if you can not restore it easily, you have to go find a DB backup and write some scripts to bring it back anyhow.

Soft delete is a necessity in SaaS. And then combine it with hard delete either triggered manually (empty recycle bin) or time based (after a month, etc.)


Exactly. As a “backup” for when a client shoots themselves in the foot (or face!) it’s invaluable. But only if the client is worth enough to bother.

If you don’t have high value clients I can see how it’s unnecessary I suppose.


The way I like to solve this is using database snapshots. Just make sure you have an incremental snapshot setup for:

- Last round hour

- Last round day

- Last round week

These periods can be adjusted to your/client needs. If you need something that was deleted, you restore the snapshot into a new live db and copy whatever you need. You could charge for solving "I just shot myself in the foot" support requests based on the cost of restoring those snapshots.

From my experience, this is much cleaner than having a spread of soft delete columns all over the codebase.


I mean, if you have a functional fast acting organization, maybe.

It seems in the vast majority of organizations I talk to, this is not how things work at all. To get a db restore on a different box you have to have at least 2 different teams involved. One to provide a machine/vm/aws instance to restore to, and another to provide the snapshot/backup.

With a soft delete all you need is the application administrator that has SQL access, which is typically far closer to the person that needs the data restored than the operations team is, hence it gets 'fixed' in a more timely fashion.


That's unfair. For the snapshot scenario, you described the bureaucratic sad path for doing it. For the soft delete, you used a happy path.

Let's compare both in either best-scenario or worst-scenario:

Snapshots on good team with no bureaucracy: Everything is automated. Even retrieval is automated via individual, commited, retrieval commands. Setup is done once and requires little maintenance, retrievals can be reused if fallen in same category (restore account, restore conversation, etc).

Snapshots on shitty team with lots of bureaucracy: Everything is manual, AWS-panel operated and only a few people have access, you have to escalate to initiate the proccess.

Soft deletes on good team with no bureaucracy: Everyones respect the soft deletion, no one queries it or do funny stuff unless for recovery purposes. The soft delete related columns, structures and tools are standardized.

Soft deletes on shitty team with lots of bureaucracy: There might be multiple, competing standards for soft delete column naming and structure. Other teams ignore the soft delete and query hidden data either knowingly (tricky hacks) or unknowingly (data team unaware of soft delete model), you can't do shit about it. When it breaks your soft deleted data, you're the one who has to fix it.

If we put external stuff in the mix (teams, people, org structures) we can spin it the way we want, but that doesn't reveal much information about the technical issue itself (having some kind of reasonable data recovery).


I hate that I agree with the author overall. I mean I probably will keep adding a `deleted` column to my tables and cursing myself when I mess up queries to hide them but I can't help thinking they will save me a ton of headache one day...

It's one of those "If you have it you will never need it but if you don't you will wish you had it"-types of things. I'm sure it's silly/stupid but it makes me feel better/safer.


Why not name your database "my_db_all" and then create a view called "my_db" that's just "select * from my_db_all where deleted != true", or something to that effect?


This seems to ignore the issues raised concerning foreign keys and consumer data protection pruning.


Just because a row is soft-deleted now doesn't mean it can't be actually deleted later.


This is still ignoring the issues listed in TFA?


Would a safer method be to have a <entities>-deleted table and migrate a deleted record to that table when it’s ‘deleted’.

I’m not versed in databases but I’d expect you can even set additional access rules to individual tables making cross referencing very hard.


Yes and no. If you are thinking of a no-sql database, then this is an easy approach. However, I would expect most of the systems to be using normalized tables and this approach would be pretty hard due to relational constraints.

You would have to have a separate database copy running for deleted entries so that it can be consistent within itself.


They way I see it relationships to the deleted record must always be nullified/marked deleted, because otherwise you get broken records referencing deletions. And relationships from a deleted record might as well persist because the foreign key will (fingers crossed) remain untouched.

Again, not versed in db at all so this is likely total rubbish.


You are correct.. But if you move the deleted record to a different db, you have to follow the chain of relationships and move more records to the "deleted" db. Which means you would have a create a full database to maintain the entity relationships.


Yes that would work and be safer in a number of ways. I think I always avoided that because it requires 2 operations (delete and insert) and a tiny bit of added complexity. I think I’ll try that approach the next time I do something like this.


That's what the article suggests (at the end)


Why not use a history table with JSON serialization of data, if the goal is to restore but not have to deal with orphan records and forgotten WHERE clauses?


Someone either never worked with b2b, LEA or UGC...

Someone breaks community rules and bullies someone or something similar, they delete their comments. What do you do now?

Business customers expect that everything they delete is possible to restore, even if the delete button clearly says, "this action is irreversible".

Law Enforcement Agencies would expect you to be able to pull data from them as well.


> Law Enforcement Agencies would expect you to be able to pull data from them as well.

A previous lifetime of mine was ensuring that when we legally were able to delete things we _really deleted them_ from _all the places they ended up including backups_ to ensure our response to subpoenas, LEOs, regulatory bodies and other busybodies was 'Unfortunately information about that customer was deleted on <date> under <legal authority>. We have no data to provide you'.


That's one way to deal with it. We didn't want CP to be freely exchanged on our platform.

I think unless you're operating a very privacy focused business where such privacy is the selling point - there is no harm in complying.


> I think unless you're operating a very privacy focused business where such privacy is the selling point - there is no harm in complying.

Disagree completely - giving access to data to LEOs (or others) that you have no ongoing business use for only has downsides and has zero upsides.

I have made several submissions to judges stating the date and for what reason we destroyed records and every time the answer has been ‘well, you were legally entitled to’ and the cases were dismissed early on.

It is especially difficult in modern times to enact, but my personal view is shred fast and shred often with everything.


On the other hand, other law enforcement agencies explicitly require that kind of data to be deleted, especially when the GDPR is concerned.


Well, it's not that simple. Even GDPR has circumstances where "Right to be forgotten" aren't applied. Even when it does apply, it must be done without "undue delay". Which is about a month.

Some of them are:

- "Establishment, exercise or defence of legal claims"

- "Compliance with a legal obligation, the performance of a task carried out in the public interest or in the exercise of official authority."

In my B2B product, I do soft-deletes, but there is a task that cleans up deleted records older than N days at a specified interval.


For Data Warehousing purposes it's nice if source systems use soft deletion for easy detection of deleted rows since a certain point in time. I don't think the use-case is limited to undoing user deletions.

If there is no soft deletion or audit table available, the alternative is to either 1) Select all the source primary keys to find out which no longer exist in the source but do exist in the data warehouse 2) If there is no primary key available, select the whole dateset for full row comparison

Since in corporate settings data warehouses load at least once a day, you can imagine the amounts of data being moved around needlessly.


For data warehousing, you usually have a ledger of events? Such that you would just record that a record with a given identifier was created/modified/deleted/etc. Right?


Yes. In essence what most businesses do is create a record for any given state, so it enables analysts to see when something was created/modified/deleted indeed.

Most business applications contain only the actual state of data, so by loading the data warehouse these changes can be saved and analyzed.

The simplest implementation is that each record gets a start_datetime and end_datetime so you can see the changes over time or figure out when something was deleted. Other techniques like data vault enable businesses to get a complete historic view of any given point in time.


The 'deleted records table' they suggest at the end of the article is a really good alternative!


This works well for my team. We had so many headaches with deleted_at and finally decided to create a table for all deleted data. It just dumps stuff as json. Undeleting is rare and manual.


Is there a database that does the WHERE deleted=null for you?

Like, some way to create a special column that makes it so every query to that table automatically has a certain filter unless you use some other keyword to override it?


Temporal tables, aka system versioned tables, are something like this: https://mariadb.com/kb/en/system-versioned-tables/


DBs with Materialized Views

Here is the syntax for Snowflake. CREATE MATERIALIZED VIEW mymv COMMENT='filtered view' AS SELECT * FROM mytable WHERE isdeleted=0;


Why materialize? A normal view is sufficient


It was to answer the OP question. Querying a materialized view is essentially the same as querying the original table. The MV allows for automated indexes etc.


I mean... a normal view hits all of the same properties, and is literally querying the original table. Materialization is an entirely orthogonal concern as far as I can think of.


IS NULL!!!

not = NULL!!!


cover both bases due to this ambiguity =)

deleted IS NULL OR LOWER(deleted) = 'null'


Soft delete seem like a similar issue to the entity-attribute-value systems that some people build atop relational databases in order to have 'dynamic' tables, effectively writing their own database.

I'm not sure the ideal solution in either case, but I think software should use the database primitives as they are intended, rather than building their own primitive operations on top.

In the case of delete, one idea might be for databases to natively support a recycle bin - shadow tables that contain deleted records, but I'm not a db engineer.


After having done "deleted_at" for years, we prefer systems like MSSQL Temporal Tables. Otherwise, managing a deleted_at column is something you have to remember to account for in indexes and in all queries.

Instead, just nuke that row and have a copy in a separate time-based table.

https://learn.microsoft.com/en-us/sql/relational-databases/t...



Soft deletion is important if you want to do analytics on your data. Hard deletes mean that you can never reconstruct an accurate history.

Also, if you’re not careful, hard deletes can lead to primary keys getting reused, which can cause all kinds of problems if the delete doesn’t cascade into every other system. I had this happen in practice once and it was a huge pain.


Of course, soft deletes can cause their own problems where you do want to reuse a key. That, or you have to start imposing uniqueness constraints on a non-key value. Certainly valid, but not at all user friendly.

Consider, if you are creating resources in something like a cloud provider, you can setup permissions to things if they are based on names or tags. If that name or tag is a UUID like thing, that means you cannot setup the permissions before the thing is created. And if it is ever accidentally deleted, you cannot restore things easily.


You can't reconstruct an accurate history with either style of deletes. Only append-only event logs allow that.


I have found soft deletes to be really helpful to preserve historical references and as a simple but reversible enable disable switch.

Sometimes a really destructive delete isn't something you want to do by default.

I have seen them abused and a lot of times queries forget to take the flag into account. Every decision is a trade-off.


Using a version controlled database gives you "soft deletes" with none of the downsides he mentions (except GDPR -- a versioned data store is probably hard to make GDPR compliant).

https://www.dolthub.com/blog/2022-11-03-soft-deletes/


The GDPR deletion requirement has been most compelling (and a headache) for my application


IANAL but I got confirmed several time that you can soft delete if all fields are then anonymized, the law does not care about the concept of database row as long as the PII is deleted


If you anonymize data then it isn't soft delete anymore as you cannot un-delete it. What you are describing is more suited to generating test data for machine learning algorithms, or simply for developers that need to set up db instance in DEV environment.


Yes i agree, my point was not clear. I meant to adress their "due to rgpd you are obliged to delete row so you are back again with the foreign key constraint problem" which is a often misunderstanding of developers that think that the "delete" meant by the law is bijective to a sql DELETE statement


To satisfy GDPR requirements when customers leave, you can blank out the PII fields. No need to blow up your application by deleting records.


Also, a lot of records are the businesses? They may have to remove who it was that they did business with, for compliance, but they do not have to forget that they did business? (Right? Genuine question.)


Not a lawyer. Please consult with one. The general rule is you must comply with all laws and regulations that impacts you. Tax regulations and contract laws and employment laws usually override the duty to delete PII when customers leave. Each set of laws have a generally defined retention period. Please keep in mind each country’s laws are different. Your lawyer can tell you what they are.

What that means is if you blank out or anonymize the record you can argue you are satisfying generally accepted record keeping rules. Eg you sold a widget to a person. To satisfy GDPR you just don’t know who they are after so many years.


Right, this matches my understanding. With the same caveats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: