Soft deletion probably isn't worth it

JohnBooty · on July 19, 2022

I've been a software dev since the 90s and at this point, I've learned to basically do things like audit trails and soft deletion by default, unless there's some reason not to.

Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah. It helps the business, it helps you as developer by giving you debug information as well as helping you to cover your ass when you are blamed for some data loss bug that was really user error.

Soft deletion has obvious drawbacks but is usually far less work than implementing equivalent functionality out-of-stream, with verbose logging or some such.

Retrofitting your app and adding soft deletion and audit trails after the fact is usually an order of magnitude more work. Can always add it pre-launch and leave it turned off.

If performance is a concern, this is usually something that can be mitigated. You can e.g. have a reaper job that runs daily and hard-deletes everything that was soft-deleted more than n days ago, or whatever.

gmiller123456 · on July 19, 2022

The author uses the "no one ever undeleted anything" as the primary justification. I think this is the part they miss. I've never undeleted a user either, but there have been many times I've gone back to look at something. Either a complaint finally gets around to me as to why the user wanted their account deleted (e.g. feature not working) and it helps to figure out why. Or they're returning and want things set up like they were. Or someone is taking over their roll and needs to be set up like the last person who's already gone.

Though you really shouldn't be relying on a database for an audit trail. It might help find some issues, but things actually used for security shouldn't be writable so easily.

JohnBooty · on July 20, 2022

    I think this is the part they miss. I've never 
    undeleted a user either, but there have been many 
    times I've gone back to look at something.

Yeah. As far as a user-facing "Undelete" button existing or being used... that's very rare in my experience.

What's much more common is a user accidentally deletes some data. They deny they made an error. The developers are blamed. You then have to go on a wild goose chase figuring out if it was possible for the app to actually screw up in that way. There's usually no definitive answer, and even if there is, management can't understand it. And regardless of how any of that plays out, you still probably have to try and recover the data from backups or something.

Alternately, maybe it was the app's fault. Still plays out nearly the same!

Soft deletes and/or audit trails save you from all of that.

    Though you really shouldn't be relying on a 
    database for an audit trail. It might help 
    find some issues, but things actually used 
    for security shouldn't be writable so easily.

I mean, at some level you need to trust the database right?

Been ages since I did it, but it's usually possible to set up a "secure" audit trail with use of database permissions. For example, the application's DB credentials can have SELECT and INSERT permissions on the audit trail table, but no UPDATE or DELETE perms.

How would you set up a secure audit trail that didn't rely on the application and/or database at some level? Even if it lives outside of the database, that data came from the database.

Not a rhetorical question. Genuinely curious!

dkarl · on July 20, 2022

+1 to investigating user error. It also happens that you have bugs, and it's hard to spot the bugs if you don't have a quick way of filtering out all the cases of user error.

> Yeah. As far as a user-facing "Undelete" button existing or being used... that's very rare in my experience.

This is pretty common in my experience, but often with slightly different terminology. I've seen "delete"/"undelete" but also "move to trash"/"restore from trash" or "mark for deletion"/"unmark for deletion" and even "archive"/"restore from archive" where deleted/marked/archived objects were excluded from normal business processes and were hidden in the UI unless the user went to a special page where they could search, view, and restore.

I've also seen chat functionality where users had the ability to delete messages but admin users had the ability to view the full history of a chat, including deleted messages. (The software was used by businesses to communicate with their customers, who were also businesses, so supervisors being able to read deleted chat messages was appropriate.)

In my experience soft delete is basically free if you do it from the start. I've never regretted soft delete but have always regretted the lack of it. However, it's easier if you also have a data retention policy from the start so you're forced to create automated cleanup processes and maintain them as you go along.

JohnBooty · on July 20, 2022

    This is pretty common in my experience, but often with 
    slightly different terminology. [...]

    I've also seen chat functionality where users had 
    the ability to delete messages but admin users had 
    the ability to view

Great points! Slack actually works that way, if I'm not mistaken. Admins can read deleted messages and, I think, the edited versions of messages right? Not sure how they store things internally, but you're right - it's common functionality in many classes of apps.

    In my experience soft delete is basically free if you do it from the start.

This is my experience too.

gmiller123456 · on July 20, 2022

    How would you set up a secure audit trail that didn't rely on the application and/or database at some level?

A database is fine for an audit trail. But the application shouldn't have permissions to cover up a change. Likewise, the audit trail should record that a record was created, deleted, maybe undeleted, etc and when, by who. Relying only on the "deleted" flag only shows you the current state, not what and when was done and who did it.

sverhagen · on July 20, 2022

Any tools you recommend here? Or just roll your own with triggers in the database?

gunnarmorling · on July 20, 2022

Log-based change data capture is a great tool for building audit logs; unlike triggers, it doesn't impact write performance, you don't need to maintain triggers in the first place, and you can have you audit log in a system separate from your main database.

Blogged about one possible implementation using Debezium a while ago here: https://debezium.io/blog/2019/10/01/audit-logs-with-change-d...

Disclaimer: I work on Debezium

gmiller123456 · on July 20, 2022

I think you'll find that most operating systems have this built in, syslog for Linux and the event log on Windows. We actually use SPLUNK which has lots of tools for collecting data from apps using different methods, and also tools for viewing and querying the data.

jrumbut · on July 20, 2022

I've long wished for an RDBMS (or perhaps ORM, but I think it would be extra cool at the database level, see below) that does things that don't hyperscale.

So many LOB applications (the part of the iceberg under the water) have modestly sized databases with a relatively consistent number of users. The data is small, but often complex, awkwardly structured, highly relational, etc. The challenges these applications face are different, but real.

An example of something that would be incredibly useful here but completely untenable for a Facebook scale system would a rich rewind mechanism. You can go back in time 15 minutes or execute a query like:

RESTORE my_db TO LAST POINT WHERE EXISTS (SELECT * FROM ACCOUNTS WHERE ID=1234)

How much trouble would that kind of query have saved you in your career? At some point this would be cost prohibitive but I bet 50% or more of all apps never reach that point.

sammoorhouse · on July 20, 2022

I think what you're looking for is called bitemporality (https://en.wikipedia.org/wiki/Bitemporal_modeling).

It allows you to query a system (some databases support this) in a temporal context. This is useful in financial services where you have market data baked into a price you've offered to a client, where the underlying market data might have been later corrected by a vendor.

So you can query how the world looked last tuesday 4pm; but also - given what we now know, how ought the world have looked last tuesday 4pm.

Interesting stuff.

fatnoah · on July 20, 2022

> This is useful in financial services where you have market data baked into a price you've offered to a client, where the underlying market data might have been later corrected by a vendor.

A long time ago, I worked on a such a system that provided service-based billing for an industry where rates were very fluid. Having this "at any point in time" view of key data was essential for many things.

salzig · on July 20, 2022

CockroachDB has "AS OF SYSTEM TIME", which even allows you to backup the database for that moment.

https://www.cockroachlabs.com/docs/stable/as-of-system-time....

jrumbut · on July 20, 2022

CockroachDB has been a font of cool ideas from the beginning IMO.

tornato7 · on July 22, 2022

BigQuery has this too, it’s extremely useful.

remus · on July 20, 2022

MS SQL Server has a fun feature called temporal tables that does some of this https://docs.microsoft.com/en-us/sql/relational-databases/ta...

jrumbut · on July 20, 2022

This is very interesting!

I haven't used this feature, is it possible to twist this to find a time associated with a table state so that you can then query the table at that time (allowing you to answer a question like what was the value of X the last time Y was true)?

namibj · on July 21, 2022

You could replay the change feed from CRDB into Materialize and catch the timestamp (s) at which some query results pop out, using an incremental query in what is basically a linear search.

mirages · on July 20, 2022

In Oracle you have the flashback tech

  SELECT * FROM employees
  AS OF TIMESTAMP
  TO_TIMESTAMP('2004-04-04 09:30:00', 'YYYY-MM-DD HH:MI:SS')
  WHERE last_name = 'Chung';

guiriduro · on July 20, 2022

For PostgreSQL with WAL logs, or in a service like RDS, can you not do a point in time recovery? Seems like an easy way to go back in time, recover the data you want, dig out any FK referents that might also have been cascade deleted etc. If you also maintained an audit log you should be able to isolate the point where the given data was deleted (if its id is in the audit log), and restore from just before it, to capture the deleted data, query it and reinsert in master.

techdragon · on July 20, 2022

One of the common ways I've seen it is writing logs in a write only way. Most common way I've seen deployed is having the audit logs setup so they go to files in an AWS S3 bucket owned by a completely separate AWS account with Write Only credentials granted to the application doing the log writing. Its effectively "dumped" there and cannot be seen or touched even by the writing application. Sort of like having a 1 way valve in your logging "pipeline".

tornato7 · on July 22, 2022

This is something that public or private blockchains could be helpful for. Since everything is built on the hash of what came before, you can’t delete something without leaving a trail.

coldtea · on July 19, 2022

>I've never undeleted a user either, but there have been many times I've gone back to look at something.

I, for one, have undeleted things tons of times, taking them of the trash can before emptying it, undoing the delete action (in apps where this is possible), and so on.

Akronymus · on July 19, 2022

I messed up a mass update query enough times to leave myself SOME provision to undo it.

The exceptions are when there is a well tested query that affects a single account or something. Like GDPR

pustan · on July 20, 2022

>The author uses the "no one ever undeleted anything" as the primary justification. I think this is the part they miss.

But did they though?

"Although I’ve never seen an undelete work in practice, soft deletion wasn’t completely useless because we would occasionally use it to refer to deleted data – usually a manual process where someone wanted to see to a deleted object for purposes of assisting with a support ticket or trying to squash a bug."

alerighi · on July 19, 2022

It may be convenient, but under the GDPR is illegal. When an user deletes an account, all the personal data associated with that user must be deleted (or anonymize it in a way that it's no longer possible to associate it back to the particular user).

You cannot just keep user information forever "just in case" they are useful again.

dataflow · on July 19, 2022

It's illegal but companies don't necessarily care to avoid soft deletes regardless. I think companies wait to get sued so they can try to argue in court why their soft deletions are reasonable and why it's too technically difficult for them to do hard deletes.

To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software. One of those details that I'm both surprised and unsurprised doesn't seem to have been addressed legally. I feel like a court ought to at least buy this aspect of the argument, so maybe they'll buy that it can be difficult in terms of the software too? Who knows. My guess is that a reasonable court would accommodate something that's reasonable for a given company, but there are lots of variations that could fall into that category.

alerighi · on July 22, 2022

> I think companies wait to get sued so they can try to argue in court why their soft deletions are reasonable and why it's too technically difficult for them to do hard deletes.

The law is clear, you can't keep user data if the user decides that you should no longer have them. It's your own responsibility to find a way to do that.

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software. One of those details that I'm both surprised and unsurprised doesn't seem to have been addressed legally. I feel like a court ought to at least buy this aspect of the argument, so maybe they'll buy that it can be difficult in terms of the software too? Who knows. My guess is that a reasonable court would accommodate something that's reasonable for a given company, but there are lots of variations that could fall into that category.

The deletion should be done in reasonable terms, i.e. you can't reasonably get the data back. Of course when you delete something from an hard drive being that an SSD or HDD till that block is not reused the data is recoverable, but that is another thing.

By the way, it's either not really difficult to secure erase user data: what you need is not to erase all the data of the user, is to encrypt all the data of the user with a symmetric key algorithm (such as AES, that is super fast) and only store the key in a system that gives you secure erase capability. When the user deletes his account you don't have to delete all the data but just the encryption key.

dataflow · on July 22, 2022

These points have already been discussed, see the sibling threads.

jcheng · on July 19, 2022

GDPR fines are structured to make it extremely risky/expensive to take the “ask for forgiveness not permission” approach you’re suggesting. I think for this reason, the fine is designed to scale with the size of the offending company’s revenue, not so much with the actual damages suffered by anyone. Take a look at some of these fines, tens of millions of euros cause the UI for your cookie consent dialog is poorly designed? https://www.tessian.com/blog/biggest-gdpr-fines-2020/

When the GDPR rules came out our company (pretty small at the time) pulled out all the stops to get ourselves into compliance, due to the above and other teeth put into the regulations. It seemed utterly irrational not to. Were we the exception, or the rule?

dataflow · on July 19, 2022

I don't know if you were the exception or the rule, but I know for a fact some do go the forgiveness approach for requirements that they believe would be disproportionately burdensome.

Probably almost all companies actually, including yourself (just to a different extent) - what did you do about the storage remapping thing I mentioned? Did it come up/did you guys discuss it? Do you believe you're in compliance despite your hardware (most likely) not guaranteeing erasure or overwriting of existing data? I'm curious how your assessment of that went, because I doubt one can be in strict compliance without guarantees from the hardware.

dinvlad · on July 20, 2022

Please, don't give ideas to whomever among their lawyers reads this later :-). We don't need another Schrems II..

dataflow · on July 20, 2022

Sorry!

dinvlad · on July 20, 2022

I think lawyers probably miss a lot of these technicalities (which explains why they can require something as technically unreasonable as Schrems II), but I just wonder if someone more technical can actually push for something like that, to win themselves some questionable points, in the name of improving privacy.

closewith · on July 20, 2022

What a bizarrely dismissive comment. You could just as easily say that some developers miss a lot of the legal technicalities, which explains why they believe Schrems II is an unreasonable ruling.

dinvlad · on July 20, 2022

Now I don't understand your comment, sorry :-)

namibj · on July 21, 2022

What's the unreasonable thing? Not having your data HQ in a place vulnerable to unrelated government overreach?

coder543 · on July 20, 2022

Remapping of blocks isn't the problem here. IANAL, but you should be encrypting all the data for GDPR compliance to begin with. With it encrypted, how the storage device chooses to map its blocks is completely irrelevant. When the key goes away, all the data is erased by definition. Throwing away the key is one of the easiest ways to comply with Right to be Forgotten, from what I've seen.

"Oh, but the key is still on the hard drive!" you would likely complain. No... the key could be stored any number of places that won't leave a copy. If you store the key on a TPM/HSM, then tell the TPM to delete the key, short of finding out that the TPM manufacturer failed to perform their duty in erasing the key, and simultaneously discovering a 0-day exploit to go in and retrieve that deleted key, it is gone. Full stop. It is unreasonable to ask everyone to do something impossible like prove that their TPMs are working properly, when even the TPM manufacturers cannot definitively prove this, so that is clearly not what the law is asking. You might as well ask companies to prove that they've never been hacked. How can anyone prove that?

Conceptually, one of the simplest ways to comply with this would be to perform database failover once a month, since you have a month to comply with these requests. Before the failover, you would ensure that all data is deleted that has been requested to be deleted. Then you bring up a new database replica, stream the remaining data to it, failover, and then destroy the key stored in the TPM on the primary and any other replicas. All that deleted data is now gone forever.

There are more complicated architectures where you could have per-user keys which are stored on an encrypted volume, and then you would be managing a handful of keys in the TPM to handle rotation of the encrypted volumes. Once a month, you copy the per-user keys that haven't been deleted onto a new encrypted volume, then tell the TPM to delete the key for the old encrypted volume: garbage collecting the deleted keys in an unrecoverable fashion. This avoids the hassles with having to failover a database so frequently, although having that level of automation around database failover has advantages too.

I'm sure some companies are content with treating the deleted data as erased, as long as their drives are properly encrypted. A rogue employee could potentially recover the data while the hard drives and the keys are still together, but the hard drives will be worthless once they're separated from the keys after they fail and are thrown away. A rogue employee could have stolen the data before it was deleted anyways, so how do you prove that rogue employees don't exist? Is this level of erasure good enough? This one I'm a little iffy on, but it still completely resolves your issues with the block remapping.

GDPR makes reference to "taking account of available technology and the cost of implementation", so I think it is fair to say to that GDPR is not asking you to erase any butterfly-effect waves that have propagated out from the read/write head of your hard drive. If the data is not recoverable by any known means, that is the purpose of the law, isn't it?

I will repeat that I'm not a lawyer, but your argument feels like a strawman designed to paint the Right to be Forgotten as something that no one can comply with. Your comments come off as snarky and acting like you think you're so clever. As if no one has ever thought about these problems before.

jandrewrogers · on July 20, 2022

Per user encryption such as you suggest does not address the problem. Systems that work this way have existed for decades and they have pathologically poor performance and scalability. Many classes of major optimization in databases don't work under these constraints. This isn't a novel idea, having been implemented for decades; it has been broadly rejected because it doesn't actually work in real systems without tradeoffs no one accepts, even in highly security sensitive environments.

You are overly dismissive of the technical challenges of actually deleting data, particularly in large systems. Technically naive solutions like having separate encryption keys for individual entities sounds good but it doesn't actually address the problem.

coder543 · on July 20, 2022

I feel like you’re overly dismissive of the other things I mentioned. Database failover comes with inconveniences, but it does allow you to use a single key and get great performance. For massive databases, it might be impossible to do that way, but most companies don’t have massive databases. I also opened a question about how ”erased” deleted data needs to be to be compliant, since I would imagine a lot of companies consider deletion on an encrypted drive to be good enough, even if the data may theoretically be recoverable under extreme circumstances until that hard drive is completely overwritten multiple times or destroyed.

Per user keys are also very useful for use cases like a personal file storage service, even if traditional RDBMS don’t work well with them. Techniques are situationally dependent.

I was not being dismissive of the existence of technical challenges, but rather dismissive of comments like:

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software.

Which seem to completely deride the possibility of compliance.

Compliance is feasible, even if it has challenges.

heisjustsosmart · on July 20, 2022

Save yourself a lot of time, read jandrewrogers comments before responding to him.

dcow · on July 20, 2022

coder543 initially responded to dataflow. Then jandrewrogers included an additional anecdote, but only in a single comment. As the thread stands at the time I posed this, jandrewrogers does not have multiple comments.

heisjustsosmart · on July 20, 2022

An ambiguous phrase - I was not criticizing you but only advising you read his historical statements on databases before engaging. He has wasted a great deal of time of other people.

dcow · on July 20, 2022

Do you know how to read usernames?

heisjustsosmart · on July 20, 2022

Yes. Your point being what.

dcow · on July 21, 2022

I am not the person you originally replied to but you (again) commented like I was. My comment was about how your comment erroneously targets the wrong person because you thought you were replying to one person when you were replying to another. You did it again in your previous comment hence my serious question.

heisjustsosmart · on July 21, 2022

It doesn't matter, it was advice for all.

dcow · on July 21, 2022

It does, actually, matter. Notably because it causes confusion when trying to understand your point, among other things.

dcow · on July 20, 2022

Is there literature on this topic? I'd like to learn at what scale per-user/tenant keys becomes untenable and the characteristics of systems that exhibit pathologically poor performance due to such a design.

Naively, it sounds like individual entity keys does solve the problem of deletion, but that your argument is that the tradeoffs aren't generally worth it?

nicoburns · on July 20, 2022

I’d imagine a pretty small scale. Imagine an endpoint that lists user accounts. You would have to individually decrypt each row! and you would never be able to do database joins of data in these tables because you it wouldn’t be possible to decrypt it until it had already left the database.

closewith · on July 20, 2022

With respect, your own comment reads as overly dismissive of the regulatory environment introduced by the GDPR. Systems (and companies) have for decades collected and stored personal data with little or no regard for the lifecycle of the data.

Since the introduction of the Regulation, companies must now refrain from the collection of personal data if they cannot satisfy the rights of the data subject, including the right to erasure (which is obviously not an absolute right).

If the technical solutions aren't performant or scalable enough for data controllers, then the result is that they cannot collect personal data. The idea that companies have in the past rejected solutions which protect the rights of data subjects because they were unwilling to accept the technical tradeoffs is exactly why strong regulation such as the GDPR is required.

Frankly, your comment reads as contemptuous to the rights of the end-user. Thankfully, the EU offers protection to its residents from people with that attitude.

dataflow · on July 20, 2022

I have no idea where the sudden hostility in your last paragraph came from, but it's ascribing nonexistent malice to me. I'll try to address a couple of your points the best I can regardless.

> Once a month, you copy the per-user keys that haven't been deleted onto a new encrypted volume, then tell the TPM to delete the key for the old encrypted volume: garbage collecting the deleted keys in an unrecoverable fashion. This avoids the hassles with having to failover a database so frequently

This ignores the existence of database indexes.

What I was saying (I thought quite explicitly) was that strict compliance to GDPR appears unreasonably burdensome to most companies, and thus most companies stop at a line they draw earlier (that line frequently being the lack of hardware guarantees re: block remapping). This is not the same as saying no one can comply with GDPR, and I had no intention of making GDPR look impossible to comply with. Quite the contrary in fact—I believe its current requirements can and should be complied with quite cheaply and efficiently, but the most effective path to that requires support from hardware manufacturers (and possibly others) that customers and regulators currently don't demand.

This is not a take-down of GDPR or a malicious strawman mockery of its requirements as you're (quite unkindly and wrongly) imputing me with. If anything, it was a plea for customers, regulators, and/or lawmakers to realize they need to expand their GDPR demands to require that vendors of other parts of the hardware/software stack (particularly storage device manufacturers) provide better support for GDPR compliance.

coder543 · on July 20, 2022

You made several comments in a row that were extremely dismissive of how GDPR compliance could be achieved, and your comments repeatedly questioned whether other people considered basic concepts of data erasure. I apologize if I misattributed malice, but I do feel those comments could have been worded differently.

Also, this quote:

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software.

That quote feels incompatible with your statement that…

> Quite the contrary in fact—I believe its current requirements can and should be complied with quite cheaply and efficiently

How can it be complied with cheaply and easily if it can’t be complied with without hardware support no matter how much software wants to? Surely you can see my confusion. Your last comment is far more in line with what I would expect to see in a discussion like this.

Regardless, I think I understand the argument in your last comment better now, but I still think destroying keys is the most effective way to achieve compliance. Even if a storage device manufacturer guaranteed that they would attempt to erase the specific block you want erased, there are problems with whether that data is still recoverable by reading the block repeatedly.

lelanthran · on July 20, 2022

> GDPR fines are structured to make it extremely risky/expensive to take the “ask for forgiveness not permission” approach you’re suggesting.

Expensive, maybe, but not risky.

If you're rolling the dice on a GPDR fine, the expected value of vague compliance is still largely positive, while the expected value of actual compliance is still slightly negative.

The expected value of the fine needs to be a larger negative than the expected value of vague compliance.

You don't even need to raise the fine value (which is a percentage/scale of revenue), you just need to make the probability of paying it approach 1.

In fact, with a higher probability of paying the fine, you could even lower the actual fine and still have it have a larger negative expected value than vague compliance.

pmarreck · on July 20, 2022

> YouTube sets cookies on our devices to track our online activity for marketing purposes

This is not the purpose of cookies. The purpose of cookies is to store state on the browser side. Does GDPR interpret all possibilities of "cookies" as the storage and communication of privacy-violating data? Because that would be extremely unfortunate

smolder · on July 19, 2022

I wonder what percentage of companies who even appear to comply with deletion requests actually do full deletion in practice. I suspect it's small, knowing how many things are coded to fake-delete for convenience. Businesses also tend to keep cold data backups around. (Maybe backups are exempted? I don't know.) There might even be cases where ostensibly deleted data can still be recovered from a disk, if they haven't overwritten.

ikiris · on July 20, 2022

The really hard part is dealing with the data still in backups.

It leads to having to do crazy things like individual keyed encryption per user, escrow hilarity, etc.

EdwardDiego · on July 20, 2022

I do a lot of work with Kafka, GPDR removal requests for data that's still stored in a Kafka topic is so much fun.

eloisius · on July 20, 2022

I worked at a large company as it was implementing GDPR compliance, and I can attest that it was a nightmare of drudgery. I think it's interesting that our entire tech stack is designed to assume the business has a right to keep user data for as long as they want (forever), and compliance with GDPR delete requests is like a one-off exception implemented at a very sketchy redaction task that has to be careful not to leave broken foreign keys and such around.

It's a nightmare to do now, but our tech needs to change such that we assume data might be a hot potato that you need to be able to get rid of, and not some immutable asset that you can hang onto forever.

JohnBooty · on July 20, 2022

Barring more foundational and revolutionary changes in our tech stacks, another poster commented that a best practice is to have tests that delete user accounts and all associated records without breakage and to build these tests into your test suite from the very inception of the project.

Far from a magic cure, but by building it into the app from inception, it seems more possible.

smolder · on July 20, 2022

Thanks for that insight. It's an interesting solution to use per-user encryption for that. Don't you have the same problems all over again with respect to preserving access to user specific data? I.e. needing key backups for business continuity, but needing to wipe those keys for deletion requests.

jandrewrogers · on July 20, 2022

Per user encryption doesn't work in real systems. These have been implemented for several decades and no one uses them because they have pathologically terrible scalability and performance. The simple act of requiring encryption of individual user data forces the use of extremely suboptimal data structures relative to what would be used in the non-encrypted case.

A single AES encryption block is 16 bytes; internal storage of records is managed in terms of individual bits if you don't encrypt them. You also can't compress data encrypted in this way, which is table stakes for databases. The encryption key schedule alone for a single record will be much larger than the typical record! Requiring a per-user encryption key has the unfortunate effect of bloating database storage and memory requirements by at least 10x, while massively reducing computational throughput.

The design of delete-efficient database kernels is an interesting unsolved problem in computer science. No one knows how to do it. People that think there are trivial solutions have not studied the problem adequately. It hasn't been a priority to solve this problem, but its reputation for being theoretically difficult is well-deserved.

ikiris · on July 21, 2022

"I've not seen this in real systems, therefore it can't exist" ?

telotortium · on July 20, 2022

Or...just have a documented retention policy for backups and delete them after a certain timeline.

avianlyric · on July 19, 2022

Guess we’ll find out when a leaks happen, and companies start getting fines.

GDPR does differentiate between structured data (I believe it uses the term “identifiable records” or similar), and huge heap of unstructured data where an individuals data can’t be quickly retrieved as it’s own atomic unit. With much stricter requirements for anything structured.

So for data on a HDD that could be recovered, but is an unstructured mess. You’re probably ok, as long as you took reasonable steps to protect the data, like full disk encryption (which could occurs below the file system, and thus allow you to recover data deleted from the FS with a full disk scan, as long as you still had the keys). If you just had peoples data unencrypted on a HDD, and didn’t securely erase it before dispose, then that’s probably still a GDPR violation.

JohnBooty · on July 20, 2022

What does GDPR say about obscuring data?

Instead of hard-deleting a structured record for Bob Smith, can you leave the record intact and scrub it of identifying data so that I don't e.g. break all of the records for orders that Bob Smith made?

lelanthran · on July 20, 2022

> Instead of hard-deleting a structured record for Bob Smith, can you leave the record intact and scrub it of identifying data so that I don't e.g. break all of the records for orders that Bob Smith made?

The various laws (GPDR, and similar ones in my jurisdiction) abuse the hell out of vague descriptions like "personal identifying data".

For example, scrubbing a customer record of all names, phone numbers, contact info, etc and leaving only the ID in the record still makes it possible to personally identify the person by:

1. Looking through the orders records to build up a profile depending on what they order,

2. Then using the delivery records from the courier company to get an address,

3. Then using existing public information to repopulate all the fields in the customer record.

In any given business, no matter how big or small, simply removing all references to a customers personal information will still leave the customer personally identifiable. This process can even be automated.

alerighi · on July 22, 2022

You can as long as you delete all the personal data of the user and make sure that other remaining data is anonymized in a way that you can no longer associate it with the user.

changoplatanero · on July 19, 2022

User deleting an account is just one way that stuff gets deleted. There are other deletion scenarios where it is appropriate to keep the information after its deleted.

Akronymus · on July 19, 2022

Yeah. Basically anything that an employee can mess up, should be reversible IMO. But actions such as deleting an account should have the option of "YES, DELETE IT PERMANENTLY, THIS CAN'T BE UNDONE"

Gigachad · on July 20, 2022

You probably want to put the account on hold for a period where it can be undone. And then clean it up after.

Akronymus · on July 20, 2022

Not always possible in a legal way. For example, GDPR compliance.

Gigachad · on July 20, 2022

GDPR gives you a pretty large period to do it within. Easily enough to fit this feature in.

charcircuit · on July 20, 2022

The only way that would make sense in if it was a cancelable timed deletion. People make mistakes. People change their minds. People get hacked.

closeparen · on July 20, 2022

Often it’s mark for deletion after some number of days.

twelve40 · on July 19, 2022

You can most definitely keep around the non-PII data you've generated for that user in your app, such as synthetic user ID's and other stuff. Furthermore, for some apps (fintech) in some countries (US) you are actually obligated to keep around certain information for several years. It really doesn't have to be all or nothing.

bpicolo · on July 20, 2022

Insurance has to keep around info for the lifetime of the policy. For some products, that's decades.

hnbad · on July 20, 2022

Synthetic user IDs can still qualify as PII since a human may still know which person that ID was associated with and data associated with that ID may be sufficient to narrow down who it was even if no human (or system) knows the exact mapping.

Pseudonymous data is not anonymous data.

But yes, the GDPR obviously allows for legal requirements of record keeping. It does however require you to delete the data once those requirements no longer apply (i.e. you have to ensure data is still deleted after those several years, you can't just not delete it because the deletion time is years in the future).

spamizbad · on July 20, 2022

Any reasonably big system will require you to invest in that anonymization. Doing cascading hard deletes and making sure any missing references don’t trip up your system or mess up reporting is much more difficult than just flagging fields that need to be scrambled with randomized data (most libraries have fake data generators that make doing these updates easy)

Long before GDPR was a thing we had this implemented allowing us to securely analyze production dumps.

This is one of those things that I think libraries need to mature to support and then it will be effortless. Right now it requires a modest, but hardly overwhelming, engineering effort to implement.

lazide · on July 19, 2022

This should have the asterisk of * after the period of time you’ve said it will take for it to be deleted.

It’s almost impossible to guarantee instant delete, folks just care if it’s gone in a documented and/or reasonable amount of time, and predictably.

Correct?

avianlyric · on July 19, 2022

Sure, but that time period is usually limited to 30 days, with an additional 30 day extension allowed if you contact the person and tell them you need more time.

You can’t just declare that it take years to delete data, GDPR sets reasonable limits on how long a company can delay true deletion.

EUROCARE · on July 20, 2022

Even if the company just issued you an invoice, that makes it obligated by law to hold onto your PII for the next 5 years - and GDPR of course has an exception for that.

avianlyric · on July 21, 2022

Sure, but that data doesn’t fall under your right-to-be-forgotten, so any obligations that apply to the right-to-be-forgotten are irrelevant.

LunaSea · on July 19, 2022

The GDPR gives limits on answering, I'm not sure it gives a limit on actually deleting the data everywhere.

Various types of data have various retention times for various reasons, some being legal reasons.

yuliyp · on July 20, 2022

The GDPR says it's 30 days with a 2 month extension. This allows for having safe soft deletion and backup retention policies without running afoul of it.

telotortium · on July 20, 2022

Do you have a link for that?

yuliyp · on July 20, 2022

https://gdpr.eu/article-12-how-controllers-should-provide-pe... section 3

telotortium · on July 21, 2022

Thanks!

M2Ys4U · on July 20, 2022

The GDPR says data should be deleted "without undue delay", it doesn't give a hard deadline.

astura · on July 19, 2022

>under the GDPR [soft delete] is illegal

This is obviously not always true. Any European can't, for example, delete their online account with their mortgage company and demand that the mortgage company deletes the records saying that they owe them money so they can get their house for free.

Nor can anyone call up their previous employer and require them to delete all the work they ever did from their company's computer systems.

A company can't respond to a tax audit with "well we lost a billion dollars this year but all our customers wanted their accounts deleted so we can't provide any documentation of it." I mean, you aren't even going to get to the audit if you can't fill out your taxes in the first place because you had to delete all your financial records.

Most computer systems/software products in existence aren't free media apps, yet the "privacy circlejerk" constantly talk in generalities as if they are. I'm sick of this, most applications have a business justification for using some form of soft delete for at least some items.

xxs · on July 20, 2022

Under GDPR the retention policy means the account has to be kept for whatever time stipulated by the law (10y), but then the data should not be retained any longer, hence deleted.

he0001 · on July 20, 2022

> Nor can anyone call up previous employer and require them to delete all the work they ever did from their company’s computer systems.

You can’t delete IP as this is not covered by GDPR, but you sure can ask them to delete your identification data from their records as GDPR also works for employees not just customers. It’s a problem with for example vcs.

Unless there’s another law in ruling, GDPR is the baseline. It’s for any identification of an individual, regardless of usage. Most “media apps” don’t have any other law that’s in effect, so GDPR is mandating what you need to do. (IANAL but worked on implementing GDPR with lawyers).

jsnell · on July 20, 2022

The right to erasure to is not unlimited (few things in GDPR are as absolute as people imagine). Yes, is personal data is retained based on consent, that consent can be withdrawn and the data must be deleted. But there are many other bases for collecting and retaining data as well.

The examples in the GP were mostly around the data being necessary for fulfilling a contract (can't request deletion of your mortgage records) or meeting legal obligations (can't delete tax records). But there's also legitimate interest.

Legitimate interest is vague and open to interpretation, but I'd be really surprised if there were any decisions from a DPA suggesting that a VCS commit log would be problematic. The legitimate interest seems really strong in that case: being able to audit who actually made each change, e.g. for security reasons.

M2Ys4U · on July 20, 2022

>You can’t delete IP as this is not covered by GDPR, but you sure can ask them to delete your identification data from their records as GDPR also works for employees not just customers.

You can ask, but the right to have data erased is not an absolute right. In fact there's an enumerated list of the circumstances in which the data must be deleted but data controllers are otherwise under no specific obligation to delete data, and there are even a series of criteria which dis-apply that obligation.

hnbad · on July 20, 2022

An often missed case of the GDPR requiring companies to delete data is that you can't just keep CVs "on file" without explicit permission, even of former employees you did end up hiring (unless you are preparing a legal case involving claims made in the CV for example).

Data controllers absolutely must delete data by default. The basic rule of GDPR is literally "privacy by default" (or "data minimization"), not "data hoarding by default". There are plenty of exemptions that grant companies the right (or rather require them) to delay the deletion by amounts of time as long as many years (e.g. for tax purposes), but they are only permitted to maintain only the absolute minimum required to fulfill these obligations and they must delete the data as soon as possible once these obligations have been fulfilled.

If you are obligated to keep a set of PII for 10 years, you must delete it in 10 years, unless you (still) have consent to keep it beyond that time limit.

he0001 · on July 20, 2022

It’s an absolute right, and one of the main points of GDPR, unless there are another law[0] stating that you can keep it, for a specific reason. Like financial transactions. When that law expires then GDPR kicks in. The company doesn’t own your data, you do.

[0] https://gdpr-info.eu/art-17-gdpr/

M2Ys4U · on July 20, 2022

>It’s an absolute right, and one of the main points of GDPR, unless there are another law[0] stating that you can keep it, for a specific reason.

It can't be "an absolute right... unless", it's either absolute or it's not.

And if you read Article 14 the first paragraph says that "The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies" (emphasis mine) and then lists six grounds. Further to that, paragraph 3 dis-applies paragraph 1 for more reasons than "another law states you can keep it".

>The company doesn’t own your data, you do.

Data controllers do own the data (insofar as one can "own" data), but data subjects have rights over the processing of those data that limit what the controllers may do with them.

astura · on July 20, 2022

Lol, okay dude, believe that if it makes you feel good.

afiori · on July 19, 2022

I would guess that there are quite a few limits to that...

A user has a long history of participating on your forum and other users have quoted their messages far and wide. Collectively all of the messages posted on your forums (with or without timestamps) reveal a few PII about the user. Do you have to delete those?

The user filed a bug report about a functionality not working, do you have to delete the text of the bug report?

Arguably if your user table look like [user_id, creation_date, deletion_date, status, account_type] then this table does not contain any PII.

Assuming that user content is not automatically PII, whose responsibility is it to track where PII can be?

Sometimes users doxx themselves (like mistakenly sharing tax return forms instead cat pics), in such a case it is the user responsibility to signal this to you.

If the user filed an issue saying "when I insert my name (Abe Cox) the input validation fails" is it you or the user that need to read through all the issues to find this case?

My point is that GDPR + right to be forgotten cannot make it look like you never had an account at all, especially without user assistance.

alerighi · on July 22, 2022

> A user has a long history of participating on your forum and other users have quoted their messages far and wide. Collectively all of the messages posted on your forums (with or without timestamps) reveal a few PII about the user. Do you have to delete those?

Technically yes.

> The user filed a bug report about a functionality not working, do you have to delete the text of the bug report?

Depends of the privacy policy of the bug tracking application. If it's stated that the report shouldn't contain any personal data you don't have to delete it. If the user asks for deletion stating that it contain personal data you have to delete it.

> Arguably if your user table look like [user_id, creation_date, deletion_date, status, account_type] then this table does not contain any PII.

It doesn't, as long you no longer have any way to associate user_id to the user personal data, you can keep it.

> Sometimes users doxx themselves (like mistakenly sharing tax return forms instead cat pics), in such a case it is the user responsibility to signal this to you.

In that situation if the user asks for deletion of the content you have to provide it. You are obviously not asked to monitor errors about the user.

hnbad · on July 20, 2022

user_id is PII if the identity of a user can be narrowed down sufficiently.

For example, if you created a social media app and later left the company, it could still be determined that if you had an account it likely has one of the lowest user ID values if those were assigned sequentially. So even if your name was removed from it and none of your posts provided any identifying information about you, the user ID could still qualify as PII.

Likewise if the user ID is public and non-PII content associated with that ID is linked elsewhere identifying the author of that content (e.g. a news article embedding a post) the identity of the author is still compromised.

The problem is not just that users may submit PII in unexpected places, the problem is that even if the data type could not possibly allow them to, metadata beyond your control may actually still taint your supposedly anonymous data.

afiori · on July 20, 2022

The question is not whether "spooky deanonymization as a distance" is possible; the question is how much you should proactively do to avoid this.

As an extreme example, if user A takes a screenshot if their own user setting page on a image sharing site, then sends it to user B that uploads it to the same site. Who is responsible for this piece of data? does share-a-pic.web need to scan all images for the username of a user that is deleting an account?

This is a convoluted example where the answer is obviously no, but it shows that a purely principled answer might be unsuitable.

Especially since non PII data you own mixed with non PII data someone else owns can become PII and you do not know what data others have.

(you could also replace the user ID in all locations with a new tagged UUID to preserve referential integrity in the DB, but there not much more than 2^32 humans in total, it s very hard not to be deanonymizable)

zrobotics · on July 20, 2022

What we did (and legal was OK with it) was to create a "GDPR Delete" function in the admin backend. It doesn't actually delete, but sets a soft delete identical to the regular user delete function and overwrites any PII with stuff like name:gdpr_delete_name_xxxx

That way log and system context is available for debugging, but the system no longer holds any PII. It does delete some logs, but anonymizing data seemed better than hard deletion. It wasn't hard to do, it took legal longer to approve than writing and testing the code and has come in handy a few times for debugging errors.

Edit: typo

pmarreck · on July 20, 2022

What if you anonymize it with a random key and then give the user that key in the event they want to undelete their deletion?

EUROCARE · on July 19, 2022

GDPR explicitly says that this doesn't apply when a more specific law requires you to store something - logically then you also need to know who the user was. Things like transaction records, network traffic logs, invoices, pay slips, workplace chat...

So yeah, you're right that not deleting user data just because it's inconvenient is wrong and illegal, but that doesn't mean there aren't legitimate usecases for soft deletions of users.

petre · on July 20, 2022

We don't do it for users but for certain user assets that are a pain to remove consistently without a background job or a job queue. Like user devices that have associated dependencies. The background job is automatic, it prunes objects evey x hours.

avereveard · on July 20, 2022

Not all personal information storage needs consent, gdpr art 6 comma b and c cover such cases as applicable to private companies.

Users can retract willing consent, but willing consent is not the only thing that allow companies to store user data.

Soft deletion of users whom had an account with transactions on it is fine

Art 17 comma b explain this quite clearly

Gdpr is amazingly readable.

Groxx · on July 20, 2022

You can do both.

Soft delete immediately when requested. Hard delete (if still soft deleted) when GDPR requires it.

Besides it's not like all data is the same. Use what's appropriate when it's appropriate. Nobody's saying "you must always soft delete, and then immediately tweet the audit trail, print that tweet, put that printout onto a rocket, and launch it to the moon for later retrieval. with a blockchain for extra GDPR hate".

pmontra · on July 20, 2022

Last month I had to implement customer deletion on a new project I started working on recently. The basics were: anonymize the user personal data, delete subscriptions, keep anything else. I believe that payments and invoices are not stored in that database so I don't know what happens to them but in general you must not delete that kind of data because GDPR notwithstanding the law requires that they are available for inspection for a long time. Of course they cannot be anonymized in a way that cannot be reversed. Being able to connect them with a trail of operations in the main database is probably useful too. So, soft delete.

kgeist · on July 20, 2022

>It may be convenient, but under the GDPR is illegal. When an user deletes an account, all the personal data associated with that user must be deleted

There's a grace period of 30 days. We hard-remove after 2 weeks allowing a user to change their mind if they deleted it by accident. More than enough.

411111111111111 · on July 19, 2022

Stop spreading misinformation. Nobody is required to hard delete the very second the [delete] button is pressed. You're following the law as long as that mentioned reaper job exists and runs at least once per month, to be safely within the allowed timeframe

brigandish · on July 20, 2022

I didn't see a claim that deletion had to be "the very second the [delete] button is pressed".

This is the actual text of the law[1]:

> The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay

There is a one month limit of the right of access[2] and right to be informed[3], which flow into some of the articles of the law; and certain articles also mention a month's limit on being informed, e.g. from article 12 parts 3 and 4 [4]:

> The controller shall provide information on action taken on a request under Articles 15 to 22 to the data subject *without undue delay and in any event within one month* of receipt of the request. That period may be extended by two further months where necessary, taking into account the complexity and number of the requests. The controller shall inform the data subject of any such extension within one month of receipt of the request, together with the reasons for the delay. Where the data subject makes the request by electronic form means, the information shall be provided by electronic means where possible, unless otherwise requested by the data subject. If the controller does not take action on the request of the data subject, the controller shall inform the data subject without delay and at the latest within one month of receipt of the request of the reasons for not taking action and on the possibility of lodging a complaint with a supervisory authority and seeking a judicial remedy.

I've marked out those phrases because they show the language of the law that you will see again and again in the articles - without undue delay and within one month. The latter does not obviate the former.

Hence, and in fact, there is no part of the law that explicitly gives you a month to delete something. If you could delete something right now but you leave it hanging around because you have arbitrarily chosen a job to run once a month and there's a leak in that time, that would be undue delay.

[1] https://gdpr-info.eu/art-17-gdpr/

[2] https://gdpr-info.eu/issues/right-of-access/

[3] https://gdpr-info.eu/issues/right-to-be-informed/

[4] https://gdpr-info.eu/art-12-gdpr/

pbreit · on July 20, 2022

The user has to explicitly request the erasure.

brigandish · on July 21, 2022

Yes, they do that by pressing the delete button on their side (and so forth). This is covered in the first line "The data subject shall have the right to obtain".

If you want to go to court and argue that the user should've know that pressing a big red button with "DELETE!" on it really means soft delete, be my guest.

411111111111111 · on July 20, 2022

You literally said that soft deletes are illegal, which heavily implies what you now deny to have said.

The 30 days wasn't particularly taken from the regulation, the actual grace period for the deletion should be 90 days in total, but I'm not a lawyer nor certain about it.

What I am certain about is that your original thesis, which is soft deletes being illegal under gdpr, is complete bullshit

brigandish · on July 20, 2022

> You literally said that soft deletes are illegal, which heavily implies what you now deny to have said.

Who is "you" here? Perhaps if you were less defensive you'd notice that I'm not the person you originally replied to, which, by the way, makes your response nonsensical.

Edit: I'll add, that the only person who is likely to get anywhere near to "complete bullshit" is going to be the one who writes "but I'm not a lawyer nor certain about it" and doesn't provide a shred of evidence to back up their, frankly completely ignorant, notions.

90 days grace? Risible.

411111111111111 · on July 20, 2022

Yeh, you take first place for the most retarded person 2022 award.

loeg · on July 20, 2022

Not every delete is a GDPR delete. You could have a hard-delete GDPR flow, obviously.

pradn · on July 19, 2022

Soft deletion is just one way to achieve undeletion. The author's proposed solution of moving the resource to another table works just as well. You can move it back to the non-deleted table to perform the undeletion. You can keep around these deleted objects as long as you want; they work as a subset of a proper audit trail. The cost of course is you have more tables, but that is less of a cost than having to add "deleted=False" predicates in all of your queries.

Also note, if you use a soft-deleted column, indexes need to be keyed by that column as well if you want to access non-deleted objects quickly. That's extra complexity.

mjevans · on July 19, 2022

Even more important; the deleted records don't need to live in your cache / RAM / etc. Potentially faster queries.

JohnBooty · on July 20, 2022

    but that is less of a cost than having to add "deleted=False" 
    predicates in all of your queries.

Maybe or maybe not. You can use a view. Or you may be using an ORM that lets you set a default scope (essentially, a default WHERE clause - ActiveRecord lets you do this)

It also depends on if you're designing an app for this from the ground up or if you're trying to retrofit an existing app with 90 million different hardcoded queries.

Also depends on what you want to do with the deleted records. Do you want to do things with them? Maybe after a user is deleted, you want them to be able to log in, but you would prefer them to see a "sorry - your account has been deleted" message instead of "user not found." Maybe you want your support staff to be able to look at deleted users. Etc. Now you need to update your app logic so that it's flexible enough to look at "users_deleted" and "users". Which may be at least as onerous as messing with the WHERE clause on every single query.

To be clear, I don't hate the "another table" solution. It's the right choice in a lot of situations IMO.

Gordonjcp · on July 19, 2022

> but that is less of a cost than having to add "deleted=False" predicates in all of your queries.

It's like people have forgotten what views are.

delgaudm · on July 19, 2022

Well, most likely they're queries that have "deleted=False" in them. ;-)

hoistbypetard · on July 19, 2022

You almost made me ruin a keyboard. Jerk.

smackeyacky · on July 19, 2022

What seems to be missing here is the DB tech he is using. On a proper database you can do your "undeleted" with triggers and it's relatively trivial. Nonsense like a "deleted" column on your main data table just seems silly.

JohnBooty · on July 19, 2022

    Nonsense like a "deleted" column on your main data table just seems silly.

Why silly? Yeah, it's one more column, but most DBs just represent dates as bigints AFAIK so it's not really the end of the world.

Sure, you have to be conscious of adding "where deleted_at = NULL" (or the opposite) to all your queries. If you're going through an ORM this is usually trivial e.g. in ActiveRecord it can be a default scope on your model.

    On a proper database you can do your "undeleted" with triggers 
    and it's relatively trivial.

I like this solution as well, FWIW.

If I was retrofitting an existing application with soft-deletes I would definitely favor this solution.

There are some obvious advantages over a "deleted" column on your main table for sure.

One drawback is that you now have to keep two table definitions in sync. You also have to add conditional logic in a lot of places to handle whether a given piece of code is looking at a normal user from the "users" table or a deleted user from the "users_deleted" table. Instead of messing with the WHERE clause on every query, now you're messing with the table name on every query. So that's a wash IMO.

icedchai · on July 19, 2022

I worked at a place that kept all the deleted stuff in their main tables. It turned out over 90% of the rows were deleted. I'm not sure how often something was undeleted, but it was not very frequent. Some of these soft deleted rows were 5+ years old. Archive that crap.

JohnBooty · on July 19, 2022

Yeah, for sure, if deletion is going to be the exception rather than the rule.... keeping deleted stuff in the main tables is not ideal.

Of course, you can have your cake and eat it too. You can e.g. keep your deleted crap in the main table (which makes certain things easier) and then hard-delete or archive it to another table after N days.

dubswithus · on July 20, 2022

I would never suggest archiving unless we're hitting some performance limit. Why generate more busy work? Just leave it there and if need be add more indexes and partitions.

icedchai · on July 20, 2022

well, they were running into performance issues. This table had something like 100 columns and at least 30 indexes already. This table was fat. Many columns, many rows, mostly unused garbage.

And several of these "columns" were actually JSON blobs that probably should've been in their own table. But what do I know. I left!

JohnBooty · on July 22, 2022

When we talk about best practices, it's probably not too useful to talk about these sorts of "extremely high technical debt situations."

Once you have an existing tirefire, it's a matter of damage control and "what is the least bad way to accomplish new functionality?" and the answer to that will always be unique for each unique tirefire.

robertlagrant · on July 19, 2022

Triggers don't solve what the author's getting at, i.e. who knows what else outside this database changes upon deletion that would need to be reversed? It's all in the article.

efsavage · on July 19, 2022

Also if people know that deletion is reversible, they're more likely to actually do it, which can keep things generally tidier.

I don't actually like using a "deleted" column, my standard table has a status column, and deleted is one of those states, along with active/pending/suspended/etc, as the needs dictate. This way I get soft deletes for basically free both in the schema, but also in the queries (which would generally default to active), so it's not really the spaghetti that the author discusses.

JohnBooty · on July 19, 2022

For some applications this is fine, depending on your app/business logic but for a lot of applications states like active/pending/suspended and "deleted" are not mutually exclusive.

Suppose I soft-delete an active, pending, or suspended user using your scheme.

Now I need to un-delete the user. What status should they have? We don't know.

This is another "best practice" I've learned over the years. Those flags/statuses you think are mutually exclusive? Maybe they aren't. Or maybe they are now but won't always be. It's usually easier in the long run to give each status its own column even if you think they'll always be mutually exclusive. Because I mean, what are you really saving? A few bytes per record? In most cases that's not worth it.

robertlagrant · on July 19, 2022

That still has the same issues. You have to remember to set every linked table's records to the same state, or remember to query every linked table through the table that has the lifecycle column on it.

jwarden · on July 20, 2022

Forgetting to include the lifecycle column in your query may mean not understanding what question you are trying to answer with your query.

dubswithus · on July 20, 2022

There are some Rails gems that take care of this by scoping associations (that also have the deleted_at column).

rjzzleep · on July 19, 2022

In rails you get these things for free. What I don't get is why everyone rolls their own framework with node.js. It's basically 90s PHP all over again.

EDIT: Soft delete is a trivial piece of code when the framework has a well defined transaction system for its ORM. It's not really related to Rails per se. Your statement is extremely disingenuous, while trying to look smart. Audit trails _can_ be(but don't have to be) more complex, especially when the framework uses a lot of stored procedures to handle operations. But other than that these frameworks are specifically designed to REDUCE complexity of such operations, dependency costs - which are huge in node.js, specifically because you can mix and match anything into everything.

Node.js people tend to stitch together XSS solutions, random templating solutions based on their frontend work, even basic salting of auth passwords becomes unpredictable because you have 30 options on minimal auth libraries.

But yes nothing is ever free. If you want to use Rails you still have to learn ruby and the framework and a basic understand of how ActiveRecord builds queries if you want to be writing performant code. And the same applies to Laravel, Django, or whatever of the 50 patchwork node.js solutions you want to base your code on.

joshmanders · on July 19, 2022

I don't know why you're ragging on Node.js users or even PHP for that matter as both ecosystems have this stuff covered too.

Also you're comparing language/runtime with an actual framework and then dogging those users...

If you want to compare Rails with Node/PHP then I'd suggest comparing with things like Laravel (PHP), Adonis (Node) and you'll find everything you can do in Rails is done in Node/PHP too.

nightpool · on July 19, 2022

What percentage of production Node.js systems use Adonis? It's probably vanishingly low. Laravel is nice, and it's been gaining a lot of adoption in new applications, but the GP said "90s PHP" for a reason. Modern Node JS backends are often littered with hand rolled SQL queries, poor MVC separation, and lots of shoddy, half baked model layers—often without using any library whatsoever. Which is a real shame, because I personally really love programming in JavaScript, and the performance is great, but the strength of Rails as a framework draws me back in every time.

tored · on July 20, 2022

How is hand rolled SQL queries something bad?

likortera · on July 26, 2022

Bad. Like really bad. Unless you're doing it for some small corner case where you need up to the list bit of performance, raw SQL queries and mapping them to objects or manually moving or many pulsating the result sets manually is really bad. Specially for others having to maintain your code 3 years later.

There's a reason ORMs are so popular.

tored · on July 29, 2022

Personally I prefer the opposite, I have managed many old projects (10+ years old) and I always prefer raw SQL over some outdated ORM that is hard to configure and finetune. With SQL I have access to one of the most powerful query languages in existence that can never be matched by an database layer written in general purpose language.

Raw SQL can be a bit problematic if it is spread out in the entire code base with little structure ("littered"), however that is relatively easy to fix by centralizing the queries with a repository pattern. It is much harder to fix an ORM ten years later.

Do we know why ORM's are popular? I don't think it has to do with that ORM's are good, rather that developers want to avoid SQL because lack of experience, but that is me guessing based on my prejudice (I was myself in that category as a junior).

I have written in the past why ORM's are bad https://news.ycombinator.com/item?id=27407690

joshmanders · on July 20, 2022

And I'm sure there's many similar situations like that in Ruby, Python, and other languages, you just don't hear about them because they don't power 99% of the internet like PHP and JavaScript does. ;)

abaldwin99 · on July 20, 2022

I have to admit I'm getting a bit of fatigue reviewing frameworks in Node that are claimed to be batteries-included and similar to Rails/Django/Laravel. SailsJS, Redwood, NestJS, Remix, Prisma (ORM compared to Active Record). They all have an exciting landing page, clean looking documentation... but once you dig in further the feature set starts looking very <= Rails v1.0.

Anyways I appreciate the name drop of Adonis. I'll have to give it a fair shake when I'm not feeling quite as cynical because I _really_ do want to find a solid Node framework in this area.

likortera · on July 26, 2022

Most "full stack" frameworks in the node ecosystem consider themselves "full stack" just because you can run code on the server. See the Fresh (deno) discussion here in HN from a few weeks ago.

To me a framework which doesn't come with a clear way to do translations, background jobs, query builder, ORM, migrations, asset bundling, validations, error handling, etc it is not full stack. It is just "server code compatible" at best.

likortera · on July 26, 2022

Adonis is far, far, far from reaching even a small percent of what Laravel is. It has almost no community and very little or almost no third party packages. When I tried it las year I couldn't even find a formatter for the templates.

Later I've found out the author of the framework has fake accounts pretending to be a happy user of it, and that was the end of it for me.

So, not comparable at all in my opinion.

sky_rw · on July 19, 2022

Nothing is free. In Rails you have _currently well maintained libraries_ for this. There are still complexity costs, dependency costs, data costs, performance cots, etc, etc, etc.

tomschlick · on July 20, 2022

> It's basically 90s PHP all over again.

Which is funny because Laravel give you this for free as well through it's ORM. Soft deletes are an easily solved problem with $table->softDeletes() in your migration.

likortera · on July 26, 2022

I think the key on this comment is "90s". Modern PHP is very different. As you say, Laravel makes trivial and a joy to use things that in 90s PHP are like today's node.

jesseryoung · on July 19, 2022

I was getting ready to disagree with you - but then I tried to think of any time I've actually pushed code to production with the "DELETE" keyword in it. The problems that I've had to solve in my career very rarely call for deleting something.

"Soft deletion" and "audit trail" are technical terms we developers come up for solutions the business wants but maybe hasn't asked for yet. It's not really a soft deletion it's a "deactivate" or "hide". Likewise, it's not an audit trail it's a "history" or "undo". Most of the time your stakeholders and users actually want these features, but don't ask because they perceive this as more expensive to build then just a "delete" button.

pwm · on July 20, 2022

> Likewise, it's not an audit trail it's a "history" or "undo".

Depends on the industry. The one I work in audit trail is a well-defined and mandatory business concern.

augustl · on July 19, 2022

This is why I don't understand why Datomic isn't more popular. Pretty much every system I've worked on never needed to scale past 100s of writes per second due to hard limits on the system (internal backoffice stuff, fundamenally scoped/shardable to defined regions, etc etc). And since Datomic is built with that in mind, you get the trade-off of full history, first class transactions and being able to query for things like "who changes this attribute to its current value, when, and why" is such as super power!

Scarbutt · on July 19, 2022

Too niche of a technology, tied to clojure, not open source and very slow(doesn't matter for most internal apps though). For many, it's better to do the tedious thing here and there with postgres. SQL also has strong grip on databases and if you look at it the other way around, postgres has lots of features that datomic lacks, with datomic you almost always need a secondary database.

tsuujin · on July 19, 2022

It’s hard to get adoption for expensive toys.

I think Datomic is neat, and I’d like to use it, but it is prohibitively expensive for a personal or hobby project. Personal projects are where I get excited about tech, and when I’m excited I’m more likely to adopt it in my day job.

They’re really shooting themselves in the foot by not having a one-click install free tier or a self hosted option.

marcofiset · on July 19, 2022

You may want to look into XTDB then. Not quite the same as Datomic, but they share many similarities. It's free and open-source.

tsuujin · on July 19, 2022

Neat, thanks!

nightski · on July 19, 2022

Personally looking at their pricing since it is so tied to AWS it is completely non-transparent how much it's going to cost us now or in the future.

I really like the concept of datomic though.

rjbwork · on July 19, 2022

In the past I've used MSSQL's Temporal Tables (also called System-Versioned Tables) to implement this kind of functionality. This also gets you, for free, Type 2 SCD functionality for OLAP-style queries.

I can't wait until Postgres has this kind of functionality baked in. It's such a nice feature.

dgb23 · on July 20, 2022

You likely know that there is an extension for PG that does this.

th0ma5 · on July 19, 2022

For me it is simply that isn't open source (at least last I checked.)

synthc · on July 19, 2022

Datomic is great but i think its missing some features that many enterprises need (access control and a query planner). Also it seems to be mostly built for DynamoDB/AWS.

wnevets · on July 19, 2022

This is something that I was forced to learn the hard way more than once. Literally today I needed to undelete a record because a customer was confused by what the "delete" button did and wanted their record back.

wst_ · on July 19, 2022

Isn't it the problem of UI, though. If the user would be informed about the consequences (possibly with bold red font and with a confirmation checkbox) would they still click that button?

doctor_eval · on July 19, 2022

File this under "falsehoods programmers believe about users": they act rationally.

Aeolun · on July 20, 2022

Delete

“I’m sorry sir, I didn’t think that clicking that button would actually delete something!”

“What exactly did you think that button would do?”

doctor_eval · on July 20, 2022

"falsehoods programmers believe about users #2": they read button labels and dialog box text.

thrashh · on July 19, 2022

As a user, I would still sometimes and then regret my decision

Soft deletions are awesome

wnevets · on July 19, 2022

All of that exist.

nightpool · on July 19, 2022

robertlagrant · on July 19, 2022

One will, one day.

corrral · on July 19, 2022

I'd add tagging, for anything that could conceivably use it, when you're doing DB design. May as well start with support, even if the functionality's initially dormant. Someone will ask for it, directly or indirectly, and it won't take long before they do.

JohnBooty · on July 20, 2022

I personally would not agree.

But... your experiences are also real and I believe you when you say that your experiences DO support your conclusion. :)

The "soft delete" design decision is usually pretty impactful and is in my experience potentially much more of a pain in the butt to implement later if you haven't included it from day 0.

Audit trails and soft-deletes are also crazy useful for developers (both for debugging and for general cover-thine-ass utility) even if end users never touch them.

Whereas, tags are easier to tack on later and are not intrinsically useful to developers.

dgb23 · on July 20, 2022

I agree. But if you’re asked to implement hierarchical categories, you implement them as tags underneath. This doesn’t cost you any time. When eventually they realize that categories don’t work (they never do) you can just turn a switch.

nicoburns · on July 19, 2022

+1 on audit trails. And one should always store audit trails in machine readable format. That way you can not only manually inspect what happened, but you can query it too (and reconstruct the entire state as it existed in the past if necessary).

aboodman · on July 20, 2022

  > Somebody always wants to undelete something
  > or examine it to see why it was deleted
  > or see who changed something, or blah blah blah

People for whom this resonates should look into Dolt: https://dolthub.com/.

It's a mysql-compat database that is versioned and forkable - basically imagine git and mysql had a baby. Every transaction creates an entry in the commit log that lets you see when, how, and why the database changed. And just like git you can `dolt checkout <hash>` to see what the data was like at some point in time, or `dolt checkout -b temp-branch <hash>` to make exploratory changes to some historical version, or `dolt revert` to revert a transaction... etc.

There is a lot more power that comes with making the entire database versioned and forkable by default. For example it makes it much easier to recover from catastrophic bad code pushes, etc.

note: Dolt was forked from Noms, my previous project, but I don't work for Dolt or have a stake. Just a fan.

makeitdouble · on July 19, 2022

> unless there's some reason not to.

Yes.

To note, with GDPR there's now legal reasons to do so regarding user personal data. That can be the moment the devs realize they actually can't delete the data, because they soft deleted for so long, many relations are now interlocked and the data model needs to be changed to give a starting point to the deletion cascade.

My lesson from that was to at least have one test deleting a mock user that spans the maximum data breadth of the service . We caught a bunch of these loops in test at dev time and that was pretty great.

JohnBooty · on July 20, 2022

    My lesson from that was to at least have one test 
    deleting a mock user that spans the maximum data 
    breadth of the service . We caught a bunch of these 
    loops in test at dev time and that was pretty great.

Thank you. That is SUPER insightful. If it wasn't too late to edit my initial post I would add this.

Falkon1313 · on July 22, 2022

Good point. That could be tricky when it crosses the boundaries of schema changes and data migrations.

cal85 · on July 20, 2022

> Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah.

In my experience this happens “rarely”, not “always”.

It can happen, and in some ultra-rare cases the impact of not being able to recover some data might be huge (company-ending, even), and engineers are good at worrying about such edge cases. That’s why we habe protective measures like soft deleting and event sourcing - because of nightmare edge cases, not because we are always having to actually use them. It’s driven by engineers avoiding their worst nightmare: having to say “I’m sorry, I cannot solve this problem for you. The data is gone.” It’s a peace-of-mind thing, not an everyday-need thing.

nivertech · on July 21, 2022

> That’s why we have protective measures like soft deleting and event sourcing

IMO soft deletion is a hack trying to fix problems in CRUD, which is a hack in itself.

CRUD attempts to model everything in the universe as a collections of mutable items, loosely based on RDBMS/SQL.

Event Sourcing is more realistic: it models everything as a append-only logs of immutable events/facts, which preserves both the historical data and, more importantly, the original intent.

Unlike CRUD, Event Sourcing is technology agnostic.

femiagbabiaka · on July 20, 2022

Having multiple use cases for data is normal and okay. Treating your primary data store that is designed for one set of use cases and using it for them all is very very bad, even before you hit scale. Know which data store to use and when for a use case is a super power than can allow you to scale to much higher magnitudes than one would intuit.

dcow · on July 20, 2022

What's wrong with the author's proposed compromise/solution?

ReptileMan · on July 20, 2022

The proper thing is database snapshots or something like that.

I don't think soft delete is wrong per se, but it is something that should be native to the database engine.

cpursley · on July 20, 2022

Can you recommend any best practices, design patterns, etc - for implementing soft delete in SQL systems?

tomschlick · on July 20, 2022

The biggest one is to use a timestamp instead of a boolean. For instance Laravel used `deleted_at` as the field name with the type of TIMESTAMP (for mysql at least).

This allows you to do a `select * from users where deleted_at is null` to get the active records, but also know when the user was deleted if you need to audit / rollback.

JohnBooty · on July 20, 2022

I think the best answer is to optimize for the expected data and use cases.

- How many records are there? Enough that performance will be an issue? If I'm planning on keeping things in the same table, can I utilize database features like indexes and partitions to mitigate any perf issues? (For some access patterns, partitions might solve 100% of your perf concerns)

- How common is deletion? Are we expecting 1% of the records to be soft-deleted, or more like 90%? If it's the latter, you may not want all those records clogging up your main table.

- Is this greenfield development, or am I adding soft-delete to an existing app? Greenfield favors "same table" soft delete; if you're retrofitting an existing app it may be better to keep deleted stuff in a separate table so that you don't break existing functionality.

- What do you want to do with the soft-deleted records? Are there times when you want to treat them just like regular records, e.g. "Give me a list of all the users who joined last week, even if we've deleted their accounts?" If the answer is "yes" then a lot of those things will probably be easier if you keep deleted and non-deleted in a single table.

brightball · on July 20, 2022

Agreed. The key to soft deletes is to actually move the record out of the original table to ensure they don’t accidentally end up in a query join.

But there’s always something that needs to be undeleted. You can either have an easy way to do it or restore an entire DB backup and cross query the missing records. Soft deletes are a lot easier.

4m1rk · on July 20, 2022

It's audit log enough for that? As deletion should also be logged.

xxs · on July 20, 2022

GDPR compliance and all - we encrypt and delete stuff... or at the very least "lose" encryption keys 1st.

pg_1234 · on July 19, 2022

This 100%

dafelst · on July 19, 2022

Views are a simple solution to this problem. Pretty much all moderns RDBMSs support updatable views, so creating views over your tables with a simple WHERE deleted_at IS NULL solves the majority of the author's problems, including (IIRC) foreign key issues, assuming the deletes are done appropriately.

I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against. Sure, concerns about splitting logic between the DB and app layers are valid, but there are fairly well developed techniques for keeping DB and app states, logic and schemas aligned via migrations and partitioning and whatnot.

pbardea · on July 19, 2022

> assuming the deletes are done appropriately

This is one gripe I have with soft-deletion. Since I can no longer rely on ON DELETE CASCADE relationships, I need to re-defined these relationship between objects at the application layer. This gets more and more difficult as relationships between objects increase.

If the goal is to keep a history of all records for compliance reasons or "just in case", I tend to prefer a CDC stream into a separate historical system of record.

dspillett · on July 19, 2022

> Since I can no longer rely on ON DELETE CASCADE relationships

Cascaded deletes scare me anyway. It only takes one idiot to implement UPSERT as DELETE+INSERT because it seems easier, and child data is lost. You could always use triggers to cascade you soft-delete flags as an alternative method, though that would be less efficient (and more likely to be buggy) than the built-in solution that cascaded deletes are.

If you look at how system-versioned (or “temporal”) tables are implemented in some DBMSs, that is a good compromise. The history table is your audit, containing all old versions of rows even deleted ones, and the base table can be really deleted from, so you don't need views or other abstractions away from the base data to avoid accidentally resurrecting data. You can also apply different storage options to the archive data (compression/not, different indexes, ... depending on expected use cases) without more manaully setting up partitioning based on the deleted/not flag. It can make some query times less efficient (you need to union two tables to get the latest version of things including deleted ones, etc.) but they make other things easier (especially with the syntactic sugar like AS AT SYSTEM_TIME <when> and so forth) and yet more things are rendered possible (if inefficient) where they were not before.

> I tend to prefer a CDC stream into a separate historical system of record.

This is similar, though with system versioned tables you are pretty much always keeping the history/audit in the same DB.

---

FWIW: we create systems for highly regulated finance companies where really deleting things is often verboten, until it isn't and then you have the other extreme and need to absolutely purge information, so these things are often on my mind.

michaelt · on July 19, 2022

> It only takes one idiot to implement UPSERT as DELETE+INSERT because it seems easier, and child data is lost.

Seems unfortunate to miss out on all the referential integrity benefits of a serious database when hiring standards, training and code reviews should all be preventing idiotic changes.

If I’m making a shopping cart system, I want to know every order line belongs to an order, every order belongs to a user and so on. Anyone who can’t be trusted to write an update statement certainly can’t be trusted to avoid creating a bunch of orphan records IMHO.

mlyle · on July 20, 2022

> Seems unfortunate to miss out on all the referential integrity benefits of a serious database when hiring standards, training and code reviews should all be preventing idiotic changes.

If you don't use ON DELETE CASCADE, the actual foreign key constraint gives you a meaningful error-- that you need to delete some stuff to have referential integrity.

ON DELETE CASCADE --- you're telling it "eh, if you need to delete some stuff to avoid an error, go ahead, don't bother me, do what I mean".

dspillett · on July 20, 2022

> Seems unfortunate to miss out on

You don't lose any referential integrity without cascades. Foreign keys are still enforced, just with an error if an action would break integrity rather than automatic deletes to satisfy the constraint that way.

> when hiring standards, training and code reviews should all be preventing idiotic changes

I was burned by this sort of thing early on, in companies where I had few such luxuries. Even though things are done better now, I'm still paranoid of that one day someone skips procedure and somehow lets a problem through all the QA loops.

> If I’m making a shopping cart system, I want to know every order line belongs to an order, every order belongs to a user and so on.

I take it from the other side: I want to know that if something is referred to elsewhere it can't be deleted until that is resolved.

If a top-level manager leaves I want an error if the hierarchy hasn't been updated before deleting his record¹, rather than his underlings, their underlings, their underling's underlings, … , being deleted when that one person is!

----

[1] Obviously this would normally be a soft-delete, there may be a lot of records referring to such an individual not just other person records. If you actually need to delete them (right to be forgotten etc.) then you need to purge the PII but keep the record so other things still hang together.

wvenable · on July 19, 2022

Often you don't have to rely on ON DELETE CASCADE relationships. Because you are never deleting anything, you will never have any orphaned records. If you don't want to see say Invoices for a deleted Customer then that's just another filter feature.

Mostly I use soft-delete because for auditing requirements we pretty much can't remove anything but also because nothing ever truly goes away. If we have an Invoice or Order then, from our perspective, we must have those forever even if the corresponding client is deleted and can never place another one.

nkozyra · on July 19, 2022

> Often you don't have to rely on ON DELETE CASCADE relationships. Because you are never deleting anything, you will never have any orphaned records

Exactly. Unless you're doing something silly like adding deleted at to bridge tables ... which, you probably don't need even in 1:many.

dexwiz · on July 19, 2022

You may end up doing this anyways if you have any application code that needs access to delete hooks, or access control varies across objects. At this point, you are probably using a ORM instead of direct queries, and place logic that could be in the db instead at the app layer.