Hacker News new | past | comments | ask | show | jobs | submit login
Soft deletion probably isn't worth it (brandur.org)
654 points by lfittl on July 19, 2022 | hide | past | favorite | 494 comments



I've been a software dev since the 90s and at this point, I've learned to basically do things like audit trails and soft deletion by default, unless there's some reason not to.

Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah. It helps the business, it helps you as developer by giving you debug information as well as helping you to cover your ass when you are blamed for some data loss bug that was really user error.

Soft deletion has obvious drawbacks but is usually far less work than implementing equivalent functionality out-of-stream, with verbose logging or some such.

Retrofitting your app and adding soft deletion and audit trails after the fact is usually an order of magnitude more work. Can always add it pre-launch and leave it turned off.

If performance is a concern, this is usually something that can be mitigated. You can e.g. have a reaper job that runs daily and hard-deletes everything that was soft-deleted more than n days ago, or whatever.


The author uses the "no one ever undeleted anything" as the primary justification. I think this is the part they miss. I've never undeleted a user either, but there have been many times I've gone back to look at something. Either a complaint finally gets around to me as to why the user wanted their account deleted (e.g. feature not working) and it helps to figure out why. Or they're returning and want things set up like they were. Or someone is taking over their roll and needs to be set up like the last person who's already gone.

Though you really shouldn't be relying on a database for an audit trail. It might help find some issues, but things actually used for security shouldn't be writable so easily.


    I think this is the part they miss. I've never 
    undeleted a user either, but there have been many 
    times I've gone back to look at something.
Yeah. As far as a user-facing "Undelete" button existing or being used... that's very rare in my experience.

What's much more common is a user accidentally deletes some data. They deny they made an error. The developers are blamed. You then have to go on a wild goose chase figuring out if it was possible for the app to actually screw up in that way. There's usually no definitive answer, and even if there is, management can't understand it. And regardless of how any of that plays out, you still probably have to try and recover the data from backups or something.

Alternately, maybe it was the app's fault. Still plays out nearly the same!

Soft deletes and/or audit trails save you from all of that.

    Though you really shouldn't be relying on a 
    database for an audit trail. It might help 
    find some issues, but things actually used 
    for security shouldn't be writable so easily. 
I mean, at some level you need to trust the database right?

Been ages since I did it, but it's usually possible to set up a "secure" audit trail with use of database permissions. For example, the application's DB credentials can have SELECT and INSERT permissions on the audit trail table, but no UPDATE or DELETE perms.

How would you set up a secure audit trail that didn't rely on the application and/or database at some level? Even if it lives outside of the database, that data came from the database.

Not a rhetorical question. Genuinely curious!


+1 to investigating user error. It also happens that you have bugs, and it's hard to spot the bugs if you don't have a quick way of filtering out all the cases of user error.

> Yeah. As far as a user-facing "Undelete" button existing or being used... that's very rare in my experience.

This is pretty common in my experience, but often with slightly different terminology. I've seen "delete"/"undelete" but also "move to trash"/"restore from trash" or "mark for deletion"/"unmark for deletion" and even "archive"/"restore from archive" where deleted/marked/archived objects were excluded from normal business processes and were hidden in the UI unless the user went to a special page where they could search, view, and restore.

I've also seen chat functionality where users had the ability to delete messages but admin users had the ability to view the full history of a chat, including deleted messages. (The software was used by businesses to communicate with their customers, who were also businesses, so supervisors being able to read deleted chat messages was appropriate.)

In my experience soft delete is basically free if you do it from the start. I've never regretted soft delete but have always regretted the lack of it. However, it's easier if you also have a data retention policy from the start so you're forced to create automated cleanup processes and maintain them as you go along.


    This is pretty common in my experience, but often with 
    slightly different terminology. [...]

    I've also seen chat functionality where users had 
    the ability to delete messages but admin users had 
    the ability to view
Great points! Slack actually works that way, if I'm not mistaken. Admins can read deleted messages and, I think, the edited versions of messages right? Not sure how they store things internally, but you're right - it's common functionality in many classes of apps.

    In my experience soft delete is basically free if you do it from the start.
This is my experience too.


    How would you set up a secure audit trail that didn't rely on the application and/or database at some level?
A database is fine for an audit trail. But the application shouldn't have permissions to cover up a change. Likewise, the audit trail should record that a record was created, deleted, maybe undeleted, etc and when, by who. Relying only on the "deleted" flag only shows you the current state, not what and when was done and who did it.


Any tools you recommend here? Or just roll your own with triggers in the database?


Log-based change data capture is a great tool for building audit logs; unlike triggers, it doesn't impact write performance, you don't need to maintain triggers in the first place, and you can have you audit log in a system separate from your main database.

Blogged about one possible implementation using Debezium a while ago here: https://debezium.io/blog/2019/10/01/audit-logs-with-change-d...

Disclaimer: I work on Debezium


I think you'll find that most operating systems have this built in, syslog for Linux and the event log on Windows. We actually use SPLUNK which has lots of tools for collecting data from apps using different methods, and also tools for viewing and querying the data.


I've long wished for an RDBMS (or perhaps ORM, but I think it would be extra cool at the database level, see below) that does things that don't hyperscale.

So many LOB applications (the part of the iceberg under the water) have modestly sized databases with a relatively consistent number of users. The data is small, but often complex, awkwardly structured, highly relational, etc. The challenges these applications face are different, but real.

An example of something that would be incredibly useful here but completely untenable for a Facebook scale system would a rich rewind mechanism. You can go back in time 15 minutes or execute a query like:

RESTORE my_db TO LAST POINT WHERE EXISTS (SELECT * FROM ACCOUNTS WHERE ID=1234)

How much trouble would that kind of query have saved you in your career? At some point this would be cost prohibitive but I bet 50% or more of all apps never reach that point.


I think what you're looking for is called bitemporality (https://en.wikipedia.org/wiki/Bitemporal_modeling).

It allows you to query a system (some databases support this) in a temporal context. This is useful in financial services where you have market data baked into a price you've offered to a client, where the underlying market data might have been later corrected by a vendor.

So you can query how the world looked last tuesday 4pm; but also - given what we now know, how ought the world have looked last tuesday 4pm.

Interesting stuff.


> This is useful in financial services where you have market data baked into a price you've offered to a client, where the underlying market data might have been later corrected by a vendor.

A long time ago, I worked on a such a system that provided service-based billing for an industry where rates were very fluid. Having this "at any point in time" view of key data was essential for many things.


CockroachDB has "AS OF SYSTEM TIME", which even allows you to backup the database for that moment.

https://www.cockroachlabs.com/docs/stable/as-of-system-time....


CockroachDB has been a font of cool ideas from the beginning IMO.


BigQuery has this too, it’s extremely useful.


MS SQL Server has a fun feature called temporal tables that does some of this https://docs.microsoft.com/en-us/sql/relational-databases/ta...


This is very interesting!

I haven't used this feature, is it possible to twist this to find a time associated with a table state so that you can then query the table at that time (allowing you to answer a question like what was the value of X the last time Y was true)?


You could replay the change feed from CRDB into Materialize and catch the timestamp (s) at which some query results pop out, using an incremental query in what is basically a linear search.


In Oracle you have the flashback tech

  SELECT * FROM employees
  AS OF TIMESTAMP
  TO_TIMESTAMP('2004-04-04 09:30:00', 'YYYY-MM-DD HH:MI:SS')
  WHERE last_name = 'Chung';


For PostgreSQL with WAL logs, or in a service like RDS, can you not do a point in time recovery? Seems like an easy way to go back in time, recover the data you want, dig out any FK referents that might also have been cascade deleted etc. If you also maintained an audit log you should be able to isolate the point where the given data was deleted (if its id is in the audit log), and restore from just before it, to capture the deleted data, query it and reinsert in master.


One of the common ways I've seen it is writing logs in a write only way. Most common way I've seen deployed is having the audit logs setup so they go to files in an AWS S3 bucket owned by a completely separate AWS account with Write Only credentials granted to the application doing the log writing. Its effectively "dumped" there and cannot be seen or touched even by the writing application. Sort of like having a 1 way valve in your logging "pipeline".


This is something that public or private blockchains could be helpful for. Since everything is built on the hash of what came before, you can’t delete something without leaving a trail.


>I've never undeleted a user either, but there have been many times I've gone back to look at something.

I, for one, have undeleted things tons of times, taking them of the trash can before emptying it, undoing the delete action (in apps where this is possible), and so on.


I messed up a mass update query enough times to leave myself SOME provision to undo it.

The exceptions are when there is a well tested query that affects a single account or something. Like GDPR


>The author uses the "no one ever undeleted anything" as the primary justification. I think this is the part they miss.

But did they though?

"Although I’ve never seen an undelete work in practice, soft deletion wasn’t completely useless because we would occasionally use it to refer to deleted data – usually a manual process where someone wanted to see to a deleted object for purposes of assisting with a support ticket or trying to squash a bug."


It may be convenient, but under the GDPR is illegal. When an user deletes an account, all the personal data associated with that user must be deleted (or anonymize it in a way that it's no longer possible to associate it back to the particular user).

You cannot just keep user information forever "just in case" they are useful again.


It's illegal but companies don't necessarily care to avoid soft deletes regardless. I think companies wait to get sued so they can try to argue in court why their soft deletions are reasonable and why it's too technically difficult for them to do hard deletes.

To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software. One of those details that I'm both surprised and unsurprised doesn't seem to have been addressed legally. I feel like a court ought to at least buy this aspect of the argument, so maybe they'll buy that it can be difficult in terms of the software too? Who knows. My guess is that a reasonable court would accommodate something that's reasonable for a given company, but there are lots of variations that could fall into that category.


> I think companies wait to get sued so they can try to argue in court why their soft deletions are reasonable and why it's too technically difficult for them to do hard deletes.

The law is clear, you can't keep user data if the user decides that you should no longer have them. It's your own responsibility to find a way to do that.

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software. One of those details that I'm both surprised and unsurprised doesn't seem to have been addressed legally. I feel like a court ought to at least buy this aspect of the argument, so maybe they'll buy that it can be difficult in terms of the software too? Who knows. My guess is that a reasonable court would accommodate something that's reasonable for a given company, but there are lots of variations that could fall into that category.

The deletion should be done in reasonable terms, i.e. you can't reasonably get the data back. Of course when you delete something from an hard drive being that an SSD or HDD till that block is not reused the data is recoverable, but that is another thing.

By the way, it's either not really difficult to secure erase user data: what you need is not to erase all the data of the user, is to encrypt all the data of the user with a symmetric key algorithm (such as AES, that is super fast) and only store the key in a system that gives you secure erase capability. When the user deletes his account you don't have to delete all the data but just the encryption key.


These points have already been discussed, see the sibling threads.


GDPR fines are structured to make it extremely risky/expensive to take the “ask for forgiveness not permission” approach you’re suggesting. I think for this reason, the fine is designed to scale with the size of the offending company’s revenue, not so much with the actual damages suffered by anyone. Take a look at some of these fines, tens of millions of euros cause the UI for your cookie consent dialog is poorly designed? https://www.tessian.com/blog/biggest-gdpr-fines-2020/

When the GDPR rules came out our company (pretty small at the time) pulled out all the stops to get ourselves into compliance, due to the above and other teeth put into the regulations. It seemed utterly irrational not to. Were we the exception, or the rule?


I don't know if you were the exception or the rule, but I know for a fact some do go the forgiveness approach for requirements that they believe would be disproportionately burdensome.

Probably almost all companies actually, including yourself (just to a different extent) - what did you do about the storage remapping thing I mentioned? Did it come up/did you guys discuss it? Do you believe you're in compliance despite your hardware (most likely) not guaranteeing erasure or overwriting of existing data? I'm curious how your assessment of that went, because I doubt one can be in strict compliance without guarantees from the hardware.


Please, don't give ideas to whomever among their lawyers reads this later :-). We don't need another Schrems II..


Sorry!


I think lawyers probably miss a lot of these technicalities (which explains why they can require something as technically unreasonable as Schrems II), but I just wonder if someone more technical can actually push for something like that, to win themselves some questionable points, in the name of improving privacy.


What a bizarrely dismissive comment. You could just as easily say that some developers miss a lot of the legal technicalities, which explains why they believe Schrems II is an unreasonable ruling.


Now I don't understand your comment, sorry :-)


What's the unreasonable thing? Not having your data HQ in a place vulnerable to unrelated government overreach?


Remapping of blocks isn't the problem here. IANAL, but you should be encrypting all the data for GDPR compliance to begin with. With it encrypted, how the storage device chooses to map its blocks is completely irrelevant. When the key goes away, all the data is erased by definition. Throwing away the key is one of the easiest ways to comply with Right to be Forgotten, from what I've seen.

"Oh, but the key is still on the hard drive!" you would likely complain. No... the key could be stored any number of places that won't leave a copy. If you store the key on a TPM/HSM, then tell the TPM to delete the key, short of finding out that the TPM manufacturer failed to perform their duty in erasing the key, and simultaneously discovering a 0-day exploit to go in and retrieve that deleted key, it is gone. Full stop. It is unreasonable to ask everyone to do something impossible like prove that their TPMs are working properly, when even the TPM manufacturers cannot definitively prove this, so that is clearly not what the law is asking. You might as well ask companies to prove that they've never been hacked. How can anyone prove that?

Conceptually, one of the simplest ways to comply with this would be to perform database failover once a month, since you have a month to comply with these requests. Before the failover, you would ensure that all data is deleted that has been requested to be deleted. Then you bring up a new database replica, stream the remaining data to it, failover, and then destroy the key stored in the TPM on the primary and any other replicas. All that deleted data is now gone forever.

There are more complicated architectures where you could have per-user keys which are stored on an encrypted volume, and then you would be managing a handful of keys in the TPM to handle rotation of the encrypted volumes. Once a month, you copy the per-user keys that haven't been deleted onto a new encrypted volume, then tell the TPM to delete the key for the old encrypted volume: garbage collecting the deleted keys in an unrecoverable fashion. This avoids the hassles with having to failover a database so frequently, although having that level of automation around database failover has advantages too.

I'm sure some companies are content with treating the deleted data as erased, as long as their drives are properly encrypted. A rogue employee could potentially recover the data while the hard drives and the keys are still together, but the hard drives will be worthless once they're separated from the keys after they fail and are thrown away. A rogue employee could have stolen the data before it was deleted anyways, so how do you prove that rogue employees don't exist? Is this level of erasure good enough? This one I'm a little iffy on, but it still completely resolves your issues with the block remapping.

GDPR makes reference to "taking account of available technology and the cost of implementation", so I think it is fair to say to that GDPR is not asking you to erase any butterfly-effect waves that have propagated out from the read/write head of your hard drive. If the data is not recoverable by any known means, that is the purpose of the law, isn't it?

I will repeat that I'm not a lawyer, but your argument feels like a strawman designed to paint the Right to be Forgotten as something that no one can comply with. Your comments come off as snarky and acting like you think you're so clever. As if no one has ever thought about these problems before.


Per user encryption such as you suggest does not address the problem. Systems that work this way have existed for decades and they have pathologically poor performance and scalability. Many classes of major optimization in databases don't work under these constraints. This isn't a novel idea, having been implemented for decades; it has been broadly rejected because it doesn't actually work in real systems without tradeoffs no one accepts, even in highly security sensitive environments.

You are overly dismissive of the technical challenges of actually deleting data, particularly in large systems. Technically naive solutions like having separate encryption keys for individual entities sounds good but it doesn't actually address the problem.


I feel like you’re overly dismissive of the other things I mentioned. Database failover comes with inconveniences, but it does allow you to use a single key and get great performance. For massive databases, it might be impossible to do that way, but most companies don’t have massive databases. I also opened a question about how ”erased” deleted data needs to be to be compliant, since I would imagine a lot of companies consider deletion on an encrypted drive to be good enough, even if the data may theoretically be recoverable under extreme circumstances until that hard drive is completely overwritten multiple times or destroyed.

Per user keys are also very useful for use cases like a personal file storage service, even if traditional RDBMS don’t work well with them. Techniques are situationally dependent.

I was not being dismissive of the existence of technical challenges, but rather dismissive of comments like:

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software.

Which seem to completely deride the possibility of compliance.

Compliance is feasible, even if it has challenges.


Save yourself a lot of time, read jandrewrogers comments before responding to him.


coder543 initially responded to dataflow. Then jandrewrogers included an additional anecdote, but only in a single comment. As the thread stands at the time I posed this, jandrewrogers does not have multiple comments.


An ambiguous phrase - I was not criticizing you but only advising you read his historical statements on databases before engaging. He has wasted a great deal of time of other people.


Do you know how to read usernames?


Yes. Your point being what.


I am not the person you originally replied to but you (again) commented like I was. My comment was about how your comment erroneously targets the wrong person because you thought you were replying to one person when you were replying to another. You did it again in your previous comment hence my serious question.


It doesn't matter, it was advice for all.


It does, actually, matter. Notably because it causes confusion when trying to understand your point, among other things.


Is there literature on this topic? I'd like to learn at what scale per-user/tenant keys becomes untenable and the characteristics of systems that exhibit pathologically poor performance due to such a design.

Naively, it sounds like individual entity keys does solve the problem of deletion, but that your argument is that the tradeoffs aren't generally worth it?


I’d imagine a pretty small scale. Imagine an endpoint that lists user accounts. You would have to individually decrypt each row! and you would never be able to do database joins of data in these tables because you it wouldn’t be possible to decrypt it until it had already left the database.


With respect, your own comment reads as overly dismissive of the regulatory environment introduced by the GDPR. Systems (and companies) have for decades collected and stored personal data with little or no regard for the lifecycle of the data.

Since the introduction of the Regulation, companies must now refrain from the collection of personal data if they cannot satisfy the rights of the data subject, including the right to erasure (which is obviously not an absolute right).

If the technical solutions aren't performant or scalable enough for data controllers, then the result is that they cannot collect personal data. The idea that companies have in the past rejected solutions which protect the rights of data subjects because they were unwilling to accept the technical tradeoffs is exactly why strong regulation such as the GDPR is required.

Frankly, your comment reads as contemptuous to the rights of the end-user. Thankfully, the EU offers protection to its residents from people with that attitude.


I have no idea where the sudden hostility in your last paragraph came from, but it's ascribing nonexistent malice to me. I'll try to address a couple of your points the best I can regardless.

> Once a month, you copy the per-user keys that haven't been deleted onto a new encrypted volume, then tell the TPM to delete the key for the old encrypted volume: garbage collecting the deleted keys in an unrecoverable fashion. This avoids the hassles with having to failover a database so frequently

This ignores the existence of database indexes.

What I was saying (I thought quite explicitly) was that strict compliance to GDPR appears unreasonably burdensome to most companies, and thus most companies stop at a line they draw earlier (that line frequently being the lack of hardware guarantees re: block remapping). This is not the same as saying no one can comply with GDPR, and I had no intention of making GDPR look impossible to comply with. Quite the contrary in fact—I believe its current requirements can and should be complied with quite cheaply and efficiently, but the most effective path to that requires support from hardware manufacturers (and possibly others) that customers and regulators currently don't demand.

This is not a take-down of GDPR or a malicious strawman mockery of its requirements as you're (quite unkindly and wrongly) imputing me with. If anything, it was a plea for customers, regulators, and/or lawmakers to realize they need to expand their GDPR demands to require that vendors of other parts of the hardware/software stack (particularly storage device manufacturers) provide better support for GDPR compliance.


You made several comments in a row that were extremely dismissive of how GDPR compliance could be achieved, and your comments repeatedly questioned whether other people considered basic concepts of data erasure. I apologize if I misattributed malice, but I do feel those comments could have been worded differently.

Also, this quote:

> To be honest, in the age of modern overprovisioned storage drives that remap blocks frequently, I'm not really sure you can implement genuine "hard" deletes without choosing significantly unorthodox hardware (or destroying a drive every time you need a single bit erased), no matter how much you want to in software.

That quote feels incompatible with your statement that…

> Quite the contrary in fact—I believe its current requirements can and should be complied with quite cheaply and efficiently

How can it be complied with cheaply and easily if it can’t be complied with without hardware support no matter how much software wants to? Surely you can see my confusion. Your last comment is far more in line with what I would expect to see in a discussion like this.

Regardless, I think I understand the argument in your last comment better now, but I still think destroying keys is the most effective way to achieve compliance. Even if a storage device manufacturer guaranteed that they would attempt to erase the specific block you want erased, there are problems with whether that data is still recoverable by reading the block repeatedly.


> GDPR fines are structured to make it extremely risky/expensive to take the “ask for forgiveness not permission” approach you’re suggesting.

Expensive, maybe, but not risky.

If you're rolling the dice on a GPDR fine, the expected value of vague compliance is still largely positive, while the expected value of actual compliance is still slightly negative.

The expected value of the fine needs to be a larger negative than the expected value of vague compliance.

You don't even need to raise the fine value (which is a percentage/scale of revenue), you just need to make the probability of paying it approach 1.

In fact, with a higher probability of paying the fine, you could even lower the actual fine and still have it have a larger negative expected value than vague compliance.


> YouTube sets cookies on our devices to track our online activity for marketing purposes

This is not the purpose of cookies. The purpose of cookies is to store state on the browser side. Does GDPR interpret all possibilities of "cookies" as the storage and communication of privacy-violating data? Because that would be extremely unfortunate


I wonder what percentage of companies who even appear to comply with deletion requests actually do full deletion in practice. I suspect it's small, knowing how many things are coded to fake-delete for convenience. Businesses also tend to keep cold data backups around. (Maybe backups are exempted? I don't know.) There might even be cases where ostensibly deleted data can still be recovered from a disk, if they haven't overwritten.


The really hard part is dealing with the data still in backups.

It leads to having to do crazy things like individual keyed encryption per user, escrow hilarity, etc.


I do a lot of work with Kafka, GPDR removal requests for data that's still stored in a Kafka topic is so much fun.


I worked at a large company as it was implementing GDPR compliance, and I can attest that it was a nightmare of drudgery. I think it's interesting that our entire tech stack is designed to assume the business has a right to keep user data for as long as they want (forever), and compliance with GDPR delete requests is like a one-off exception implemented at a very sketchy redaction task that has to be careful not to leave broken foreign keys and such around.

It's a nightmare to do now, but our tech needs to change such that we assume data might be a hot potato that you need to be able to get rid of, and not some immutable asset that you can hang onto forever.


Barring more foundational and revolutionary changes in our tech stacks, another poster commented that a best practice is to have tests that delete user accounts and all associated records without breakage and to build these tests into your test suite from the very inception of the project.

Far from a magic cure, but by building it into the app from inception, it seems more possible.


Thanks for that insight. It's an interesting solution to use per-user encryption for that. Don't you have the same problems all over again with respect to preserving access to user specific data? I.e. needing key backups for business continuity, but needing to wipe those keys for deletion requests.


Per user encryption doesn't work in real systems. These have been implemented for several decades and no one uses them because they have pathologically terrible scalability and performance. The simple act of requiring encryption of individual user data forces the use of extremely suboptimal data structures relative to what would be used in the non-encrypted case.

A single AES encryption block is 16 bytes; internal storage of records is managed in terms of individual bits if you don't encrypt them. You also can't compress data encrypted in this way, which is table stakes for databases. The encryption key schedule alone for a single record will be much larger than the typical record! Requiring a per-user encryption key has the unfortunate effect of bloating database storage and memory requirements by at least 10x, while massively reducing computational throughput.

The design of delete-efficient database kernels is an interesting unsolved problem in computer science. No one knows how to do it. People that think there are trivial solutions have not studied the problem adequately. It hasn't been a priority to solve this problem, but its reputation for being theoretically difficult is well-deserved.


"I've not seen this in real systems, therefore it can't exist" ?


Or...just have a documented retention policy for backups and delete them after a certain timeline.


Guess we’ll find out when a leaks happen, and companies start getting fines.

GDPR does differentiate between structured data (I believe it uses the term “identifiable records” or similar), and huge heap of unstructured data where an individuals data can’t be quickly retrieved as it’s own atomic unit. With much stricter requirements for anything structured.

So for data on a HDD that could be recovered, but is an unstructured mess. You’re probably ok, as long as you took reasonable steps to protect the data, like full disk encryption (which could occurs below the file system, and thus allow you to recover data deleted from the FS with a full disk scan, as long as you still had the keys). If you just had peoples data unencrypted on a HDD, and didn’t securely erase it before dispose, then that’s probably still a GDPR violation.


What does GDPR say about obscuring data?

Instead of hard-deleting a structured record for Bob Smith, can you leave the record intact and scrub it of identifying data so that I don't e.g. break all of the records for orders that Bob Smith made?


> Instead of hard-deleting a structured record for Bob Smith, can you leave the record intact and scrub it of identifying data so that I don't e.g. break all of the records for orders that Bob Smith made?

The various laws (GPDR, and similar ones in my jurisdiction) abuse the hell out of vague descriptions like "personal identifying data".

For example, scrubbing a customer record of all names, phone numbers, contact info, etc and leaving only the ID in the record still makes it possible to personally identify the person by:

1. Looking through the orders records to build up a profile depending on what they order,

2. Then using the delivery records from the courier company to get an address,

3. Then using existing public information to repopulate all the fields in the customer record.

In any given business, no matter how big or small, simply removing all references to a customers personal information will still leave the customer personally identifiable. This process can even be automated.


You can as long as you delete all the personal data of the user and make sure that other remaining data is anonymized in a way that you can no longer associate it with the user.


User deleting an account is just one way that stuff gets deleted. There are other deletion scenarios where it is appropriate to keep the information after its deleted.


Yeah. Basically anything that an employee can mess up, should be reversible IMO. But actions such as deleting an account should have the option of "YES, DELETE IT PERMANENTLY, THIS CAN'T BE UNDONE"


You probably want to put the account on hold for a period where it can be undone. And then clean it up after.


Not always possible in a legal way. For example, GDPR compliance.


GDPR gives you a pretty large period to do it within. Easily enough to fit this feature in.


The only way that would make sense in if it was a cancelable timed deletion. People make mistakes. People change their minds. People get hacked.


Often it’s mark for deletion after some number of days.


You can most definitely keep around the non-PII data you've generated for that user in your app, such as synthetic user ID's and other stuff. Furthermore, for some apps (fintech) in some countries (US) you are actually obligated to keep around certain information for several years. It really doesn't have to be all or nothing.


Insurance has to keep around info for the lifetime of the policy. For some products, that's decades.


Synthetic user IDs can still qualify as PII since a human may still know which person that ID was associated with and data associated with that ID may be sufficient to narrow down who it was even if no human (or system) knows the exact mapping.

Pseudonymous data is not anonymous data.

But yes, the GDPR obviously allows for legal requirements of record keeping. It does however require you to delete the data once those requirements no longer apply (i.e. you have to ensure data is still deleted after those several years, you can't just not delete it because the deletion time is years in the future).


Any reasonably big system will require you to invest in that anonymization. Doing cascading hard deletes and making sure any missing references don’t trip up your system or mess up reporting is much more difficult than just flagging fields that need to be scrambled with randomized data (most libraries have fake data generators that make doing these updates easy)

Long before GDPR was a thing we had this implemented allowing us to securely analyze production dumps.

This is one of those things that I think libraries need to mature to support and then it will be effortless. Right now it requires a modest, but hardly overwhelming, engineering effort to implement.


This should have the asterisk of * after the period of time you’ve said it will take for it to be deleted.

It’s almost impossible to guarantee instant delete, folks just care if it’s gone in a documented and/or reasonable amount of time, and predictably.

Correct?


Sure, but that time period is usually limited to 30 days, with an additional 30 day extension allowed if you contact the person and tell them you need more time.

You can’t just declare that it take years to delete data, GDPR sets reasonable limits on how long a company can delay true deletion.


Even if the company just issued you an invoice, that makes it obligated by law to hold onto your PII for the next 5 years - and GDPR of course has an exception for that.


Sure, but that data doesn’t fall under your right-to-be-forgotten, so any obligations that apply to the right-to-be-forgotten are irrelevant.


The GDPR gives limits on answering, I'm not sure it gives a limit on actually deleting the data everywhere.

Various types of data have various retention times for various reasons, some being legal reasons.


The GDPR says it's 30 days with a 2 month extension. This allows for having safe soft deletion and backup retention policies without running afoul of it.


Do you have a link for that?



Thanks!


The GDPR says data should be deleted "without undue delay", it doesn't give a hard deadline.


>under the GDPR [soft delete] is illegal

This is obviously not always true. Any European can't, for example, delete their online account with their mortgage company and demand that the mortgage company deletes the records saying that they owe them money so they can get their house for free.

Nor can anyone call up their previous employer and require them to delete all the work they ever did from their company's computer systems.

A company can't respond to a tax audit with "well we lost a billion dollars this year but all our customers wanted their accounts deleted so we can't provide any documentation of it." I mean, you aren't even going to get to the audit if you can't fill out your taxes in the first place because you had to delete all your financial records.

Most computer systems/software products in existence aren't free media apps, yet the "privacy circlejerk" constantly talk in generalities as if they are. I'm sick of this, most applications have a business justification for using some form of soft delete for at least some items.


Under GDPR the retention policy means the account has to be kept for whatever time stipulated by the law (10y), but then the data should not be retained any longer, hence deleted.


> Nor can anyone call up previous employer and require them to delete all the work they ever did from their company’s computer systems.

You can’t delete IP as this is not covered by GDPR, but you sure can ask them to delete your identification data from their records as GDPR also works for employees not just customers. It’s a problem with for example vcs.

Unless there’s another law in ruling, GDPR is the baseline. It’s for any identification of an individual, regardless of usage. Most “media apps” don’t have any other law that’s in effect, so GDPR is mandating what you need to do. (IANAL but worked on implementing GDPR with lawyers).


The right to erasure to is not unlimited (few things in GDPR are as absolute as people imagine). Yes, is personal data is retained based on consent, that consent can be withdrawn and the data must be deleted. But there are many other bases for collecting and retaining data as well.

The examples in the GP were mostly around the data being necessary for fulfilling a contract (can't request deletion of your mortgage records) or meeting legal obligations (can't delete tax records). But there's also legitimate interest.

Legitimate interest is vague and open to interpretation, but I'd be really surprised if there were any decisions from a DPA suggesting that a VCS commit log would be problematic. The legitimate interest seems really strong in that case: being able to audit who actually made each change, e.g. for security reasons.


>You can’t delete IP as this is not covered by GDPR, but you sure can ask them to delete your identification data from their records as GDPR also works for employees not just customers.

You can ask, but the right to have data erased is not an absolute right. In fact there's an enumerated list of the circumstances in which the data must be deleted but data controllers are otherwise under no specific obligation to delete data, and there are even a series of criteria which dis-apply that obligation.


An often missed case of the GDPR requiring companies to delete data is that you can't just keep CVs "on file" without explicit permission, even of former employees you did end up hiring (unless you are preparing a legal case involving claims made in the CV for example).

Data controllers absolutely must delete data by default. The basic rule of GDPR is literally "privacy by default" (or "data minimization"), not "data hoarding by default". There are plenty of exemptions that grant companies the right (or rather require them) to delay the deletion by amounts of time as long as many years (e.g. for tax purposes), but they are only permitted to maintain only the absolute minimum required to fulfill these obligations and they must delete the data as soon as possible once these obligations have been fulfilled.

If you are obligated to keep a set of PII for 10 years, you must delete it in 10 years, unless you (still) have consent to keep it beyond that time limit.


It’s an absolute right, and one of the main points of GDPR, unless there are another law[0] stating that you can keep it, for a specific reason. Like financial transactions. When that law expires then GDPR kicks in. The company doesn’t own your data, you do.

[0] https://gdpr-info.eu/art-17-gdpr/


>It’s an absolute right, and one of the main points of GDPR, unless there are another law[0] stating that you can keep it, for a specific reason.

It can't be "an absolute right... unless", it's either absolute or it's not.

And if you read Article 14 the first paragraph says that "The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies" (emphasis mine) and then lists six grounds. Further to that, paragraph 3 dis-applies paragraph 1 for more reasons than "another law states you can keep it".

>The company doesn’t own your data, you do.

Data controllers do own the data (insofar as one can "own" data), but data subjects have rights over the processing of those data that limit what the controllers may do with them.


Lol, okay dude, believe that if it makes you feel good.


I would guess that there are quite a few limits to that...

A user has a long history of participating on your forum and other users have quoted their messages far and wide. Collectively all of the messages posted on your forums (with or without timestamps) reveal a few PII about the user. Do you have to delete those?

The user filed a bug report about a functionality not working, do you have to delete the text of the bug report?

Arguably if your user table look like [user_id, creation_date, deletion_date, status, account_type] then this table does not contain any PII.

Assuming that user content is not automatically PII, whose responsibility is it to track where PII can be?

Sometimes users doxx themselves (like mistakenly sharing tax return forms instead cat pics), in such a case it is the user responsibility to signal this to you.

If the user filed an issue saying "when I insert my name (Abe Cox) the input validation fails" is it you or the user that need to read through all the issues to find this case?

My point is that GDPR + right to be forgotten cannot make it look like you never had an account at all, especially without user assistance.


> A user has a long history of participating on your forum and other users have quoted their messages far and wide. Collectively all of the messages posted on your forums (with or without timestamps) reveal a few PII about the user. Do you have to delete those?

Technically yes.

> The user filed a bug report about a functionality not working, do you have to delete the text of the bug report?

Depends of the privacy policy of the bug tracking application. If it's stated that the report shouldn't contain any personal data you don't have to delete it. If the user asks for deletion stating that it contain personal data you have to delete it.

> Arguably if your user table look like [user_id, creation_date, deletion_date, status, account_type] then this table does not contain any PII.

It doesn't, as long you no longer have any way to associate user_id to the user personal data, you can keep it.

> Sometimes users doxx themselves (like mistakenly sharing tax return forms instead cat pics), in such a case it is the user responsibility to signal this to you.

In that situation if the user asks for deletion of the content you have to provide it. You are obviously not asked to monitor errors about the user.


user_id is PII if the identity of a user can be narrowed down sufficiently.

For example, if you created a social media app and later left the company, it could still be determined that if you had an account it likely has one of the lowest user ID values if those were assigned sequentially. So even if your name was removed from it and none of your posts provided any identifying information about you, the user ID could still qualify as PII.

Likewise if the user ID is public and non-PII content associated with that ID is linked elsewhere identifying the author of that content (e.g. a news article embedding a post) the identity of the author is still compromised.

The problem is not just that users may submit PII in unexpected places, the problem is that even if the data type could not possibly allow them to, metadata beyond your control may actually still taint your supposedly anonymous data.


The question is not whether "spooky deanonymization as a distance" is possible; the question is how much you should proactively do to avoid this.

As an extreme example, if user A takes a screenshot if their own user setting page on a image sharing site, then sends it to user B that uploads it to the same site. Who is responsible for this piece of data? does share-a-pic.web need to scan all images for the username of a user that is deleting an account?

This is a convoluted example where the answer is obviously no, but it shows that a purely principled answer might be unsuitable.

Especially since non PII data you own mixed with non PII data someone else owns can become PII and you do not know what data others have.

(you could also replace the user ID in all locations with a new tagged UUID to preserve referential integrity in the DB, but there not much more than 2^32 humans in total, it s very hard not to be deanonymizable)


What we did (and legal was OK with it) was to create a "GDPR Delete" function in the admin backend. It doesn't actually delete, but sets a soft delete identical to the regular user delete function and overwrites any PII with stuff like name:gdpr_delete_name_xxxx

That way log and system context is available for debugging, but the system no longer holds any PII. It does delete some logs, but anonymizing data seemed better than hard deletion. It wasn't hard to do, it took legal longer to approve than writing and testing the code and has come in handy a few times for debugging errors.

Edit: typo


What if you anonymize it with a random key and then give the user that key in the event they want to undelete their deletion?


GDPR explicitly says that this doesn't apply when a more specific law requires you to store something - logically then you also need to know who the user was. Things like transaction records, network traffic logs, invoices, pay slips, workplace chat...

So yeah, you're right that not deleting user data just because it's inconvenient is wrong and illegal, but that doesn't mean there aren't legitimate usecases for soft deletions of users.


We don't do it for users but for certain user assets that are a pain to remove consistently without a background job or a job queue. Like user devices that have associated dependencies. The background job is automatic, it prunes objects evey x hours.


Not all personal information storage needs consent, gdpr art 6 comma b and c cover such cases as applicable to private companies.

Users can retract willing consent, but willing consent is not the only thing that allow companies to store user data.

Soft deletion of users whom had an account with transactions on it is fine

Art 17 comma b explain this quite clearly

Gdpr is amazingly readable.


You can do both.

Soft delete immediately when requested. Hard delete (if still soft deleted) when GDPR requires it.

Besides it's not like all data is the same. Use what's appropriate when it's appropriate. Nobody's saying "you must always soft delete, and then immediately tweet the audit trail, print that tweet, put that printout onto a rocket, and launch it to the moon for later retrieval. with a blockchain for extra GDPR hate".


Last month I had to implement customer deletion on a new project I started working on recently. The basics were: anonymize the user personal data, delete subscriptions, keep anything else. I believe that payments and invoices are not stored in that database so I don't know what happens to them but in general you must not delete that kind of data because GDPR notwithstanding the law requires that they are available for inspection for a long time. Of course they cannot be anonymized in a way that cannot be reversed. Being able to connect them with a trail of operations in the main database is probably useful too. So, soft delete.


>It may be convenient, but under the GDPR is illegal. When an user deletes an account, all the personal data associated with that user must be deleted

There's a grace period of 30 days. We hard-remove after 2 weeks allowing a user to change their mind if they deleted it by accident. More than enough.


Stop spreading misinformation. Nobody is required to hard delete the very second the [delete] button is pressed. You're following the law as long as that mentioned reaper job exists and runs at least once per month, to be safely within the allowed timeframe


I didn't see a claim that deletion had to be "the very second the [delete] button is pressed".

This is the actual text of the law[1]:

> The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay

There is a one month limit of the right of access[2] and right to be informed[3], which flow into some of the articles of the law; and certain articles also mention a month's limit on being informed, e.g. from article 12 parts 3 and 4 [4]:

> The controller shall provide information on action taken on a request under Articles 15 to 22 to the data subject *without undue delay and in any event within one month* of receipt of the request. That period may be extended by two further months where necessary, taking into account the complexity and number of the requests. The controller shall inform the data subject of any such extension within one month of receipt of the request, together with the reasons for the delay. Where the data subject makes the request by electronic form means, the information shall be provided by electronic means where possible, unless otherwise requested by the data subject. If the controller does not take action on the request of the data subject, the controller shall inform the data subject without delay and at the latest within one month of receipt of the request of the reasons for not taking action and on the possibility of lodging a complaint with a supervisory authority and seeking a judicial remedy.

I've marked out those phrases because they show the language of the law that you will see again and again in the articles - without undue delay and within one month. The latter does not obviate the former.

Hence, and in fact, there is no part of the law that explicitly gives you a month to delete something. If you could delete something right now but you leave it hanging around because you have arbitrarily chosen a job to run once a month and there's a leak in that time, that would be undue delay.

[1] https://gdpr-info.eu/art-17-gdpr/

[2] https://gdpr-info.eu/issues/right-of-access/

[3] https://gdpr-info.eu/issues/right-to-be-informed/

[4] https://gdpr-info.eu/art-12-gdpr/


The user has to explicitly request the erasure.


Yes, they do that by pressing the delete button on their side (and so forth). This is covered in the first line "The data subject shall have the right to obtain".

If you want to go to court and argue that the user should've know that pressing a big red button with "DELETE!" on it really means soft delete, be my guest.


You literally said that soft deletes are illegal, which heavily implies what you now deny to have said.

The 30 days wasn't particularly taken from the regulation, the actual grace period for the deletion should be 90 days in total, but I'm not a lawyer nor certain about it.

What I am certain about is that your original thesis, which is soft deletes being illegal under gdpr, is complete bullshit


> You literally said that soft deletes are illegal, which heavily implies what you now deny to have said.

Who is "you" here? Perhaps if you were less defensive you'd notice that I'm not the person you originally replied to, which, by the way, makes your response nonsensical.

Edit: I'll add, that the only person who is likely to get anywhere near to "complete bullshit" is going to be the one who writes "but I'm not a lawyer nor certain about it" and doesn't provide a shred of evidence to back up their, frankly completely ignorant, notions.

90 days grace? Risible.


Yeh, you take first place for the most retarded person 2022 award.


Not every delete is a GDPR delete. You could have a hard-delete GDPR flow, obviously.


Soft deletion is just one way to achieve undeletion. The author's proposed solution of moving the resource to another table works just as well. You can move it back to the non-deleted table to perform the undeletion. You can keep around these deleted objects as long as you want; they work as a subset of a proper audit trail. The cost of course is you have more tables, but that is less of a cost than having to add "deleted=False" predicates in all of your queries.

Also note, if you use a soft-deleted column, indexes need to be keyed by that column as well if you want to access non-deleted objects quickly. That's extra complexity.


Even more important; the deleted records don't need to live in your cache / RAM / etc. Potentially faster queries.


    but that is less of a cost than having to add "deleted=False" 
    predicates in all of your queries.
Maybe or maybe not. You can use a view. Or you may be using an ORM that lets you set a default scope (essentially, a default WHERE clause - ActiveRecord lets you do this)

It also depends on if you're designing an app for this from the ground up or if you're trying to retrofit an existing app with 90 million different hardcoded queries.

Also depends on what you want to do with the deleted records. Do you want to do things with them? Maybe after a user is deleted, you want them to be able to log in, but you would prefer them to see a "sorry - your account has been deleted" message instead of "user not found." Maybe you want your support staff to be able to look at deleted users. Etc. Now you need to update your app logic so that it's flexible enough to look at "users_deleted" and "users". Which may be at least as onerous as messing with the WHERE clause on every single query.

To be clear, I don't hate the "another table" solution. It's the right choice in a lot of situations IMO.


> but that is less of a cost than having to add "deleted=False" predicates in all of your queries.

It's like people have forgotten what views are.


Well, most likely they're queries that have "deleted=False" in them. ;-)


You almost made me ruin a keyboard. Jerk.


What seems to be missing here is the DB tech he is using. On a proper database you can do your "undeleted" with triggers and it's relatively trivial. Nonsense like a "deleted" column on your main data table just seems silly.


    Nonsense like a "deleted" column on your main data table just seems silly.
Why silly? Yeah, it's one more column, but most DBs just represent dates as bigints AFAIK so it's not really the end of the world.

Sure, you have to be conscious of adding "where deleted_at = NULL" (or the opposite) to all your queries. If you're going through an ORM this is usually trivial e.g. in ActiveRecord it can be a default scope on your model.

    On a proper database you can do your "undeleted" with triggers 
    and it's relatively trivial. 
I like this solution as well, FWIW.

If I was retrofitting an existing application with soft-deletes I would definitely favor this solution.

There are some obvious advantages over a "deleted" column on your main table for sure.

One drawback is that you now have to keep two table definitions in sync. You also have to add conditional logic in a lot of places to handle whether a given piece of code is looking at a normal user from the "users" table or a deleted user from the "users_deleted" table. Instead of messing with the WHERE clause on every query, now you're messing with the table name on every query. So that's a wash IMO.


I worked at a place that kept all the deleted stuff in their main tables. It turned out over 90% of the rows were deleted. I'm not sure how often something was undeleted, but it was not very frequent. Some of these soft deleted rows were 5+ years old. Archive that crap.


Yeah, for sure, if deletion is going to be the exception rather than the rule.... keeping deleted stuff in the main tables is not ideal.

Of course, you can have your cake and eat it too. You can e.g. keep your deleted crap in the main table (which makes certain things easier) and then hard-delete or archive it to another table after N days.


I would never suggest archiving unless we're hitting some performance limit. Why generate more busy work? Just leave it there and if need be add more indexes and partitions.


well, they were running into performance issues. This table had something like 100 columns and at least 30 indexes already. This table was fat. Many columns, many rows, mostly unused garbage.

And several of these "columns" were actually JSON blobs that probably should've been in their own table. But what do I know. I left!


When we talk about best practices, it's probably not too useful to talk about these sorts of "extremely high technical debt situations."

Once you have an existing tirefire, it's a matter of damage control and "what is the least bad way to accomplish new functionality?" and the answer to that will always be unique for each unique tirefire.


Triggers don't solve what the author's getting at, i.e. who knows what else outside this database changes upon deletion that would need to be reversed? It's all in the article.


Also if people know that deletion is reversible, they're more likely to actually do it, which can keep things generally tidier.

I don't actually like using a "deleted" column, my standard table has a status column, and deleted is one of those states, along with active/pending/suspended/etc, as the needs dictate. This way I get soft deletes for basically free both in the schema, but also in the queries (which would generally default to active), so it's not really the spaghetti that the author discusses.


For some applications this is fine, depending on your app/business logic but for a lot of applications states like active/pending/suspended and "deleted" are not mutually exclusive.

Suppose I soft-delete an active, pending, or suspended user using your scheme.

Now I need to un-delete the user. What status should they have? We don't know.

This is another "best practice" I've learned over the years. Those flags/statuses you think are mutually exclusive? Maybe they aren't. Or maybe they are now but won't always be. It's usually easier in the long run to give each status its own column even if you think they'll always be mutually exclusive. Because I mean, what are you really saving? A few bytes per record? In most cases that's not worth it.


That still has the same issues. You have to remember to set every linked table's records to the same state, or remember to query every linked table through the table that has the lifecycle column on it.


Forgetting to include the lifecycle column in your query may mean not understanding what question you are trying to answer with your query.


There are some Rails gems that take care of this by scoping associations (that also have the deleted_at column).


In rails you get these things for free. What I don't get is why everyone rolls their own framework with node.js. It's basically 90s PHP all over again.

EDIT: Soft delete is a trivial piece of code when the framework has a well defined transaction system for its ORM. It's not really related to Rails per se. Your statement is extremely disingenuous, while trying to look smart. Audit trails _can_ be(but don't have to be) more complex, especially when the framework uses a lot of stored procedures to handle operations. But other than that these frameworks are specifically designed to REDUCE complexity of such operations, dependency costs - which are huge in node.js, specifically because you can mix and match anything into everything.

Node.js people tend to stitch together XSS solutions, random templating solutions based on their frontend work, even basic salting of auth passwords becomes unpredictable because you have 30 options on minimal auth libraries.

But yes nothing is ever free. If you want to use Rails you still have to learn ruby and the framework and a basic understand of how ActiveRecord builds queries if you want to be writing performant code. And the same applies to Laravel, Django, or whatever of the 50 patchwork node.js solutions you want to base your code on.


I don't know why you're ragging on Node.js users or even PHP for that matter as both ecosystems have this stuff covered too.

Also you're comparing language/runtime with an actual framework and then dogging those users...

If you want to compare Rails with Node/PHP then I'd suggest comparing with things like Laravel (PHP), Adonis (Node) and you'll find everything you can do in Rails is done in Node/PHP too.


What percentage of production Node.js systems use Adonis? It's probably vanishingly low. Laravel is nice, and it's been gaining a lot of adoption in new applications, but the GP said "90s PHP" for a reason. Modern Node JS backends are often littered with hand rolled SQL queries, poor MVC separation, and lots of shoddy, half baked model layers—often without using any library whatsoever. Which is a real shame, because I personally really love programming in JavaScript, and the performance is great, but the strength of Rails as a framework draws me back in every time.


How is hand rolled SQL queries something bad?


Bad. Like really bad. Unless you're doing it for some small corner case where you need up to the list bit of performance, raw SQL queries and mapping them to objects or manually moving or many pulsating the result sets manually is really bad. Specially for others having to maintain your code 3 years later.

There's a reason ORMs are so popular.


Personally I prefer the opposite, I have managed many old projects (10+ years old) and I always prefer raw SQL over some outdated ORM that is hard to configure and finetune. With SQL I have access to one of the most powerful query languages in existence that can never be matched by an database layer written in general purpose language.

Raw SQL can be a bit problematic if it is spread out in the entire code base with little structure ("littered"), however that is relatively easy to fix by centralizing the queries with a repository pattern. It is much harder to fix an ORM ten years later.

Do we know why ORM's are popular? I don't think it has to do with that ORM's are good, rather that developers want to avoid SQL because lack of experience, but that is me guessing based on my prejudice (I was myself in that category as a junior).

I have written in the past why ORM's are bad https://news.ycombinator.com/item?id=27407690


And I'm sure there's many similar situations like that in Ruby, Python, and other languages, you just don't hear about them because they don't power 99% of the internet like PHP and JavaScript does. ;)


I have to admit I'm getting a bit of fatigue reviewing frameworks in Node that are claimed to be batteries-included and similar to Rails/Django/Laravel. SailsJS, Redwood, NestJS, Remix, Prisma (ORM compared to Active Record). They all have an exciting landing page, clean looking documentation... but once you dig in further the feature set starts looking very <= Rails v1.0.

Anyways I appreciate the name drop of Adonis. I'll have to give it a fair shake when I'm not feeling quite as cynical because I _really_ do want to find a solid Node framework in this area.


Most "full stack" frameworks in the node ecosystem consider themselves "full stack" just because you can run code on the server. See the Fresh (deno) discussion here in HN from a few weeks ago.

To me a framework which doesn't come with a clear way to do translations, background jobs, query builder, ORM, migrations, asset bundling, validations, error handling, etc it is not full stack. It is just "server code compatible" at best.


Adonis is far, far, far from reaching even a small percent of what Laravel is. It has almost no community and very little or almost no third party packages. When I tried it las year I couldn't even find a formatter for the templates.

Later I've found out the author of the framework has fake accounts pretending to be a happy user of it, and that was the end of it for me.

So, not comparable at all in my opinion.


Nothing is free. In Rails you have _currently well maintained libraries_ for this. There are still complexity costs, dependency costs, data costs, performance cots, etc, etc, etc.


> It's basically 90s PHP all over again.

Which is funny because Laravel give you this for free as well through it's ORM. Soft deletes are an easily solved problem with $table->softDeletes() in your migration.


I think the key on this comment is "90s". Modern PHP is very different. As you say, Laravel makes trivial and a joy to use things that in 90s PHP are like today's node.


I was getting ready to disagree with you - but then I tried to think of any time I've actually pushed code to production with the "DELETE" keyword in it. The problems that I've had to solve in my career very rarely call for deleting something.

"Soft deletion" and "audit trail" are technical terms we developers come up for solutions the business wants but maybe hasn't asked for yet. It's not really a soft deletion it's a "deactivate" or "hide". Likewise, it's not an audit trail it's a "history" or "undo". Most of the time your stakeholders and users actually want these features, but don't ask because they perceive this as more expensive to build then just a "delete" button.


> Likewise, it's not an audit trail it's a "history" or "undo".

Depends on the industry. The one I work in audit trail is a well-defined and mandatory business concern.


This is why I don't understand why Datomic isn't more popular. Pretty much every system I've worked on never needed to scale past 100s of writes per second due to hard limits on the system (internal backoffice stuff, fundamenally scoped/shardable to defined regions, etc etc). And since Datomic is built with that in mind, you get the trade-off of full history, first class transactions and being able to query for things like "who changes this attribute to its current value, when, and why" is such as super power!


Too niche of a technology, tied to clojure, not open source and very slow(doesn't matter for most internal apps though). For many, it's better to do the tedious thing here and there with postgres. SQL also has strong grip on databases and if you look at it the other way around, postgres has lots of features that datomic lacks, with datomic you almost always need a secondary database.


It’s hard to get adoption for expensive toys.

I think Datomic is neat, and I’d like to use it, but it is prohibitively expensive for a personal or hobby project. Personal projects are where I get excited about tech, and when I’m excited I’m more likely to adopt it in my day job.

They’re really shooting themselves in the foot by not having a one-click install free tier or a self hosted option.


You may want to look into XTDB then. Not quite the same as Datomic, but they share many similarities. It's free and open-source.


Neat, thanks!


Personally looking at their pricing since it is so tied to AWS it is completely non-transparent how much it's going to cost us now or in the future.

I really like the concept of datomic though.


In the past I've used MSSQL's Temporal Tables (also called System-Versioned Tables) to implement this kind of functionality. This also gets you, for free, Type 2 SCD functionality for OLAP-style queries.

I can't wait until Postgres has this kind of functionality baked in. It's such a nice feature.


You likely know that there is an extension for PG that does this.


For me it is simply that isn't open source (at least last I checked.)


Datomic is great but i think its missing some features that many enterprises need (access control and a query planner). Also it seems to be mostly built for DynamoDB/AWS.


This is something that I was forced to learn the hard way more than once. Literally today I needed to undelete a record because a customer was confused by what the "delete" button did and wanted their record back.


Isn't it the problem of UI, though. If the user would be informed about the consequences (possibly with bold red font and with a confirmation checkbox) would they still click that button?


File this under "falsehoods programmers believe about users": they act rationally.


Delete

“I’m sorry sir, I didn’t think that clicking that button would actually delete something!”

“What exactly did you think that button would do?”


"falsehoods programmers believe about users #2": they read button labels and dialog box text.


As a user, I would still sometimes and then regret my decision

Soft deletions are awesome


All of that exist.


Yes.


One will, one day.


I'd add tagging, for anything that could conceivably use it, when you're doing DB design. May as well start with support, even if the functionality's initially dormant. Someone will ask for it, directly or indirectly, and it won't take long before they do.


I personally would not agree.

But... your experiences are also real and I believe you when you say that your experiences DO support your conclusion. :)

The "soft delete" design decision is usually pretty impactful and is in my experience potentially much more of a pain in the butt to implement later if you haven't included it from day 0.

Audit trails and soft-deletes are also crazy useful for developers (both for debugging and for general cover-thine-ass utility) even if end users never touch them.

Whereas, tags are easier to tack on later and are not intrinsically useful to developers.


I agree. But if you’re asked to implement hierarchical categories, you implement them as tags underneath. This doesn’t cost you any time. When eventually they realize that categories don’t work (they never do) you can just turn a switch.


+1 on audit trails. And one should always store audit trails in machine readable format. That way you can not only manually inspect what happened, but you can query it too (and reconstruct the entire state as it existed in the past if necessary).


  > Somebody always wants to undelete something
  > or examine it to see why it was deleted
  > or see who changed something, or blah blah blah
People for whom this resonates should look into Dolt: https://dolthub.com/.

It's a mysql-compat database that is versioned and forkable - basically imagine git and mysql had a baby. Every transaction creates an entry in the commit log that lets you see when, how, and why the database changed. And just like git you can `dolt checkout <hash>` to see what the data was like at some point in time, or `dolt checkout -b temp-branch <hash>` to make exploratory changes to some historical version, or `dolt revert` to revert a transaction... etc.

There is a lot more power that comes with making the entire database versioned and forkable by default. For example it makes it much easier to recover from catastrophic bad code pushes, etc.

note: Dolt was forked from Noms, my previous project, but I don't work for Dolt or have a stake. Just a fan.


> unless there's some reason not to.

Yes.

To note, with GDPR there's now legal reasons to do so regarding user personal data. That can be the moment the devs realize they actually can't delete the data, because they soft deleted for so long, many relations are now interlocked and the data model needs to be changed to give a starting point to the deletion cascade.

My lesson from that was to at least have one test deleting a mock user that spans the maximum data breadth of the service . We caught a bunch of these loops in test at dev time and that was pretty great.


    My lesson from that was to at least have one test 
    deleting a mock user that spans the maximum data 
    breadth of the service . We caught a bunch of these 
    loops in test at dev time and that was pretty great. 
Thank you. That is SUPER insightful. If it wasn't too late to edit my initial post I would add this.


Good point. That could be tricky when it crosses the boundaries of schema changes and data migrations.


> Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah.

In my experience this happens “rarely”, not “always”.

It can happen, and in some ultra-rare cases the impact of not being able to recover some data might be huge (company-ending, even), and engineers are good at worrying about such edge cases. That’s why we habe protective measures like soft deleting and event sourcing - because of nightmare edge cases, not because we are always having to actually use them. It’s driven by engineers avoiding their worst nightmare: having to say “I’m sorry, I cannot solve this problem for you. The data is gone.” It’s a peace-of-mind thing, not an everyday-need thing.


> That’s why we have protective measures like soft deleting and event sourcing

IMO soft deletion is a hack trying to fix problems in CRUD, which is a hack in itself.

CRUD attempts to model everything in the universe as a collections of mutable items, loosely based on RDBMS/SQL.

Event Sourcing is more realistic: it models everything as a append-only logs of immutable events/facts, which preserves both the historical data and, more importantly, the original intent.

Unlike CRUD, Event Sourcing is technology agnostic.


Having multiple use cases for data is normal and okay. Treating your primary data store that is designed for one set of use cases and using it for them all is very very bad, even before you hit scale. Know which data store to use and when for a use case is a super power than can allow you to scale to much higher magnitudes than one would intuit.


What's wrong with the author's proposed compromise/solution?


The proper thing is database snapshots or something like that.

I don't think soft delete is wrong per se, but it is something that should be native to the database engine.


Can you recommend any best practices, design patterns, etc - for implementing soft delete in SQL systems?


The biggest one is to use a timestamp instead of a boolean. For instance Laravel used `deleted_at` as the field name with the type of TIMESTAMP (for mysql at least).

This allows you to do a `select * from users where deleted_at is null` to get the active records, but also know when the user was deleted if you need to audit / rollback.


I think the best answer is to optimize for the expected data and use cases.

- How many records are there? Enough that performance will be an issue? If I'm planning on keeping things in the same table, can I utilize database features like indexes and partitions to mitigate any perf issues? (For some access patterns, partitions might solve 100% of your perf concerns)

- How common is deletion? Are we expecting 1% of the records to be soft-deleted, or more like 90%? If it's the latter, you may not want all those records clogging up your main table.

- Is this greenfield development, or am I adding soft-delete to an existing app? Greenfield favors "same table" soft delete; if you're retrofitting an existing app it may be better to keep deleted stuff in a separate table so that you don't break existing functionality.

- What do you want to do with the soft-deleted records? Are there times when you want to treat them just like regular records, e.g. "Give me a list of all the users who joined last week, even if we've deleted their accounts?" If the answer is "yes" then a lot of those things will probably be easier if you keep deleted and non-deleted in a single table.


Agreed. The key to soft deletes is to actually move the record out of the original table to ensure they don’t accidentally end up in a query join.

But there’s always something that needs to be undeleted. You can either have an easy way to do it or restore an entire DB backup and cross query the missing records. Soft deletes are a lot easier.


It's audit log enough for that? As deletion should also be logged.


GDPR compliance and all - we encrypt and delete stuff... or at the very least "lose" encryption keys 1st.


This 100%


Views are a simple solution to this problem. Pretty much all moderns RDBMSs support updatable views, so creating views over your tables with a simple WHERE deleted_at IS NULL solves the majority of the author's problems, including (IIRC) foreign key issues, assuming the deletes are done appropriately.

I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against. Sure, concerns about splitting logic between the DB and app layers are valid, but there are fairly well developed techniques for keeping DB and app states, logic and schemas aligned via migrations and partitioning and whatnot.


> assuming the deletes are done appropriately

This is one gripe I have with soft-deletion. Since I can no longer rely on ON DELETE CASCADE relationships, I need to re-defined these relationship between objects at the application layer. This gets more and more difficult as relationships between objects increase.

If the goal is to keep a history of all records for compliance reasons or "just in case", I tend to prefer a CDC stream into a separate historical system of record.


> Since I can no longer rely on ON DELETE CASCADE relationships

Cascaded deletes scare me anyway. It only takes one idiot to implement UPSERT as DELETE+INSERT because it seems easier, and child data is lost. You could always use triggers to cascade you soft-delete flags as an alternative method, though that would be less efficient (and more likely to be buggy) than the built-in solution that cascaded deletes are.

If you look at how system-versioned (or “temporal”) tables are implemented in some DBMSs, that is a good compromise. The history table is your audit, containing all old versions of rows even deleted ones, and the base table can be really deleted from, so you don't need views or other abstractions away from the base data to avoid accidentally resurrecting data. You can also apply different storage options to the archive data (compression/not, different indexes, ... depending on expected use cases) without more manaully setting up partitioning based on the deleted/not flag. It can make some query times less efficient (you need to union two tables to get the latest version of things including deleted ones, etc.) but they make other things easier (especially with the syntactic sugar like AS AT SYSTEM_TIME <when> and so forth) and yet more things are rendered possible (if inefficient) where they were not before.

> I tend to prefer a CDC stream into a separate historical system of record.

This is similar, though with system versioned tables you are pretty much always keeping the history/audit in the same DB.

---

FWIW: we create systems for highly regulated finance companies where really deleting things is often verboten, until it isn't and then you have the other extreme and need to absolutely purge information, so these things are often on my mind.


> It only takes one idiot to implement UPSERT as DELETE+INSERT because it seems easier, and child data is lost.

Seems unfortunate to miss out on all the referential integrity benefits of a serious database when hiring standards, training and code reviews should all be preventing idiotic changes.

If I’m making a shopping cart system, I want to know every order line belongs to an order, every order belongs to a user and so on. Anyone who can’t be trusted to write an update statement certainly can’t be trusted to avoid creating a bunch of orphan records IMHO.


> Seems unfortunate to miss out on all the referential integrity benefits of a serious database when hiring standards, training and code reviews should all be preventing idiotic changes.

If you don't use ON DELETE CASCADE, the actual foreign key constraint gives you a meaningful error-- that you need to delete some stuff to have referential integrity.

ON DELETE CASCADE --- you're telling it "eh, if you need to delete some stuff to avoid an error, go ahead, don't bother me, do what I mean".


> Seems unfortunate to miss out on

You don't lose any referential integrity without cascades. Foreign keys are still enforced, just with an error if an action would break integrity rather than automatic deletes to satisfy the constraint that way.

> when hiring standards, training and code reviews should all be preventing idiotic changes

I was burned by this sort of thing early on, in companies where I had few such luxuries. Even though things are done better now, I'm still paranoid of that one day someone skips procedure and somehow lets a problem through all the QA loops.

> If I’m making a shopping cart system, I want to know every order line belongs to an order, every order belongs to a user and so on.

I take it from the other side: I want to know that if something is referred to elsewhere it can't be deleted until that is resolved.

If a top-level manager leaves I want an error if the hierarchy hasn't been updated before deleting his record¹, rather than his underlings, their underlings, their underling's underlings, … , being deleted when that one person is!

----

[1] Obviously this would normally be a soft-delete, there may be a lot of records referring to such an individual not just other person records. If you actually need to delete them (right to be forgotten etc.) then you need to purge the PII but keep the record so other things still hang together.


Often you don't have to rely on ON DELETE CASCADE relationships. Because you are never deleting anything, you will never have any orphaned records. If you don't want to see say Invoices for a deleted Customer then that's just another filter feature.

Mostly I use soft-delete because for auditing requirements we pretty much can't remove anything but also because nothing ever truly goes away. If we have an Invoice or Order then, from our perspective, we must have those forever even if the corresponding client is deleted and can never place another one.


> Often you don't have to rely on ON DELETE CASCADE relationships. Because you are never deleting anything, you will never have any orphaned records

Exactly. Unless you're doing something silly like adding deleted at to bridge tables ... which, you probably don't need even in 1:many.


You may end up doing this anyways if you have any application code that needs access to delete hooks, or access control varies across objects. At this point, you are probably using a ORM instead of direct queries, and place logic that could be in the db instead at the app layer.


Being unable to effectively use foreign key relationships is definitely a downside of using soft deletes. But it's also worth asking if these types of behaviors, which would also include a feature like triggers, really belongs in a database or whether it's better to have at the application level (or at least at a layer above the data layer). I'd argue that ultimately you probably don't want these things at the DB level because you get into a situation where you're sharing business logic between two (or more places).


My perspective is DB level triggers are the absolute very best place to put cascading update/delete logic so it only ever needs to be written once and is consistent regardless of any future frontend clients that might be written in a different language and/or framework than the original codebase.

Right now in $dayjob I am converting an old non-DRY codebase from NoSQL data layer format to proper relational SQL backend.

This old front-end was created by verbosely coding up a relational cascading update/delete system for the NoSQL backend, in numerous places redundantly with subtle differences and inconsistencies, making the code brittle.

My current estimate is some front end functions will be reduced in LOC size by 95% once we use the power of SQL in the backend.

And the backend SQL Triggers+StoredProcedure required to replace these long NoSQL frontend functions doing cascading updates/deletes is only around 10% the size of the replaced front-end code.

And now future new frontends can reuse this without exploding the size of their codebase where complex data operations are required. And no need to reinvent the same data handling algorithm all over again (and risk subtle variation creeping in from the different front-end implementation of data algorithms)


I'm less likely to use triggers, but I'll say I pretty much always want proper foreign key relationships set up in the database. Unique and other constraints too. In principal I might agree with you that it's an application level concern, but being able to setup these kind of invariants and trust that they will be enforced even in the face of bugs in my code (and there will be bugs in my code) is just too powerful to let go of in the name of purity. I'd much rather let a bit of business logic creep between multiple layers that discover that I have a whole bunch of orphaned records and no sensible way to reconcile them.


The DB's responsible for maintaining the integrity of data in it, unless there's some very good reason you can't let it do that. It's faster and better at it than your application, 99% of the time, and it can keep doing that even when/if a second application starts using the same database, or when being used from a SQL prompt, or whatever.


Presumably you have a schema defining the tables, the columns, and the types at the least, along with things like unique indexes. So you already have data constraints in your database design. And that's where they belong, to ensure the data integrity, since the database's concern is the data.

If you're doing everything as one big table of entity-attribute-value with generic blobs for the values, then yes you'll have to re-implement all the normal constraints (that the database would handle) in your application and do all your data integrity handling there. And you'll also have to duplicate that logic across every application that accesses that database now and in the future.

Data usually lives longer and has more uses than just one program. So I think it's generally better to put integrity constraints in the database, rather than having to re-implement and duplicate that logic several places.


If we're assuming you're using a view based approach which elides the soft deleted rows automatically then you'll get a lot of these dependent objects correctly updated for free assuming you're pulling them out of the DB with JOINs - SELECT FROM foo JOIN bar (assuming bar is a view into barwithdeleted) will automatically filter out the invalid rows from foo... if you're using this information to populate a CRUD interface it's likely you'll be JOINing bar already to get some metadata for display (like maybe bar.name instead of the surrogate bar.id key you use for joining).


Yes, but other queries (any aggregate queries that don't join the soft deleted table, any joins to other tables) will now return rows that would have been deleted under hard deletion with cascade.


This is definitely something to watch out for, but in practice (as someone that migrated a system without soft deletes to one that had them) I found that it doesn't tend to come up nearly as much as you might think - usually the table being transitioned to support soft deletes is a relatively core table (since ancillary tables usually aren't worth the complexity to transition) so a lot of your reporting queries will already be pulling in that table. You definitely need to check to make sure you're not missing anything - and sometimes CRUD interfaces will need to be completely revamped to include the table in question - but it's usually not that hard.


You could use a trigger to cascade soft-delete flag toggles, provided all the relevant tables have such a column. Still have to factor that into your other queries, but at least you wouldn't have to make potentially-impossible-to-exhaustively-check joins to figure out if a given row was "deleted".


Is not there any attempt to improve the soft deletion at the engine/SQL level? I can see it as a possible feature request.


If you're using PostgreSQL, you can implement cascading soft-deletes yourself.

The information schema table holds all foreign key relationships, so one can write a generic procedure that cascades through the fkey graph to soft-delete rows in any related tables.


Could someone take a stab at an example of what this would look like? Sounds really interesting.


Here's an interesting approach using rules: https://evilmartians.com/chronicles/soft-deletion-with-postg...



There's the idea of temporal tables, https://pgxn.org/dist/temporal_tables/

It's not a standard (I think) but it'd let you do a cascading delete and then be able to go and look at the old objects as they were at time of deletion too.

You'd need to do things very differently to show a list of deleted objects though.


> temporal tables … It's not a standard

They were introduced in ANSI SQL 2011.

How closely implementations follow the standard I don't know, but something close exists in several DBMSs: I use them regualrly in MS SQL Server, there are plug-ins for postgres, MariaDB has them, and so forth.


It appears that there's been an attempt at standardizing temporal features in SQL in the SQL:2011 standard: https://en.wikipedia.org/wiki/SQL:2011


Neat I didn't know that had happened. I can't say I follow SQL standards all that thoroughly.


You lose traditional FK constraints with temporal tables since there's multiple copies of a row. One workaround is to partition current rows separately from historical rows and only enforce FK constraints on the current partition.


One interesting feature that some DBs implement is something like SELECT AS OF SYSTEM TIME (https://www.cockroachlabs.com/docs/stable/as-of-system-time....) which _kinda_ does this.

However in practice this usually dramatically slows down reads if you have to constantly skip over the historic rows so you probably don't want to keep garbage round longer than absolutely necessary. The concept of a historic table mentioned below could be interesting though - especially if it could be offloaded to cold storage.


Doubt it. It seems like something obvious yet I’ve waited so long for it. Seems like you have to rely on third party plugins.


> This is one gripe I have with soft-deletion. Since I can no longer rely on ON DELETE CASCADE relationships

If you use soft deletes on all tables, you can also cascade them as long as you either cascade updates to the real keys as well, or prevent such updates, by having a deleted flag column on each table, including it in a unique constraint with the actual key column(s), and including it in the foreign key.


The main problem with views for this use case in practice is that they ossify your schema. Views and matviews are effectively a dependency tree, and many common types of schema evolution become substantially more difficult when the system forces you to wrap your DDL in a series of view drop/recreation steps.

This is merely annoying when dealing with regular views because recreating even a large number of views is fast, but can be catastrophic if you have any matviews in your table dependency tree. A matview can easily turn what should be an instantaneous DDL operation into a partial outage while the matview is being regenerated.

(this is all postgres specific, it may be untrue for other systems)


As an FYI using a tool like DBT solves this problem. As someone who was not a data engineer I was not familiar, there were tools like this


dbt is great, but I'm not sure it's appropriate to manage objects in a production transactional database. It's designed for analytical, columnar databases.


In case you're not familiar with DBT, it helps with the "transform" in ELT https://docs.getdbt.com/docs/introduction.


I have this problem. What is DBT?



Thank you for this. I had no idea this existed either and it may solve some points for us on MySQL where we've really been yearning for materialized views... which I'm now reading my be a foot gun.


> The main problem with views for this use case in practice is that they ossify your schema. Views and matviews are effectively a dependency tree, and many common types of schema evolution become substantally more difficult when the system forces you to wrap your DDL in a series of view drop/recreation steps.

You're not wrong, especially with the second part. I.e., deeply nested or convoluted dependencies between views can definitely make it awkward or painful to make adjustments near the root of the tree.

When I started this reply I was going to say "I hear you, but it's not an issue I run into very often". But that's not true. I've actually been burned by that moderately often, and have sometimes avoided or redesigned the root-level table change to avoid having to propagate all those changes to the rest of the dependency tree.

That said, in my experience (also mostly with postgres for this context) I feel like that's usually been more of a developer laziness issue (my own laziness that is), rather than an "ossified schema" issue. It's definitely a PITA when some simple change is going to break a dozen inter-connected views, but that's a coding issue not a deployment issue almost all of the time.

To be fair I don't really use matviews very often, but for true basic views I am guessing that the actual execution of the DDL to rebuild changed views is manageable in all but the most extreme cases. Even then there _should_ be a maintenance window of some sort available.

Thinking this thru a little bit, I believe the "anti-pattern" you're warning against isn't really views themselves but deeply nested/interconnected views (views that query other views, etc). I use views often (for this logical-delete type idiom for example) and I have rarely regretted it. I have often regretted creating complicated view-base dependency trees however, so I think I'm wholeheartedly in agreement on that point.


To avoid an outage, have you tried fronting the matview with an additional view to allow hot-swapping?


it is very dangerous to have dependency on materialized view - it is a poor architectural decision from DBA to do that.

if you want view depending on mat view - materialize it yourself in a table, and refresh it yourself controllably.


Aren’t you just saying that materialized views should not be used? I’m pretty sure that was my original point.


Seriously. That "Downsides: Code leakage" point is nonsensical.

``` CREATE OR REPLACE VIEW active_customer AS SELECT * FROM customer WHERE deleted_at IS NULL OR deleted_at <= NOW() ; ```

There, I fixed it.

Just use `active_customer` instead of `customer ... deleted_at IS NULL`.

In fact, since the deleted_at column is a timestamp, the original "leakage" query:

``` SELECT * FROM customer WHERE id = @id AND deleted_at IS NULL; ```

is actually broken. A non-null `deleted_at` timestamp that's in the future implies the record hasn't been deleted yet, right?

I've often had junior devs assert that views are some kind of code smell, but these sorts of "canned query/filter that you want to apply very very often" seem like the perfect use case for a view to me. It's DRY, and the fact that your standard "query" is in the database" means you can change it more readily than trying to make sure you hit all the points it might be embedded in the application code.

> I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against

Early-ish in the JDBC days a senior dev I was working with at the time (as a junior dev myself) made a pretty good case that "the database is part of the application" that's always stuck with me. Full database independence via software level abstractions is a pretty silly goal outside of library code. If you have a service that makes extensive use of the database, don't throw away the database features in the interest of some abstract "we could swap out oracle with mysql without changing anything" objective. If you want it to be generic, use the SQL standard, but don't be afraid to have a few db-specific bits in the app code if that's a subsystem you might replace once a decade or something.

I blame the DBA/Dev divide for a lot of this. A lot of the impedance between these layers is social/procedural. If you can change the DB as easily as the code, there's a lot less fear of using the right tool for the specific job.


The query isn't broken. In the Rails community at least it is very common to use a nullable frobbed_at column to indicate both "was it frobbed" and "when was it frobbed". In that context, the boolean check is always NULL/NOT NULL, rather than a time comparison.


I can see that use case and I guess if that's a popular idiom within a particular dev community it's not indefensible, but from a first-principles code-complete/pragmatic-programmer kind of lens, that seems to me like a really terrible naming convention if that's your intent.

It seems to me that having a date-type column named `is_frobbed` is highly preferable to `frobbed_at` if your intent is for `IS NOT NULL` to mean frobbed.

Sure, it's a little weird to have boolean-sounding `is_frobbed` name for a date column, but the number of times that `frobbed_at IS [NOT] NULL` appears in the code is likely to dwarf the number of times you're inserting a date type into the `is_frobbed` column. I feel like there's always going to be someone (like me in this case) that's going to come across a `frobbed_at IS NOT NULL` case, notice that it's a date column and make the same flawed assumption I did. A similar thing will happen to the `is_frobbed` column too, but in that case trying to treat the date type as a boolean is going to make it obvious that you're not understanding the full context.

Frankly, if I was naming a column for this idiom and didn't have other constraints (like the Rails context), I would probably try to find a more exact and direct way to express it, if that's a little clunky. Maybe something like `frobbed_when_not_null`? But I don't love that. Honestly it may make more sense to have a actual boolean-valued `is_frobbed` column and an independent `frobbed_at` timestamp column that's populated by a trigger when the value of `is_frobbed` changes. I feel like "clunky but direct" beats out "elegant but misleading" in the long run, especially given the degree of "clunky" and "elegant" we're talking about here.

I don't expect to talk you out of it, and I'd fall in line with this in a Rails context too if that's the convention, but I think it's objectively poor design.

For what it's worth, if the `frobbed_at` convention usually intended to track the timestamp at which the frobbing happened, the good news is that my `AND frobbed_at <= NOW()` check would be unlikely to break anything in practice. Assuming that there's nothing weird going on with the timestamps, the frobbed_at dates will always be in the past anyway.


I would use "deleted_after" for the use case you describe (scheduled deletions). "deleted_at" is saying that an event occurred (deletion) at a specific time. There is never going to be a case where you have a deletion time for a record that is not deleted, nor should there ever be a time where a record is deleted but a time is not recorded. So a single timestamp column is a parsimonious solution.


I agree `deleted_after` is a more appropriate name than `deleted_at`, but it starts to run into naming conventions again. For example `published_at` is often used in editorial systems (newspapers, blogs, etc.) for things that "embargoed until" some future date. `published_after` or even just `pub_dt` would probably be more appropriate, but I feel like the `_at` suffix is well established in some contexts.

For what it's worth, I have built and managed systems where "delete/disable this thing in the future" is a valid use case, but I'm wondering whether/how often/how the date aspect of the Rails-style `deleted_at` is actually used. Does anyone ever care when the record was deleted, or is it just extra metadata that is occasionally used in an ad hoc way for debugging or diagnostics. If the typical rails app replaced `deleted_at` with boolean-valued column, would it actually matter?


I have relied on *_at timestamps countless times, for support, analytics, etc. deleted_at in specific? Less often, sure.


> the database is part of the application

100% this. If you accept that the database is part of the application, you give yourself permission to use the full feature set of the database, and life becomes a lot simpler. Using views, stored procedures and other features lets you implement things like soft delete trivially, without it infecting all your application code.

In my entire career I've changed backend databases for an application exactly twice. It's not easy, and no amount of abstraction is likely to make it easier.


> Using views, stored procedures and other features lets you implement things like soft delete trivially, without it infecting all your application code.

That’s great but some of us actually like to write code. Especially Ruby on Rails where soft delete is a breeze if you don’t overthink it and build something the business doesn’t need.


Well, less code means less bugs, but go nuts. I'm just saying that the database is part of the stack. You wouldn't avoid Ruby features "just in case we stop using Ruby" - so why would we avoid using database features? It's up to you how best to assemble the features from your stack.

There is a reason we avoided database features in the 90's, which was to avoid database vendor lock-in. This was almost entirely a financial decision - you didn't want to be at the mercy of a vendor who could 10x the license fees and put you out of business, so you needed to have leverage over the vendor, and you needed to be able to credibly say you could change databases -- and in fact this exact scenario actually happened to me.

But that kind of behaviour is not really needed any more, and there's no need to avoid the great database features that systems like PG provide. In my experience, using database primitives to implement features tends to perform much faster (10x - 100x) and more consistently than the equivalent implementations higher up the stack.


The main reason was indeed vendor lock-in. Databases (like compilers) were hugely expensive back before open source got big.

But also some of the features were a bit flaky back then. I remember views in MySQL always caused problems - if you dumped and restored (ex: staging to dev), the special handling of permissions for views and such ("Access denied; you need the SUPER privilege for this operation") would break stuff. So we just didn't use them.^1

However, I'm still in favor of them, and think it's worth finding workarounds for things like that. They're not as flaky now, and people have found workarounds for the remaining flakiness. Nowadays the fear of using them is just tradition from 30 year old problems.

But now we have one or two generations of developers who were taught "Don't use the database features! 'Best practice' is to use it as a dumb datastore and re-implement them all in your application." But they don't know why, so wouldn't even know that those reasons are no longer applicable.

>^1 There exists a shortcoming with the current implementation of views. If a user is granted the basic privileges necessary to create a view (the CREATE VIEW and SELECT privileges), that user cannot call SHOW CREATE VIEW on that object unless the user is also granted the SHOW VIEW privilege.

>That shortcoming can lead to problems backing up a database with mysqldump, which may fail due to insufficient privileges. This problem is described in Bug #22062.

>The workaround to the problem is for the administrator to manually grant the SHOW VIEW privilege to users who are granted CREATE VIEW, since MySQL doesn't grant it implicitly when views are created.

> -- https://dev.mysql.com/doc/refman/8.0/en/view-restrictions.ht...


The use case for using the database here is overblown. I've never experienced of heard of anyone using a view for the users table. I suppose it may exist in the wild somewhere.

I wouldn't avoid Ruby features because the applications I develop are in Ruby and it would require a complete rewrite anyway. I avoid the database because there are development speed advantages to using Rails features. And everything works together nicely assuming you do things the Rails way.

https://github.com/rubysherpas/paranoia

> If a record has has_many associations defined AND those associations have dependent: :destroy set on them, then they will also be soft-deleted if acts_as_paranoid is set, otherwise the normal destroy will be called.

It's convenient.


> I've never experienced of heard of anyone using a view for the users table

Well, now you have heard of people doing this.


I'd argue that the simple `deleted_at IS NULL` check is not broken - unless your product / domain specifically allows and requires scheduled future deletions adding such logic can easily introduce bugs. For example, you could to get the comparison flipped by accident, and if it's only in one place out of many that bug could go unnoticed for a while.


If any non-null value for deleted_at indicates a logical delete, it seems like TIMESTAMP is the wrong data type for that column.

I mean I guess I can imagine a scenario where you want to know _when_ a record was logically deleted in addition to a flag that says is_deleted, but maybe that should be two different columns.

Or at the very least, `deleted_at` is a misleading name in that case. Interpreting `deleted_at` as "is deleted after but not before" is definitely a reasonable _enough_ interpretation. I agree it doesn't unambiguously mean that, but it's at best ambiguous enough to make for a bad column name if `IS NOT NULL` is the way you intend to check for a logical delete.

> you could to get the comparison flipped by accident, and if it's only in one place out of many that bug could go unnoticed for a while.

That's exactly why I proposed a database view as the proper solution to both whole `deleted_at IS NULL` "code leakage" issue. If you have a table that uses a logical-delete idiom that you intend to query from a bunch of different places in the code, a database view that filters out the logically deleted rows is a much more appropriate solution. It's a universally DRY solution. The single simple view covers all tech stacks, all services, etc., up to and including interacting with the database via some sort of REPL console.

If you need to add `deleted_at IS NULL` to, say, 90% of the queries against the `customer` table and you're afraid someone is going to forget to do that somewhere somehow, it seems pretty obvious kind of "extract common WHERE clause" refactoring is warranted. This is especially true in the actual "business meaning" context here. You are only going to be interested in the logically deleted rows in small number of specific circumstances. Having some way to extract away that detail from the 90% of the use cases where "logical deleted" is meant to be functionally equivalent to "actually deleted" seems extremely warranted. It's a perfect counter-example to overcome an aversion to using database views. I get that the `active_customer` abstraction is probably awkward to implement in a lot of application code contexts (depending on your ORM/DB-query framework I guess), but the database view is a simple and elegant way to address that really clearly defined need.


> If any non-null value for deleted_at indicates a logical delete, it seems like TIMESTAMP is the wrong data type for that column.

Not if I also want the timestamp of when it was deleted


That's fair. It's what I was trying to acknowledge when I wrote "I can imagine a scenario where you want to know _when_ a record was logically deleted".

That said, I'm genuinely curious about how often and in what way that time-based information is actually used.

For example, I often see created_at and modified_at columns that are completely ignored by the actual business logic and application code that are only used (if ever) for ad hoc diagnostic purposes like reconstructing the sequence of events after some catastrophic error in the app logic screwed up the database. If run-away-logic-screwing-up-the-database is a legitimate cause for concern, that's probably reason enough to include that extra meta-data, but I can easily count on one hand the number of times that's come up in practice across my long career. I mean, I also include that sort of column often. And there are certainly entities for which those data are relevant for application logic and/or reporting purposes. But there are definitely cases in which those data are never used at all too.

If this `deleted_at` timestamp-as-boolean convention is popular enough in the Rails community to be universally known, is there some common use for knowing _when_ a record was logically deleted (outside of reconstructing of the database state after something goes wrong) that I'm missing?

I.e., is this something that's actually used in the typical business logic and application code, or is it primarily used for ad hoc debugging?

To be clear I'm not _doubting_ that such a use case exists, it's just that off the top of my head I can't think of commonly occurring one.


Every Rails developer knows what “deleted_at” means. I’m absolutely repulsed at the thought of turning the users table into a view. What’s next? No code and we only use triggers?


> I’m absolutely repulsed at the thought of turning the users table into a view. What’s next? No code and we only use triggers?

Assuming there's a justifiable need to apply the logical-delete idiom to the users table, what's the Rails-native, ActiveRecord-based approach to filtering out the logically deleted rows?


You don't put the check in your code, you put it in the view, and access the data exclusively through that view. In that way, the check is defined exactly once.


And now your view is based on an unstable function, so it's impossible to write partial indexes for (very important for performance!) and impossible to use foreign keys with. Just add a check constraint that deleted at is never in the future and move on with your life. If you really, really need to schedule a delete, make it its own concern (UserVersions table with applicable_at?), don't mix it with your soft delete logic.


At least in Postgres, having a huge amount of "dead" data in large tables is problematic because vacuum always has to read the full data set.

Even with conditional indexes where you exclude deleted data you take a significant performance hit reading dead blocks because there is no way to quickly vacuum them. You accumulate hours of bloat until your vacuum finishes.

You can't beat a separate insert only archive table which you never have to vacuum.


This is practically the #1 use-case for partitions imo. Partitions are tables with syntactic sugar (that Postgres understands in its query planner) so partitions maintain their own indexes and are vacuumed individually. If you structure your partitions into hot/cold, or hot/cold_day1/cold_day2/etc then you get several advantages:

* hot and cold do not churn each others indexes or tables, you effectively have only one set of indexes (and data) that's actually churning and the others are stable.

* hot and cold can be treated differently - you can perform more aggressive indexing on data once you know it's out of the hot-set, or partition your hot data onto a tablespace with dedicated hardware while cold data is on bulk hardware, etc. Since queries are planned onto each partition individually, postgres can often select "the right way" for each partition.

* "deleted_at" is a special case of cold table. Dropping any partition is basically free, so if you partition into date ranges, then at the end of your data retention period you just drop the partition for that date range, which doesn't churn indexes or induce vacuuming

If data can never exit the cold-data state, then it's effectively an append-only table too, it just exists as a supplement to your online/hot-data table but it doesn't require special attention/etc. So we're in agreement on that point, that's a good design feature if you can swing it!

(note that for audit logging, I think it's simpler to just do the separate table. But the partition strategy does have some advantages for "cold" or "dead" data as a more generic concern imo)


Vacuum does not have to read the full data set every time. The visibility map tracks, on a block level, whether all rows in a page are known to be visible to everyone (starting in 8.4 or such) and whether the page is "frozen", i.e., does not contain visibility information that might need vacuuming (starting in 9.6 IIRC).

However, indexes do currently have to be scanned as a whole. But that's only done by autovacuum if there's enough row versions for that to be worth it (in recent versions).


Shouldn’t partitioning help with that? (I have no experience with Postgres.)


Only if those partitions are on separate storage, otherwise you have the same number of dead tuples/dirty pages.


Not sure that is true, Postgres can rule our entire partitions and scan less stuff I think?


Doesn't help with writes. More dead tuples can mean more pages scanned for the same amount of data.


Based on my experience, I like the author's approach since it makes things pretty clear-cut and optimized the storage in the core table (in my experience as well, deletes happen frequently and the soft deletes are rarely touched). In large, row-oriented tables that that storage can add up and even with views/materialized views there's a cost to using/maintaining those as well.


A problem (unless something has changed, my context is Oracle from some time ago) is that NULL values are not indexed. So the "WHERE deleted_at IS NULL" could trigger a full table scan. It can also cause row migration when the NULL value is eventually filled in. Unless you explicitly need the deleted date, it's probably better to use a non-nullable Y/N for this.


It seems Oracle does it although there is a special syntax to opt-in. That seems wild. I am not aware of another DBMS having that limitation though.


Views can really bite you performance wise, at least with Postgres. If you add a WHERE against a query on a view, Postgres (edit: often) won't merge in your queries' predicates with the predicates of the view, often leading to large table scans.


IIRC Postgres has supported predicate push down on trivial views like this for over a decade now, and possibly even more complex views these days (I haven't kept up with the latest greatest changes).


Postgres can do it, you're correct, but in my experience it rarely happens with any view that's even slightly non-trivial even on recent versions of Postgres. Most views with a join break predicate pushdown. It greatly reduces the usecases of views in practice.


I haven’t had any problems with this at all and I’ve been using joins in my views for years.

Are you using CTEs in your views?



Postgres 12 fixed the CTE issue.


There is no impact with views in MS SQL. You can also have indexed views and filtered indexes, so you can have even better performance.


This is one of those situations where a good ORM can simplify things greatly. For example, with EF Core you can add a global filter which will filter out soft-deleted rows in all queries automatically (unless you add .IgnoreQueryFilters()).

It couples nicely with some hackery which turns removes into soft-deletes. You can remove objects as usual and they get soft-deleted in the database.

I've used this in a few projects and it's fantastic.

https://docs.microsoft.com/en-us/ef/core/querying/filters

https://www.thereformedprogrammer.net/ef-core-in-depth-soft-...


> there are fairly well developed techniques for keeping DB and app states, logic and schemas aligned via migrations and partitioning and whatnot.

Hi, <1 yr experience swe here. Would HN mind unpacking "whatnot" with specific names of some these techniques?


I had the same reaction to the 'code leakage' section, but 'foreign keys'? You can't reference a view; so you either don't use them (fks) or they point at the underlying table and you have the problem described.

You could have views that say 'thing I have a foreign key to is not deleted' of course, but that sort of seems like 'code leakage' again, just in SQL this time.


> developers underutilize the capabilities of the massively advanced database engines

So true. There are so many amazing, powerful features in all of the major players.

Also: updatable views are amazing. With query rewriting (whatever you vendor calls it) you can affect some truly material changes to the system without any changes to the client applications. An example would be implementing temporal relations.


How does this help with foreign keys? Normally you can’t have foreign keys referencing a view.

I agree that one should make use of RDBMS capabilities. A check constraint may be practical instead of (or in addition to) the foreign-key constraint.


> (...updatable view...) WHERE deleted_at IS NULL

This is the way. Also, save record creation timestamp, and you can have very flexible "time-machine" selects/views of your table essentially for free.


I'm glad to see this thread. I've been mulling over this exact issue of deleted_at code leakage with a naive soft-delete implementation. My immediate thought was to use views, so its nice to see this is not yet another case of me using crack-brain ideas out of inexperience.

What's an appropriate naming convention?

Should I do it universally and put transparent views in front of all my tables so I don't have to refactor my code to point to new views whenever I do suddenly need to put in a model constraint that isn't 1:1 with my data layer? Is a transparent view a no-op in terms of perf? if it matters, this is being done in Postgres

I will probably make my constraints partial over the not-deleted records, particularly for unique constraints used for upserts. Am I about to footgun myself? Is it even necessary with the new uniqueness behavior with NULLs being implemented in postgres? Will my performance characteristics be better one way or the other in particular circumstances? It sounds like if I have a high ratio of deleted to not-deleted records a partial index becomes necessary.


Views are such a powerful concept I’m honestly disheartened by how hard it is to use, replicate or leverage that functionality outside of dropping straight into the db shell



Views can be used to implement pretty much any kind of automated inference, reasoning, rules etc. on the "raw" table data. The example of filtering out deleted records is just one of the simplest. That one single feature can easily transform a simple DB platform into a fully-featured knowledge base system, easily usable to support even complex reasoning tasks.


I was going to chime in with this. thanks. One issue with views however is that a lot of these features require more and more nuanced knowledge of RDBMSes where these days unless you have a veteran architect, most of the team just knows the various library/tooling that interacts with "a variety of databases" so there is often less effort to go deeper.


Where is this anti-fb culture? Is it a startup thing?

Everywhere I have worked people know a decent amount about their data store. Not architects, just mid devs and higher.


I've seen so much torturously complex code to coerce an ORM or query builder to generate a query that would've been simple just to write. Because "that's the way it's done" - Spend 5 minutes writing a query and then 5+ hours trying to figure out the convoluted code necessary to coerce something to generate that same query.

Of course, there are times when query builders are necessary to create the queries dynamically depending on parameters. But it's all too common to see them used and abused even for static queries.

People are just plain afraid of writing SQL. Or they think it's 'best practice' to use an ORM or query builder and using actual SQL is somehow 'wrong'.

Even people who know a decent amount about the database do it anyway because "that's the way that it's supposed to be done" or "that's 'best practice'".


That doesn't solve the foreign key problem. You can still easily have a reference to a record that is "deleted"


Also in Postgres, you cannot have a foreign key constraint that references a view, not even a materialised view.

I'm with the author on this one. Any soft delete logic does have a tendency to bleed into other systems and make your systems more complicated, for very little gain.


How would a view solve the foreign key issue? Are you suggesting coding specific deletion triggers into the view such that appropriate foreign keys are "cascade" deleted when a row in the view is deleted?


because when you are getting a child record through a join (in the view), the parent will never have deleted_at set.

Say I have a simple view `select * from foo join bar on foo.foo_id = bar.foo_id where foo.deleted_at is null`

I never have to worry about deleting from bar, because I should never grab a child when the parent is 'deleted'.


Assuming that the joins are always joined against the views, not the raw tables. This introduces an opportunity for a mistake.


100% this. And yes, with inner joins it also solves the relationship issues.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: