Hacker Newsnew | past | comments | ask | show | jobs | submit | hackerbirds's commentslogin

Users should never be expected to know these gotchas for a feature called "private", documented or not. It's disappointing to see GitHub calling it a feature instead of a bug, to me it just shows a complete lack of care about security. Privacy features should _always_ have a strict, safe default.

In the meantime I'll be calling "private" repos "unlisted", seems more appropriate


> I'll be calling "private" repos "unlisted"

The same for “deleted” repos.


"deleted" is just a fancy word "inaccessible to the user"


No, it really isn’t. Anyone who uses that word that way is just factually incorrect, and probably pretty irresponsible depending on the context. Software should not tell lies.


> delete: remove or obliterate (written or printed matter), especially by drawing a line through it or marking it with a delete sign

Which is, indeed, what every modern database does.


I think you are referring to tombstoning. That's usually a temporary process that may immediately delete the underlying data, keeping a tombstone to ensure the deletion propagates to all storage nodes. A compaction process purges the underlying data (if still present) and the tombstones after a suitable delay. It's a fancy delete that takes some time to process, but the data is eventually gone. You could turn off the compaction, if you wanted.

I believe Kafka make deletion difficult, since it's an append-only log, but Kafka doesn't work well with laws that require deletion of data, so I don't believe it's a popular choice any longer (I.E. isn't modern).


If you run a DELETE FROM in any modern sql engine, which is the absolute best you could expect when asking for a delete in the UI^, the data is nowhere near gone. It’s still in all the backups, all the WALs, all the transactions that started before yours, etc. It’s marked for eventual removal, and that’s it. Just as the definition of delete I provided says.

^ (more likely they’ll just update the table to set a deleted flag)


> eventual removal

To me, the idea that the deletion takes time to complete doesn't negate the idea that the data will be gone once the process completes.

WAL archive and backups are external systems. You could argue that nothing supports deletion because an external backup could exist, but that's not a useful conversation.


Going back to the point of the the thread, we agree the deleted data is not erased. The user is unable to access it through normal mechanisms, but the existence of side channels that could reveal it does not negate the idea that it has truly been “deleted”, especially when one looks at the historical context surrounding that word.


What? I don't agree with that.

Can you point to an example of a modern database that "supports deletion" but keeps the data around forever? Maybe I've just used different tools than you. Knowing modern data retention concerns I'd be surprised if such a thing existed.


Who said anything about that? We’re talking about side channels and eventual^TM deletion. Given enough time no information will remain anywhere, sure. But that’s not very relevant.


I think we are trying to define the word "delete". You found an archaic definition and are trying to use it in a modern technical setting. You've claimed that modern databases delete without actually removing data but haven't pointed to which systems you are talking about. I'm familiar with tombstoning, either as a "soft-delete" or as part of an eventual deletion process. But I've never seen that called deletion as that would be very confusing.

Pointing to which database you are talking about should clear this up quickly.

I don't think it's reasonable to talk about backups here. A backup is external to the database so it inherently cannot delete it. Similar to how a piece of paper cannot destroy a photograph of the paper, but burning the paper destroys it.


I used the first definition of delete I found which, while arguably “archaic”, matches the modern technical term almost exactly. We’d typically call that a well known word with a clear meaning.

And sure, the DELETE FROM statement in postgres - or any other standards compliment sql db I know.


In technical writing you often don't want to use the dictionary for definitions, similar to how words in a contract can have unexpected meaning in a legal setting.

For Postgres you've got to consider vacuum. Auto vacuum is enabled by default. Deleted rows are removed unless you go out of your way to make it do something different.


Imagine the data that was deleted is of the highest level of illegality you can imagine. Under no circumstance can your service be associated with that content.

- What was your "definition of delete" again?

- You mentioned some of the convenient technical defaults your frameworks and tools provide out-of-the-box, can you think of ways to improve the situation?

(You might re-run delete requests after restoring a backup; transaction should resolve in a timely fashion, failed deletes can be communicated to the user quickly etc.)


We are missing the point here. The GP was claiming that delete meant something other than adding a mark to an item that you want to eventually be removed from the system. It doesn’t.


I understand that you describe the status quo in many systems today.

However, besides the technical aspect you talked about the "absolute best you could expect when asking for a delete in the UI^".

I think this where I, other posters in the thread, most people, and probably the GDPR and other legislature, would disagree. We expect significantly more effort to clean up deleted data.

This includes, for example, the ability to delete datasets from backups, as well as a general accountability of how often and where all the data is stored and if, and when a deletion process is complete.


> GDPR and other legislature

Nope. GDPR allows deleted data to be retained in backups so long as there is an expiration process in place. Doesn’t matter how long it is. But certainly nobody has a right to forcing a company to pull all of their backups from cold storage and trove through them all any time any deletion request takes place. That’d be the quickest path to Distributed Denial of Bank Account Funds imaginable. Even the GDPR isn’t that bone-headed.

But yes, it is part of the law that the provider should tell you that your data isn’t actually being erased and instead it will be kept around until they get around to erasing everything as part of their standard timelines. But that knowledge doesn’t do anyone much good.

> CNIL confirmed that you’ll have one month to answer to a removal request, and that you don’t need to delete a backup set in order to remove an individual from it.

https://blog.quantum.com/2018/01/26/backup-administrators-th...


But GitHub is keeping this stuff indefinitely. No long expiration, no probability of eventual disk overwriting, nothing. All they're doing is shutting the front door without shutting the side door.


Interesting point about the GDPR; I will soften my point to mean that lawmakers have started (late) to regulate data retention / deletion and the rights of users in general and that might be a trend for the future.

However I would like to avoid the impression that with the description of the technical status quo the topic is settled. To do so I would go back to my previous point: Imagine some truly illegal pictures are in that cold storage backup, and one day you might have to restore that data. (Since aparently the user's wish to delete data is not quite as respected as certain other hard legal requirements regarding content)

What solutions to mitigate the situation could a company, or backup tool/web framework etc. reasonably come up with? Maybe check the restored data against a list of hashes/IDs of to-be-deleted-data?


Every modern file system works like this too. Then there’s copy-on-write snapshotting and SSD wear leveling to worry about. Data isn’t actually destroyed until the space is reused to store something else at an indeterminate point in the future.

Or when its encryption key is overwritten.

But it probably is a good idea to stop returning deleted data from web APIs.


this is why when I'm building confirm UI, I prefer the term "destroy?" on the confirm action. It's much clearer to the user that this is a destructive and irreversible action and we will be removing this data/state.

*obviously doesn't apply to soft deletes.


No, deleted is a word for deleted. But we started saying things were "deleted", while our eyes flicked to the stack of backup tapes in the corner, acknowledging the white lie, because really deleting things conflicted with other priorities and was hard. And we left it there, until privacy regulations came along and it turned out not using the normal definition of deleted could get you sued. So IMO Github is wide open to paying damages to the first person able to demonstrate them.


It's tolerated for there to be temporary inaccessible copies sticking around when something is deleted.

What GitHub is doing here is neither temporary nor inaccessible.


"Bought" is just a fancy word for "granted a license for usage, subject to terms and conditions, which may be revoked at any time, for any reason, without any warning"


Yep, I see GitHub as "public only" hosting, and if I want to host something private, I will choose another vendor.


The noted issue looks to be applicable to forks only, not to all private repos.


It also applies to this situation:

    1. Create a private repo R
    2. Create a private fork F of R
    3. Push commits to the fork F
    4. Make R public
The commits pushed to F prior to R being made public will become de facto public, even though F has always been a private fork. The post makes clear that commits pushed to F after R is made public are placed into a separate, private fork network.

So basically, if you ever intend to open source anything, never do it to an existing private repo. Always start a from-scratch repo to be the root of your new public project.


I find the attitude worrying. I understand that it's maybe not easy to fix, or even fixable without breaking some use cases.

However, if they "don't care" about such an issue, how can I trust them to care about other stuff?


Github’s attitude and perception of the terms “privacy” and “security” - it is more important.


For the benefit of anybody thinking "with gitlab I'm safe from this": If you're saying (and perhaps you're not) that some other git hosting service

- gives you control over gc-ing their hosted remote?

- does not to your knowledge have a third-party public reflog or an events API or brute-forceable short hashes?

if so, especially the second of those seems a fragile assumption, because this is "just" the way git works (I'm not saying the consequences aren't easy to mentally gloss over). Even if gitlab lacks those things curently (but I think for example it does support short hashes), it's easy to imagine them showing up somehow retroactively.

If you're just agreeing with the grandparent post that github's naming ("private") is misleading or that the fork feature encourages this mistake: agreed.

Curious to know if any git hosting service does support gc-ing under user control.


> if I want to host something private, I will choose another vendor.

Or you know, self-host, preferrably on-prem.

Basic git hosting only needs a sshd running on the server. If you want collaborative features with a web UI then there are solutions for that available too.


Or commit an ecryptfs.

Clone and mount, unmount and commit


extremely annoying, but only true private option on somebody's else computer.

i read headlines like the above with the implied "not just to the employees there anymore"


Which vendors work best for private projects?


gitea works well. Use that on your own network.


You could consider GitLab.. though this only seems to affect private forks of public repos.


I've been happy with Jetbrains Space (now Space Code); but I'm using it for private, professional work and paying for it, so perhaps that isn't what you mean.


JetBrains Space, Atlassian Bitbucket, GitLab (also On-Premises), Gitea

Order does not indicate any preference.


I've used both Bitbucket and Azure in the corporate world.


Sourcehut :)


> I'll be calling "private" repos "unlisted"

That might be a bit too strict. I'd still expect my private repos (no forks involved) to be private, unless we discover another footnote in GH's docs in a few years ¯\_(ツ)_/¯

But I'll forget about using forks except for publicly contributing to public repos.

> Users should never be expected to know these gotchas for a feature called "private".

Yes, the principle of least astonishment[0] should apply to security as well.

[0] https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...


Specifically about the feature called "private", the only gotcha seems to be that when the upstream transitions from private to public, it may unexpectedly take more data public than desired, right? The other discussed gotchas were all about deleting public data not actually being deleted or inaccessible.


I see your point, on the other hand, the standard procedure for that on GitHub UI is to create a repo and then select another as a template.

That doesn't fork, but does what you would expect, a fully private repo.


> It's disappointing to see GitHub calling it a feature instead of a bug

git is a "distributed" version control software afterall. It means a peer can't control everything.


Anyone at your company and just push to a public git repository at any time. Nothing stopping them except threat of consequences.


So? Employees with access to sensitive data are capable of leaking that data. News at eleven!

And anyone in the world can pull what was pushed to a public git repo before you delete it. You should always assume that has happened.


This is about access to private repos, not public ones:

"Anyone can access deleted and private repository data on GitHub"


You might have noticed that my comment is a reply to another comment.


Disagree. If you're using a service, understand how it works.

Not everything needs to be designed for idiots and lazy people, it's ok for some tools and services, especially those aimed at technical people and engineers to require reading to use properly and to be surprising or unintuitive at first glance.


There's got to be a word for these kinds of ridiculous arguments which use personal responsibility as a cudgel against a systematic fix.

I agree generally that interfaces have been dumbing down too far, but "private is actually not private and it's on you for not knowing that, idiot B)" is a weird place to be planting that flag.


There should probably also be a word for the belief that when a system doesn't work how you want it to, that is so obviously a systematic problem that needs fixing rather than, for example, evidence of differing goals or priorities that it is reasonable to describe anyone who thinks otherwise as ridiculous.


That means having an opinion


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: