hackerbirds's comments

hackerbirds · on July 24, 2024

Users should never be expected to know these gotchas for a feature called "private", documented or not. It's disappointing to see GitHub calling it a feature instead of a bug, to me it just shows a complete lack of care about security. Privacy features should _always_ have a strict, safe default.

In the meantime I'll be calling "private" repos "unlisted", seems more appropriate

layer8 · on July 24, 2024

> I'll be calling "private" repos "unlisted"

The same for “deleted” repos.

NullPrefix · on July 24, 2024

"deleted" is just a fancy word "inaccessible to the user"

callalex · on July 24, 2024

No, it really isn’t. Anyone who uses that word that way is just factually incorrect, and probably pretty irresponsible depending on the context. Software should not tell lies.

dumbo-octopus · on July 24, 2024

> delete: remove or obliterate (written or printed matter), especially by drawing a line through it or marking it with a delete sign

Which is, indeed, what every modern database does.

8organicbits · on July 25, 2024

I think you are referring to tombstoning. That's usually a temporary process that may immediately delete the underlying data, keeping a tombstone to ensure the deletion propagates to all storage nodes. A compaction process purges the underlying data (if still present) and the tombstones after a suitable delay. It's a fancy delete that takes some time to process, but the data is eventually gone. You could turn off the compaction, if you wanted.

I believe Kafka make deletion difficult, since it's an append-only log, but Kafka doesn't work well with laws that require deletion of data, so I don't believe it's a popular choice any longer (I.E. isn't modern).

dumbo-octopus · on July 25, 2024

If you run a DELETE FROM in any modern sql engine, which is the absolute best you could expect when asking for a delete in the UI^, the data is nowhere near gone. It’s still in all the backups, all the WALs, all the transactions that started before yours, etc. It’s marked for eventual removal, and that’s it. Just as the definition of delete I provided says.

^ (more likely they’ll just update the table to set a deleted flag)

8organicbits · on July 25, 2024

> eventual removal

To me, the idea that the deletion takes time to complete doesn't negate the idea that the data will be gone once the process completes.

WAL archive and backups are external systems. You could argue that nothing supports deletion because an external backup could exist, but that's not a useful conversation.

dumbo-octopus · on July 25, 2024

Going back to the point of the the thread, we agree the deleted data is not erased. The user is unable to access it through normal mechanisms, but the existence of side channels that could reveal it does not negate the idea that it has truly been “deleted”, especially when one looks at the historical context surrounding that word.

8organicbits · on July 25, 2024

What? I don't agree with that.

Can you point to an example of a modern database that "supports deletion" but keeps the data around forever? Maybe I've just used different tools than you. Knowing modern data retention concerns I'd be surprised if such a thing existed.

dumbo-octopus · on July 25, 2024

Who said anything about that? We’re talking about side channels and eventual^TM deletion. Given enough time no information will remain anywhere, sure. But that’s not very relevant.

8organicbits · on July 25, 2024

I think we are trying to define the word "delete". You found an archaic definition and are trying to use it in a modern technical setting. You've claimed that modern databases delete without actually removing data but haven't pointed to which systems you are talking about. I'm familiar with tombstoning, either as a "soft-delete" or as part of an eventual deletion process. But I've never seen that called deletion as that would be very confusing.

Pointing to which database you are talking about should clear this up quickly.

I don't think it's reasonable to talk about backups here. A backup is external to the database so it inherently cannot delete it. Similar to how a piece of paper cannot destroy a photograph of the paper, but burning the paper destroys it.

dumbo-octopus · on July 25, 2024

I used the first definition of delete I found which, while arguably “archaic”, matches the modern technical term almost exactly. We’d typically call that a well known word with a clear meaning.

And sure, the DELETE FROM statement in postgres - or any other standards compliment sql db I know.

8organicbits · on July 25, 2024

In technical writing you often don't want to use the dictionary for definitions, similar to how words in a contract can have unexpected meaning in a legal setting.

For Postgres you've got to consider vacuum. Auto vacuum is enabled by default. Deleted rows are removed unless you go out of your way to make it do something different.

UweSchmidt · on July 25, 2024

Imagine the data that was deleted is of the highest level of illegality you can imagine. Under no circumstance can your service be associated with that content.

- What was your "definition of delete" again?

- You mentioned some of the convenient technical defaults your frameworks and tools provide out-of-the-box, can you think of ways to improve the situation?

(You might re-run delete requests after restoring a backup; transaction should resolve in a timely fashion, failed deletes can be communicated to the user quickly etc.)

dumbo-octopus · on July 25, 2024

We are missing the point here. The GP was claiming that delete meant something other than adding a mark to an item that you want to eventually be removed from the system. It doesn’t.

UweSchmidt · on July 25, 2024

I understand that you describe the status quo in many systems today.

However, besides the technical aspect you talked about the "absolute best you could expect when asking for a delete in the UI^".

I think this where I, other posters in the thread, most people, and probably the GDPR and other legislature, would disagree. We expect significantly more effort to clean up deleted data.

This includes, for example, the ability to delete datasets from backups, as well as a general accountability of how often and where all the data is stored and if, and when a deletion process is complete.

dumbo-octopus · on July 25, 2024

> GDPR and other legislature

Nope. GDPR allows deleted data to be retained in backups so long as there is an expiration process in place. Doesn’t matter how long it is. But certainly nobody has a right to forcing a company to pull all of their backups from cold storage and trove through them all any time any deletion request takes place. That’d be the quickest path to Distributed Denial of Bank Account Funds imaginable. Even the GDPR isn’t that bone-headed.

But yes, it is part of the law that the provider should tell you that your data isn’t actually being erased and instead it will be kept around until they get around to erasing everything as part of their standard timelines. But that knowledge doesn’t do anyone much good.

> CNIL confirmed that you’ll have one month to answer to a removal request, and that you don’t need to delete a backup set in order to remove an individual from it.

https://blog.quantum.com/2018/01/26/backup-administrators-th...

hunter2_ · on July 25, 2024

But GitHub is keeping this stuff indefinitely. No long expiration, no probability of eventual disk overwriting, nothing. All they're doing is shutting the front door without shutting the side door.

UweSchmidt · on July 25, 2024

Interesting point about the GDPR; I will soften my point to mean that lawmakers have started (late) to regulate data retention / deletion and the rights of users in general and that might be a trend for the future.

However I would like to avoid the impression that with the description of the technical status quo the topic is settled. To do so I would go back to my previous point: Imagine some truly illegal pictures are in that cold storage backup, and one day you might have to restore that data. (Since aparently the user's wish to delete data is not quite as respected as certain other hard legal requirements regarding content)

What solutions to mitigate the situation could a company, or backup tool/web framework etc. reasonably come up with? Maybe check the restored data against a list of hashes/IDs of to-be-deleted-data?

mdavidn · on July 25, 2024

Every modern file system works like this too. Then there’s copy-on-write snapshotting and SSD wear leveling to worry about. Data isn’t actually destroyed until the space is reused to store something else at an indeterminate point in the future.

Or when its encryption key is overwritten.

But it probably is a good idea to stop returning deleted data from web APIs.

cottsak · on July 25, 2024

this is why when I'm building confirm UI, I prefer the term "destroy?" on the confirm action. It's much clearer to the user that this is a destructive and irreversible action and we will be removing this data/state.

*obviously doesn't apply to soft deletes.

stubish · on July 25, 2024

No, deleted is a word for deleted. But we started saying things were "deleted", while our eyes flicked to the stack of backup tapes in the corner, acknowledging the white lie, because really deleting things conflicted with other priorities and was hard. And we left it there, until privacy regulations came along and it turned out not using the normal definition of deleted could get you sued. So IMO Github is wide open to paying damages to the first person able to demonstrate them.

Dylan16807 · on July 25, 2024

It's tolerated for there to be temporary inaccessible copies sticking around when something is deleted.

What GitHub is doing here is neither temporary nor inaccessible.

pvillano · on July 29, 2024

"Bought" is just a fancy word for "granted a license for usage, subject to terms and conditions, which may be revoked at any time, for any reason, without any warning"

chrisandchris · on July 24, 2024

Yep, I see GitHub as "public only" hosting, and if I want to host something private, I will choose another vendor.

OutOfHere · on July 24, 2024

The noted issue looks to be applicable to forks only, not to all private repos.

eslaught · on July 24, 2024

It also applies to this situation:

    1. Create a private repo R
    2. Create a private fork F of R
    3. Push commits to the fork F
    4. Make R public

The commits pushed to F prior to R being made public will become de facto public, even though F has always been a private fork. The post makes clear that commits pushed to F after R is made public are placed into a separate, private fork network.

So basically, if you ever intend to open source anything, never do it to an existing private repo. Always start a from-scratch repo to be the root of your new public project.

chrisandchris · on July 25, 2024

I find the attitude worrying. I understand that it's maybe not easy to fix, or even fixable without breaking some use cases.

However, if they "don't care" about such an issue, how can I trust them to care about other stuff?

EugeneOZ · on July 25, 2024

Github’s attitude and perception of the terms “privacy” and “security” - it is more important.

thingification · on July 27, 2024

For the benefit of anybody thinking "with gitlab I'm safe from this": If you're saying (and perhaps you're not) that some other git hosting service

- gives you control over gc-ing their hosted remote?

- does not to your knowledge have a third-party public reflog or an events API or brute-forceable short hashes?

if so, especially the second of those seems a fragile assumption, because this is "just" the way git works (I'm not saying the consequences aren't easy to mentally gloss over). Even if gitlab lacks those things curently (but I think for example it does support short hashes), it's easy to imagine them showing up somehow retroactively.

If you're just agreeing with the grandparent post that github's naming ("private") is misleading or that the fork feature encourages this mistake: agreed.

Curious to know if any git hosting service does support gc-ing under user control.

account42 · on July 25, 2024

> if I want to host something private, I will choose another vendor.

Or you know, self-host, preferrably on-prem.

Basic git hosting only needs a sshd running on the server. If you want collaborative features with a web UI then there are solutions for that available too.

dheera · on July 24, 2024

Or commit an ecryptfs.

Clone and mount, unmount and commit

1oooqooq · on July 25, 2024

extremely annoying, but only true private option on somebody's else computer.

i read headlines like the above with the implied "not just to the employees there anymore"

stvltvs · on July 24, 2024

Which vendors work best for private projects?

prmoustache · on July 25, 2024

gitea works well. Use that on your own network.

tracker1 · on July 24, 2024

You could consider GitLab.. though this only seems to affect private forks of public repos.

t-writescode · on July 25, 2024

I've been happy with Jetbrains Space (now Space Code); but I'm using it for private, professional work and paying for it, so perhaps that isn't what you mean.

chrisandchris · on July 25, 2024

JetBrains Space, Atlassian Bitbucket, GitLab (also On-Premises), Gitea

Order does not indicate any preference.

the8thbit · on July 24, 2024

I've used both Bitbucket and Azure in the corporate world.

Ragnarork · on July 25, 2024

Sourcehut :)

catalypso · on July 24, 2024

> I'll be calling "private" repos "unlisted"

That might be a bit too strict. I'd still expect my private repos (no forks involved) to be private, unless we discover another footnote in GH's docs in a few years ¯\_(ツ)_/¯

But I'll forget about using forks except for publicly contributing to public repos.

> Users should never be expected to know these gotchas for a feature called "private".

Yes, the principle of least astonishment[0] should apply to security as well.

[0] https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...

CGamesPlay · on July 25, 2024

Specifically about the feature called "private", the only gotcha seems to be that when the upstream transitions from private to public, it may unexpectedly take more data public than desired, right? The other discussed gotchas were all about deleting public data not actually being deleted or inaccessible.

epolanski · on July 25, 2024

I see your point, on the other hand, the standard procedure for that on GitHub UI is to create a repo and then select another as a template.

That doesn't fork, but does what you would expect, a fully private repo.

est · on July 25, 2024

> It's disappointing to see GitHub calling it a feature instead of a bug

git is a "distributed" version control software afterall. It means a peer can't control everything.

Osiris · on July 25, 2024

Anyone at your company and just push to a public git repository at any time. Nothing stopping them except threat of consequences.

account42 · on July 25, 2024

So? Employees with access to sensitive data are capable of leaking that data. News at eleven!

And anyone in the world can pull what was pushed to a public git repo before you delete it. You should always assume that has happened.

oxfordmale · on July 25, 2024

This is about access to private repos, not public ones:

"Anyone can access deleted and private repository data on GitHub"

account42 · on July 25, 2024

You might have noticed that my comment is a reply to another comment.

barnabee · on July 25, 2024

Disagree. If you're using a service, understand how it works.

Not everything needs to be designed for idiots and lazy people, it's ok for some tools and services, especially those aimed at technical people and engineers to require reading to use properly and to be surprising or unintuitive at first glance.

niam · on July 25, 2024

There's got to be a word for these kinds of ridiculous arguments which use personal responsibility as a cudgel against a systematic fix.

I agree generally that interfaces have been dumbing down too far, but "private is actually not private and it's on you for not knowing that, idiot B)" is a weird place to be planting that flag.

barnabee · on July 25, 2024

There should probably also be a word for the belief that when a system doesn't work how you want it to, that is so obviously a systematic problem that needs fixing rather than, for example, evidence of differing goals or priorities that it is reasonable to describe anyone who thinks otherwise as ridiculous.

phito · on July 25, 2024

That means having an opinion