Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Private Repos Considered Private-­Ish (tylercipriani.com)
163 points by fagnerbrack on June 4, 2023 | hide | past | favorite | 145 comments



Don’t forget OpenAI and Microsoft using your github data for training GPT. Their privacy statement says your content will not be read by “human eyes.”

https://docs.github.com/en/get-started/privacy-on-github/abo...


Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.


> formerly reputable corporations

I'm sorry but when was Microsoft ever reputable? They have a long history (and reputation) of being merciless in every single way they can, and have for as long as I can remember.


A cooperation is never reputable. It is easier to reason with a dog than with a cooperation.

However, me thinks this relates to the times before Github became an offering by Microsoft. But the deal was just too hard to miss, getting this massive army of minion coders who all pray to the octocat and now do the Balmers dance.

Oh so much fun, now it turns out, that all feed the new AI overlords.


That misspelling of "corporation" severely breaks the meaning you almost certainly intended.


oh snap, too late to edit ... you are absolutely right.


Developers developers developers developers


GitHub wasn't always owned by Microsoft.

Please note I am not attempting to address the reputability of GitHub pre-acquisition. That is a separate matter.


I think 90s Microsoft could have something of a claim.

It made a lot of sharp business choices in that decade, but it also left a LOT of money on the table for developers, as part of a strategic goal to grow the platform.

Then the 00s came, platform growth slowed (because they were already running on everything desktop), and the "vs linux" decisions started coming.


Eh, I wouldn't say the say they handled the whole DR-DOS saga very reputably. And in 2001 they came to within an inch of their life in US v. Microsoft, a litigation mainly based on anticompetitive practices from the 90s.

Microsoft was disreputable in the 90s to the point they were almost broken up several times.


Microsoft's battle against Netscape (which is also part of the show Valley of the Boom), and the Halloween documents, are two more examples.


It bleeds over into ethics at some point.

Microsoft in the DOS days was still fighting hard for market. Microsoft in the Netscape days? Eh... less of a competitive claim. Post ~2005? No claim.

For me, the definition of "reputable" changes when you're a competitor among equals vs when you're a monopoly.


They were a monopoly since the '90s.


Okay, maybe 90s MS could've, but I have a feeling that's 30 years ago.


>I'm sorry but when was Microsoft ever reputable?

They went on an open source charm offensive a few years ago. "Oh, we've turned over a new leaf" etc.

A lot of people believed that they'd had a legitimate change in heart because of the change in strategy.

More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".

Their dubious anti-Linux tactics via leaning on OEMs in the desktop market remained more or less unchanged.


> More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".

and with vscode, copilot, and wsl2 they're doing a terrifyingly good job :-/

I really hope people don't let their guard down.


GP said: Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.

you said: I'm sorry but when was Microsoft ever reputable?

nobody said Microsoft had ever been reputable.

GitHub is the formerly reputable corporation here.

GP comment doesn't even make sense without that.


Define reputable.

They're still probably the most indirectly trusted company on the earth

Why? because almost all enterprises do run some non-trivial amount of MS code

Either Windows, Azure / Azure AD or AD at all, Teams/Outlook, VS Code, or anything else.

And please, let's do not start arguing that some startup made of 30 people uses Macs only.


Anybody who was a programmer through the 1980's knows the anti-competitive practices that MS used to destroy many up-and-comers. They essentially were a dam on technological progress by coercing the world to use a non-protected single-tasking OS when we could have easily been using pre-emptive multitasking OSes on the hardware of the day.


Many SaaS companies probably double dip by monetizing customer data in various ways. Good luck even knowing if it’s happening and if you do figure it out I’m sure the EULA will be properly one sided.

Can you imagine how much intel Google Docs, GMail, Salesforce, Profitwell, etc have about company performance and plans?

I’m sure nobody is using any of that data to insider trade, to give just one example. Nobody would do that.


Microsoft providing software and services to organizations running concentration camps in Texas is why I stopped using GitHub.


Care to elaborate?


What alternatives do you recommend, and why?


A private instance of gitlab on a private server works.


IP theft?

Just code is not an IP.

Your product or a specific algorithm is.

And in my opinion patent on algorithm should be illegal.

There.is no inherent problem hosting Code on GitHub.

You are not doing a good job if you move companies away from working setups due to this.

And they haven't had high security requirements anyway because everyone else normally hosts GitHub Enterprise or gitlab themselfs


I'm always surprised that there are always people in hn liking software patents.

But sure have your opinion but at least try to bring your issue actoss


I don't think anyone on HN likes software patents but god damn do they love copyright of code.


I don't think it's right to conclude "your private repo data will be used to train GPT" based on the text you linked to there.


Yes, not conclusive, but the cynic in me is usually right esp. when there are large sums of money and power on the line. OpenAI was originally suppose to be open, but power corrupts.


I never read that before but you are right and based on that wording looks like private repos are open to having LLMs trained upon them.

I assumed up to this point they could only use public ones but this wording suggests otherwise.


This comment is complete FUD as you’re linking to a tertiary paragraph in a document describing how your data may be used if the owner of a private repository enable it, for such use as Dependabot, CodeQL, or secrets scanning.

The top of the document is:

> GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.

> If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."

Taking a single paragraph out of context isn’t worthy of the top-voted comment.


I call BS. OpenAI was originally meant to be an open source and non-profit company. It is now a closed source, for-profit company and is controlled by the company that gave FUD its original meaning for discrediting others who stood against it. It is supreme naivety to think they will not use whatever they can to gain power in the AI arena. There is nothing in that statement that a high powered lawyer can not twist and bend to their liking. I can easily see the transformer matrix weights being defines as "aggregate data" -- "they are just floating point numbers -- they are not anyone's source code."


OpenAI != GitHub even if GitHub has allowed training on public repositories (which I believe to be an absolute mistake because it should be preserving licensing on the processed repositories and we know that it is not doing so).

There are many reasons to distrust Microsoft. The wording of this particular paragraph explaining how data gets used in accordance with the linked terms of service (which are the actual governing documents, not the page you’ve linked to) is not one.


GitHub = MS, OpenAI = MS, so by transitivity OpenAI = GitHub (where = means "controlling or related interest").

I agree that the particular wording is not sufficient to specify much of anything, but does it sure doesn't shut the door on the possibility either.


OpenAI is not a subsidiary of Microsoft. GitHub is.

Unless you can meaningfully show that Microsoft is actively applying a subsidiary relationship (that is, where it directs OpenAI’s product direction), I have to disagree with your base notion.

At this point, I reiterate that your original claim is 100% FUD and disinformation.

I’m not asking you to trust GitHub or Microsoft, but legal terms have meaning and the terms do not support your assertions.



iirc, they said they wouldn't do this? so this could just refer to secret scanning


Clearly they lied, if their policy says otherwise.


Their policy, if you scroll up from this link, is to scan only “aggregate metadata” and only if you opt in.

GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.

If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."


> The information we learn only comes from aggregated data

It seems pretty clear to me that this means they're allowed to use private repos to train copilot, etc.

I wonder if any researchers have tried putting fingerprinted source code into a private repo, and then (after it is retrained) getting copilot to suggest stuff that could only have come from the injected supposedly-private source code.

That would make a nice paper. I hope someone does it.


I genuinely don't see how "The information we learn only comes from aggregated data" relates to training LLMs, which need raw data, not aggregated data, as their input.

Maybe we have different definitions of the term "aggregated"?

This suggests to me that GitHub need to extend that text to explain what they mean by "aggregated".


The LLM itself is a form of aggregated data.


Sure, but the raw training data isn't.

I think GitHub need to clarify this themselves.


But then the training can surreptitiously be called data aggregation. We don't look at the raw data, we aggregate it and then query the aggregation.

Definitely needs clarification, though somehow I suspect this is all by design.


Sure. All aggregated data is ultimately derived from raw, unaggregated data. One can make the argument that training an LLM is "just" an unusually complicated form of aggregation.

Whether that would hold up is another question. But yeah, I agree with the conclusion that they need to clarify this.


Training the LLM is a form of “learning”, and putting all the data in an input training set is a form of aggregation.

The clause seems to mean “we can do whatever we want with your data as long as we violate many people’s privacy at scale at the same time”.


Enter lawyers.


What is AI training but “parsing content” for “delivering generalized insights”? They intentionally use slippery language that can defend their practices.


Policy says nothing about ChatGPT or LLM/AI models.


I couldn't find the privacy policy for Azure Repos. Does anyone know if it has the same type of statement regarding 'human eyes' for private repos ?


Recommendations missing from the article:

- Enable mandatory 2fa within your Github organization (if you don't use an organization, you probably should)

- Disable the ability to fork repos in your organization

- Configure and enable mandatory SAML authentication. In combination with mandatory 2fa, this makes phishing and even key leakage less likely (specific keys need to be double authorized for SAML, so that random private key you have on a random CI/CD platform from a few years ago can't access repos in SAML organizations even if it's leaked)

- Disable the ability to make repos public at the organization level

These are some quick setting changes that should address at least 3 of the 4 main issues in the article.

Additional suggestion:

Enable branch protection on master. Require at least 1 peer review to merge anything in it. Enforce branch restrictions to include repo admins so restrictions can't be bypassed. This should stop obvious mistakes like accidentally committing .git or random credentials.

Edit: if the author sees this, feel free to add these things to the article!


Your other points are useful but I find the below questionable and counterproductive:

> - Disable the ability to fork repos in your organization

If someone can read it, they can trivially fork it (clone locally, then republish as new repo). The only thing you're preventing with this advice is the free discoverability and tracking of forks which you get with forks created with the GitHub "fork" button. The forks are still there but now you have a harder time finding them.


But now it becomes a harder mistake to make. Creating a new repository is easy, but not so easy you would do it by accident.

Importantly when this is discovered you can fire the person who did it, and they can't say it's was an accident, while the fork button to the wrong place is potentially an accident.


You can't accidentally share a private fork any easier than you could a new repo. If you try to make a private fork public in GitHub there is just a message that "For security reasons, you cannot change the visibility of a fork."

As far as I know the article's example of 'a developer forks a private repo and makes it public' is not possible in github.


I’d say it becomes much easier: if you fork a private repository, you need to go through multiple layers of warning to move it to public.

If you want a collaboration repository with forking disabled, public is the default.


The idea is to protect against accidental sharing. A rogue employee can simply take dump and share on 4Chan no matter what tools you use.


Forks of private repos are similarly private by default, with some hoops and warnings before you can extend it. I still dont see how that recommendation brings any benefits in this regard.


As long as they don't take the dump on company time the developer should retain any intellectual property rights regarding photos they take of it.


What does "Disable the ability to fork repos in your organization" do? Isn't that entirely defeated by users changing the origin?


Yes, it's just lip service. The moment you've cloned a repo to your computer, you have "forked" it. It's simply the nature of decentralized version control.


It prevents it from happening in the GitHub UI.

Of course it’s still possible to download the code and upload it to a separate repo (but then it’s not a fork).


Of course bypassing the UI means that normal development flows (like force rebasing, or futzing with the CI setup) now are the same set of commands as malicious exfiltration of source code.

It also breaks GitHub's normal protection against accidentally creating public forks of private repos (and the feature where it auto-deletes your private forks when you change jobs).

There are probably other ways in which disabling private forks undermines your organization's security, but those are pretty obvious.


(Genuine question) what is a use case where an employee needs to fork your company’s repo rather than open a PR?

The only one I can think of is what you mentioned (testing CI workflows that match a branch name) but at least at our company that can usually be done in a PR.

I don’t see any obvious reason why an employee would ever need to fork a company’s private repo to their personal GitHub account.


> I don’t see any obvious reason why an employee would ever need to fork a company’s private repo to their personal GitHub account.

I don't see it as well, but Github docs seem to suggest that forking a private repo inside the same org is allowed/forbidden by the same setting?


Every single clone is a fork. I feel like you're missing the fundamental concept of git as a DVCS.


GitHub support disagrees: https://github.com/orgs/community/discussions/35849

Clone = copy of code that you can sync with the remote on GitHub

Fork = copy of code that isn’t connected to the original remote. Of course, every fork involves a clone. But not every clone is a fork.

I could be wrong, but “fork” isn’t really a concept native to Git.

https://en.m.wikipedia.org/wiki/Fork_(software_development)


Forks in git are connected by the fact that they contain identical sha’s.


They clearly mean fork in the GitHub UI, not fork as in running git clone locally.


For large projects this can happen. Think the Linux forks that AMD etc maintain and then regularly do a series of PRs to integrate them into some staging branch which then makes its way to the kernel if approved.

It’s a typical workflow for inter-org codebase.


Those are public repositories.

I do not see the benefit of forking private repositories and have disabled it in the company org that I manage.


You can also just use Github Enterprise. You're on Github.com but in a completely separate tenant and any collaboration outside of your tenant is impossible: you're recognised but it says you're not allowed.


> Enable branch protection on master. Require at least 1 peer review to merge anything in it. Enforce branch restrictions to include repo admins so restrictions can't be bypassed. This should stop obvious mistakes like accidentally committing .git or random credentials.

Every team I've seen do this had their productivity drop by over a half when it was implemented. YMMV, but my normal heuristic is to see if I'm making 2x market value (due to stock vesting or whatever). If not, I dust off my LinkedIn profile and start catching up with old colleagues once this gets turned on.

The mindset that leads to enabling that policy implies a few bad things:

- The team has probably seen growing pains, but did not switch to feature branches for each team, which means it needs to have a release manager, but doesn't know what those people do.

- The product has inadequate testing, and there is no QA organization, and the mismanagement is creating revenue headwinds.

- Management doesn't trust prior hiring decisions, and has decided to treat the dev team like children (treating them like a cost center comes next).

- The company could be chasing revenue via regulatory compliance. There is a backwater of companies that adopt some viral compliance standard that mandates all suppliers also comply with the boneheaded compliance rules. This last reason is the least ominous of the possible reasons, and usually comes with checkbox style implementation of master protection (e.g., admins can override, and over half the team members are admins.)


Edit: Longer response below, but if requiring peer reviews triggers you to start looking for a new job, what does your dev workflow look like? Do you do peer reviews at all or only sometimes?

Admittedly, branch protection requiring peer review into master was something we started for SOC 2 compliance.

But, it’s actually great if implemented well.

Some suggestions:

- Limit the use of “master” branch to code currently deployed to production (don’t use master to stage code that hasn’t been deployed yet). This branch should have the strictest restrictions (no admin override) because merging anything into this branch only happens immediately before a deploy. This also allows your team to assume that any pushes to master should/could trigger an automated deploy workflow

- Have a release branch where all PRs are merged/squashed into. Use this branch as the base branch for all PRs. At our company this is simply the “staging” branch. This branch can have looser restrictions (allow overriding restrictions). The release branch is anything that’s going to be deployed in the next release. Merged into the release branch can trigger automatic deploys to staging for QA team to do any final functional testing before code gets to production

- Or for a smaller company, don’t have any restrictions on the release branch. You’ll still be in compliance with all frameworks because, at a minimum, no code gets to production without a peer review (because the entire release branch has to be peer approved before it can be merged to master)

- Still allow hotfix PRs into master if you need to deploy something urgently without deploying the release branch. But hotfixes ideally should be rare (if they happen all the time it means you’re deploying a lot of buggy code to production and probably need to do better testing/QA)

There’s always a balance between security/quality control and productivity.

At the absolute minimum, you really should enable branch restrictions even if the only restriction is requiring all merges come from a PR. This will block developers accidentally force pushing their local master and potentially overwriting master branch completely (this has happened at our company prior to enabling restrictions, and it required another developer force pushing their, more up to date, local master to resurrect the correct state)

The goal should be to block actions that are obviously bad in all cases (e.g. pushing a commit directly to master without a PR).

Whatever your branch restrictions are, they should align with whatever your company’s internal code review processes are. Then, the branch restrictions are simply acting as a fall back in case developers make a mistake (e.g. avoids merging/pushing to master by accident)

For a small startup, all this advice is irrelevant because you probably care a lot more about how fast you can pump out changes and care much less about bugs getting into prod. And that’s fine.

These suggestions are mostly relevant for mission critical code bases where the restrictions align with your QA process. The branch restrictions should be enabled after documenting a QA process. Branch restrictions should not be arbitrarily enabled if the reasons for the restrictions don’t align with your existing internal processes/workflows.


> requiring peer reviews

I didn’t say anything about peer reviews.

Regarding the rest of your comment:

- if you are committing to a branch directly that then immediately goes to production you are doing something horribly wrong

- having a formal process for code reviews is like making sure there is a formal process to make sure employees wear pants to work. If this sort of thing has to be policed by security settings in the dev environment, then something deeper is wrong.

I mostly work on safety critical code, so I know what it means to ship a mission critical code base.

Most of the things you allude to suggest that you’ve never worked with a competent release manager or qa organization.

Edit: also:

> There’s always a balance between security/quality control and productivity

This is only true for definitions of “productivity” that exclude security/quality control, which usually mean that things are so out of whack, it is time to find a new job. By definition, the “productive” people aren’t worrying about product quality, and are being promoted for it, so soon the organization will be run by people that sabotaged the business.


> Most of the things you allude to suggest that you’ve never worked with a competent release manager or qa organization.

You’re right, I oversee an 8 person product team for a 20 person company. Not a huge organization. QA is a shared function and we don’t have a “release manager” on staff.

I’m genuinely confused about what you’re trying to say. Sounds like you’ve had a bad experience with branch restrictions. It would be interesting to hear what the workflow was and how branch restrictions, etc, got in your way.


It’s right there in the name: GitHub private repos are “private,” not “secure” or “secret”. That name was chosen purposefully.

When we moved our code from privately hosted SVN to GitHub 10+ years ago, folks at GitHub were quite clear that we should not trust private repos with secrets like private keys or passwords.

So why have private repos at all? It is to allow control of collaborators. GitHub originally allowed everyone to see and fork every repo—a true open source approach. Private repos were added basically so corporate codebase managers would not have to spend tons of time rejecting random PRs and responding to random people.

For a LOT of private software projects, everything secret can be easily stored in the database, environments, or a dedicated tool for managing secrets. GitHub private repos are great for those. If my entire codebase is a highly sensitive trade secret, I would not use GitHub private repos, personally.


When I have an expectation of privacy, say in a fenced backyard, I don't expect to be seen. It doesn't matter if it's a human or a robot commanded by a human that does the looking, it's still a violation of privacy.

It's not really "private" if the backyard has a one way glass pane on one wall. Github should not be calling repos private if they're not.


This is one of many instances where developers should not make shallow assumptions without reading the documentation. GitHub is very clear about how private repos should be used.

Edit to add: would you store your bank password on a piece of paper you always leave sitting in your backyard? Think carefully about your metaphor.


> would you store your bank password on a piece of paper you always leave sitting in your backyard?

Not that I'm advocating for checking passwords in git repos, Github or otherwise, but you metaphor depends on how rich you are and how bad is your neighborhood, I would think.

My mom has definitely done exactly that — her medical conditions make it very hard to use bank apps without full daylight and also hard to bring the notebook home after every use — and I don't think that her password management choices, however frowned upon they may be here, have ever been a serious problem in practice.


Did you read the article? They’re not saying that GitHub is reading your repos, they’re saying that mistakes happen and you shouldn’t assume that private repos are secure.

Go out in your fenced yard and assume you won’t be seen, but if you forget to close the gate someone might come back there anyway


I agree with the original comment -- github has secrets storage. There are also other secrets storage services that can be used out there. It's never a good idea to check in secrets in your code. Maybe the word "private" means different things to different users, but the fact that GitHub has a whole other set of features dedicated to "secrets" should be a giveaway to users that there is a distinction between "private" and "secret".


I don’t think it was “chosen very carefully”- the other word is “public”, and it refers mainly to who has access to the repo. Public and private are the ideal words imo


I dunno about other people but I’ve always assumed that if we’re giving employees, who we may have to fire someday, access to the code base that repos should be treated as if they could be exposed to the world at any time any way.


I've had to explain to leadership that certain employees don't even need access to the actual codebase to potentially cause harm via a competitor.

The real solution to this is to have a business model that is about more than just some source code in a repository, or "things one person could undo". Concerns like years-long relationships with customers (aka trust), platform-style lock-in patterns, etc. How many times have various parts of AAA game studio or Microsoft codebases been compromised? I am still waiting for that hacker edition of HL3 and my free copy of windows that doesn't suck.

I actually spent a few minutes walking through a hypothetical where 100% of our latest code is lifted and taken to our biggest competitor. I think it would probably cause them more harm than good. Complexity is a hell of a thing. Just having a point-in-time snapshot doesn't really give you an advantage over someone who has been at it continuously for years. Perhaps we can strike a consulting contract with whoever steals our code...

I am not against reasonable measures (i.e. MFA, GH Enterprise, VPNs), but I won't go into paranoid-tier (Citrix-style desktops) over stuff like this ever again. Code is a cheap commodity in 2023. Enshrining your repository as if it is the actual vehicle of business value is indicative of poor leadership. Customers, relationships, execution, etc. are way more challenging and important.


This is a great baseline, but it is often easier said than done.

For example, those repositories contain a lot of privately identifiable information, it is not that easy to get such a baseline ready for that _"should be treated as if they could be exposed to the world at any time any way."_

Depending on jurisdiction this can affect sensitive information that requires much stronger controls in place when you (rightfully!) expect the repository to become public despite it is a private one.


For example, those repositories contain a lot of privately identifiable information

This seems to miss my point - I have no idea why PII is in a code repo.


Those of the workforce, that is a pretty typical situation. Timestamps of activity etc. .


This post is mostly FUD. Case in point, the following advice:

> So, if you’re worried about it: stop putting sensitive data into private repositories.

Most of the issues mentioned in the post (misconfiguration, phishing, mistakes, zero-days) apply to all software, including non-cloud software. So the above advice is equivalent to "stop putting sensitive data into computers". It's run-of-the-mill popular-security nonsense that conveniently ignores that the alternatives come with their own risks, including security risks, and doesn't even attempt to perform a cost-benefit analysis.

Less of this stuff, please.


I like to think myself as a pragmatic practitioner of security, and even I would say this isn't "popular-security nonsense". The "cost-benefit analysis" here equals: don't bloody do that, _obviously_.

If your system relies on secrets being present in repos, it is a poorly designed system. There is no scenario where this is necessary, other than one not wanting to put in the effort to inject secrets sensibly.

Yes, "the alternatives come with their own risks", but those risks are where the "cost-benefit analysis" equals "not worth the effort".


> If your system relies on secrets being present in repos, it is a poorly designed system.

Repos are databases that track changes in files, nothing more and nothing less. There are millions of repos on GitHub that don't contain application code.

Some people put their entire home directory in a Git repository. Home directories almost always contain secrets. That doesn't mean it's a bad idea, it just means the repo isn't meant for the eyes of others. In other words, it's a private repository.


People putting their entire home directory in a repo, on somebody else's shared computer and network, is going to end in a disaster scenario. This sounds like a bad idea.


> Most of the issues mentioned in the post

...are directly attributed to the cited GitHub incidents. I only scanned the article once but I saw no FUD in it.


Avoid having secrets if you can. Use authentication methods that don’t require them. When they are required rotate them often - ideally automated. Keep them out of password managers. They are like condoms: cheap enough to use once, unpleasant if used twice. Hence the common UX where you get to see the key once encourages using once.

Give those tokens the minimum permissions possible. Make sure the impact of needing to rotate that secret doesn’t mean other secrets are exposed (looking at you: azure storage account primary access key!) Automated rotations make keeping them up to date in git untenable anyway.

Some services like Vercel and Firebase encourage bad practices in this regard. Firebase requires you to download a plaintext JSON of kingdom keys in order to do anything useful! Vercel KV storage encourages you to run an SDK command to download prod secrets into a local env file.


GitHub Enterprise had (has?) an interesting loophole to discover the existence of private repositories.

Attempting to transfer ownership of a repository to another user was aborted if the user had a repository of the same name—even if it was private.

Public GitHub doesn’t seem to have this issue with the transfer request system, though. Maybe it did at some point?


> We cram our secrets into git

Excuse me?!


Yeah, I reacted to that too. It's like nonchalantly saying that you have all your passwords written on post-it notes at your desk.

The topic of discussion shouldn't be how to secure your desk from spying eyes, but about why having post-it notes with passwords is bad practice and just a bad idea overall.

If your private github repo accidentally goes public, the response should be "that's annoying but ultimately harmless", anything else is misguided.


Postits for passwords are better practice than memorizing passwords. If you can memorize it, it is a bad password. Password managers are better yet, but you still need the master password.

The problem is not keeping those passwords in a secure location, treat it like a stack of $100 bills.


For your home desk? I can buy that. But for your work desk? No way is that even remotely more acceptable than having a memorizable password.


You don't leave them on the desk. Lock them up with a key. Every office gives you file cabinet that locks.


That's basically an analog password manager, we have gone full circle.

Or we return the metaphor to github repos, having a separate cabinet is like having a secret vault so that secrets are not directly in plain view in the repo itself, which is exactly what you should be doing.


Not quite full circle as we have now agreed that writing your passwords down on paper is acceptable.


I'll grant you that, it's not the medium that matters. But pen on paper wasn't my objection, it was the post-it note on the bezel of your monitor.


Some folks use tools like https://github.com/mozilla/sops to store most secrets (besides the sops key, of course) in source control. Of course, you aren't committing the cleartext but if the repo gets published you should probably rotate your keys just to be safe...


Even this I would consider to be bad practice. Old versions of secrets are never relevant. Easy way to break your system:

1. Write code v1 2. Add secret 3. Write code v2 4. Rotate secret 5. Oops, some kind of problem, let's go back to known-good and redeploy (2). Broken because it tries the older secret, not the rotated secret.

Just don't store secrets in version control.


That one has an easy fix: store secrets in a separate repo that you never roll back. That's not the reason to avoid storing secrets in git. You might be giving some junior dev here the idea that if they can solve this issue, then storing secrets in git will be ok. Obviously it's not; it's still a bad idea after you've solved this minor annoyance and, indeed, this annoyance had nothing to do with the security reason why you don't store secrets in git.


This assumes that the secrets are deployed along with everything else in the repository. Even if the same repository contains your app, they needn't be deployed together. And as far as old secrets go, they are at most as sensitive as current secrets.


What for? I've seen this happening and if there would not have been a review, it would have stayed there unnoticed.

/edit:

Also what someone considers a secret and then not, is often not well defined. If management has no clue what this is about, it is often better to only commit and push on direct and simple work order, because these need to be well understood and you have the paper trail (as that author also suggests blameful retrospectives - IMHO hilarious).

Or do we have forgotten about the basic rules sending data over the interwebs to other people computers?


It’s slightly annoying when someone blunts doing something dumb by declaring some form of “we all do it.”


If source code isn't public, it's a secret by the traditional definition of the word.


> by the traditional definition of the word.

Sure, but in computer programming “secrets” is also industry jargon for small strings of characters that enable authentication, like passwords or private keys, which have much higher standard of secrecy than the rest of the codebase.


A secret is something that you go out of your way to keep private. But there are a lot of other things that are private by default, but aren't really secrets, like "the brand of toilet paper I use".


This. Every CI platform under the sun has support for secrets and config that should never live in git. It's worth ensuring people know this, of course, but I'm not sure storing secrets in git is all that prevelant. Many platforms also have secrets scanning to ensure you don't accidentally do this too.


> Many platforms also have secrets scanning to ensure you don't accidentally do this too.

The reason secrets scanning even became a thing is because of how often secrets get committed to git. Some of them even lead to intrusions.

Uber (2016) – Attackers gained unrestricted access to Uber’s private Github repositories, found exposed secrets in the source code, and used them to access millions of records in Amazon S3 buckets.

Scotiabank (2019) – Login credentials and access keys were left exposed in a public GitHub repo.

Amazon (2020) – Credentials including AWS private keys were accidentally posted to a public GitHub repository by an AWS engineer.

Symantec – Looking at hardcoded AWS keys in mobile apps, discovered they had a much wider permissions scope and led to a significant data leakage.

GitHub – Over 100K public repositories on GitHub were found to contain access tokens.

I've worked at companies with developers who didn't know that once committed, the secret remains in the history even if a subsequent commit removes it. It's not trivial, and involves rewriting the history[1]. There's also no way to fix clones of the repo, and there are a handful of other ways secrets can still leak.

The most secure way to deal with secrets accidentally committed to git is to rotate the secret.

1 https://docs.github.com/en/authentication/keeping-your-accou...


Yes, they happened in the past. Over 3 years back. Times change.

You can use this and crenedials rotation, amazing, I know...

https://docs.github.com/en/code-security/secret-scanning/abo...

https://docs.gitlab.com/ee/user/application_security/secret_...

https://circleci.com/blog/detect-hardcoded-secrets-with-gitg...

https://www.gitkraken.com/media/events/azure-spring-clean-20...

edit: used config/secrets management for as long as I can remember doing this cloud stuff (several years), so the excuses are very, very poor imho.


OP's target audience, judging by all the emojis and the subject matter, seems to be a pretty green junior devs who might not know any better. OP is putting themselves in their shoes as a rhetorical device, the author isn't saying that they themselves put secrets into git.


Does the article has anything Github specific? All softwares can have bug so moving from Github to anything else is useless. VPN only self-hosted provides some additional protection but that's true for all cloud services like Jira, CI/CD and not a novel issue.


The article cites several GitHub incidents. If it can be verified GitHub competitors have a roughly equal number of incidents then the lack of novelty would be proven. I'm not supporting either view, just addressing your point.


This is true for everything in the cloud that isn’t end-to-end-encrypted. There is the saying that the cloud is just someone else's computer. But in addition, you also have no control over your data on that computer not being leaked elsewhere, or the computer ending up becoming someone different’s computer.


Adding to this my own pessimistic take is that once $big_company that acquired $useful_service decides they have obtained a sufficient level of corporate capture with paid businesses then they may decide to add limits to non-paid repository counts, sizes, pulls, pushes and tighten those limits each year until the non-paying members have moved on. Something similar may be slowly happening with Gmail based on ask hn submissions.


This indirectly outlines one of the primary reasons organizations (like mine) prefer using GitHub Enterprise - we get the collaboration benefits of GitHub, but our data is entirely controlled and hosted by us. It's extra work and more costly overall, but it's a small price to pay for the data security.


I think you mean GitHub Enterprise Server. GitHub Enterprise is still a cloud / SaaS product.


Enterprise, even cloud hosted, would prevent things like a fork being public.


From my perspective, the upkeep of maintaining such versions is far from trivial. It involves dedicated resources ensuring the regularity and reliability of backups. Furthermore, it's rare to find organizations conducting comprehensive disaster recovery (DR) restorations of their self-hosted GitLab or GitHub instances. This is typically due to an inherent expectation that each upgrade will proceed without hitches and yield flawless operation.

Moreover, the process of securing approval for NAT inbound rules, particularly for integrations with services like Jira, Slack and so on, often turns into a labyrinthine ordeal. It usually involves navigating the differing viewpoints and interpretations across multiple departments, each with its own unique stance, which further exacerbates the complexity of the task.


As an aside, it remains funny to me how much effort organizations put into fighting NAT in 2023 instead of effort to just use ipv6.

Of course then there is the embarrassment of azure ipv6 so it's perhaps somewhat forgivable:)


There are many valid reasons to not use IPv6, it would make an interesting HN article.


My ISP still doesn't even support ipv6. From my limited understanding that itself would already make it a problem to access an ipv6 only site.


I see a lot of deserved distrust of MS, but I thought GitHub was a operated as a separate unit under Microsoft? [1] I expected that Co-Pilot was an initiative of that leadership team and training the LLM is what's likely reading certain repositories?

On a side note, I'm trying to imagine what "sensitive" code would be read, incorporated into an LLM such as Co-pilot, and somehow have any meaningful impact to me once incorporated?

[1] https://github.com/about/leadership


A company I worked for a few years ago sold a source code license and I was tasked with sharing the source, sans history and only at a specific point in time, to the purchaser’s GitHub organization . I was told it had to be done as quickly as possible, and that we might do it a few more times so figure out a way to make it repeatable.

I wrote a Bash script to copy our ~25 repositories to a new organization using the GitHub REST API. Turns out that the default when creating a new repository was to make it public. Within a few minutes I was getting emails by third party services who had downloaded and parsed our code, discovering a long abandoned AWS credential hard-coded in an older codebase. (I think GitHub now does this for you)

I was lucky that nothing important was actually compromised (we had done a decent job of keeping secrets out of our repos and the one exception was an account that was long gone), but it was an eye-opening experience. If these services had found and downloaded our code in minutes you can assume any repo made public, even for a few moments, has been downloaded and cataloged by a potential bad actor.


Any idea how successful the company who you sold sold the license to was with code base?

Trying to develop on a code base without history and a one time snapshot seems quite hard.


How about encryption?

https://github.com/AGWA/git-crypt has been solid for me


Assuming you mean encrypting the secret files specifically and not the whole repo. This could work, but why put the secrets in the repository at all? Either everyone has to replace them with their own secrets before being able to run the code. Or everyone has to have the key to decrypt the actual credentials (developers, interns, deployments, CI, scanners, etc) making it not secret.

Why not inject the credentials at runtime, from a system meant for this, with support for auditing and key rotation etc?


Old secrets are around forever, tied to long lived credentials (PGP keys) who's access can't be revoked (because it's always in the commit history). Additionally, it can be quite painful in an ever-changing collaborative context.


I like this a lot and use it myself. Always have a tough time convincing developers though, because they don't like "all that terminal stuff" :/


Are you sure they're developers?

Nevermind, I guess they can be web developers :p


Didn't want to go there, but I think you get it. When your developers are using JavaScript to write backend programs, you know where the bar is.


Unsurprising. Private repositories are not 'private' [0] and GitHub, Microsoft and OpenAI are thanking you for allowing them to train their AI on your so-called 'private' code. [1]

This is essentially a shameless admission and they not even hiding it any more:

Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

This is another great reason to self host instead of using services like GitHub openly reading your private code, which I have been saying for years.

[0] https://news.ycombinator.com/item?id=34258431

[1] https://docs.github.com/en/get-started/privacy-on-github/abo...


As the person on call for exactly that sort of problem, it saddens me that people still push secrets to git at all.

Every gitops platform on the planet supports not pushing secrets to version control. There is no valid reason to have credentials inside a git repo ever. If the repo itself is what's secret, you should have known the risk when you pushed it to someone else's server. There is no excuse when gitlab is free and a docker container away.


I'm not sure why in the "Easy Fixes" it doesn't say the thing that is mentioned in the next point "stop putting sensitive data into private repositories". That should _also_ go in "Easy Fixes" IMHO, use environment variables, vaults, etc to not store your keys into your code.


Have you guys considered the machines are reading your private repos all the time? Forget about LLMs -- what about your text editor? Also, it is likely that your browser has access to your private repo when you use your browser to browse your private repo.


You could also leak private repo names via GHCR: https://www.chainguard.dev/unchained/ghcr-private-repos-some...


>We cram our secrets into git, then shove it off to the most expansive code forge in the history of humanity

Do we? Putting secrets in git is against policy almost everywhere.


People like to put api keys in their repos and it’s hard to make them grasp why they shouldn’t. I guess some lessons have to be learned the hard way.


I wonder, how do Azure DevOps and AWS CodeCommit compare in this regard?


Getting a bit tired of the anti-GitHub threads here ...


Silly and HN hand-crafted. Too many noobs nowadays around lol




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: