Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.
I'm sorry but when was Microsoft ever reputable? They have a long history (and reputation) of being merciless in every single way they can, and have for as long as I can remember.
A cooperation is never reputable. It is easier to reason with a dog than with a cooperation.
However, me thinks this relates to the times before Github became an offering by Microsoft. But the deal was just too hard to miss, getting this massive army of minion coders who all pray to the octocat and now do the Balmers dance.
Oh so much fun, now it turns out, that all feed the new AI overlords.
I think 90s Microsoft could have something of a claim.
It made a lot of sharp business choices in that decade, but it also left a LOT of money on the table for developers, as part of a strategic goal to grow the platform.
Then the 00s came, platform growth slowed (because they were already running on everything desktop), and the "vs linux" decisions started coming.
Eh, I wouldn't say the say they handled the whole DR-DOS saga very reputably. And in 2001 they came to within an inch of their life in US v. Microsoft, a litigation mainly based on anticompetitive practices from the 90s.
Microsoft was disreputable in the 90s to the point they were almost broken up several times.
> More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".
and with vscode, copilot, and wsl2 they're doing a terrifyingly good job :-/
GP said: Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.
you said: I'm sorry but when was Microsoft ever reputable?
nobody said Microsoft had ever been reputable.
GitHub is the formerly reputable corporation here.
Anybody who was a programmer through the 1980's knows the anti-competitive practices that MS used to destroy many up-and-comers. They essentially were a dam on technological progress by coercing the world to use a non-protected single-tasking OS when we could have easily been using pre-emptive multitasking OSes on the hardware of the day.
Many SaaS companies probably double dip by monetizing customer data in various ways. Good luck even knowing if it’s happening and if you do figure it out I’m sure the EULA will be properly one sided.
Can you imagine how much intel Google Docs, GMail, Salesforce, Profitwell, etc have about company performance and plans?
I’m sure nobody is using any of that data to insider trade, to give just one example. Nobody would do that.
Yes, not conclusive, but the cynic in me is usually right esp. when there are large sums of money and power on the line. OpenAI was originally suppose to be open, but power corrupts.
This comment is complete FUD as you’re linking to a tertiary paragraph in a document describing how your data may be used if the owner of a private repository enable it, for such use as Dependabot, CodeQL, or secrets scanning.
The top of the document is:
> GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.
> If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."
Taking a single paragraph out of context isn’t worthy of the top-voted comment.
I call BS. OpenAI was originally meant to be an open source and non-profit company. It is now a closed source, for-profit company and is controlled by the company that gave FUD its original meaning for discrediting others who stood against it. It is supreme naivety to think they will not use whatever they can to gain power in the AI arena. There is nothing in that statement that a high powered lawyer can not twist and bend to their liking. I can easily see the transformer matrix weights being defines as "aggregate data" -- "they are just floating point numbers -- they are not anyone's source code."
OpenAI != GitHub even if GitHub has allowed training on public repositories (which I believe to be an absolute mistake because it should be preserving licensing on the processed repositories and we know that it is not doing so).
There are many reasons to distrust Microsoft. The wording of this particular paragraph explaining how data gets used in accordance with the linked terms of service (which are the actual governing documents, not the page you’ve linked to) is not one.
OpenAI is not a subsidiary of Microsoft. GitHub is.
Unless you can meaningfully show that Microsoft is actively applying a subsidiary relationship (that is, where it directs OpenAI’s product direction), I have to disagree with your base notion.
At this point, I reiterate that your original claim is 100% FUD and disinformation.
I’m not asking you to trust GitHub or Microsoft, but legal terms have meaning and the terms do not support your assertions.
Their policy, if you scroll up from this link, is to scan only “aggregate metadata” and only if you opt in.
GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.
If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."
> The information we learn only comes from aggregated data
It seems pretty clear to me that this means they're allowed to use private repos to train copilot, etc.
I wonder if any researchers have tried putting fingerprinted source code into a private repo, and then (after it is retrained) getting copilot to suggest stuff that could only have come from the injected supposedly-private source code.
That would make a nice paper. I hope someone does it.
I genuinely don't see how "The information we learn only comes from aggregated data" relates to training LLMs, which need raw data, not aggregated data, as their input.
Maybe we have different definitions of the term "aggregated"?
This suggests to me that GitHub need to extend that text to explain what they mean by "aggregated".
Sure. All aggregated data is ultimately derived from raw, unaggregated data. One can make the argument that training an LLM is "just" an unusually complicated form of aggregation.
Whether that would hold up is another question. But yeah, I agree with the conclusion that they need to clarify this.
What is AI training but “parsing content” for “delivering generalized insights”? They intentionally use slippery language that can defend their practices.
- Enable mandatory 2fa within your Github organization (if you don't use an organization, you probably should)
- Disable the ability to fork repos in your organization
- Configure and enable mandatory SAML authentication. In combination with mandatory 2fa, this makes phishing and even key leakage less likely (specific keys need to be double authorized for SAML, so that random private key you have on a random CI/CD platform from a few years ago can't access repos in SAML organizations even if it's leaked)
- Disable the ability to make repos public at the organization level
These are some quick setting changes that should address at least 3 of the 4 main issues in the article.
Additional suggestion:
Enable branch protection on master. Require at least 1 peer review to merge anything in it. Enforce branch restrictions to include repo admins so restrictions can't be bypassed. This should stop obvious mistakes like accidentally committing .git or random credentials.
Edit: if the author sees this, feel free to add these things to the article!
Your other points are useful but I find the below questionable and counterproductive:
> - Disable the ability to fork repos in your organization
If someone can read it, they can trivially fork it (clone locally, then republish as new repo). The only thing you're preventing with this advice is the free discoverability and tracking of forks which you get with forks created with the GitHub "fork" button. The forks are still there but now you have a harder time finding them.
But now it becomes a harder mistake to make. Creating a new repository is easy, but not so easy you would do it by accident.
Importantly when this is discovered you can fire the person who did it, and they can't say it's was an accident, while the fork button to the wrong place is potentially an accident.
You can't accidentally share a private fork any easier than you could a new repo. If you try to make a private fork public in GitHub there is just a message that "For security reasons, you cannot change the visibility of a fork."
As far as I know the article's example of 'a developer forks a private repo and makes it public' is not possible in github.
Forks of private repos are similarly private by default, with some hoops and warnings before you can extend it. I still dont see how that recommendation brings any benefits in this regard.
Yes, it's just lip service. The moment you've cloned a repo to your computer, you have "forked" it. It's simply the nature of decentralized version control.
Of course bypassing the UI means that normal development flows (like force rebasing, or futzing with the CI setup) now are the same set of commands as malicious exfiltration of source code.
It also breaks GitHub's normal protection against accidentally creating public forks of private repos (and the feature where it auto-deletes your private forks when you change jobs).
There are probably other ways in which disabling private forks undermines your organization's security, but those are pretty obvious.
(Genuine question) what is a use case where an employee needs to fork your company’s repo rather than open a PR?
The only one I can think of is what you mentioned (testing CI workflows that match a branch name) but at least at our company that can usually be done in a PR.
I don’t see any obvious reason why an employee would ever need to fork a company’s private repo to their personal GitHub account.
For large projects this can happen. Think the Linux forks that AMD etc maintain and then regularly do a series of PRs to integrate them into some staging branch which then makes its way to the kernel if approved.
You can also just use Github Enterprise. You're on Github.com but in a completely separate tenant and any collaboration outside of your tenant is impossible: you're recognised but it says you're not allowed.
> Enable branch protection on master. Require at least 1 peer review to merge anything in it. Enforce branch restrictions to include repo admins so restrictions can't be bypassed. This should stop obvious mistakes like accidentally committing .git or random credentials.
Every team I've seen do this had their productivity drop by over a half when it was implemented. YMMV, but my normal heuristic is to see if I'm making 2x market value (due to stock vesting or whatever). If not, I dust off my LinkedIn profile and start catching up with old colleagues once this gets turned on.
The mindset that leads to enabling that policy implies a few bad things:
- The team has probably seen growing pains, but did not switch to feature branches for each team, which means it needs to have a release manager, but doesn't know what those people do.
- The product has inadequate testing, and there is no QA organization, and the mismanagement is creating revenue headwinds.
- Management doesn't trust prior hiring decisions, and has decided to treat the dev team like children (treating them like a cost center comes next).
- The company could be chasing revenue via regulatory compliance. There is a backwater of companies that adopt some viral compliance standard that mandates all suppliers also comply with the boneheaded compliance rules. This last reason is the least ominous of the possible reasons, and usually comes with checkbox style implementation of master protection (e.g., admins can override, and over half the team members are admins.)
Edit: Longer response below, but if requiring peer reviews triggers you to start looking for a new job, what does your dev workflow look like? Do you do peer reviews at all or only sometimes?
Admittedly, branch protection requiring peer review into master was something we started for SOC 2 compliance.
But, it’s actually great if implemented well.
Some suggestions:
- Limit the use of “master” branch to code currently deployed to production (don’t use master to stage code that hasn’t been deployed yet). This branch should have the strictest restrictions (no admin override) because merging anything into this branch only happens immediately before a deploy. This also allows your team to assume that any pushes to master should/could trigger an automated deploy workflow
- Have a release branch where all PRs are merged/squashed into. Use this branch as the base branch for all PRs. At our company this is simply the “staging” branch. This branch can have looser restrictions (allow overriding restrictions). The release branch is anything that’s going to be deployed in the next release. Merged into the release branch can trigger automatic deploys to staging for QA team to do any final functional testing before code gets to production
- Or for a smaller company, don’t have any restrictions on the release branch. You’ll still be in compliance with all frameworks because, at a minimum, no code gets to production without a peer review (because the entire release branch has to be peer approved before it can be merged to master)
- Still allow hotfix PRs into master if you need to deploy something urgently without deploying the release branch. But hotfixes ideally should be rare (if they happen all the time it means you’re deploying a lot of buggy code to production and probably need to do better testing/QA)
There’s always a balance between security/quality control and productivity.
At the absolute minimum, you really should enable branch restrictions even if the only restriction is requiring all merges come from a PR. This will block developers accidentally force pushing their local master and potentially overwriting master branch completely (this has happened at our company prior to enabling restrictions, and it required another developer force pushing their, more up to date, local master to resurrect the correct state)
The goal should be to block actions that are obviously bad in all cases (e.g. pushing a commit directly to master without a PR).
Whatever your branch restrictions are, they should align with whatever your company’s internal code review processes are. Then, the branch restrictions are simply acting as a fall back in case developers make a mistake (e.g. avoids merging/pushing to master by accident)
For a small startup, all this advice is irrelevant because you probably care a lot more about how fast you can pump out changes and care much less about bugs getting into prod. And that’s fine.
These suggestions are mostly relevant for mission critical code bases where the restrictions align with your QA process. The branch restrictions should be enabled after documenting a QA process. Branch restrictions should not be arbitrarily enabled if the reasons for the restrictions don’t align with your existing internal processes/workflows.
- if you are committing to a branch directly that then immediately goes to production you are doing something horribly wrong
- having a formal process for code reviews is like making sure there is a formal process to make sure employees wear pants to work. If this sort of thing has to be policed by security settings in the dev environment, then something deeper is wrong.
I mostly work on safety critical code, so I know what it means to ship a mission critical code base.
Most of the things you allude to suggest that you’ve never worked with a competent release manager or qa organization.
Edit: also:
> There’s always a balance between security/quality control and productivity
This is only true for definitions of “productivity” that exclude security/quality control, which usually mean that things are so out of whack, it is time to find a new job. By definition, the “productive” people aren’t worrying about product quality, and are being promoted for it, so soon the organization will be run by people that sabotaged the business.
> Most of the things you allude to suggest that you’ve never worked with a competent release manager or qa organization.
You’re right, I oversee an 8 person product team for a 20 person company. Not a huge organization. QA is a shared function and we don’t have a “release manager” on staff.
I’m genuinely confused about what you’re trying to say. Sounds like you’ve had a bad experience with branch restrictions. It would be interesting to hear what the workflow was and how branch restrictions, etc, got in your way.
It’s right there in the name: GitHub private repos are “private,” not “secure” or “secret”. That name was chosen purposefully.
When we moved our code from privately hosted SVN to GitHub 10+ years ago, folks at GitHub were quite clear that we should not trust private repos with secrets like private keys or passwords.
So why have private repos at all? It is to allow control of collaborators. GitHub originally allowed everyone to see and fork every repo—a true open source approach. Private repos were added basically so corporate codebase managers would not have to spend tons of time rejecting random PRs and responding to random people.
For a LOT of private software projects, everything secret can be easily stored in the database, environments, or a dedicated tool for managing secrets. GitHub private repos are great for those. If my entire codebase is a highly sensitive trade secret, I would not use GitHub private repos, personally.
When I have an expectation of privacy, say in a fenced backyard, I don't expect to be seen. It doesn't matter if it's a human or a robot commanded by a human that does the looking, it's still a violation of privacy.
It's not really "private" if the backyard has a one way glass pane on one wall. Github should not be calling repos private if they're not.
This is one of many instances where developers should not make shallow assumptions without reading the documentation. GitHub is very clear about how private repos should be used.
Edit to add: would you store your bank password on a piece of paper you always leave sitting in your backyard? Think carefully about your metaphor.
> would you store your bank password on a piece of paper you always leave sitting in your backyard?
Not that I'm advocating for checking passwords in git repos, Github or otherwise, but you metaphor depends on how rich you are and how bad is your neighborhood, I would think.
My mom has definitely done exactly that — her medical conditions make it very hard to use bank apps without full daylight and also hard to bring the notebook home after every use — and I don't think that her password management choices, however frowned upon they may be here, have ever been a serious problem in practice.
Did you read the article? They’re not saying that GitHub is reading your repos, they’re saying that mistakes happen and you shouldn’t assume that private repos are secure.
Go out in your fenced yard and assume you won’t be seen, but if you forget to close the gate someone might come back there anyway
I agree with the original comment -- github has secrets storage. There are also other secrets storage services that can be used out there. It's never a good idea to check in secrets in your code. Maybe the word "private" means different things to different users, but the fact that GitHub has a whole other set of features dedicated to "secrets" should be a giveaway to users that there is a distinction between "private" and "secret".
I don’t think it was “chosen very carefully”- the other word is “public”, and it refers mainly to who has access to the repo. Public and private are the ideal words imo
I dunno about other people but I’ve always assumed that if we’re giving employees, who we may have to fire someday, access to the code base that repos should be treated as if they could be exposed to the world at any time any way.
I've had to explain to leadership that certain employees don't even need access to the actual codebase to potentially cause harm via a competitor.
The real solution to this is to have a business model that is about more than just some source code in a repository, or "things one person could undo". Concerns like years-long relationships with customers (aka trust), platform-style lock-in patterns, etc. How many times have various parts of AAA game studio or Microsoft codebases been compromised? I am still waiting for that hacker edition of HL3 and my free copy of windows that doesn't suck.
I actually spent a few minutes walking through a hypothetical where 100% of our latest code is lifted and taken to our biggest competitor. I think it would probably cause them more harm than good. Complexity is a hell of a thing. Just having a point-in-time snapshot doesn't really give you an advantage over someone who has been at it continuously for years. Perhaps we can strike a consulting contract with whoever steals our code...
I am not against reasonable measures (i.e. MFA, GH Enterprise, VPNs), but I won't go into paranoid-tier (Citrix-style desktops) over stuff like this ever again. Code is a cheap commodity in 2023. Enshrining your repository as if it is the actual vehicle of business value is indicative of poor leadership. Customers, relationships, execution, etc. are way more challenging and important.
This is a great baseline, but it is often easier said than done.
For example, those repositories contain a lot of privately identifiable information, it is not that easy to get such a baseline ready for that _"should be treated as if they could be exposed to the world at any time any way."_
Depending on jurisdiction this can affect sensitive information that requires much stronger controls in place when you (rightfully!) expect the repository to become public despite it is a private one.
This post is mostly FUD. Case in point, the following advice:
> So, if you’re worried about it: stop putting sensitive data into private repositories.
Most of the issues mentioned in the post (misconfiguration, phishing, mistakes, zero-days) apply to all software, including non-cloud software. So the above advice is equivalent to "stop putting sensitive data into computers". It's run-of-the-mill popular-security nonsense that conveniently ignores that the alternatives come with their own risks, including security risks, and doesn't even attempt to perform a cost-benefit analysis.
I like to think myself as a pragmatic practitioner of security, and even I would say this isn't "popular-security nonsense". The "cost-benefit analysis" here equals: don't bloody do that, _obviously_.
If your system relies on secrets being present in repos, it is a poorly designed system. There is no scenario where this is necessary, other than one not wanting to put in the effort to inject secrets sensibly.
Yes, "the alternatives come with their own risks", but those risks are where the "cost-benefit analysis" equals "not worth the effort".
> If your system relies on secrets being present in repos, it is a poorly designed system.
Repos are databases that track changes in files, nothing more and nothing less. There are millions of repos on GitHub that don't contain application code.
Some people put their entire home directory in a Git repository. Home directories almost always contain secrets. That doesn't mean it's a bad idea, it just means the repo isn't meant for the eyes of others. In other words, it's a private repository.
People putting their entire home directory in a repo, on somebody else's shared computer and network, is going to end in a disaster scenario. This sounds like a bad idea.
Avoid having secrets if you can. Use authentication methods that don’t require them. When they are required rotate them often - ideally automated. Keep them out of password managers. They are like condoms: cheap enough to use once, unpleasant if used twice. Hence the common UX where you get to see the key once encourages using once.
Give those tokens the minimum permissions possible. Make sure the impact of needing to rotate that secret doesn’t mean other secrets are exposed (looking at you: azure
storage account primary access key!) Automated rotations make keeping them up to date in git untenable anyway.
Some services like Vercel and Firebase encourage bad practices in this regard. Firebase requires you to download a plaintext JSON of kingdom keys in order to do anything useful! Vercel KV storage encourages you to run an SDK command to download prod secrets into a local env file.
Yeah, I reacted to that too. It's like nonchalantly saying that you have all your passwords written on post-it notes at your desk.
The topic of discussion shouldn't be how to secure your desk from spying eyes, but about why having post-it notes with passwords is bad practice and just a bad idea overall.
If your private github repo accidentally goes public, the response should be "that's annoying but ultimately harmless", anything else is misguided.
Postits for passwords are better practice than memorizing passwords. If you can memorize it, it is a bad password. Password managers are better yet, but you still need the master password.
The problem is not keeping those passwords in a secure location, treat it like a stack of $100 bills.
That's basically an analog password manager, we have gone full circle.
Or we return the metaphor to github repos, having a separate cabinet is like having a secret vault so that secrets are not directly in plain view in the repo itself, which is exactly what you should be doing.
Some folks use tools like https://github.com/mozilla/sops to store most secrets (besides the sops key, of course) in source control. Of course, you aren't committing the cleartext but if the repo gets published you should probably rotate your keys just to be safe...
Even this I would consider to be bad practice. Old versions of secrets are never relevant. Easy way to break your system:
1. Write code v1
2. Add secret
3. Write code v2
4. Rotate secret
5. Oops, some kind of problem, let's go back to known-good and redeploy (2). Broken because it tries the older secret, not the rotated secret.
That one has an easy fix: store secrets in a separate repo that you never roll back. That's not the reason to avoid storing secrets in git. You might be giving some junior dev here the idea that if they can solve this issue, then storing secrets in git will be ok. Obviously it's not; it's still a bad idea after you've solved this minor annoyance and, indeed, this annoyance had nothing to do with the security reason why you don't store secrets in git.
This assumes that the secrets are deployed along with everything else in the repository. Even if the same repository contains your app, they needn't be deployed together. And as far as old secrets go, they are at most as sensitive as current secrets.
What for? I've seen this happening and if there would not have been a review, it would have stayed there unnoticed.
/edit:
Also what someone considers a secret and then not, is often not well defined. If management has no clue what this is about, it is often better to only commit and push on direct and simple work order, because these need to be well understood and you have the paper trail (as that author also suggests blameful retrospectives - IMHO hilarious).
Or do we have forgotten about the basic rules sending data over the interwebs to other people computers?
Sure, but in computer programming “secrets” is also industry jargon for small strings of characters that enable authentication, like passwords or private keys, which have much higher standard of secrecy than the rest of the codebase.
A secret is something that you go out of your way to keep private. But there are a lot of other things that are private by default, but aren't really secrets, like "the brand of toilet paper I use".
This. Every CI platform under the sun has support for secrets and config that should never live in git. It's worth ensuring people know this, of course, but I'm not sure storing secrets in git is all that prevelant. Many platforms also have secrets scanning to ensure you don't accidentally do this too.
> Many platforms also have secrets scanning to ensure you don't accidentally do this too.
The reason secrets scanning even became a thing is because of how often secrets get committed to git. Some of them even lead to intrusions.
Uber (2016) – Attackers gained unrestricted access to Uber’s private Github repositories, found exposed secrets in the source code, and used them to access millions of records in Amazon S3 buckets.
Scotiabank (2019) – Login credentials and access keys were left exposed in a public GitHub repo.
Amazon (2020) – Credentials including AWS private keys were accidentally posted to a public GitHub repository by an AWS engineer.
Symantec – Looking at hardcoded AWS keys in mobile apps, discovered they had a much wider permissions scope and led to a significant data leakage.
GitHub – Over 100K public repositories on GitHub were found to contain access tokens.
I've worked at companies with developers who didn't know that once committed, the secret remains in the history even if a subsequent commit removes it. It's not trivial, and involves rewriting the history[1]. There's also no way to fix clones of the repo, and there are a handful of other ways secrets can still leak.
The most secure way to deal with secrets accidentally committed to git is to rotate the secret.
OP's target audience, judging by all the emojis and the subject matter, seems to be a pretty green junior devs who might not know any better. OP is putting themselves in their shoes as a rhetorical device, the author isn't saying that they themselves put secrets into git.
Does the article has anything Github specific? All softwares can have bug so moving from Github to anything else is useless. VPN only self-hosted provides some additional protection but that's true for all cloud services like Jira, CI/CD and not a novel issue.
The article cites several GitHub incidents. If it can be verified GitHub competitors have a roughly equal number of incidents then the lack of novelty would be proven. I'm not supporting either view, just addressing your point.
This is true for everything in the cloud that isn’t end-to-end-encrypted. There is the saying that the cloud is just someone else's computer. But in addition, you also have no control over your data on that computer not being leaked elsewhere, or the computer ending up becoming someone different’s computer.
Adding to this my own pessimistic take is that once $big_company that acquired $useful_service decides they have obtained a sufficient level of corporate capture with paid businesses then they may decide to add limits to non-paid repository counts, sizes, pulls, pushes and tighten those limits each year until the non-paying members have moved on. Something similar may be slowly happening with Gmail based on ask hn submissions.
This indirectly outlines one of the primary reasons organizations (like mine) prefer using GitHub Enterprise - we get the collaboration benefits of GitHub, but our data is entirely controlled and hosted by us. It's extra work and more costly overall, but it's a small price to pay for the data security.
From my perspective, the upkeep of maintaining such versions is far from trivial. It involves dedicated resources ensuring the regularity and reliability of backups. Furthermore, it's rare to find organizations conducting comprehensive disaster recovery (DR) restorations of their self-hosted GitLab or GitHub instances. This is typically due to an inherent expectation that each upgrade will proceed without hitches and yield flawless operation.
Moreover, the process of securing approval for NAT inbound rules, particularly for integrations with services like Jira, Slack and so on, often turns into a labyrinthine ordeal. It usually involves navigating the differing viewpoints and interpretations across multiple departments, each with its own unique stance, which further exacerbates the complexity of the task.
I see a lot of deserved distrust of MS, but I thought GitHub was a operated as a separate unit under Microsoft? [1] I expected that Co-Pilot was an initiative of that leadership team and training the LLM is what's likely reading certain repositories?
On a side note, I'm trying to imagine what "sensitive" code would be read, incorporated into an LLM such as Co-pilot, and somehow have any meaningful impact to me once incorporated?
A company I worked for a few years ago sold a source code license and I was tasked with sharing the source, sans history and only at a specific point in time, to the purchaser’s GitHub organization . I was told it had to be done as quickly as possible, and that we might do it a few more times so figure out a way to make it repeatable.
I wrote a Bash script to copy our ~25 repositories to a new organization using the GitHub REST API. Turns out that the default when creating a new repository was to make it public. Within a few minutes I was getting emails by third party services who had downloaded and parsed our code, discovering a long abandoned AWS credential hard-coded in an older codebase. (I think GitHub now does this for you)
I was lucky that nothing important was actually compromised (we had done a decent job of keeping secrets out of our repos and the one exception was an account that was long gone), but it was an eye-opening experience. If these services had found and downloaded our code in minutes you can assume any repo made public, even for a few moments, has been downloaded and cataloged by a potential bad actor.
Assuming you mean encrypting the secret files specifically and not the whole repo. This could work, but why put the secrets in the repository at all? Either everyone has to replace them with their own secrets before being able to run the code. Or everyone has to have the key to decrypt the actual credentials (developers, interns, deployments, CI, scanners, etc) making it not secret.
Why not inject the credentials at runtime, from a system meant for this, with support for auditing and key rotation etc?
Old secrets are around forever, tied to long lived credentials (PGP keys) who's access can't be revoked (because it's always in the commit history). Additionally, it can be quite painful in an ever-changing collaborative context.
Unsurprising. Private repositories are not 'private' [0] and GitHub, Microsoft and OpenAI are thanking you for allowing them to train their AI on your so-called 'private' code. [1]
This is essentially a shameless admission and they not even hiding it any more:
Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.
This is another great reason to self host instead of using services like GitHub openly reading your private code, which I have been saying for years.
As the person on call for exactly that sort of problem, it saddens me that people still push secrets to git at all.
Every gitops platform on the planet supports not pushing secrets to version control. There is no valid reason to have credentials inside a git repo ever. If the repo itself is what's secret, you should have known the risk when you pushed it to someone else's server. There is no excuse when gitlab is free and a docker container away.
I'm not sure why in the "Easy Fixes" it doesn't say the thing that is mentioned in the next point "stop putting sensitive data into private repositories". That should _also_ go in "Easy Fixes" IMHO, use environment variables, vaults, etc to not store your keys into your code.
Have you guys considered the machines are reading your private repos all the time? Forget about LLMs -- what about your text editor? Also, it is likely that your browser has access to your private repo when you use your browser to browse your private repo.
https://docs.github.com/en/get-started/privacy-on-github/abo...