Hacker News new | past | comments | ask | show | jobs | submit login

Don’t forget OpenAI and Microsoft using your github data for training GPT. Their privacy statement says your content will not be read by “human eyes.”

https://docs.github.com/en/get-started/privacy-on-github/abo...




Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.


> formerly reputable corporations

I'm sorry but when was Microsoft ever reputable? They have a long history (and reputation) of being merciless in every single way they can, and have for as long as I can remember.


A cooperation is never reputable. It is easier to reason with a dog than with a cooperation.

However, me thinks this relates to the times before Github became an offering by Microsoft. But the deal was just too hard to miss, getting this massive army of minion coders who all pray to the octocat and now do the Balmers dance.

Oh so much fun, now it turns out, that all feed the new AI overlords.


That misspelling of "corporation" severely breaks the meaning you almost certainly intended.


oh snap, too late to edit ... you are absolutely right.


Developers developers developers developers


GitHub wasn't always owned by Microsoft.

Please note I am not attempting to address the reputability of GitHub pre-acquisition. That is a separate matter.


I think 90s Microsoft could have something of a claim.

It made a lot of sharp business choices in that decade, but it also left a LOT of money on the table for developers, as part of a strategic goal to grow the platform.

Then the 00s came, platform growth slowed (because they were already running on everything desktop), and the "vs linux" decisions started coming.


Eh, I wouldn't say the say they handled the whole DR-DOS saga very reputably. And in 2001 they came to within an inch of their life in US v. Microsoft, a litigation mainly based on anticompetitive practices from the 90s.

Microsoft was disreputable in the 90s to the point they were almost broken up several times.


Microsoft's battle against Netscape (which is also part of the show Valley of the Boom), and the Halloween documents, are two more examples.


It bleeds over into ethics at some point.

Microsoft in the DOS days was still fighting hard for market. Microsoft in the Netscape days? Eh... less of a competitive claim. Post ~2005? No claim.

For me, the definition of "reputable" changes when you're a competitor among equals vs when you're a monopoly.


They were a monopoly since the '90s.


Okay, maybe 90s MS could've, but I have a feeling that's 30 years ago.


>I'm sorry but when was Microsoft ever reputable?

They went on an open source charm offensive a few years ago. "Oh, we've turned over a new leaf" etc.

A lot of people believed that they'd had a legitimate change in heart because of the change in strategy.

More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".

Their dubious anti-Linux tactics via leaning on OEMs in the desktop market remained more or less unchanged.


> More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".

and with vscode, copilot, and wsl2 they're doing a terrifyingly good job :-/

I really hope people don't let their guard down.


GP said: Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.

you said: I'm sorry but when was Microsoft ever reputable?

nobody said Microsoft had ever been reputable.

GitHub is the formerly reputable corporation here.

GP comment doesn't even make sense without that.


Define reputable.

They're still probably the most indirectly trusted company on the earth

Why? because almost all enterprises do run some non-trivial amount of MS code

Either Windows, Azure / Azure AD or AD at all, Teams/Outlook, VS Code, or anything else.

And please, let's do not start arguing that some startup made of 30 people uses Macs only.


Anybody who was a programmer through the 1980's knows the anti-competitive practices that MS used to destroy many up-and-comers. They essentially were a dam on technological progress by coercing the world to use a non-protected single-tasking OS when we could have easily been using pre-emptive multitasking OSes on the hardware of the day.


Many SaaS companies probably double dip by monetizing customer data in various ways. Good luck even knowing if it’s happening and if you do figure it out I’m sure the EULA will be properly one sided.

Can you imagine how much intel Google Docs, GMail, Salesforce, Profitwell, etc have about company performance and plans?

I’m sure nobody is using any of that data to insider trade, to give just one example. Nobody would do that.


Microsoft providing software and services to organizations running concentration camps in Texas is why I stopped using GitHub.


Care to elaborate?


What alternatives do you recommend, and why?


A private instance of gitlab on a private server works.


IP theft?

Just code is not an IP.

Your product or a specific algorithm is.

And in my opinion patent on algorithm should be illegal.

There.is no inherent problem hosting Code on GitHub.

You are not doing a good job if you move companies away from working setups due to this.

And they haven't had high security requirements anyway because everyone else normally hosts GitHub Enterprise or gitlab themselfs


I'm always surprised that there are always people in hn liking software patents.

But sure have your opinion but at least try to bring your issue actoss


I don't think anyone on HN likes software patents but god damn do they love copyright of code.


I don't think it's right to conclude "your private repo data will be used to train GPT" based on the text you linked to there.


Yes, not conclusive, but the cynic in me is usually right esp. when there are large sums of money and power on the line. OpenAI was originally suppose to be open, but power corrupts.


I never read that before but you are right and based on that wording looks like private repos are open to having LLMs trained upon them.

I assumed up to this point they could only use public ones but this wording suggests otherwise.


This comment is complete FUD as you’re linking to a tertiary paragraph in a document describing how your data may be used if the owner of a private repository enable it, for such use as Dependabot, CodeQL, or secrets scanning.

The top of the document is:

> GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.

> If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."

Taking a single paragraph out of context isn’t worthy of the top-voted comment.


I call BS. OpenAI was originally meant to be an open source and non-profit company. It is now a closed source, for-profit company and is controlled by the company that gave FUD its original meaning for discrediting others who stood against it. It is supreme naivety to think they will not use whatever they can to gain power in the AI arena. There is nothing in that statement that a high powered lawyer can not twist and bend to their liking. I can easily see the transformer matrix weights being defines as "aggregate data" -- "they are just floating point numbers -- they are not anyone's source code."


OpenAI != GitHub even if GitHub has allowed training on public repositories (which I believe to be an absolute mistake because it should be preserving licensing on the processed repositories and we know that it is not doing so).

There are many reasons to distrust Microsoft. The wording of this particular paragraph explaining how data gets used in accordance with the linked terms of service (which are the actual governing documents, not the page you’ve linked to) is not one.


GitHub = MS, OpenAI = MS, so by transitivity OpenAI = GitHub (where = means "controlling or related interest").

I agree that the particular wording is not sufficient to specify much of anything, but does it sure doesn't shut the door on the possibility either.


OpenAI is not a subsidiary of Microsoft. GitHub is.

Unless you can meaningfully show that Microsoft is actively applying a subsidiary relationship (that is, where it directs OpenAI’s product direction), I have to disagree with your base notion.

At this point, I reiterate that your original claim is 100% FUD and disinformation.

I’m not asking you to trust GitHub or Microsoft, but legal terms have meaning and the terms do not support your assertions.



iirc, they said they wouldn't do this? so this could just refer to secret scanning


Clearly they lied, if their policy says otherwise.


Their policy, if you scroll up from this link, is to scan only “aggregate metadata” and only if you opt in.

GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.

If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."


> The information we learn only comes from aggregated data

It seems pretty clear to me that this means they're allowed to use private repos to train copilot, etc.

I wonder if any researchers have tried putting fingerprinted source code into a private repo, and then (after it is retrained) getting copilot to suggest stuff that could only have come from the injected supposedly-private source code.

That would make a nice paper. I hope someone does it.


I genuinely don't see how "The information we learn only comes from aggregated data" relates to training LLMs, which need raw data, not aggregated data, as their input.

Maybe we have different definitions of the term "aggregated"?

This suggests to me that GitHub need to extend that text to explain what they mean by "aggregated".


The LLM itself is a form of aggregated data.


Sure, but the raw training data isn't.

I think GitHub need to clarify this themselves.


But then the training can surreptitiously be called data aggregation. We don't look at the raw data, we aggregate it and then query the aggregation.

Definitely needs clarification, though somehow I suspect this is all by design.


Sure. All aggregated data is ultimately derived from raw, unaggregated data. One can make the argument that training an LLM is "just" an unusually complicated form of aggregation.

Whether that would hold up is another question. But yeah, I agree with the conclusion that they need to clarify this.


Training the LLM is a form of “learning”, and putting all the data in an input training set is a form of aggregation.

The clause seems to mean “we can do whatever we want with your data as long as we violate many people’s privacy at scale at the same time”.


Enter lawyers.


What is AI training but “parsing content” for “delivering generalized insights”? They intentionally use slippery language that can defend their practices.


Policy says nothing about ChatGPT or LLM/AI models.


I couldn't find the privacy policy for Azure Repos. Does anyone know if it has the same type of statement regarding 'human eyes' for private repos ?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: