Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.
I'm sorry but when was Microsoft ever reputable? They have a long history (and reputation) of being merciless in every single way they can, and have for as long as I can remember.
A cooperation is never reputable. It is easier to reason with a dog than with a cooperation.
However, me thinks this relates to the times before Github became an offering by Microsoft. But the deal was just too hard to miss, getting this massive army of minion coders who all pray to the octocat and now do the Balmers dance.
Oh so much fun, now it turns out, that all feed the new AI overlords.
I think 90s Microsoft could have something of a claim.
It made a lot of sharp business choices in that decade, but it also left a LOT of money on the table for developers, as part of a strategic goal to grow the platform.
Then the 00s came, platform growth slowed (because they were already running on everything desktop), and the "vs linux" decisions started coming.
Eh, I wouldn't say the say they handled the whole DR-DOS saga very reputably. And in 2001 they came to within an inch of their life in US v. Microsoft, a litigation mainly based on anticompetitive practices from the 90s.
Microsoft was disreputable in the 90s to the point they were almost broken up several times.
> More realistically, Linux had driven them into near irrelevance in the server market and just pushed them from "extinguish" or "extend" to "embrace".
and with vscode, copilot, and wsl2 they're doing a terrifyingly good job :-/
GP said: Thats the main reason i moved my code away from github and am advising clients to follow suit. It boggles the mind that we have to actively police against ip theft by formerly reputable corporations but here we are.
you said: I'm sorry but when was Microsoft ever reputable?
nobody said Microsoft had ever been reputable.
GitHub is the formerly reputable corporation here.
Anybody who was a programmer through the 1980's knows the anti-competitive practices that MS used to destroy many up-and-comers. They essentially were a dam on technological progress by coercing the world to use a non-protected single-tasking OS when we could have easily been using pre-emptive multitasking OSes on the hardware of the day.
Many SaaS companies probably double dip by monetizing customer data in various ways. Good luck even knowing if it’s happening and if you do figure it out I’m sure the EULA will be properly one sided.
Can you imagine how much intel Google Docs, GMail, Salesforce, Profitwell, etc have about company performance and plans?
I’m sure nobody is using any of that data to insider trade, to give just one example. Nobody would do that.
Yes, not conclusive, but the cynic in me is usually right esp. when there are large sums of money and power on the line. OpenAI was originally suppose to be open, but power corrupts.
This comment is complete FUD as you’re linking to a tertiary paragraph in a document describing how your data may be used if the owner of a private repository enable it, for such use as Dependabot, CodeQL, or secrets scanning.
The top of the document is:
> GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.
> If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."
Taking a single paragraph out of context isn’t worthy of the top-voted comment.
I call BS. OpenAI was originally meant to be an open source and non-profit company. It is now a closed source, for-profit company and is controlled by the company that gave FUD its original meaning for discrediting others who stood against it. It is supreme naivety to think they will not use whatever they can to gain power in the AI arena. There is nothing in that statement that a high powered lawyer can not twist and bend to their liking. I can easily see the transformer matrix weights being defines as "aggregate data" -- "they are just floating point numbers -- they are not anyone's source code."
OpenAI != GitHub even if GitHub has allowed training on public repositories (which I believe to be an absolute mistake because it should be preserving licensing on the processed repositories and we know that it is not doing so).
There are many reasons to distrust Microsoft. The wording of this particular paragraph explaining how data gets used in accordance with the linked terms of service (which are the actual governing documents, not the page you’ve linked to) is not one.
OpenAI is not a subsidiary of Microsoft. GitHub is.
Unless you can meaningfully show that Microsoft is actively applying a subsidiary relationship (that is, where it directs OpenAI’s product direction), I have to disagree with your base notion.
At this point, I reiterate that your original claim is 100% FUD and disinformation.
I’m not asking you to trust GitHub or Microsoft, but legal terms have meaning and the terms do not support your assertions.
Their policy, if you scroll up from this link, is to scan only “aggregate metadata” and only if you opt in.
GitHub aggregates metadata and parses content patterns for the purposes of delivering generalized insights within the product. It uses data from public repositories, and also uses metadata and aggregate data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph. If you enable the dependency graph for a private repository, then GitHub will perform read-only analysis of that specific private repository.
If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service. The information we learn only comes from aggregated data. For more information, see "Managing data use settings for your private repository."
> The information we learn only comes from aggregated data
It seems pretty clear to me that this means they're allowed to use private repos to train copilot, etc.
I wonder if any researchers have tried putting fingerprinted source code into a private repo, and then (after it is retrained) getting copilot to suggest stuff that could only have come from the injected supposedly-private source code.
That would make a nice paper. I hope someone does it.
I genuinely don't see how "The information we learn only comes from aggregated data" relates to training LLMs, which need raw data, not aggregated data, as their input.
Maybe we have different definitions of the term "aggregated"?
This suggests to me that GitHub need to extend that text to explain what they mean by "aggregated".
Sure. All aggregated data is ultimately derived from raw, unaggregated data. One can make the argument that training an LLM is "just" an unusually complicated form of aggregation.
Whether that would hold up is another question. But yeah, I agree with the conclusion that they need to clarify this.
What is AI training but “parsing content” for “delivering generalized insights”? They intentionally use slippery language that can defend their practices.
https://docs.github.com/en/get-started/privacy-on-github/abo...