Hacker News new | past | comments | ask | show | jobs | submit login

> Microsoft GitHub is the largest collection of open source code in the world. Microsoft GitHub is in a unique and dominant positions to host and access and distribute most of the open-source code in the world

No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Further to that, all these models (GPT and Image generation) are trained on scraped data, trying to suggest that only GitHub/Microsoft could do it defeats the purpose of trying to establish what the legal rights are over training models with scraped data.

We need test cases and precedent, but trying to use this as one is not going to work.

Edit:

It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-...

and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code

GitHub / Microsoft do not have a monopoly on this data.




> Google had a dominant position because it had the resources to scan all books.

I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.

Anyone else could set up a project to go round libraries and scan books. Google has put more money into it than other organisations, but The Internet Archive has about 20 million scans (https://archive.org/details/texts).


There certainly are other spaces where open source code is hosted and available, but the default for most is GitHub. I think it's in a similar position to Google 10 years ago. Sure there are other search engines, but Google is by and large the standard one.

That does put Microsoft in the unique position to have direct unfettered access to any and all open source code on GitHub without restrictions. Unless you or I get the same kind of direct access without rate limiting and antibot protection, then they do dominate and have an advantage over everyone else.


Not sure if you posted before the edits, but I'm pretty convinced by them, seeing as how there are multiple alternatives with the same data.


it’s really not that hard to

git clone

git set origin…

It’s much harder to copy Google’s index.


You think it's practical to do this with almost all the public repos on Github?


That's not Github's fault or Github's problem, from an antitrust perspective. If they went out of their way to make it difficult, you might have an argument but, as far as I know, they aren't. It's just practically difficult by the nature of the problem.


They rate limit, so they do make it difficult though


They rate limit to protect their infrastructure, not to make it difficult. This is not anticompetitive.


So they still have an advantage then ?

Microsoft can continually train on the majority open source and or public code with zero limitations while others can’t ?


Yes, they have an advantage but that's not anticompetitive. That's just the reality of the world.

> It's just practically difficult by the nature of the problem.

What I said here is why. It's not easy to allow external parties to "just download the entirety of Github." It's not unreasonable to rate limit your infrastructure, especially if the person using it isn't paying you money.

The fact MS can train on the code more easily is irrelevant here. It's possible for a third party to download the code, it'll just take longer.


Yup,

Well anyway, it's enough to inspire me to look for a new home for my projects.

Cheers.


Yeah, I think so.


This is addressed in the same paragraph - you can't scan/download "whole" github because you'll be throttled.


Are you actually throttled if you try to git clone or is that what the theory is, or is the assumption that it uses API calls to scrape through github?

Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.


I'm not arguing for or against whether they are in the dominant position; what I'm doing is pointing out that the grandparent quoted part of the text (and argues against it) without quoting the justification the author provided that is directly relevant to what they say.

> There’s an important notion to address here. Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).


> Has anyone actually tried, because i've cloned lots of repos and have never been throttled

(Full disclosure: I have some pretty serious data hoarding issues)

When someone says "I've cloned lots of repos and have never been throttled" I'm afraid I immediately start wondering whether "lots" means multiple GB or multiple TB ... or more!


21Tb of data, they might rate limit you! But might be possible via proxies. But only public repos.


Copilot was only trained on public repos. Id be surprised if you were throttled.


I'd be surprised if they didn't throttle anyone trying to download 21TB of data. And I wouldn't judge them for it.


There’s no need to crawl for your own dataset:

https://pile.eleuther.ai/


@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?

I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.

For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

So the alternative is to buy github?


I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

There are multiple private companies and public institutions that are currently training LLMs.

The work that it required to train an LLM is actually in support of fair use, just as it was with regards to Google scanning books.


> No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

Absolutely wrong. GitHub is doing way more than just hosting code. It hosts bugtrackers, CI and much more. For most FOSS project it's the ONLY place where you can go and submit a bug report.

It's not just a repository, it's a communication tool and refuses to interoperate with other platform.

This is monopoly, just like NPM and Linkedin. Microsoft never changes.


Github also has access to private repositories.


They don't use privet repositories to train Copilot.


Maybe not yet. All just a change of their terms away. Oh you don't like it? We will give you 2 weeks to migrate. Perhaps you want this other more expensive subscription?

Just like with other code they should not be using as they do, they would probably run another "ask questions later" approach.


They say they don’t


I’m sure you think this is a clever reply but the reality is that GitHub wouldn’t even begin to think if that were even technically possible. If it got out that it trained on confidential customer data, it would be game over. The risk is so stupidly large nobody in their right mind would take it. So yeah, if they say they don’t, they don’t.


Yet its ok to train of copyleft code?


Copyleft code is (typically) not confidential.


I don't understand why people just automatically doubt things that companies say when they can be sued (or would otherwise destroy their business) if they are lying about it. Seems unnecessarily pessimistic.


People doubt Microsoft because they've historically run a very aggressive business and done things of questionable morality many times.

They've been to court and they've lost and it definitely hasn't destroyed their business one bit.

For example, Microsoft subsidiary LinkedIn routed customer email through their servers so that they could scrape it. They did that without customer knowledge via a dark patten.

They later apologised for doing it but still used it to propel the company's growth. In the end it didn't hurt anything but their reputation for respecting people's privacy.

Microsoft's own anti-trust history is littered with exceptional behaviour too. They are the size they are now by dint of super aggressive business practices.


Normally because history shows us that redress via the court systems is rarely punitive to a company the size of Microsoft, further Microsoft has a long history of lying to its customers with seemingly no impact on its business.


I mean, we discovered that the whole car industry was lying flagrantly on their emission tests which had the potential of destroying the whole business and there were A LOT of people who knew about it and could talk anytime

Why wouldn't sw companies do the same?


And how many of those companies were materially impacted or had more than a couple quarters of negative consumer backlash?

None.... so the grandparents comment is with out evidence that either consumers or regulators hold companies to account


But will that actually be against ToS or copyright? Many people tend to say that copilot learning from OSS doesn’t infringe any copyright and is no different from a person just learning from someone else’s work. So how is it different if copilot is learning from private repositories? Or eg from leaked source code?


Isn’t it illegal to learn from leaked source code? Or even to view it at all?


It is not, at least in the US. Distribution is illegal; possession may or may not be prosecuted; and if you read the code and provably reuse it or make use of trade secrets you could lose a lawsuit. But if you "somehow" have access and don't do anything associated with the code, the basic act of reading it carries no penalties.


I fully expect the answer to this vary wildly from jurisdiction to jurisdiction.


I'm frequently told on HN that Big Tech would willingly, flagrantly violate GDPR like its nothing. Even if the upside of collecting that info was minimal and the downside was 4% of global revenue.

I guess if they can do that, then what's a small lie about private repos between friends.


I’m fairly confident this is untrue. At Microsoft at least, it’s a big deal when there is a privacy issue, even a small localized one on a single product - and creates a small firestorm.

We’ll get engineers working long hours focused on it, consulting closely with our legal and trust teams. One of the first questions we ask legal when we suspect a privacy issue is “Is this a notifiable event?”

It’s not really about getting slapped by regulators - it’s the fact that much of Microsoft’s business is built by earning the trust of large companies and small ones. Many of them are in the EU of course, but we have strict compliance we apply broadly. It’s just not worth damaging our reputation (and hurting our business) for some shortcut somewhere, as trust takes a long time to build and is easily broken.


Why would they possibly lie about that?


Because they do shady shit, like, by default Copilot would "sample" code for training while using it. Maybe this is no longer the default, maybe it still is, but it was the default.

This type of thing erodes trust? Why should my proprietary code be used for training by default?

I was really annoyed by this.


OpenAI is not the same company as GitHub, and it has always been pretty clear that chats on ChatGPT are recorded and used for training (unless you now opt out).


Not sure why you're bringing OpenAI into it. My comment and the article is about "Copilot"

I'm talking about when using "Github Copilot" and you ask for a code suggestion, it would send the "context" back to GitHub / Microsoft and use that code as training.

Your comment is interesting to me though because there does seem to be a surprisingly large amount of defending OpenAI going on. Almost seems automatic now.


Because Github Copilot is an interface into OpenAI Codex:

"GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI."

https://docs.github.com/en/copilot/overview-of-github-copilo...


> it would send the "context" back to GitHub / Microsoft

Because this is fundamentally how the system works. The context is the prompt.

> and use that code as training

This part has never been true. It’s not how these systems work. Do you have anything to back up your claim?


> It's almost trivial to built a bot to scrape OS code from anywhere on the web.

Seems like a logistical nightmare to me. Git repos interact spectacularly poorly with web scraping in general.


I would've said you should download only archives, but really I think commits are also very important data since that shows the actual changes in the code which would be very useful to train AI to suggest changes to the code.


There are valid non-evil reasons for git hosts to want to throttle and put up obstacles toward scraping as well, both via crawlers or 'git clone' or whatever. These are very expensive operations.


It appears to be the exact opposite to me, `git clone --depth 1 ...` will give you a code that you can know exactly how to parse, vs. webpages that have all sorts of semantical issues.


Git clone is a very expensive operation. Git hosts generally will try to prohibit mass git clone:ing for this reason.


What makes it so expensive? I’d always assumed it downloaded the .git directory statically, and the computational bits were down by the local client.


I'd assume this is in relation to how much other operations cost. With 'git clone' you at least download the whole repository. Compare that to 'git fetch', which is essentially a lookup at the last-modified timestamp.


Yeah. Git repositories can grow very large very quickly. A single clone here and there isn't too bad, but if you're scraping tens of thousands of projects, you can easily rack up terabytes in disk and network access.


How so? Can’t someone just download the zip file and make a queue of downloads or does GitHub rate limit?


Microsoft GitHub has access to all the commits you force pushed away or branch you deleted. We have no reason to believe that it’s actually gone with no transparency and the source code being closed.


> We need test cases and president

I imagine you meant "precedent".


> The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Copying a file is not the same thing as "scanning" a book. To scan you first need to get your hands on the book (the download part) and then use industrial scanners to scan them. So apple-apple comparison here is scanning <-> training & scanned collection of books <-> trained model, and finally the portals to the loot: Google Books <~> Github+VSC.

Not everyone has the resources to actually process -- that is train the 'model' -- using the publicly available 'data'. Most also don't also own Github and VSC platforms to field their model. In fact, is anyone other than microsoft in a position to both scrape OSS, train a coding AI, and then include that tool in dominant software development platforms?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: