Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow to charge LLM developers for access to its coding content (theregister.com)
31 points by neverrroot 79 days ago | hide | past | favorite | 16 comments



Wasn't one of the tenets of SO that the answers were all creative commons and available via a regular data dump. It was a reaction to EE putting up pay walls and locking peoples own answers up. The idea was whatever you contribute here won't ever be locked up like that. How can they change that?

Does the creative commons somehow prohibit using in training sets? What does attribution mean in that case? Are the data dumps still happening?

(( Edit: Interesting idea below about a new creative commons that asks for a model trained on the data to be open. Seems like in the spirit of the CC ))


It looks they announced two changes here:

1. They are charging for use of the Overflow API, not the data itself.

2. They are enforcing the attribution clause of the existing CC-BY-SA license on the content. In their opinion if an AI bot answer includes parts from the stack overflow API (or data dump?), it should credit the most closely matching answer by linking back to stack overflow.


It is not clear if they can enforce the CC license at all here, they are not the authors of the content and their ToS do not contain any clause to delegate enforcement to SE.


Excellent answer.

Following up on the parent, for those who don't already know, EE = ExpertsExchange, an older question-answer website that got strong Google SEO for the questions but eventually hid the actual answers behind a paywall. If you wanted to see the answer, you had to pay. StackOverflow was, at least partially, a reaction to that in the beginning.

There was initially some confusion about the license behind the answers, and they went through some license-spinning (https://stackoverflow.com/help/licensing).

Now, most things at StackOverflow (and the other StackExchange websites) are under CC BY-SA 3.0 or 4.0. Important to point out that there is no ban on training in any of those licenses, probably because they were written before that was even a thing. However, regardless (and the obligatory IANAL), that attribution clause should certainly be included, if it was directly derivative of the code on SO. (How to track thousands of attributions across a large codebase is another question.)

Whether closing down the API is abiding by the spirit of the license is an open question, but it certainly seems to be allowed by the letter of the license.

My personal feeling is that this is rowing upstream, and that the large and incredibly-well-funded companies like OpenAI, Claude, Google, and Meta have already scraped all of that historical data, so this really only hurts the new startups that are poorly funded. However, I'm sure that making this deal with Google et al will be a good thing for Stack Exchange as a whole and perhaps the funding will breathe new life into Stack Overflow (et al).


It’s probably considered enough of a transformation that it no longer falls under the license, but that’s the main question on copyright that’s getting solved in courts right now anyway.

I wonder if the Share Alike license could be updated to include sharing models trained on that data. I’d certainly like to see more CC-licensed models out there.


Can't wait for the avalanche of excuses from AI proponents about how attribution is impossible because their stealing machine totally doesn't steal and even if it did couldn't tell you what it's stealing from and anyway this is the FUTURE!!!1! How dare you be opposed to literally any technological change we decide to call progress go back to your cave luddite


I assume those ideas were formulated around 2000ish, when humanity's smartest computers were at best as smart as nematodes and incapable of using language convincingly. Now that humanity's smartest computers are as smart as an ant colony, it turns out that with enough data they can use human language convincingly, which is very interesting but nevertheless a serious challenge to existing ideas around copyright. It's not enough to completely pull up the drawbridge so that humans can't freely access the data. But it's reasonable for a community-driven org like SO to be concerned about fundamentally dumb robots like GPT-4 threatening their survival and their mission.

Let me add that Google and OpenAI are fully aware that their tech is useless in the medium-term without technically-skilled humans contributing actual human thoughts to places like Stack Overflow.


SO blog: Stack Overflow and Google Cloud Announce Strategic Partnership to Bring Generative AI to Millions of Developers https://stackoverflow.co/company/press/archive/google-cloud-... ( https://news.ycombinator.com/item?id=39559592 )

Meta: What does the new Google AI Partnership mean? https://meta.stackoverflow.com/questions/429306/what-does-th...

Tech Crunch: Google brings Stack Overflow's knowledge base to Gemini for Google cloud https://news.ycombinator.com/item?id=39552701


This just shows how great of an acquisition GitHub was for Microsoft. Having access to all that code is a real differentiator for training.

I've heard that code helps models more generally, with reasoning and language, because it's clearly structured and things like conditionals and control flow apply to logic questions and making deductions and inferences in just plain conversation. It seems plausible to me, but I haven't seen anything formal about the idea.


Technically I think anything MIT licensed can be cloned and trained on. At the same time Microsoft is under no obligation to provide an easy way for language learning model to do it, given that Microsoft is hosting all of this content they can easily have co-pilot learn from every bit of code that's hosted on GitHub with a permissible license.


All user content on Stack Overflow is licensed under a Creative Commons license. SE also provides a regular data dump of all that data. This does make it difficult to create a robust barrier againt AI companies using that data. They added some crawlers to their robots.txt, but I'm not sure if there are any other technical barriers right now. And as Google is using the data under the CC license in the deal SO announced recently, there doesn't seem to be a license barrier either.

So I don't see how SE could robustly enforce it, though maybe that isn't necessary anyway.


[dupe]

More discussion yesterday over here: https://news.ycombinator.com/item?id=39552701


I think this is a net positive and a wonderful alternative to ad revenue. I'll take a website that charges for AI training over ads.


I wouldn’t because the data they offer are (among other) mine. When I contributed to SO I did not intend on doing so with a way for the data to be used in a model training. I did it with a license which specifically requires my work to be mentioned when used. And I still want that. Will AI regurgitate its code with the mentions of its sources? I don’t think so… Now I won’t contribute anymore but I can’t really remove what I already contributed, can I?


The reason they can put ads on their website is because of the user submissions and people seeking out your content. I don't see a meaningful difference


The ads do not use the content and claim it as their own?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: