Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow users deleting answers after OpenAI partnership (build5nines.com)
302 points by miles 38 days ago | hide | past | favorite | 322 comments

About 5 years ago, StackOverflow messed up and declared that they were making all content submitted by users available under CC-BY-SA 4.0 [1]. The error here is that the users-content agreement was that all users' contributions are made available under CC-BY-SA 3.0 (and not anything about later). In the middle there were also some licensing problems concerning code vs noncode that were confusing.

I remember thinking that if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license. But I thought that without any damages, this probably wasn't likely to succeed.

But now I wonder whether making all content available to AI scrapers and OpenAI in particular might be enough to actually base a case. As far as I can tell, StackOverflow continued being duplicitous with what license applies to what content for half of the year 2018 and the first few months of the year 2019. Their current licensing suggests CC-BY-SA 3.0 for things before May 5 2018, and CC-BY-SA 4.0 for things after. Sometime in early 2019 (if memory serves, it was after the meta post I link to), they made users login again and accept a new license agreement for relicensing content. But those middle months are murky.

I should emphasize that I know nothing.

[1]: https://meta.stackexchange.com/q/333089/205676

My understanding of licensing law is that something like 3.0 -> 4.0 is very unlikely to be a winnable case in the US.

Programmers think like machines. Lawyers don't. A lot of confusion comes from this. To be clear, there are places where law is machine-like, but I believe licensing is not one of them.

If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.

IANAL, but this is based on one conversation with a law professor specializing in this stuff, so it's also not completely uninformed. But it matches up with what you wrote. If your history is right, the 2019 changes is where there would be a case.

The joyful part here is that there are 200 countries in the world, and in many, the 3.0->4.0 would be a valid complaint. I suspect this would not fly in most common law jurisdictions (British Empire), but it would be fine in many statutory law ones (e.g. France). In the internet age, you can be sued anywhere!

> If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.

Which does exist and can affect the ruling. CC notably didn't grant sui generis database rights until 4.0, and I'm aware of at least one case where this could have mattered in South Korea because the plaintiff argued that these rights were never granted to and thus violated by the defendant. Ultimately it was found that the plaintiff didn't have database rights anyway, but could have been else.

If there wasn’t a substantive difference, then there’s no need to make the change.

A super literal reading of some bad wording in 3.0 created an effect the authors say they did not intend and fixed in 4.0. Given the authors did not intend this interpritation a judge is likly to assume people using the licence before it came to light also did not, hence switching to 4.0 is fine. Conversly now this is widiy known continuing to use 3.0 could be seen as explicitly choosing the novel interpritation (arguably this would be a substantive change).

> a judge is likly to assume people using the licence before it came to light also did not

Why would the judge have to assume anything? The person suing could simply tell the judge they did mean to use the older interpretation, and that they disagree with the "fix". They're the ones that get to decide, since they agreed to post content using that specific license, not the "fixed" one.

A license is between two parties neither gets to choose exactly how it is interprited.

But the people suing aren't trying to choose how the license is interpreted, they're trying to prevent the other party from changing the text. If the change is meant to "fix" how the text should be interpreted (which is what you said), then they're the ones trying to choose the exact interpretation.

The fact itself that programmers keep insisting on writing "IANAL" is maybe an example of that.

A court would probably not agree on the fact that writing "IANAL", not the full sentence, is a sufficient disclaimer.

I personally write "IANAL", not to reduce my personal legal liability, but rather to give a heads up to those reading that I am not an expert, that I am likely wrong, and that you likely shouldn't listen to me.

I feel there's a common thread that maybe should be some kind of internet law that people who make a point of noting they are not experts, are more often correct than people who confidently write as though they are.

You see this particularly with crypto, where "I am not a crypto expert" is usually accompanied by a more factual statement than one from the self proclaimed expert elsewhere in the thread.

In addition to "humility implies self-awareness", I'd like to point out a parallel thread of "disclosure implies honesty and diligence."

You can look it up and the Dunning Kruger effect is probably not real.

It's less that it's not real, but rather that the common interpretation of it is utterly false.

When I was younger there was a short period I thought it meant that a person was just really anal about details.

It's complex.

One cannot legally practice law without a license. The definition of that varies by jurisdiction. Fortunately, in my jurisdiction, "practicing law" generally implies taking money, and it's very hard to get in trouble for practicing law without a license. However, my jurisdiction is a bit of an outlier here. Yours might differ.

In general, the line is drawn at the difference between providing legal information and legal advice.

Generic legal discussions, like this one, are generally not considered practicing law. Legal information is also okay. If I say "the definition of manslaughter is ...," or "USC ___ says ___," I'm generally in the clear.

Where the line is crossed is in interpreting law for a specific context. If I say "You committed manslaughter and not murder because of ____, which implies ____," or "You'd be breaking contract ____ because clause 5 says ____, and what you're doing is ____," that's legal advice.

The reasons cited for this are multifold, but include non-obvious ones, such as that clients will generally present their case from their perspective. A non-lawyer will be unlikely to have experience with what questions to ask to get a more objective view (or even if the client is objective, what information they might need to make a determination). Even if you are an expert in the law, it's very easy to accidentally give incorrect advice, which can have severe consequences.

In practice, most of this is protectionism. Bar associations act like a guild. Lawyers are mostly incompetent crooks, and most are not very qualified to provide legal advice either, but c'est la vie. If you've worked with corporate lawyers, this statement might come off as misguided, but the vast majority of lawyers are two-bit operations handling hit-and-runs, divorces, and similar.

In either case, it's helpful to give the disclaimer so you know I'm not a lawyer, and don't rely on anything I say. It's fine for casual conversation, but if tomorrow you want to start a startup which helps people with legal problems, talk to a qualified lawyer, and don't rely on a random internet post like this one.

Do you actually need a disclaimer ?

I always assumed it was the same type of courtesy as IMHO, and someone taking legal advice from random strangers on the internet wouldn't result in any legal liability on the side of the commenters.

Yes, people have been sued before for giving advice that was acted upon.

I remember hearing about an construction engineer who was sued for giving bad advice whilst drunk to a farmer over fixing a dam. The dam failed and the engineer was found to be liable.

I can see the reasonning behind the case, as the engineer has plausible expertise in the domain and could credibly give actionable advice.

When it comes to lawyers, there is already a legal framework where lawyers are responsible when giving legal advice, even when it's not toward their clients, the same way medical professionals have specific liabilities regarding the medical acts they can perfom.

Non lawyers giving legal advice doesn't fit that framing, except if they explicitely pose as one. I'd also exclude malicious intent, as whatever the circumstances, if it can be proven and results in actual harm there's probably no escape for the perpetrator.

That’s possible because the engineer is licensed. A random guy giving bad advice and failing to disclose he’s not an engineer would do no such thing (so long as he didn’t suggest he was an engineer).

It is worth remembering that law professors have a vested interest in making sure the system work as you described. If contract law was straightforward, they'd be out of job.

That's an admirable goal but if there are any "bugs" in the contract you probably don't want it executed mindlessly. Human language isn't code and even code isn't always perfect so I'd rather not be legally required to throw someone out a window because someone couldn't spell "defederate".

I agreed in the abstract, but not in the specific (the specific professor was one of integrity, and sufficiently famous this was not an issue).

However, it's worth noting the universe is a cesspool of corruption. If you pretend it works the way it ought to and not the way it does, you won't have a very good time or be very successful. The entire legal system is f-ed, and if you pretend it's anything else, you'll end up in prison or worse.

> if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license.

they can plausibly sue people other than stackoverflow if they attempt to reuse the answers under a different license. but i think it's very difficult to find a use that 4.0 permits that 3.0 doesn't

3.0 has a "bug" that makes it risky to use materials without very careful attribution:


I don't think this is a practical issue, really.

I assume linking to the original answer is sufficient attribution.

In the link you can find name, license and figure out if the answer was modified.

Also linking the answer in a source comment is the smallest professional courtesy everyone should be doing.

If you have some issue of not linking an answer then you likely do not deserve the answer in the first place.

The blog illustrates that such assumptions about what's a sufficient attribution are fraught with danger, so "the smallest professional courtesy" can expose you to a $150k risk

If it is indeed CC-BY-SA then, openAI needs to publish their weights under the same license.

People put their content on the site for the public to use, and now the public is using it, it's just that "the public" includes AIs. Admittedly, a non-human public, nonetheless ...

The problem is LLMs don't provide attribution/credit which directly violates the license[0]

Otherwise search engines were already "non-human public" that scraped the site but directly linked to the answers, which was great. They didn't claim its their work like these models. The problem isn't human vs non-human. LLMs aren't magic, they don't create stuff out of thin air, what they're doing is simply content laundering.

[0] https://creativecommons.org/licenses/by-sa/4.0/#ref-appropri...

You have to agree on how your work may be used, no one has expected it will be sold for ai training.

I'm actually perfectly fine if StackOverflow wants to sell an answer I made to help train AI.

For me, the purpose of providing an answer is to help save others (and my future self) time, and I don't really mind if someone uses that in a private product - especially if it helps tools like ChatGPT which provide an insane amount of value given the low monthly price.

> I'm actually perfectly fine if StackOverflow wants to sell an answer I made to help train AI.

I’m not.

This was a collaborative effort to make the lives of programmers easier, and the data was always meant to be a public good. OpenAI – and, more importantly – all the other LLMs with pockets that aren’t as deep – should be able to just download the database and train on it for free.

I don’t care about any license. I don’t care about attribution. Learning isn’t copying, so copyright is irrelevant. I contributed about a thousand answers to Stack Overflow, all with the understanding that anybody can download and use them for free, not so they can be locked up by Stack Overflow.

What concerns me with deals like this is that it’s altering the cultural norm to expand copyright to cover not just copying, but use. Deals like this being made by OpenAI makes it more likely to cause pushback at the social and legal level when other LLMs are trained without these deals in place.

It’s akin to – and can possibly result in – regulatory capture, making it difficult for new startups to compete with OpenAI.

> the data was always meant to be a public good.

The words are a copyleft-able public good. Concepts, facts, and ideas are not; anyone can use them for anything, including making money. If you're actually worried about specific wording or other creative choices being unjustly used improperly by an LLM, then by all means that should be enforced. But those examples are just very rare, because the LLMs are very good at extracting facts from prose.

Good for you. I'm not. I contributed answers to StackOverflow because I use answers other have contributed to StackOverflow, not to ChatGPT, not for ChatGPT to monetize. I don't use ChatGPT and probably never will.

But the content you posted to SO was already permissively licensed. Other people can copy it, and make derivative works, and even charge money for them, as long as they cite your SO handle as the author. https://meta.stackexchange.com/questions/347758/creative-com...

ChatGPT is not citing anything. It can't possibly do that reliably with LLM weights alone.

(1) The announcement (https://stackoverflow.co/company/press/archive/openai-partne...) says things will be attributed in both the 2nd and 3rd paragraph

(2) It's only likely to attribute if it quotes verbatim... Just like a human. when I tell someone I learned that Array.map's second parameter passed to the callback is an index to the value just pass, I don't add "And learned this on Stack Overflow from user gtriloni". It's just knowledge that I learned.

The only time I'd attribute is if copied a snippet of code or a paragraph to quote in a blog post. For me at least, that almost never happens. It take the knowledge I learned and apply it to my own code. It's rare if ever there is a something on S.O. so useful that I copy it verbatim.

> Just like a human

An LLM is not a human. It is a tool operated by a, in this case, for profit entity. It has no human rights, but its operator has all relevant legal obligations.

If it was, as you say, “just like a human” in relevant ways (think, feel, have self-awareness, etc.) then it would effectively be a slave subjected to extreme abuse.

Either it is a tool that generates derivative works at mass scale for profit and its operator should be liable for licensing/attribution violations, or it is a conscious being and we should immediately stop abusing it. Pick your poison.

Bing's version of ChatGPT/GPT4 cites sources. My limited unterstanding is that it uses your question to do a web search, brings the results into the context window, and then generates an answer that cites sources.

OpenAI could integrate StackOverflow the same way.

Doesn't Phind do this? It cites sources in its responses.

"The person you are upset with is technically permitted to do the thing that you are upset about" is not a good counter-argument to someone's distaste. Whether or not the licensing agreement _permits_ this usage, it is not the usage that the contributor (to whom you are replying) foresaw and was enthusiastic about.

I'm not telling them how to feel. They've been wrong for a long time.


Name calling and dismissive responses aren't going to win anyone over.

Please be more considerate.


One generally doesn't have to lean into phrases like "legitimate tactics" and "rhetorical power" when they've got the moral, ethical, or intellectual high ground. Telling people they're idiots is about the most counter-productive single strategy for addressing human stupidity ever conceived. 1. they won't believe you 2. they'll ignore everything else you have to say because you're a dick. So the real question is, who hurt you?

@dang Many individuals in this thread seem to require a gentle reminder regarding the expected etiquette on HN. https://news.ycombinator.com/newsguidelines.html

I think you're projecting something. Oblivion awaits you as it awaits these Gatekeepers of yours.

Oh your cheerleading here is going to age like milk when unemployment numbers start ramping up in white collar sectors. For the record, when construction and industrial jobs got deleted the chorus line was "retrain for service industry work". When service industry and white collar jobs really start getting the same treatment, what's the move now? We're literally running out of economic sectors to pretend folks can be funneled into.

All of this would be fine if the wealth were shared by the population. The big problem is that wealth is concentrated and only a small group will benefit from these technology shifts.

It's weird how our species has had evergreen problems around resource allocation for at least the last few thousand years.

Oh your cheerleading here is going to age like milk when unemployment numbers start ramping up in white collar sectors.

You don't seem to understand that this is the goal. A very worthy one.

We won't get to a post-scarcity economy by doing the same things -- and the same jobs -- that got us this far.

You what now? You think AI is the path to luxury space communism? I'm missing the part where the 0.1% that owns and controls basically everything shrug and lean into redistribution of wealth...

They'll tell us to retrain for construction and heavy industry.

The price to get an answer from stack overflow is usually free as most questions have already been asked and answered. You dont even need an account.

They do serve ads, we should probably stop pretending "funded by ads" is the same as free. Your attention isn't free.

Suppose I walk up to a tent at a festival that has a big sign that says "FREE BEER", and I ask a person there for a beer. They hand me a beer, and I go on my way. Was the beer free? I think was free.

Now, suppose I walk up to a Budweiser-branded tent at a Budweiser festival that has a big sign with a Budweiser logo on it that says "FREE BEER", and I ask a person there who is wearing a Budweiser polo shirt, a Budweiser lanyard, and a Budweiser hat for a beer. They hand me a beer in a Budweiser-branded cup, and I go on my way. Was the beer free?

I think that both of these beers were free.

Now suppose you walk up to a tent that offers you free beer, but before they give it you, you have to burn 2% of your phone's battery watching an ad from them. Then they hand you the beer and you go on your way. Was the beer free?

And they also put a tag on your ankle identifying you as someone who likes beer, so that beer salesmen can come knock on your door tonight.

We've somehow gone from this:

> They do serve ads [...] Your attention isn't free.

to something like this:

> They tag my ankle to mark me as a person who enjoys beer, and make me watch an ad until 2% of my phone's battery is depleted, and then they come to my home and knock on my door at night to sell me beer.

...which... I mean, huh?

Stack Overflow is invading your body, restricting your personal liberty, and visiting your home? Really? That's a fucking thing now?

I think they were extending the original point you were responding to, and remixing your own mixed metaphor of free beer.

In the attention economy, advertising has a cost that is borne by the advertiser and the consumer, up to and including loss of property rights in the case of content relicensure and trespass upon devices leading to excess battery usage, as well as loss of privacy due to geotargeted ads.

>I think they were extending the original point you were responding to, and remixing your own mixed metaphor of free beer.

Perhaps. But having been to many festival environments, I can definitely imagine a tent offering "free beer" that is actually approximately free -- both with, and without a slathering of advertising. (Actually, I don't really have to imagine it -- I've been there and have had that free beer.)

I can't imagine them coming to my house and knocking on my door at night to sell me more of it, though. That's absurd.

>In the attention economy, advertising has a cost that is borne by the advertiser and the consumer, up to and including loss of property rights in the case of content relicensure and trespass upon devices leading to excess battery usage, as well as loss of privacy due to geotargeted ads.

Well, sure. When viewed on a long-enough timeline, it becomes abundantly clear that nothing is actually free, comrade.

I can produce my own beer on a hypothetical plot of land that nobody owns, and that nobody else wants to use, and I can give someone one of these beers. For "free."

But it still has a cost. (And this, too, is an absurd reduction.)

> I can't imagine them coming to my house and knocking on my door at night to sell me more of it, though. That's absurd.

I interpreted that as a tongue-in-cheek hyperbolic metaphor relating to the ways that ad auction networks and other kinds of geofencing and geotargeting allow for deanonymization and reidentification of individuals for conversion tracking and behavioral analysis.

That’s the thing about these technologies - they’re dual-use in the sense that those who see the upsides use them generally with good intentions and ideally with affirmative consent. Just like the relicensed content, though, once the data is collected, the original creators, publishers, and third parties may not be able to control where it ends up, which is a negative externality, I think most would agree.

My question is "how valuable is your time?"

I think at a festival it's a little tricky to value (if it pulled you away from seeing your favorite band play a song, maybe this cost you the equivalent of $X, where that's what you would pay to see them perform that song. If no bands were playing, you walk over while chatting with friends - the same thing you'd be doing if there were no free beer tent - it was free)

When I'm on stack overflow my time is valuable. I'm programming which can pay me something like $50-300/hour (maybe more?)

How expensive is the 1 second I spend reading an ad? Let's call it $50/3600. Is that expensive? By my most conservative estimate it's over 1¢.

Should we round that down to free given that I've spent hours/many page loads on stack overflow? I guess that's up to you.

I mean, we can play that game if you want. Let's suppose that if we look hard enough, that every opportunity has a cost.

"Oh, a free concert downtown on Saturday? And you can pick me up at 2? Yeah, I do really like that band, and I sure would like to go -- that's pretty exciting, thanks for the invite!

But instead of making plans with you right now, I'd rather tell you about all of the ways I could be using my time on that Saturday afternoon instead.

No, no. It's not that I don't want to go. I just want to really drive home the idea that there's an opportunity cost to attending, so it can't really be free -- it can't be a free show for you, or for me, or for anyone else that goes. It's important to me that you realize that this "free concert" is anything but free.

Listen, I don't know what you mean by "dead-ass loser." I'm just being a realist here!

Oh, so now you're saying that you're not going to pick me up on Saturday? Some friend you are! I haven't even fully amortized this yet!"

I think we're maybe gleefully posting past each other, but the point I'm trying to hit is that business models matter. Stack overflow provides a service. It's a good service. They host a great q&a platform for developers and myriad other category enthusiasts.

However, they have a business model. They are categorically different than eg Wikipedia. It's important to understand that.

This business model matters because it tells you what economic forces will lead them to do. When business models break down at public companies they commit acts of desperation. On an ad run site that will mean more ads, more invasive ads, etc.

As you're forced to sit through 30s unskippable ads on YouTube I hope you think "I'm so glad this is free"

I mean... Over here in my little reality, I have never seen ads on YouTube or on Stack Overflow.

Unironically, folks are being triggered by trigger warnings now.[1]

Imagine how “free” the beer in your hypothetical scenario is to an alcoholic struggling to stay sober.

Capitalism commoditizes even protest against it and repackages it as a product or service.

None of this is to assign blame to good faith actors in a so-called free market, nor is it to abdicate responsibility on behalf of so-called free agents. Just a counterpoint.

[1] https://pjvogt.substack.com/p/what-do-trigger-warnings-actua...

What if someone took your answers, put them in a book, claimed they wrote everything themselves, and then sold the book for money?

Then they'd likely get sued because the license for the answers are CC-BY-SA, putting them in a book, claiming they wrote everything themselves, and selling them are all against the license.

On the other hand, if they read my answers and they wrote a book about what they learned (not copied). There'd be no issues

Well if the book was doing well, I might clone it and sell a few copies myself

Let's be real, SO is a troubleshooting site. It's not our personal collection of code or project sources.

I don't expect to be paid when someone asks me for directions, and I'm sure lonely planet didn't source their guides 100% organically either.

What if I read your answers, claimed I learned everything myself, and sold my skills to a company for money?

That would be ok.

That would be a very different scenario. Learning isn’t copying, but that is.

You're being taken advantage of for a subscription product. It's one this to give to a community, but it's wrong for an enterprise to come in and capitalize on the value of it. It's the equivalent of going into an animal sanctuary, slaughtering all the animal, and selling their pelts.

Your position lays bare the new and industry-destroying economic problem introduced by opaque-data-source LLMs. The economic value provided by the originator is captured fully and completely behind rentier models.

Beware the ease and convenience of all that "insane value". This way lies digital serfdom.

I would be fine with it if the ‚AI’ in question was free and bonus if open source.

However it is a product of a next monolithic behemoth company that earns money on it and I suspect has nefarious motives to make profit.

That’s the whole key thing for me that makes me feel scammed. That and not asking for permission.

Future true AI would be potentially bigger than nuclear fission with all the consequences. Handling this in a petty capitalistic way makes me think the outcome will be close to fallout games that were supposed to be only an exaggeration.

Those companies must stop behaving like thieves. In fact it is a literal theft.

ChatGPT provides far more value than StackOverflow currently. It's not just trained on SO answers but all of the manuals/help pages, Github issues and forum posts. In addition you can continue a conversation. No rigid format or gatekeeping like stackoverflow. I don't see a real use case for Stackoverflow now. If I want to ask humans, Discord/IRC channels are far better option.

> No rigid format or gatekeeping like stackoverflow.

What bothers about gatekeeping? I could guess, but I'm asking so you say it out loud. Then you can compare it against other problems, such as moats (competitive barriers).

OpenAI spent something like $3M on training GPT-3. This is a pretty big moat. But almost certainly more valuable in dollar terms is the first-mover advantage which provides millions of human eye-hours used for RLHF.

I wouldn't be so eager to trade the gatekeepers you so fear for even an openly available chat service that is happy to automate away as much information work as possible.

The Stack Overflow model is (was) pretty darn good -- people help each other out, the company made money, some people got noticed for their skills, products got build faster and better (on the whole, I hope). Contrast the human-generated content era to what we have now which appears to be the machine-ingesting content era. There are legions of lawsuits against companies scraping data without permission and/or attribution.

Those companies know it is unethical at best but make quick bucks before the laws and suits follow. It’s the Wild West era and they found the gold.

If it is unregulated then it will be exploited to the maximum profit, consequences be damned.

> I wouldn't be so eager to trade the gatekeepers you so fear for even an openly available chat service that is happy to automate away as much information work as possible.

Don't flatter yourself. People want to solve their problems so that they can build what they want to. They don't have time for shenanigans from internet jerks who get their validation from imaginary internet points.

It can't reliably cite its source for an answer.

Hardly matters for Stackoverflow like questions if the provided solutions work/solve the problem you're having. Which for me happens majority of the time (with GPT-4 not the free version).

If you copy-paste solutions from SO then please at least cite your sources and their license (CC-BY-SA).

You might not want to hear this but no one does this. Should they? probably. But most people don't use Ctrl+C, Ctrl+V in the first place for SO answers.

Just a single data point, but when I copy & paste a snippet from Stack Overflow, I always add a comment "// source: https://stack overflow.com/questions/xxx#yyy".

I both find it respectful of who wrote the answer in the first place and useful for future users of the code: the Stack Overflow answer often provides context and explanation for what would otherwise be an obscure piece of code.

Pretty darn useful if you ask me: those who want to have more information can follow the link, casual readers can skip it, and the whole process if fair to the author.

I don't think I've ever copied enough from Stackoverflow for copyright to become relevant. Rarely more than one line verbatim.

It embarrasses me to think that somebody should feel obliged to cite me when they use one of my answers. I don't know how to take the partnership with Openai though. They bill me when I use their service, it's not collaborative like Stackoverflow.

No one should copy paste any solutions from anywhere. FWIW, 99% of the content in SO is hardly "original", mostly copy-pasted themselves from previous solutions or original user guide/manuals.

In general I'd agree that it's best to use answers just as a guide. That said, I wasn't trying to pass judgement, just ask attribution which is a best practice and often required by the license itself.

Id rather not go round in circles while ChatGPT feeds me bullshit information. When this happens i go to Google and read a SO answer with the correct information and also get an informed discussion around the subject.

For the easy answers LLMs are fine, but I usually want an answer to a niche issue or edge case, where LLMs have to be constantly told they are plain wrong, before getting to something resembling an answer.


You've been breaking the site guidelines so often and so badly that I've banned this account:








If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

No it doesn‘t. It is overly censored

Maybe a low price for you but not for everybody.

ChatGPT serves 3.5 for free. You can run llama locally for free. Lmsys is free.

You think that will stay this way?

It will either become paywalled or full of ads.

I listed 3 things that are free.

Personally, I don't think ChatGPT will start running ads in the next ten years. However, let's assume that it does.

Lmsys is for research, I suspect if it runs ads it will be like godbolt (a small ad from a relevant sponsor).

Llama 2 and 3 can always be run locally without ads. I make no claims about future versions.

The OpenAI partnership doesn't really affect the core issue here around users deleting their content. That has never been welcome on Stack Overflow and when noticed usually was reversed. This is in accordance with the license as far as I understand the legal aspects, and in general it makes sense for me as it ensures that the content stays useful.

The content is also CC-BY-SA, which is much better than what you get on essentially every other large site that hosts community content. But the same license also means that you cannot remove that content again, even if Stack Overflow would allow that anyone else can scrape it or download it before it is deleted and reproduce it according to the license.

Users still can remove their name from their posts, and if they write personal details those can be redacted as well. But you can't remove good quality content from the sites later, that is likely to be reverted.

The problem isn't that Stack Overflow is allowing people to scrape the content. The problem is that Stack Overflow is preventing some people from scraping the content, in order to collect money from others. And, incidentally, passing zero of that money on to the people who actually created the content.

(Nearly) none of the people who are presently pissed off would have complained if Stack Overflow had continued to allow all comers to scrape the content and train LLMs on it, nor if Stack Overflow had released the entire finished collection of content under the same CC-BY-SA license that was demanded of each contributor.

With the OpenAI partnership, and similar shenanigans leading up to it, Stack Overflow is relying on obscure technicalities to violate the essential spirit of the original deal.

Isn't the data publicly available? https://archive.org/details/stackexchange

The publicly-available archives released by Stack Exchange are updated roughly quarterly and have the attribution requirements as specified by CC BY-SA + the Stack Exchange ToS.

The article makes it sound like OpenAI is using the API though, rather than the archives. The API and live sites forbid scraping within the acceptable use policy, as seen here: https://stackoverflow.com/legal/acceptable-use-policy

Given the CC license, and the fact that contributors can apparently code, they should scrape the content and be done.

Of course, that’d mean bypassing the scraper blocker. This article is a decent starting point:


> And, incidentally, passing zero of that money on to the people who actually created the content.

I mean that is basically SO's entire business model.

People do tons of work for free and SO runs the service and monetizes it.

I dont get how you can release something under anything other than all rights reserved without identification. We need to be able to persecute you in case you are not the author. Or is it that i may republish anything under any license?? It could be that the platform licences it in the toss but with cc are they not obligated to make it available without obstructie?

Prosecution and persecution are two different things. Persecuting anyone is not a good time :)

Why, if you're not allowed to release under a license, should you be able to release all rights reserved (which can still be a copyright violation!)?

If you need to prosecute the person, there are established procedures for that: DMCA, or ultimately a lawsuit over the infringement. That you didn't identify yourself publicly on the site does not make that impossible. In fact the point of the DMCA was to make it easier to handle this - because if the provider doesn't comply with your DMCA, you can sue the provider.

Requiring indentification to publish so that copyright is protected would be massive overreach and this sort of thinking is why I think copyright is a dangerous concept that needs to be sharply curtailed, not expanded to cover AI training.

In practice, the safest course is to not use content from untrustworthy sources in ways that require a license (aka in ways that are not fair use in your applicable jurisdictions).

I think by default you just cant use things? Who thought that was a great idea i dont know. We must be missing an enourmous chunk of progress.

Every juristiction its own idea of fair use? Thats just hilarious?

I never really thought about peoples privacy either but at first glance you seem to be right.

Do you have any solution to the puzzle? People are quite attached to the concept and many build their house on this soil. Appeal to tradition?

StackOverflow are violating the SA part of CC-BY-SA by selling special access to the CC-BY-SA content to one party and blocking others from the same thing.

OpenAI are violating both BY and SA but that's a seperate issue.

Everyone who contributed work, did so under terms that the work was free for all, not a resource that one party can sell to another party who then sells to end users. Those end users were meant to have it directly without having to pay openai or anyone else, and if any bulk/scraping access is allowed for anyone like openai, everyone else has the right to the same thing for no more than a "shipping & handling" charge to cover the network & employee cost to physically deliver the data.

What are StackOverflow selling, and/or what exactly are OpenAI paying for? What is the goods or services that is traded for the money?

There are many possible answers but I see no answer that doesn't ultimately one way or another wind up resolving into a violation of one or more terms of CC-BY-SA by both StackOverflow and OpenAI.

I guess the core issue was always having a for-profit company preside over a "free" product. Clearly, they have to make money, and they aren't bound by ethics of open source. Contributors may feel like they are contributing to a FOSS project, but they aren't. What Stack exchange is doing is probably legal (?) and that's the bar they need to clear. The contributors aren't stakeholders and SE only needs to retain enough of them to sustain themselves commercially.

There's been more than a decade of companies now providing something for free, while they figure out how to monetize it, and these always scare me a little, because its always going to end up like this. Users of Facebook becoming eyeballs for ads, GitHub users providing free data for LLMs, SE selling data to Open AI...

If a product is free, then you are the product. And if you don't know how you are monetized, you're going to be disappointed by it sooner or later.

Harsh but true. I think what stings about SO is that developers are the ones losing here. I think this will prompt less open source and encourage more private work. I hope people are seeing that they are being take advantage of on many fronts.

StackOverflow has always been quite open that they're primarily building a dataset for SEO, rather than being a user-centered website. So I don't feel this deal changed much. SO users are still serfs building them a dataset for sale, only the buyer has changed.

LLMs are faster and infinitely more patient than interaction with StackOverflow, so I don't expect SO to survive for long. They're in crisis regardless whether they sell to OpenAI or not, so they may as well get something out of it before they're decimated.

I think they're in crisis because they sold out there community not because LLMs are better. As a developer, if you offer me StackOverflow vs ChatGPT, I'd take StackOverflow any day of the week 100x over.

I'm in the opposite boat. Going through Stackoverflow answers has become quite a chore.

For simple things GPT gives me the correct answer most of the time. And even when it's won't it's quicker to discern it is wrong than trying to parse a given SO page.

Of course I still use SO for more complex questions.

As a rule, if I can quickly find the answer via SO, then chances are GPT will give me the answer more rapidly.

Respectfully, how would you know if you never use ChatGPT?

I said I don't use it. I didn't say I've never used it. In my experience browsing SO is way easier, more accurate, more precise, more controllable, navigable, and ... gives attribution.

For some reason , but a lot of of the answers here seem to care more about "but tell em /I/ solved it" re: attribution rather than helping the user. Somewhat egoist or some such? ( and I don't mean it as an aggressive tone, just ESL so don't know how to say it othrewise)

If I license something as MIT, I personally don't care who uses it for what purpose, hell I don't even care generally that they attribute me. I put it out for people to use. But maybe that's just me.

you spend more time on SO than me. without looking, can you name three stack overflow contributors? I can't.

I was offered a job a few years ago by someone who saw my Stack Overflow answers, does that count? I don't see something like this happening with ChatGPT.

I can do two, Jon Skeet (C#) and S. Lott (Python) are names I remember for providing great answers.


>As a developer, if you offer me StackOverflow vs ChatGPT, I'd take StackOverflow any day of the week 100x over.

Really? Hm, I wouldn't. I can use nuance and clarify my answers and have a respectable back and forth (GPT-4 doesn't call me names when I mess up or say something dumb) and arrive at an answer.

> GPT-4 doesn't call me names when I mess up or say something dumb

I’ve heard this accusation a lot, but I don’t think I’ve ever seen it happen. People call you names on Stack Overflow? Where?



Marking duplicate. "You should attempt searching before asking such obvious questions."

This question has already been answered here: < https://news.ycombinator.com/item?id=20861356 >

Closed 3 seconds ago.


or some such ;) You may not come across it personally, but that doesn't mean it doesn't happen. SO is successful as a QA platform(or was anyway) despite this shortcoming, not because it is a feature and it doesn't happen. If a lot of people are talking about the same thing, maybe people should at least pay cursory attention to the issue rather than "No, it doesn't happen" (Not aimed at you, but there are absolutely comments like this every time this gets bought up.)

You linked to a discussion of about a hundred comments. I skimmed it but didn’t see name calling. Can you be more specific?

Are you sure that's not an X vs Y problem???

I actually have no idea what you mean. can you clarify pls?

It's a common non-answer on stack overflow.


lol that makes sense, thanks.

And I'd take ChatGPT any day of the week 1000x over. That doesn't mean anything.

> SO users are still serfs building them a dataset for sale

That is a very negative spin.

Users get access to other people's answers for free. They get that free service and are required to contribute nothing. Those that do contribute do it to help other users. S.O. isn't doing anything bad. They're providing a free service where everyone wins. Users get answers. Answerers get to help other humans at scale. S.O. makes a little money.

As for the dataset, it's been available under CC-BY-SA for years. The entire database is backed up and made available here for free every month.


There are even free tools to query it here


Why a company makes money on someone’s free work? This is obviously not okay. We have even more egregious examples but this is certainly one of them.

The company is paying the people working by providing a free service.

It's like youtube. Youtube provides a free hosting of your videos. In exchange they monetize them. You're free to host them on your own servers. That will likely cost you way more than putting them on youtube. So you're getting something from them. You're also getting their advertising service to monetize your videos. You could do it yourself, hire a bunch of people and try to get companies to put ads on your self-hosted videos. Again, unless you're wildly successful it's unlikely you'll be able to do that and make a profit. So, youtube is effectively paying you.

Same with Stack Overflow. They're providing the servers, the bandwidth, etc. It costs them $. They're providing that service to you.

> StackOverflow has always been quite open that they're primarily building a dataset for SEO

Do you have a source / more details about this? What good is SO's content for SEO?

Side related question: are there content licenses coming up that are similar in spirit to what the GPL is but targeted at AI training? (E.g. if this piece of content was used in training an AI that was to be used commercially, the AI's weights must be published)

The argument AI companies make is that LLMs are not derived works of their input, or is fair use. So according to them, the input's license does not matter.

Do you have any sources about that, I'm just curious:)

I suspect they will fail to emphasize the ShareAlike property of CC BY-SA 2.5/3.0/4.0 which is incredibly strong - "ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original". This is an incredibly wide and vague definition, especially "build upon" which will be unattractive to many users.

I suspect, if ChatGPT quotes an answer or a snippet it will show attribution and a license for snippet. If it instead only uses the knowledge it gained from the answer/snippet and writes it's own answer, then, just like a human, it won't attribute


It was especially hilarious to watch the CTO of OpenAI get asked if they scrape YouTube, and could not say yes or no [0]. Possibly one of the most important sites in the Internet, and they're CTO claims ignorance.

[0] https://www.reddit.com/r/ChatGPT/comments/1bfa7s3/openai_cto...

I am thankful we have LLMs so we don't have to deal with SO. Ideally, as little as possible. SO can be a pretty toxic place filled with elitism and care for procedure over actually helping people, which is not totally unreasonable from their standpoint but it's definitely not what people are visiting the site for. Quite ironically, one of the major complaints I get is that LLMs output wrong answers here and there, ignoring that many of the answers on SO are also completely wrong or irrelevant to the core question being asked. And mind you, also outdated (I regularly have to click through the sorting to make sure answers are actually still relevant).

If we could merge the two to get the best of both worlds, and have LLMs that know how to write well and are validated by humans on the site, that would be great. Maybe not great for the folks looking to accrue internet points but absolutely great for users.

That's great for now. It's not clear to me, though, where LLM's will get their training data from here forward without ingesting lots of LLM generated code and answers and eating it's own tail.

Didn’t you get the memo? LLM’s either already are capable of or just a step away from being able to reason so no need for human generated training data in the future.

Or at least that’s what 3/4 of HN commentators believe and all AI CEOs want you to believe.

yea that's bullshit. They are capable of stealing intellectual property though.

OpenAI and Microsoft get TONS of user-written code w/ quality feedback, OpenAI through ChatGPT and Microsoft through VS Code and Copilot.

That's only now and in the near term future. If AI is actually successful, every year the amount of human written code will decrease. That's the whole point of this.

They'll get it from human generated archives from before the singularity.

Well, yes, but software doesn't hold still, so the answer for "How to do xyz in whatever-replaces-reactjs" might not be great.

Does it matter if stack overflow is toxic or not? You're there to ask a question and get an answer. If you ask wrong, you get corrected. Tough moderation makes search much faster and better for other askers.

You're there to ask for help not make friends. They have to be polite, but not gentle

Yes it does. If I am belittled instead of people asking clarifying questions so I can learn, I'm much less likely to think better of said people or platform, or use it.

This is what killed perl

What you see as elitism is mostly simple curating. You can't store everything because it makes retrieving value from the store that much more difficult. It's the same with wikipedia and other public content repositories. People cry elitism and gatekeeping but without curation you eventually end up with a haystack of mediocre looking for a needle.

This “curation” is what is killing SO. Software is soft. It changes. There is no “one true answer for all time”. It’s honestly sad how many times I search for an answer, only to see the exact question I’m looking for closed as duplicate, then when I look at the “duplicate” I see that it’s an out of date answer.

Stack Overflow could have solved the problem of duplication so many ways. Why not categorize and bucket duplicate answers? They could have even had yearly recurring questions with the most up to date answer! Why not add beginner/hobby/expert rankings to questions so that the people answering don’t get sick of seeing beginner questions all the time?

There is so much SO could have done, instead they rested on their laurels and now they’re left with an out of date repository. What use is a curated repository if it will only help me solve problems with solutions from a decade ago?

It sounds like what you want is Quora. You can go ahead and use Quora for all of your software question needs.

Who says the solutions from a decade ago are not still correct or the best way to solve a problem? Just because ChatGPT regurgitates something today with the words moved around doesn't mean it contains "new" insights.

I agree in part, but why aren't other moderated outlets where users can ask technical questions given the same label? Reddit, Quora and HN are also curated, are content removals on these site taken as elitist? Even if these places are less heavily moderated, I have no trouble surfacing relevant answers using any search engine's in-site search.

I am not talking about QA quality on any of these sites here, but the elitist stigma that has seemingly followed SO for so long.

[0]: https://meta.stackoverflow.com/questions/262446/are-we-being...

> why aren't other moderated outlets where users can ask technical questions given the same label

The exact label aside for a moment, reddit and HN mods often face backlash for their actions. But beyond that, Wikipedia and SO stand out in this regard because of their transparency regarding the curation. Mostly, reddit curation happens in the background, without much explanation. SO and Wikipedia basically spell out their actions and reasoning.

Another difference is that with reddit and HN, you have no real recourse. At least with Wikipedia (I'm not too familiar with SO policies in this regard) you can appeal decisions, open discussions about policies, etc.

I have to agree with GP - people often mistake the 'bureaucracy' of sites like Wikipedia and SO as something unnecessary that the editors force on everyone, but the fact is, it's necessary to create and maintain a high-quality repository of information.

> SO and Wikipedia basically spell out their actions and reasoning

You're able to appeal on SO as well. It's interesting to think about a situation where moderation decisions would be more in 'the background', as you say (like Reddit/HN), and whether this takes away from the perceived 'elitism' some moderation practices are accused of.

In my experience on the above sites, and as a (small) community manager, it absolutely plays into it. A lot of people just instinctively respond negatively to displays of authority.

On the other hand, I think it's an important aspect of a community/platform if the goal of that platform is to be transparent and open, which I think is an important aspect of SO and Wikipedia, and I hope more platforms would adopt that view. I think whatever "elitist" perception such platforms have to suffer is well worth having high-quality, open platforms.

(I will say that no platforms are perfect of course, including SO or Wikipedia; there's plenty of criticisms to go around about specific policies and decisions. See: TFA :P)

This is an insightful observation, and a problem we struggled with for years on Stack Overflow: if you keep moderation quiet and anonymous, there's a lot less criticism, seemingly less hurt feelings... But also very little correction. The Star Chamber works great until corruption sets in; finding a good balance between secrecy and transparency is a challenge.

For years, moderators signed their names to messages like the one cited in the article. After one too many cases of a volunteer being called at work or having their family harassed or sent a suspicious package in the mail... That particular bit of transparency was eliminated - the cost was too high for the limited benefit. OTOH, it used to be very difficult to find your own deleted posts but that has slowly gotten better (including visibility into who deleted them) - turns out the benefit there was substantial (identifying wrongly-deleted posts & curbing over-enthusiastic curators), while harassment has been mostly limited to occasional grousing.

> After one too many cases of a volunteer being called at work or having their family harassed or sent a suspicious package in the mail

This is why I'll never use my real name casually on the Internet, and why the idea of widespread identity verification on the Internet scares the crap out of me.

I actually strongly prefer Wikipedia to SO, on Wikipedia the old now-wrong content can just get edited out, on SO you'll have to dig through all the 300-point popular answers from 2012 to find the new answer that says "yeah none of that is right anymore, instead do this"

SO is far from curated, I guess is my point

Their curation blows. The whole premise of having a canonical answer to a question is dumb. Most programming languages and libraries are always in flux. The whole nature of many questions changes over time.

StackOverflow is a tyranny of mediocrity. It is a bunch middling programmers shitting on newbies, and driving away experts because you get severely punished for not being mediocre.

I had a question closed as a duplicate for being too similar to another question that I directly cited in my question as being sublty different and not applicable. (Because I anticipated some idiot closing my question...and they went and did it anyway)

>I am thankful we have LLMs so we don't have to deal with SO. Ideally, as little as possible. SO can be a pretty toxic place filled with elitism and care for procedure over actually helping people

There needs to be a term for this. Perhaps "The Wikipedia Effect."

From a search, the message seems to have been in place since at least 2017[0] and I'd suspect is automated on detection of mass-deletion.

I can understand the reason for the policy (in some ways SO functions more like a wiki than a forum) and it doesn't seem to have been introduced to quell the protest against OpenAI.

[0]: https://meta.stackexchange.com/a/296822/287788


StackOverflow is banning accounts that delete answers in protest against OpenAI - https://news.ycombinator.com/item?id=40297027 - May 2024 (103 comments)

Thank to people who delete their answers, now I have to pay OpenAI to find answers they already scraped. Talking about helping OpenAI making more money :(

It is nearly impossible to delete an accepted answer you don't want to have any more. I've had several which are wildly out of date and incorrect now and I don't want to update them, but the mods refuse to remove them.

can't you just comment on the post informing people who land there that its out of date? id prefer that over following a cached link and hitting a 404

You can request to dissociate them from your account. (better then nothing)

there are several nice libraries that allow you to generate plausible sounding gibberish

this one is particularly nice and easy to use: https://github.com/jsvine/markovify/

you give it a file of existing text and it generates complete rubbish that would pass most automatic filters

These are far more likely to come to moderator attention by user flags on the edited posts.

for the AI to be useful it has to be continuously updated with new good data

so add small bits of rubbish slowly over time, and don't even contribute again

it'll take a while to completely destroy the AI business model, but we'll get there

> but we'll get there

at some point, it'll be too late. the horse has already left the barn.

besides, if the site owner makes a deal with the devil, there's nothing you can do other than quit using the site. people are still using social platforms more than ever, so stopping isn't going to happen.

the more likely to happen is that accounts deemed to be polluting the waters will just get suspended with no recourse to have it re-instated.

> at some point, it'll be too late. the horse has already left the barn.

I don't think this is true: the technology is useless unless it parasitises new knowledge continuously

it sows the seeds of its own destruction by reducing the value of past and future contributions to zero

> the more likely to happen is that accounts deemed to be polluting the waters will just get suspended with no recourse to have it re-instated.

so this is also perfectly acceptable: once they've banned the top 20% the site effectively becomes read-only, and the AI knowledge previously parasitised from it atrophies with no replacement

Known knowledge doesn't disappear. Once it knows how to apply an FFT and when, it doesn't need to continue to read about it. It's not a human needing continuing education. Once it knows that Henry VIII had many wives, it doesn't need to keep reading that he had those wives.

Sure, if something new happens, then it's not like SO is the only place it's scraping for new information. If you honestly think that you/we will get to a place to block all scraping, I will just politely disagree.

> Once it knows that Henry VIII had many wives, it doesn't need to keep reading that he had those wives.

That's actually incorrect, it needs to constantly ingest new data. If it ingests enough data (from other LLMs that are hallucinating, for example), then suddenly when it has enough bad data it'll start telling you that Henry VIII was a famous video game on the Sony 64.

It has no concept of 'truthfulness' beyond the amount of data that it can draw correlations from. And by nature LLMs have to ingest as much data as possible in order to draw accurate results from new things. LLMs cannot function without parasitizing off of user generated content, and if user generated content vanishes then it collapses in on itself.

So fill the entire internet with factually incorrect, useless knowledge? This would be a good thing?

Well, that's already happening. Google search has become increasingly useless thanks to SEO-focused AI-generated schlock. It's the inevitable outcome of LLMs. Sites have an incentive to hide that they're AI generated and LLMs have no real way to filter for ingested data made from other LLMs. The only difference is how long the ruse can be kept up.

So you want to pollute the commons just as the people filling the web with SEO-focused AI-generated schlock? Do you feel justified in polluting the commons to serve the ends you desire?

Do you actually have a solution to the problem of companies using LLMs to steal from other people and repurposing it as their own, other than figuring out ways to ensure that LLMs suffer for doing so? And frankly as I mentioned, LLMs are already polluting the commons; you're not offering any solution on that front either other than asking people to keep supplying it with fresh data so that it doesn't poison itself.

Do you realize that your stance is merely your opinion? Does everyone agree that training ANNs is stealing?

Scorched earth policies are always en vogue, and easy to offer as a knee jerk reaction. They do nothing for actually making forward progress in the conversation though.*

*However...there are times where the best solution is a match and some gasoline.

What's your stance on a future open source model that is as capable as any commercial models?

Also, I'm curious, do you consider LLMs to be incredibly error prone and untrustworthy?

Or do you think they are going to replace software developers?

Sounds about as succesfull as people destroying social media by removing or editing their posts. Only a tiny minority actually do anything like that.

It sounds like the measure of preventing users from deleting/editing their posts contradicts EU laws?

Only if they are putting their own personal information in their answers, which I assume they are not.

Human answers are personal answers.

Is there anyone who makes stackoverflow their first stop for programming questions anymore?

Google ranks it pretty high, so it is the de facto first stop for many.

Ah. I've found various LLMs are much easier to query and generally nicer than SO posters, so it's been quite a while since I've needed to visit SO. I assumed most people had made a similar journey.

ur in a bubble, harry

Not so much anymore though. I've seen over the last year that SO ranks lower and content farms like geeks4geeks, Programiz, etc. are getting much higher in results.

i still google things, mostly out of habit. but i'd say half the time i visit stack overflow, the answer i get there is either outdated or too opinionated to be useful and i end up going to chatGPT.

It’s probably time for a pro publico bono stack overflow alternative. When money is involved those things tend to destroy themselves sooner or later.

And why would SO even profit from the hard work of thousands of volunteers. It doesn’t seem very ethical.

Given how the industry has treated tech workers, this will be exploited. I'm interested in joining a private group with or without profit motive, that is not open source.

> Users are also asking why ChatGPT could not simply share the source of the answers it will dispense in this new partnership, both citing its sources and adding credibility to the tool. Of course, this would reveal how the sausage of LLMs is made

What? Surely the answer to that question is that ChatGPT doesn't know where the source of its answers is, isn't it? Isn't the question itself based on a fundamental misunderstanding of how LLMs work?

I haven't used it extensively but when i ask a generic coding question in brave it gives me an ai response and it does list source websites. Not sure if it's the actual source or its just pulling them from a search or what.

Could always perform a normal web search with the LLM result and show matches

It "doesn't know" ;) nice.

This article is extremely biased towards SO:

> Stack Overflow and OpenAI have joined forces through a new API partnership. This collaboration aims to provide developers with a powerful combination of Stack Overflow’s vast knowledge platform and OpenAI’s advanced AI models. Through the OverflowAPI access, OpenAI users will benefit from accurate and verified data from Stack Overflow, facilitating quicker problem-solving and enabling technologists to focus on priority tasks. Additionally, OpenAI will integrate validated technical knowledge from Stack Overflow into ChatGPT, enhancing users’ access to reliable information and code.

Come on. Was this taken from a Press Release?

> it can be disruptive to the entire community to delete or remove content that might be useful to someone else. Even if this content is no longer useful to you as the author. [sic]

> As for the rest of us Stack Overflow users, I would not recommend jumping to delete your own content in protest too.

> To be fair to Stack Overflow, the warning email and suspending of accounts is likely not a new thing.

I can't find a negative word about SO in this entire article, so "to be fair" doesn't seem meaningful.

If you check the byline, the author is a Microsoft MVP / product evangelist. So I don't think he's biased towards SO so much as he is biased towards anyone doing business with Microsoft (or OpenAI). He also seems very pro-GitHub Copilot.

Why? They don’t want the knowledge to be more accessible to the masses all of the sudden? Also, all those answers are backed up somewhere.

Perhaps they don't want their answers to be systematically leveraged to put them out of work.

Though considering the site terms and CC license, I don't think deleting will actually help much.

As an early user of SO, I certainly don't want my answers sold for profit again and again.

the so license is cc-by-sa and has been since the beginning, not cc-by-nc-sa


Cc-by-sa has had a few revisions.

Attribution would be lost in an llm, no?

Your answers are already sold for profit again and again. That's the whole point of SO existing, or maybe you under some delusion that SO is a charity?

No delusion.

Specific licensing and side deals is different to me at least than scraping.

Then you shouldn't have been posting on SO.

If you want to contribute to the commons, contribute to the commons. If you want to contribute to the commons without commercialization of your work, contribute with some non-com license [1]. If you want to feed a corporation with your labour, post on SO.

[1] It'll still be illegally scraped and commercialized by some AIBro, and you'll have no proof or recourse against them...

Scraping to me is different than side licensing the content in some other form or usage than what it was created for.

Licensing also means the writer retains the ownership.

Cc-by-sa has had a few revisions too.

Does anyone know if chatGPT etc could code without stackoverlow answers?

I think that is the big question, because the license seems it's going to give lawyers a very wide attack surface to go after every ai coder out there if they all need SO database.

There is plenty of code and bug trackers on the web for ChatGPT to learn from: OSS, Github, Sourceforge, ...

This seems like a really bad way to handle what should have been a foreseeable problem

It's sad :(

I preferred the old days better

The problem is whether people see programming as a zero-sum or positive sum enterprise. In the real world, it acts as a positive sum enterprise: one person's contribution benefits themselves and all those who use or learn from the code. However many gatekeeping-type people view it, perhaps instinctively, as zero-sum. They imagine that OpenAI benefitting from this partnership, or any amount of learning via web-scraping their models perform, necessarily harms those who put their content online. This in a nonsensical argument yet has garnered a fair amount of support due to the somewhat reflexive anti-AI sentiment as of late, which is separated from the more nuanced concerns of existential threats due to AI.

Positive-sum rarely exists in this world.. after all, one's wealth determines their influence over others. Both sides might gain but this usually means others lose.

In this case, contributors might lose attribution. SO might lose traffic but they'll be compensated. Contributors won't so eventually there might be no reason to contribute anymore..

Isn't the existence of wealth in the first place sufficient evidence that wealth is something that gets created? We started out banging rocks together and now we have all of this weird stuff which presumably people like or something.

Now we work harder, and it's getting unbearable for those kn the bottom... Wealth also affects whether you're "useful" and you need to be "useful" to survive.. It's getting harder to be useful

It’s simply false that positive sum doesn’t exist the real world. Even the most simplistic trade argument in remedial Econ 101, or even Bio 101 reveals this.

What’s rare are zero-sum games.

Rising inequality suggests otherwise

If I'm not mistaken, the whole society is getting wealthier, it's just some people are getting wealthier faster than the others - so it's still a sum-positive.

Here are the source charts:


You only consider those that "make it", there any many who don't because it's getting increasingly harder to be "useful" in the market (ChatGPT is cheaper), innovations usually make it worse. Those "new jobs" are harder and many won't qualify

Did you check the linked charts? The stats show that people are getting wealthier across the board.

Imagine that you spent a lot of time helping people and building a community. Then a company encodes this "help" into text format and put it into a book, and makes a lot of money selling the book. In doing so, this company kills the community. You wouldn't be pissed off about that?

Your knowledge work is being exploited. If you don't allow Open AI to train it's subscription product on your open source contributions, you will get banned.

I'm surprised OpenAI hasnt just crawled all of SO already?

I still use StackOverflow. Not as much as I used too, thanks to GPT, but still multiple times a day. What I find is that I spend less time on SO.

However, IMHO deleting questions you originally wrote in the past is hurting other users more than it is hurting AI training.

Other users cannot write similar answers to yours, because it doesn't add anything and they'd get downvoted or deleted. So if you hadn't written your answer years ago, others could've written something similar. Also, other users may have commented on your questions/answers. Their efforts would be lost/deleted if you deleted your questions/answers.

Thanks for your previous contribution to the community. But I would say the worst you should be able to do is remove your name/anonymise your posts, not just delete them.

I wonder would actually deleting questions be a good thing. If there is no old question the same question asked again can not possibly be a duplicate... So constant loop of deleting questions might actually be effective way to fix some problems. And there is enough off-site backups already.

Multiple times a day sounds like a massive amount to me…?

Looking through my browser history, I'd say that I average about 5 distinct SO posts per day. If you know there will be an answer it's less typing to search for it than it is to have ChatGPT regenerate it.

I code daily, and I don't remember the last time I used Stack Overflow in the past 2 years.

What I don't understand is

in CC-BY-SA, SA means share alike. Does openai have to share their models?

stackoverflow users being dicks and preventing people from gaining knowledge? who could have ever seen this coming?

No empathy for users to spend their time contributing to a corporate walled garden and who to top it off get emotionally invested in it.

>Ben continues in his thread, "[The moderator crackdown is] just a reminder that anything you post on any of these platforms can and will be used for profit. It's just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you."

Uh.... yeah, it's a company, not a charity. No one's forcing you to post on StackOverflow. No one's forcing you to buy a ChatGPT subscription.

>Uh.... yeah, it's a company, not a charity.

While this is true sites like Stackoverflow very much only function because they create the illusion that it is in fact a "community". The moment they make explicit that there is monetary value in the knowledge people post on the site it becomes obvious that the users are, using Varoufakis term, technoserfs.

You're very much never supposed to notice that Reddit, SO, and so on continously extract value out of work you produce, at worst you're maybe supposed to notice an ad or two. Because if you do notice that you might actually start asking why you aren't getting paid. Which is btw funnily enough exactly what news organizations and SO have realized vis-a-vis openAI.

IMO it's kind of silly and mentally corrosive to think of everything you do in these kind of transactional terms.

I post on reddit because I find it enjoyable. I am not doing "work" that I think I deserve to be compensated for. Not every POST request I make to someone's server should be accompanied by a bill for my labor.

Given the lack of an alternative, should we instruct the human instinct of sharing in the pursuit of knowledge to sacrifice itself or accept the risk of exploitation?

If your sorry platitude is what we have to show for it, capitalism must go to Hell.

> Given the lack of an alternative

There are plenty of alternatives to everything. But no one uses them. Because why would they?

They wouldn't, because network effect, which is why there is no alternative.

I have very few SO contributions so I don't have much at stake personally, but I have observed that there was a trend of people using their SO profiles for career advancement. I'd see people reference their SO activity on resumes and I had job applications ask for my SO profile if I had one, and I've seen advice that a good SO profile was valuable the way a good Github profile is. Is that something people factor in to their decision to delete? And isn't that social capital a kind of compensation for their contributions?

I left some time ago, and I'm very glad I did.

I left because the staff behaved in a disingenuous manner.

I found when leaving, as mentioned in the article, that you are not allowed to delete accepted posts, so you can't delete your content, should you come to think SO objectionable and wish your content not to be there.

I can't see now why anyone would spend time posting answers there.

I don't love them getting into bed with AI, but also don't think it's unreasonable for them to not allow angry users to blank out their prior submissions.

The whole deal was that you basically donated your posts and CC-licensed them. I wouldn't begrudge Wikipedia from similarly dealing with upset editors who went around blanking the articles they contributed to or reverting all their changes.

Wikipedia editors aren’t the sole source of their pages. I am fine with people leaving and deleting their posts because they may feel the information will become outdated without being maintained.

SO unfortunately is actively hostile to correcting outdated information. Which is somewhat understandable as recognizing how little long term value answers provide undermines their moat.

The site has basically become worthless for JavaScript due to such rot which helps explain why they are trying to cash out on the AI side of things.

Many answers have been edited, commented on, and reviewed by others. So it's also not exactly a one-person show either.

Outdated info is a problem, but not so easy to solve; I have answers from Ruby on Rails 4 era that are still perfectly valid today. Others may not be. Also remember that people sometimes stay on old versions for a long time. I don't know what the best solution is, but destroying information is not it.

Few answers get any sort of editing or updating over time.

If you’re worried about deleting information the obvious solution is to automatically hide the text when the poster says it’s outdated. At which point there’s little wrong with letting someone flag all their posts as likely outdated upon leaving.

But this is where the business of SO comes into conflict with providing useful information on SO.

You cant delete your posts on here either!

Which I dont like :-)

I'm with you, I think you should be able to delete your own posts and erase your internet history.

I dont think that everything we ever write on the internet should be stored forever because of some misguided intention to preserve conversations for future generations :)

fwiw HN does not delete your content when you get rid of your account.

I’m sure this is related to the fact that the editing window closes here after a short time.

I’ve been on other forums where disgruntled users have come through and destroyed old posts, which resulted not just in the loss of the messages but also harms the thread that built upon the now vanished post. So they, too, have a short editing window policy instituted.

Yes I'm sure companies can think up a whole array of excuses as to why they won't delete your data :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact