Hacker News new | past | comments | ask | show | jobs | submit | page 2 login
Stack Overflow users deleting answers after OpenAI partnership (build5nines.com)
302 points by miles 39 days ago | hide | past | favorite | 322 comments



Resistance is futile!


Hmmms. While I definitely can see SO's arguments concerning deletion, that letter seems to blatantly contradict GDPR's right to be forgotten, which Wikipedia describes as a more limited "right to data erasure" [1].

To coin a Dutch phrase: I cannot make chocolate out of that. Anyone here have an idea how to bring these two points together? Other than the obvious "wrt. EU inhabitants, SO is lying", that is. Or is it really that simple?

[1] https://en.m.wikipedia.org/wiki/Right_to_be_forgotten - under "European Union"


I find it deeply troubling that platforms are becoming so hostile that users are having to strike against the owners by mass deleting their content. And then the platforms handle this by simply undeleting the content and banning them from continuing to delete any more (StackOverflow, Reddit)

This may also be legally dubious in Europe as, while the authors may have granted copy rights to the platform owners, they still maintain their moral rights which may apply in this case (IANAL)


I just submitted a request to have all my content removed from SO and will challenge the outcome if needed. My right to be forgotten and have my content deleted supersedes SO’s dubious, “nobody really reads these” terms and conditions.


Great idea. Have done the same.


I’m not convinced yet that GDPR covers things.

GDPR is focused on PII and anything that can identify you directly or indirectly as an individual.

If your posts start with “My name is Jane Doe and this is my post” then that would be one thing.

However from everything I’ve been able to ascertain, your average forum post is unlikely to be covered, save from anonymizing the email address etc.

Granted there are certain things that can lead to deanonymization, in the case of forum posts I’m unsure how far that goes.

It’s a myth that the GDOR “right to be forgotten” is an absolute.

I also suspect that creating anonymized IDs and attributing that ID to each thread would be enough to get past such attempts to link posts to PII.


This bodes poorly for the future of SO.

One of the reasons that Quora today is absolutely unusable is that it no longer is a curated discussion between internet users and knowledgeable people, but AI spamming the site with swarms of low-quality questions, and AI answering those questions with swarms of low-quality answers. I think it's likely that Stack Overflow will end up following a similar pattern.


Quora was already absolutely unusable back before GPT-2. It became unusable as soon as people realized that all they had to do was self-identify as an expert to get taken seriously on there, so people started developing whole lifestyles around building up their Quora profiles. From that point on the actually knowledgeable people weren't interested in contributing because there was no way to distinguish themselves from the people who were faking expertise. AI may have been the final nail in the coffin, but Quora was dead long ago.

Stack Overflow managed to avoid that particular hazard by placing less emphasis on real-world identity and expertise, but it also has been in a long-term decline for many other reasons. The fact that they made such a vocal stance against AI and then pivoted so dramatically is just one example of how much they've struggled to find direction lately.


Just a point of clarification, the user moderator base (its power users) took a strong stance against AI, and the company, chasing every possible dollar, overruled them.

Short term profits over user preference is what happened here


Oh, yes, my mistake! I misremembered. Thank you!


I know this wasn’t really your point, but it’s worth noting that Quora being low quality spam is not the problem. It’s why the hell Google surfaces Quora so prominently given that the results are pure shit and require registration to even see all the shit.

Is there any reasonable explanation for how they’re ranked so high? Like, how can even googlers tolerate it?


Just a guess, but I think when they started losing the spam wars they put in some kind of handcrafted whitelist ranking boost, either directly based on brand/site, or link proximity to known good sites, etc. And maybe they don't update that list too often. You can find some info about an ML update Google called "Vince" that sounds a lot like that.


Not updated in over 2 years?


Not updated in a way that affects Quora over many years would not surprise me.


How could that even work to not affect one of the most popular sites whatsoever?


Poor maintenance of a probably thousands long whitelist of "brand quality" seed sites? When the only measure they really care about is ad revenue, and bad organic results might mean more ad clicks? It's not really that outlandish, just plain complacency from a company with an overwhelming market share lead in search. That's how Google started in the first place...capitalizing on complacency/stagnation on the then leaders in search.


> Like, how can even googlers tolerate it?

Assuming you mean people working at Google, the answer is probably that profit/promotions outweight personal use. More clicks, more back-buttons, more search adjustments, more advertising revenue.


Quora and SO are rather different communities. In Quora's best days, there were celebrities or quasi-celebrities making interesting posts, just like on Twitter or Google Plus in top times. Also Quora used to have very active and talented Community Managers / Top Writers. Marc Bodnick used to do tons of curation but left a while ago to create his own social network(s).

In contrast, SO has never been so "celebrity"-driven and the content has a rather different audience. I think it's understandable that the major contributors don't like how their content is being used, similar to the Reddit revolt.

What might "replace" SO is some AI-assisted way to establish a handbook and FAQ for any new technology. That could be a chatbot as well as some effective method for feeding that bot content.

And then SO-the-community, i.e. people who want to talk to each other, will probably branch off into some other forum or network.



Good luck enforcing that.


If a human can detect it, it can be flagged. They already deal with human spam.


Congrats to OpenAI (and the rest of the LLM bros) for creating negative incentives for sharing knowledge.


I do not understand, how are they "creating negative incentives for sharing knowledge"?

If I posted on SO before in the hope that others find it useful (and not for the karma) - and now it might help others not directly through the site, but with further steps through a llm, where is the problem? Knowledge was shared.


Part of the benefit for the answerer is the experience of interacting with the questioner, receiving upvotes and comments, having answers accepted, and having your name on an answer that's helped people. You get credit for answering questions on Stack Exchange sites. It's not much -- it's not supposed to be -- it's rarely of material consequence -- but it matters. I still get upvotes on some of my old EE.SE answers when my written work helps someone enough for them to give notice. It's a little reminder that I've done something useful in my life.

Having my work ingested into ChatGPT takes the me out of it. It turns me into, essentially, unpaid contract labor for OpenAI. They get all the credit, and I get forgotten. Why would I be okay with that?

If you want to write free code for OpenAI to improve ChatGPT, you're welcome to do so. Cut out the middlemen and send it to them directly. But please leave me and my work out of it.


"Having my work ingested into ChatGPT takes the me out of it. It turns me into, essentially, unpaid contract labor for OpenAI. They get all the credit, and I get forgotten. Why would I be okay with that?"

So you are ok with unpaid contract labor in exchange for virtual points. But if you don't get virtual points as appreciation, no one should benefit. That is ok, but then sharing knowledge is not your main, but secondary goal. Your main goal is the recognition.

But if you delete your comments, you won't get anything at all anymore. If they remain, real humans will still benefit directly or indirectly. And why should I write exclusicly for openAI? I share my knowledge for anyone. If SO would restrict public access and favour OpenAI - that would be the moment I would want to delete everything. But at the moment LLMs just get also official access, but they had access to SO before, just in a grey legal area. So nothing really changes.


Smcin has answered your other point. Let me respond to this one:

> But if you delete your comments, you won't get anything at all anymore. If they remain, real humans will still benefit directly or indirectly. And why should I write exclusicly for openAI? I share my knowledge for anyone.

The goal -- implicitly for AI companies and explicitly for many of the commenters on this story -- is to replace sites like Stack Exchange. Stack Exchange's traffic will instead go to ChatGPT. The most likely outcome of this is that Stack Exchange will eventually shut down or severely degrade its service. If ChatGPT were a supplemental tool, one user out of many, you would be right. But it's not a complement, it's a competitor, designed to make a profit off of assimilating my work without giving me any compensation or credit.


Exactly. People join SO and other SE websites to ask questions and get answers.

With ChatGPT and similar platforms, trained on SE answers (and open Github repos,...), people will eventually skip Stack Exchange and directly go to ChatGPT.


> But if you don't get virtual points as appreciation [unpaid contract labor]... then sharing knowledge is not your main, but secondary goal. Your main goal is the recognition.

It's false dichotomy to parse out components of motivations; most SO users are motivated by a mix of altruism, sharing knowledge, some recognition, optionally linking to your profile/website/blog/resume/portfolio, getting job approaches and a dose of pride/ego/vanity. As a longtime SO user, that has historically been the bargain, when (most/)all of your submissions were directly seen by human end-users. As a plus, all of that gave you good SEO commensurate with your contributions. So, it's unreasonable to try to dichotomize into "users who mainly did it for the rep" vs ones who want to teach and share.

But the 2023 and 2024 announcements are different: the future is your submissions will be used to train AIs; however SO doesn't seem to have devoted much thought to licensees like OpenAI complying with SO's attribution requirements [0] (attribution must cite individual URL of question/answer, and SO username, which then links onwards via your SO profile page to the items mentioned above). (If the AI synthesizes an answer derived from 5 separate SO items, do they guarantee to attribute all 5 items?) So the human eyeballs are being intermediated, your incentives to participate are evaporating, and that pretty much breaks SO's historical bargain with its user community.

The next major bad development would be SO opening the floodgates on the moderation queue backlog of thousands of items of AI-generated content (which caused the 2023 moderator strike/resignations), much low-quality and arguably should be banned; if/when that feedback loop is closed, the results might well be unholy; certainly bona-fide human contributors will be marginalized and have less incentive. (and if AI were to be used for moderation, then that could be exploitable).

Inbound views/hits on your content on SO either come from a) Google + other search engines b) SO's search itself c) attribution from OpenAI's ChatGPT d) attribution from other(/future) AI licensees. If your code is scraped once but effectively viewed 1 million times from GPT, you won't see those 1 million hits show up; you can only vaguely infer they might be happening if the attribution is actually implemented, and some users click through on it (or by reverse-querying the AI). So c),d) will proportionately increase as a),b) proportionately decrease.

So everything has changed. And obviously the incentive to you to continue to provide unpaid volunteer labor ongoing without even attribution decreases.

[0]: https://creativecommons.org/licenses/by-sa/4.0/#ref-appropri...


What are the negative incentives? How would an LLM improving in capabilities harm those who shared their knowledge for free online at some point in the past?


My experience is worth less if an AI can summon it at-will. It hasn't necessarily come down to this yet in the software industry, but in others (like animation), folks who were previously responsible for generating concept art have found themselves without jobs as management can get "good enough" results from a much cheaper medium (that was, at least en-masse, trained on their "prior art").

I don't personally have a well formed opinion one way or another on this, but to dismiss the existence of a issue at all is logically lacking.


the same reasoning would equally justify the claim that your experience is worth less if beginner programmers can summon it at will; if you believe that reasoning you wouldn't have contributed to stackoverflow in the first place. i don't and if you contributed to stackoverflow you didn't either


The scale might be different here, since prompting AI is much cheaper than hiring a begginer programmer. The previous loss could for instance be compensated by attribution.


ML gives a whole new meaning to "training your replacements".


Coming to think of it, recent ML is just a scaled up version of Infosys, Wipro, etc. Shit quality answers for enterprises, now accessible for the masses.


SO made it such a pain in the ass to contribute I gave up trying every time I’ve historically been interested. Like I’m already sacrificing my time to offer my expertise helping someone, you want me to jump through a bunch of hoops to have the privilege of doing so? No thanks.


That same amount of pain in the ass gaming made spam and terrible quality answers equally discouraged. Given the volume of at least decent content on stack overflow, I'd say the game worked. Somebody could try to make it better with a competitor but it would be a hard thing to succeed at.


The more hoops they've added the worse the quality has gotten. The quality has declined over time, and most of the good answers you see nowadays are from people who got in the habit of contributing back when the process was much simpler, and would likely never have joined the site if it was as onerous as it is today.


On a related note, it costs my employer way more to pay me to solve a captcha than it would to pay a captcha solving service.

At some point, passing the hoops turns into a negative predictor of comment quality.


Have you assessed they quality of QA's that aren't years old? Anything decent that I find is usually quite old and possibly out of date.

It doesn't help that asking for a more recent answer gets your question closed as a duplicate, and new answers can never overcome the inertia of the historical ones.

September has came for stack overflow


What do you suggest as a better alternative?


I'm starting to wonder if the days of "free, ad-supported, user-generated content wells" are over. The audience and participation base have grown larger than the ability of these single entities to rationally cope with while still maintaining their original mission and profits.

We've outscaled our original hopes for the Internet. It was originally meant to be a tool genuinely controlled by it's users; unfortunately, it's largely ended up in the stranglehold of a few monopolists.


Stack Overflow has been assimilated. Resistance is futile. It served a useful purpose but now it's part of the glorious AI universe to come. Rest in peace.


The knowledge that's baked into those LLMs comes from sites like Stack Overflow. Without them, how can the LLMs learn new things?


That is the big question with LLM's. How can we tell what is being fed in is original content or just the output fed back in like a recursive fractal?


That also means there is probably a lot of wrong information on Stack Overflow that is baked into the training too. Hopefully, they accounted for this in training, but no way of knowing.

I have not really had a lot of accuracy issues with GPT, but then again, I probably and not savvy enough to spot them, most of the time, anyway.


If it's posted on stack overflow, it's not new, it's merely been published. If this is the bar for LLM "learning" then they are doomed to live in a hazy bubble of the recent past.


SO isn't the only source. Official docs/wikis, GitHub issues, etc, are also good knowledge sources.


Haha, what's that gonna do? Ever heard of soft delete? It's a thing where even if you delete something off a website, the database still retains that information even though it becomes inaccessible by the public.

Everything we write on the web is like that, including this very comment.


even if it were a hard delete, do these people think OpenAI is scraping the live version of the site?

the answers have already been exported. all you're doing by deleting it is ensuring it's only available on ChatGPT, and no longer available to web users who aren't using AI tools that ingested the content before it was deleted.


It lowers the value of the site if people stop answering and sabotage their existing answers in protest. Does that make sense? Do you understand?


Wonder if Wikimedia Foundation couldn't just take the opportunity, now that Stack Overflow is alienating their userbase, to launch a rival Q&A site. I was always puzzled why did they never attempt to enter this space, even before Stack Overflow, given their prior experience in crowdsourced information commons.


Pragmatically, the software powering most their properties, MediaWiki, is not suited for it. It's hard to see them investing in development of a new platform given uncertainties of success.


In addition to deleting answers, I think protesters should up vote wrong answers and crappy posts.

For years the community has defended punitive down votes on correct answers to crappy questions as "you can do with your vote as you like". I see no argument against flipping that around.


What is the end game here? You wanna get paid? That wouldn't be more than a few cents, just like how Spotify deals with artists.


My personal end game if I have one and I'm not sure I do would be to ensure that I can help individual novice programmers become better at their craft. Not to make billion dollar corporations even richer.


What's your stance on when there are open source LLMs that are as capable or more capable than GPT-4?


Does everyone get equal access to let their own copy of the open source llm download a copy of SO?

Are those open source llm users in turn selling access to the content they got for free, and also stripped of attribution?

What exactly is changing hands in trade for the money, that doesn't one way or another violate CC-BY-SA?

It's not merely the fact of any form of commercial activity, since there is no NC in there, but the specific actions here by both StackOverflow and OpenAI violate the terms the content was originally created and shared under.


CC-BY-SA is a copyright license. If LLMs are found to be fair use then there is no copyright infringement regardless of license.


And you intend to do this by signaling that incorrect answers are correct?


No one needs a sensible, logical, or even rational reason to do that which they are already entitled to do.


You know they read this as "They can do something illogical if they want to." instead of "They don't owe you an explaination of their reasoning nor require your approval of it, and your not knowing or understanding or agreeing with their reasoning does not mean there is none or make it invalid."


It’s all the same. Whether it’s rational or not isn’t really relevant and is subjective.


"logic is subjective." noted.


In social interactions, well, yes. If you have a formal logical framework of human behavior I'd love to see it.


SO has been doing the absolute worst things to squander their amazing lead for years

i haven't used that website since GPT came out, and now i contribute nothing to it

but i'm glad all of its content ended up training the models that put it out of business, thanks SO! you'll never be anything other than user contributions


As dour as it sounds, I am in a similar boat. Who'd have thought that not needlessly getting called names when you ask a question (even if it's dumb -as that's how you learn) makes people less likely to interact with you.

what's amusing to me is that some people even in this thread are calling it a pro, not a con. I guess our field does indeed attract a certain kind of personality.


OpenAI shouldn't have paid a license. Just scrape the content. It's fair use.

Now they paid the shakedown fee, and Stack Overflow has user riot on their hands, albeit one that's trivial to shutdown. Send OpenAI the db backup.


fair use applies when there's a copyright infringement to defend against, but the cc license on stackoverflow clearly permits scraping the content


Well then, there's even less reason to pay the rent Stack Overflow was wanting.


agreed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: