They used stack overflow to make their case, and report that user engagement has gone down after the release of ChatGPT. Could it not be the case that SO is less adept at finding related/duplicate questions than ChatGPT? Given the later's facility with the language, I would expect it to be. So I look at the paper to see if they accounted for that, and find this.
"Second, we investigate whether ChatGPT is simply displacing simpler or lower quality posts on Stack Overflow. To do so, we use data on up- and downvotes, simple forms of social feedback provided by other users to rate posts. We observe no change in the votes posts receive on Stack Overflow since the release of ChatGPT. This finding suggests that ChatGPT is displacing a wide variety of Stack Overflow posts, including high-quality content."
Can anyone tell me how they are related and what it means?
> They used stack overflow to make their case, and report that user engagement has gone down after the release of ChatGPT. Could it not be the case that SO is less adept at finding related/duplicate questions than ChatGPT? Given the later's facility with the language, I would expect it to be. So I look at the paper to see if they accounted for that, and find this.
The moderation team and community in general on Stack Overflow is so toxic I'm not even sure you could control for that effect well enough to arrive at this conclusion. I would argue people are leaving because it's easier to ask ChatGPT your question than be flamed and banned for asking how to do something. A half right answer from ChatGPT is better than getting marked duplicate and closed because the moron moderation team can't detect nuance.
I have anecdotal stories about aa friend corroborating this. He had been rebuked and experienced an unwelcoming response on Stack Overflow in response to his questions while trying to learn web development, and found ChatGPT a great teacher in comparison.
I'm a very experienced developer. Sometimes I have questions that are easier to ask a person than go dig through miles of documentation. I could easily frame a question correctly to get a response but even I feel extremely unwelcome there. I couldn't imagine being junior.
It is unwise to leave the detection of nuance to volunteer moderators. Rather, highlight the nuance in your question (or answer). That strategy had always landed me high value answers to my questions.
I've arrived from google on "closed" questions I needed answers for, XY questions where I had question X but not secret question Y, people trying to XY answer extremely simple and straightforward X-and-only-X questions, questions unanswered for years with several upvotes and me-toos.
But shallow questions where the official docs are too raw or are missing a few specifics? Stack Overflow was good for that before chatgpt.
In 2013 or maybe it was 2014 I was tasked with building a couple of things for our Sharepoint 2010 that ran our intranet. I’m a programmer but it was pretty far out of my area of expertise and it was hard to find anything documentation related, probably because Sharepoint 2013 had been released. Anyway, I made things work by doing a lot of google programming, of which a lot led me to SO. Most of it was build with JavaScript but I still needed to figure out how to work with Sharepoint APIs and Lists. Fast forward to 2023 and I’m tasked with working with our current Sharepoint. It’s still far out of my comfort zone, but the Microsoft documentation is on point and it turns out to be a fairly trivial task to engage with the APIs. This story isn’t really unique to Sharepoint, and while obviously very anecdotal, these days it’s very rare that your average blog spam or SO answers are more helpful to me than official documentation, and often the official GitHub code examples. Now I’m more experienced today than I was in 2013 so that helps as well, but for the most part, I think the key reduction for my personal google programming has been how much better we as software developers have become at documenting our tools.
So in 2023 I almost never visit SO, not because of GPT but because the people who write frameworks also supply you with documentation and implementation examples.
Again anecdotal, but I do use GPT quite a bit. Though rarely to help me figure something out. It writes my documentation, unless the code is very secret. It sometimes writes some basic code like auto-generating a docker file or basic API classes. Things that aren’t in competition with SO. I’m sure it’ll take over the role of SO for some parts, but to me personally, it seems far more likely that SO’s decline is linked to a range of other challenges as well.
they should ask people that ask questions, not people passively looking for answers
the experience asking a question on stackexchange sites is horrible! each tag is its own community with edicts and customs you have no idea about which derail your path to actually getting an answer, the auto-moderation system is completely broken because it thinks your question should be replaced with one from 2013, or the human moderators unilaterally fix your question which then causes the auto-moderator to think its similar to the one from 2013 and unceremoniously replaces it
With chatgpt you dont need to prove anything about what you tried, you dont need a reputation score to do anything, you dont need to be a steward of every upvote for all eternity, you dont need to engage in meta discussion to get your post out of deletion, and it answers faster. You don't need to be told to ask a separate question if you have any followups, with all of the same gamble of problems, while chatgpt already assumes what you will have a followup question about and just tells you all the gotchas. Because its read all the forum posts and has seen what the recurring issues are.
its better, faster, and cheaper time wise
its not a direct comparison, chatgpt is more like a pair programmer. If stackoverflow had an adhoc pair programming live session you could hop into it would be a better comparison.
This is a great point. I think the research is "biased" to begin with, in that it feels like you could easily p-hack your way into a "chat gpt has degraded X" kind of paper if that's what you want to show. But with SO in particular, you're right that there are actually clear improvements that it makes as a competitor.
I can't think of a way, buy it would be interesting to see if there is a similar effect on Wikipedia contributions.
> We observe no change in the votes posts receive on Stack Overflow since the release of ChatGPT. This finding suggests that ChatGPT is displacing a wide variety of Stack Overflow posts, including high-quality content.
My alternative hypothesis: ChatGPT is eliminating all the long-tail, garbage questions, so the vote patterns don't change because the missing questions weren't getting much of them anyway.
To expect change in vote counts and/or ratios, you have to assume that users have sort of "vote budget" that they are going to spend one way or another. If voting is mostly independent variable and mostly depends on particular user and particular post then you should not expect voting patterns/counts to change meaningfully.
> Could it not be the case that SO is less adept at finding related/duplicate questions than ChatGPT? Given the later's facility with the language, I would expect it to be.
Given the later's facility with the language, I would expect it to be a better search engine. I would expect that ChatGPT is replacing Google as sort of tokenizer. I.e. search pipeline changes from "form natural question -> input to google -> go to SO -> if first few links do not yield answer post new question" to "form natural question -> input to chatgpt -> extract keyword tokens -> input to google -> go to SO".
There is an important bit in the article:
> > Using data on programming language popularity on GitHub, we find that the most widely used languages tend to have larger relative declines in posting activity.
Here we can form a hypothesis that reduction in post frequency comes from entry level posts with posters not knowing what to search for. Under this hypothesis ChatGPT has strongest effect on users using SO as knowledge base rather than Q&A forum. This user type distinction would affect posting frequency much more heavily than voting patterns.
Some of my colleagues are using ChatGPT and blindly follow its advice. ChatGPT's answers often look more persuasive than human answers on SO.
Still, I find ChatGPT answers much harder to validate. In programming Q&A, the answer is usually a series of calls. When you try to apply the human solution (assuming it's at least somewhat correct), the error often lies in input data mismatch or some minor changes in requirements when the answer is not perfectly aligned. Rarely the calls themselves are wrong - they might be deprecated if it's an old answer, but generally, they do at least exist. With ChatGPT, you never know which particular part of the answer has been hallucinated.
Also, as SO has more stringing question requirements, you are forced to construct a minimal case to reproduce the error. Often, composing an SO question got me straight to the cause of the error.
What happens after LLMs kill off SO and then seek more updated training data?
It seems like these “deaths” are either temporary or something else will pop-up that continuously improves and trains an LLMs with proprietary data.
There’s def a shift in the social contract of the internet. We’re shifting from “publish a few things you know a lot bit in exchange to read stuff other smart people have shared” to “solve a problem with an LLM and train it in a manner that can be helpful to others”
What’s unclear is how the middle of that will be monetized. Search engines figured out how to do it for the “publish and share” paradigm and made a boatload of money off of it after paying for the massive infrastructure required to index everything. How will LLMs do it without killing the data that trains it?
I think open source might benefit a lot from being so easy for LLMs to provide answers for. Hopefully they’ll be able to reason straight from the source code.
I've had GPT3 give me the wrong answers to questions. Then I google the topic, find the documentation and go "hm, yeah, I can see how it would think that." For instance, ask it what `set +o` does in Bash, then check the online docs.
With programming atleast there can a validation step at LLM based system can deploy. Check if the suggested changes work before spitting out the answer.
I'd expect user engagement to be going down in the same trend that has been occurring over the past five years as Stackoverflow is less and less useful.
They used a difference-in-differences[1] to control for such longitudinal/temporal effects. Since ChatGPT isn't readily available in China or Russia, but StackOverflow is, they can compare how SO changed pre and post ChatGPT in countries where it is widely available and compare that to the pre and post in countries where it isn't available (basically as the control).
As someone who is getting their start with coding, where/what forums exist that have a high quality/helpful community? My biggest struggle has been with relatively simple questions – with a broad stroked theme/issue to 'em for the most part. AKA having a mentor or just a group of coders who are willing to help out if you are willing to be an active member (but I'm relatively useless, aka active maybe in an off-topic lounge part of it). I'd appreciate it if ya got the sauce, a PM if you'd like to keep it on the DL perhaps? Thanks!
Programming language discords might be good. Code review stack exchange has been pretty helpful, though maybe wait until you have a grasp on the basics before posting.
If you struggle with broad issues a HN post is probably a good place to start, assuming you can express your question well. I imagine you'll get the highest signal to noise ratio here.
Edit: btw nobody can pm you as you have no contact info in your bio afaict.
I see this as a rough parallel to "is the printing press a threat to illuminated manuscripts". Maybe it is under a very narrow view, but overall, improved propagation of ideas has led to improved dissemination of ideas, and it will this time too. People who's narrow world has been disrupted will perform all sorts of mental gymnastics to tell us how we're going to be worse off for it, but we won't.
Ironically, internet discussion forums democratized specialist knowledge that you previously would have had to be in the right university or research circle to access. SO monetized that, and now we've moved on to complaints like this about further democratization of knowledge and (to quote another post) that knowledge being "stolen".
There is a weird sort of analogy where universities spent the last 20 years putting their classes online, or associate professors and grad students uploaded YouTube deep-dive videos, etc. Point is, enrollment at Universities are way down now and you could probably draw some kind of link between the two.
However, there are also links between jobs that no longer care, and GenZ is a smaller cohort, and also maybe more of an interest in trades as a path forward.
I'm not suggesting the paper linked here is right or wrong, but we should keep an open mind about all the other potential reasons why things suddenly shift. One random reason could be that during this overly oppressive year of layoffs engineers have either stopped needing to search or haven't had time to work on a project that needed it.
Also there's smarter IDEs which are likely the #1 reason I've stopped searching the internet for one-off small things like "what's the API to deep copy an array in Rust." Documented autocomplete has gotten really good and things like copilot are just icing on the cake.
The problem with this sort of comparison is that up until now, new ideas came from humans and technology advances merely helped to spread ideas, or greased the wheels. Now you can generate new ideas. Often without any skill of your own.
This is bad at first glance because look what happened when smart phones became popular: no one remembers phone numbers of their family, people can’t remember directions when driving. Now you don’t even need the ability to generate ideas.
And it’s not clear to me that LLMs democratises information. LLM have guard rails, are potentially trained on copyrighted data. Need to be run by large companies…
> when smart phones became popular: no one remembers phone numbers of their family, people can’t remember directions when driving.
This is a concern, though I'd argue unrelated to the concern about people no longer contributing publicly online, and one that was already present with SO. I've seen SO posts specifically reference or criticize the "copy paste" crowd that is just taking the answer and putting it into their code without thought. Some people will use any technology blindly.
Because of the implied effect on things like attention span, memory, and cognition in general. It's the same reason that mastering fundamental mathematics is critical, even though we have calculators. Trends in the Flynn Effect are relevant here. [1] The Flynn Effect is the observation that "real" IQ values were increasing, dramatically, for decades. In recent decades (since births in ~1970), this effect has not only decreased but reversed in the West. And it cannot be explained by dysgenics alone, since it even shows up in same-family cohorts. That suggests environmental causes are likely playing some role.
Those coming to adulthood in the 90s were the first generation to really get to experience mass, endless, and cheap dopamine driven digital entertainment. Notably the [positive] Flynn effect is still in full swing in places like China and India, where digital entertainment market saturation has taken longer. So they'll effectively work as perfect experiments. If it reverses over the coming decades, as everybody over in these places now has their heads shoved in e.g. smart phones as much as anywhere else, it should be telling.
What reasons are there to think that the reversed Flynn effect comes from a lack of memorization of rote knowledge like phone numbers, as opposed to attention-sucking social media apps?
That's the point. People used to be able to memorize 7 digit numbers and now we can't because we don't have to. Where the brain is a muscle, it's atrophying. Those parts of the brain aren't being exercised and used, so we're getting dumber. Individually and as a society. The fear is that this will become an impediment to progress.
You can only say this right now where there are an abundance of ideas and less skill to implement them. But even implementation requires creative thinking and the ability to generate ideas. Now we want to take away the need for the first part. This is NOT like going from handwriting to printing or radio to TV.
Something new is going on.
I use ChatGPT for work in a limited way. If you haven’t tried it, I suggest that you do. This potentially paradigm shifting technology.
We could say this for a long time now; all the historical squabbles over religious story is evidence useless ideation is an innate human feature that wastes time/resources, and the advancement of skill at implementation is what’s actually improved our day to day.
That's a bit of a leap. Because one set of ideas were wrong therefore having ideas is useless? Even those wrong ideas generated new and useful ideas.
I also don't understand how you intend to improve implementation without having more ideas.
True democratization of knowledge would mean LLM's are open source and usable by everyone (the way of Wikipedia).
Even then you'd want attribution to survive.
Eliminating or changing the incentives people have for doing things is the absolutely surest way to change behavior.
Postulating that all published work can be appropriated at will by certain private for-profit entities is the death knell of the knowledge economy as we know it.
> improved propagation of ideas has led to improved dissemination of ideas, and it will this time too
I sincerely hope you are shitposting. So called Web 2.0, the platformized web has placed huge incentives to stifle propagation and dissemination of ideas. However, these incentives are somewhat distributed under human review, therefore there can be bubbles with opposing ideas. LLMs centralize censorship, by design.
Currently you can find "BrawndoLounge" and "ToiletWater" subreddits that heavily censor opposing ideas. Even if one is full of hate speech or whatever, there still are opposing ideas there. If LLM guardians decide that one of these sources is "toxic", in an LLM-led web that viewpoint simply vanishes.
The paper title is clickbait and therefore the HN comments are focused on various low-effort reactionary noise, but it seems like an interesting study.
I would not be surprised to find StackOverflow usage dropped significantly because of ChatGPT. It's simply a much more effective tool for getting help with typical programming problems. Not as good of a resource for expert-level or architectural advice, but that's okay with me. Basic "debugging via internet" is much easier to do with an interactive service with lots of knowledge.
It's often pretty helpful just pasting an entire error message with backtrace into ChatGPT and seeing what it thinks.
If you're only pasting backtraces, you're not taking advantage of its coding ability. You can describe the problem, have it write some code, then iterate by telling it what to add, cases you want it to handle, add features to the code. It's been doing a pretty good job of helping me designing a database down to the actual CREATE TABLE statements for me as well.
I won't upload any of my writing to the web anymore until this is all sorted out.
You can make fun of me all you like, but it's taken me decades to get good at this, and I'll be damned if some soft-skinned SV kid with a MacBook uses my work to power his mill.
When I felt like writing again after years of closing my previous blog, it though about it for a while before committing to https://bitecode.dev.
But eventually, I realized that
- I also write for myself, not just for others.
- People like reading things without having to prompt for it. So they will read the blog because it's nice and topics come to them even if they don't think that they need to know.
- ChatGPT doesn't have opinion. It tries very hard to be balanced. Your blog will have an opinion.
- There is more to the experience you provide than just knowledge. You can add tools, exercises, videos, etc. Which GPT cannot, for now, replicate.
- GPT can replicate style, but will not by default. People will come for your style as well. And pics. And design. And jokes.
- People value the interaction they feel when they content seems like there is a person behind it. They get attached. They develop sympathy.
- A blog puts things in context. "If you want to know that, you probably need to know that". It also gives information about what happen right now.
- Humans curate. In a world where creating crappy content is very cheap, a good filter has tremendous value.
So yes, you will be scanned, and replicated. Doesn't mean you don't have value in writing what you do.
And making something of value is nice.
This may change in 10 years or so. Maybe the LLM will be able to do all that. But depriving yourself of a rewarding experience right now for the fear of what might happen is not worth it.
I agree with you. Programmers at these AI companies basically create a wealth concentration mechanism that diverts the money resulting from the value of our work into their pockets.
Sometimes I'm thinking about publishing AI generated content that is clearly labeled as such. Just to make the anyone who scrapes it to train their model a little bit worse.
And Abimelech fought against the city all that day; and he took the city, and slew the people that were therein: and he beat down the city, and sowed it with salt.
Those that do make fun of you are victim blaming. A lot of these folks stating that if you put out in the public then it's not yours anymore sound like criminals to be fair.
Techno communism essentially. "From each according to his ability, to each according to his needs". And just like that type of economic setup a handful of bros will reap the benefits - in this case sam altman's clique and others like them.
Private companies abusing works shared in good faith for free in order to profit is like communism? I'm sorry, that's just absurd "anything I don't like is communism"-level thinking. These are organizations operating under the incentives of capitalism to achieve the goal of capitalism (and lots of tech enthusiasts trying to come up with post-hoc justifications for the shiny new toy).
This is a very intriguing study as it would appear the fact that LLMs need public data to train on, its incentive on the marketplace of ideas is to reduce the very thing that gives it its power. It contains the seeds of its own destruction, destroying the web and open data ethos, and yet another data point pointing at another AI winter.
For reasons I can't articulate I see LLMs as a vehicle for removing the creators from their ideas. This is very different than search engines. If a search engine generates traffic for documented ideas it creates a community. An LLM based internet seems to remove the creator and shim itself in between for the sake of business.
It's tricky, because in many ways, this is achieving exactly what I, as user, want computers to do for me: give me information I requested, and only that. I very much do not care about who discovered/created/published it, except only if it helps me quickly ascertain the trustworthiness of said information. I do not want to be forced or prodded to establish relationships with creators or communities. I do not want their ads and upsells.
It's the same issue as with search engines providing "information boxes": huge win for me, but a mortal enemy for those who want to monetize anything resembling intellectual property.
> An LLM based internet seems to remove the creator and shim itself in between for the sake of business.
This sounds bad, and in some cases it is, but in others it is not. Content farms and recipe sites have creators behind them too.
> I very much do not care about who discovered/created/published it, except only if it helps me quickly ascertain the trustworthiness of said information. I do not want to be forced or prodded to establish relationships with creators or communities. I do not want their ads and upsells.
And this is why humanity is going down the tubes...because you want something, you derive value from what you want, and yet you do not care about giving something back to who makes it.
> And this is why humanity is going down the tubes...
On the contrary - this is exactly how and why humanity built a technological civilization in the first place. Note that I didn't say
> because you want something, you derive value from what you want, and yet you do not care about giving something back to who makes it.
Yes, because it would be backward and limiting to do that. Note: I never said I don't want to give anything back - I said I don't personally care specifically about the author/publisher. I don't want to establish any relationship with them. If I'm paying them directly for something, I'm paying them for that thing - not also for relationship (which really is a sales channel), not also for being advertised to. If I'm paying some intermediary, then rewarding the maker is the intermediary's problem, not mine.
Consider: do you compensate directly, and have an active relationship with, the person who bakes your bread (hint: people selling bread in bakeries are not actual bakers)? The company who supplied them with flour? The farmers who supplied the flour-makers with grain? Do you pay delivery drivers directly? After all, you're deriving value from their labor directly. Etc. Then there's an entire army of people whose work benefits you directly, and whom you don't even think much about, and rely on being compensated from the common pool (e.g. taxes) or stochastically.
The whole point of money is to allow exchanging value without forcing parties into maintaining an ongoing relationship. That's a feature, not a bug. And if anything is driving humanity down the drain, it's the idea that you should, need, or are even entitled to capture all the value you produce.
What does your example have to do with this situation?
The people in the bread supply chain get paid, the author of content we're discussing will never get paid by anyone, will never even get a bit if personal satisfaction from their analytics knowing last month x thousand people read that page and it hopefully helped them.
It's completely zero reward, even worse it's completely zero feedback of any kind!
This really is the doom of the web as we know it because for the first time ever there will be an active disincentive to put knowledge on it. I think much information will retreat to places like Discord or locked down login only versions of sites like Stackoverflow.
> What does your example have to do with this situation?
It's addressing GP's complaint about me pointing out the indirect and transactional nature of the interaction between information producer and consumer.
> the author of content we're discussing will never get paid by anyone
That's... not my problem? Bear with me here.
> will never even get a bit if personal satisfaction from their analytics knowing last month x thousand people read that page and it hopefully helped them.
Aha!
So we're talking specifically about content creators that publish for free in hopes of maximizing a number on their analytics? That's healthy neither for them nor the society at large. Or you mean people publishing content for free to make money off ads? Yeah, I don't mind that content to disappear entirely.
Note that outside of web publishing, it was never the expectation of an author to have any idea how many people read their work, much less get paid for every single "read event". They only got a lump sum or a fraction of first sale of a printed work - but had no insight or control over further circulation of parts of entirety of their works. Being able to resell your books or magazines, or give or lend it to friends, or borrow some from a library, are all good things.
All that was true before AI, and people found reasons to write new books, or to publish quality content on-line, for free and without advertising or telemetry. LLMs don't change that. If anything, they may reduce readership, not publication.
> I think much information will retreat to places like Discord or locked down login only versions of sites like Stackoverflow.
This has already been happening for the past couple years; LLMs, again, don't change anything here.
I'm really struck how this is the first time I have seen a disincentive to freely share information on the internet.
I can't tell if I have aged out of some ideal or is it that the individual creative efforts are being homogenized into pseudo answers for someone to sell.
Regarding the baker example, some form of compensation is eventually directed to the employees, the farmers etc. even though you could say there are many layers of indirection.
In case of LLMs, no compensation is directed to the person authoring the information. While it may not be a problem for the consumer of the information, it removes any incentives for the people authoring the information to continue doing so, which has long term consequences.
Depends what the incentives are. Some folks like the community- that doesn't change. A website like honestwargamer or goonhammer has a clan. Mostly because they are insightful and friendly. But if an AI scraped their content, it would have to build a better goonhammer? Maybe. But I bet it would have half a dozen nerds contributing and tinkering at the backend, painting the new minis that came out, talking about tournament results etc.. That is current, constant fresh relevant content, based on IRL activity. Very hard to replicate.
For honestwargamer... Build a better twitch stream... that seems even less likely. they run a round and stream games, review results, collate stats. And then engage with the audience in twitch.
So when you say "contribute to the internet", this is what I consume and.. I'm sure there are similar examples in every niche- fishing, golf, coding, AI art creation ...
No I don't see this as gloomy scenario, and content creators- the goonhammers and honestwargamers, creatives, are still going to get paid (a bit, they were never rich), maybe in new ways.
That's a somewhat one-sided view. For gaming and entertainment, yes, people would do it anyway, since it's just fun, but these do not contribute much useful information to the collective consciousness anyway. Hobbies & creative communities will also survive.
OTOH, there are also plenty of technical blogs full of advanced content that is not "fun" to produce on its own, that are written to interact with a community of professionals (or juniors), and that might wither if engagement with actual human beings is reduced.
Yes, because it would be backward and limiting to do that. Note: I never said I don't want to give anything back - I said I don't personally care specifically about the author/publisher. I don't want to establish any relationship with them. If I'm paying them directly for something, I'm paying them for that thing - not also for relationship (which really is a sales channel), not also for being advertised to. If I'm paying some intermediary, then rewarding the maker is the intermediary's problem, not mine.
And I maintain that your attitude is one that makes the world worse. We should know where things come from and not have an anonymous transactional attitude towards it. Our technological civilization as you call it has led us to destruction with only a minority benefitting.
BTW, my favourite place to get bread is one in which they actually sell and bake it in store (a tiny market with its own oven).
I don't know that I have a position yet. The problem I have is when a tech wizard says the tech is so complex nobody knows what is going on is never good.
>I do not want to be forced or prodded to establish relationships with creators or communities. I do not want their ads and upsells.
I feel the same way I lurked on HN for 7 years before signing up. I enjoy the lurking aspects of the internet.
I usually think the internet can survive anything.
This seems like the same disruption that Uber promoted to avoid regulation. Imagine all the great ideas we can generate and claim they are original to the unauditable AI thought process.
The last worst thing to happen to the internet IMHO was SEO. I sort think a schism may develop to the reality of the broadcast world and the narrowcast world.
I don't have good answers. I have some high-level intuitions.
One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.
Secondly, there's a lot of information laundering going on - any random book I read carries between a few to few hundred references to prior written work. What I pay for the book goes to the author and the publishers, but AFAIK it doesn't go to any of the authors and publishers of works referenced in the book. Wikipedia takes this one step further, effectively turning all that information free.
Thirdly, AFAIK copyright explicitly does not cover information/knowledge - it covers specific works. So Google showing me an info box with a recipe scrapped from some site could technically fall afoul of the law - but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV.
I think that is a somewhat narrow view. Maybe to make the contrast sharper: Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?
The new situation isn't the same as search as that wasn't there to hide information sources or to immediately convert information into useful things (texts, guides, etc.).
> Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?
If this matters to you, then you shouldn't. But to flip this around: why should you care?
Unless you're doing some unique work targeting a global audience, the point when LLM gets trained on what you created is way outside space you'd normally care about. Trying to capture all the value your work generates does not lead to a good world.
Or maybe it's me who isn't profit-minded enough, but e.g. a lot of what I wrote on-line, including blog articles and commentary on Reddit and HN, has been used by search engines for free for a long time (over a decade, in some cases), and now is (most likely) part of the training corpora for LLMs. But I never believed, and still don't believe, that I'm entitled to some share of the gains LLMs (or search engines) make.
Perhaps there will be a drop in high value information in the public domain, but right now, I can't exactly see LLMs impacting the incentives for creation and sharing of that information. I don't see how LLMs would make someone go "oh well, AI is here, I might as well stop providing people with no-strings-attached high quality information", if the existence of search engines didn't make them stop already.
For years people have been making travel blogs based on where they've visited and the practical information they've discovered, like experiences of visiting attractions or good places to stay in cities or how they got from one place to another. They monetised with ads and affiliate links so they could travel more based on that income.
In LLM land, they get no monetisation any more because nobody visits their sites, instead the LLM just regurgitates the answers they found.
The search engines actively supported these authors, by sending them people who needed the answers they had.
So in LLM land this information goes away because the feedback loop of the traveller creating information which earns them money to continue travelling goes away.
A LOT of the useful information on the web was built on similar feedback loops and they go away in LLM land.
I consider this to be a problem on its own, but it's not relevant here because:
> In LLM land, they get no monetisation any more because nobody visits their sites, instead the LLM just regurgitates the answers they found.
That can't possibly be true, because if it were, there wouldn't be any travel blogs anymore today. All that travel spam has been made redundant approximately around the time Flickr was created, and every interesting location ever has been photographed from every interesting angle in a way neither me, nor you, nor your favorite travel blogger could ever hope to match. All the information they post has also been posted many times over by travel bloggers that came before.
The point being: travel information and photography is worthless commodity these days. Travel bloggers (or Instagrammers, or whatever) are not in the business of selling information. They're selling dreams and personal experiences. The photos and information are necessary as delivery vector ("social object"), but by themselves are worthless and not the point. The point is entertainment, and ideally getting you trapped in a parasocial relationship with the travel blogger/grammer, which gives them a recurring revenue stream.
> In LLM land, they get no monetisation any more because nobody visits their sites, instead the LLM just regurgitates the answers they found.
It's the same model as with most other ad-monetized social media publishing. People will keep visiting them for the same reason they visit them now, and for the same reason they have their favorite youtubers and tiktokers. LLMs and other generative models don't change anything here, at least not short-to-mid-term, because they can't convincingly replicate human connection and keep it up for long.
(Also, I personally don't buy that travel instagrammers can actually sustain their travel lifestyle through ads and affiliate marketing and sponsorship deals. I suspect most are funded in some way, whether by family wealth or by services performed while traveling around.)
> The search engines actively supported these authors, by sending them people who needed the answers they had. (...) A LOT of the useful information on the web was built on similar feedback loops and they go away in LLM land.
Hard disagree. The only feedback loop this created in practice is the one that displaces quality information from the Internet - the combination of SEO and ad-based monetization means the most scummy players are the ones with most money to stalk every conceivable search query. The results speak for themselves: making a Google query for pretty much any topic of interest to general population will give you only content marketing sites - results that carry negative knowledge, as in if you waste your time reading them, you'll come more misinformed about the topic than you were before. If LLMs make all that go away, I'm 100% for it.
As for "A LOT of the useful information" - nope, can't think of a single case where ad/affiliate-supported site was a good information source, vs. just displacing a better free source.
> As for "A LOT of the useful information" - nope, can't think of a single case where ad/affiliate-supported site was a good information source, vs. just displacing a better free source.
> Today stingynomads.com is our full-time business and main source of income.
Now please tell me in what way is this not an example of an ad/affiliate supported site that provides a lot of useful information and what non ad/affiliate based resource has it displaced that was better? Cause I'm doubting someone would write up a better guide than that, publish it and not monetise it.
Then perhaps they should find other means of livelihood, instead of preventing the rest of the world from making full use of the information and technology available to it.
They will and less information will be put into places that are freely accessible. If it's put anywhere at all it'll be put behind login only/paywalled/unscrapable places that LLM's can't access.
Why would they suddenly paywall information if they weren't already?
The way I see it, there are roughly three groups of information providers:
1. Those who do it pro bono - because they feel like its a worthwhile thing to do, or because they believe in by "pay it forward", or otherwise because they haven't even thought that what they share is worth trying to extract rent from.
2. Those who do it "for free", as a way to lure people to where they can expose them to ads, affiliate marketing, upsells, or other such schemes - making money by being predators using information as bait.
3. Those who just put up a paywall, being up front that they're selling information, not giving it away.
(There's also a weird "in-between" group of publishers that are almost like 1., except they're being funded out of marketing budgets of companies that figure providing quality information is good advertising.)
LLMs don't change anything for group #1. They may compete with group #3, but that's business as usual, not anything transformative. The group that's directly affected is #2, which also happens to be the group that produces all the garbage on the Internet, so I'm actually very happy to see them forced to find a more useful way of making money. Since group #2 produces "information" that's arguably negative knowledge on the net, it's likely to improve the amount of quality information you'll be able to find on-line.
Group #2 is the reason why there's so much information on the internet in the first place, especially free resources, for better or worse. I think it's pretty ignorant of you to say we can just discard one of the main ways people who create things on the internet get paid.
Ideas are copied by reading or hearing them. You can't own your ideas now, unless by own you mean horde. The perpetual creators rights you want extended are already artificial and require a non-trivial amount of our GDP to enforce and they still stifle future creation in a lot of areas.
Most people are paid for doing things every day, they don't get to create one thing and never work again. Expanding creators compensation laws is regressive and only helps a few elites survive job uncertainly, not the bulk of the people. We're better off limiting this sort of thing specifically to help everyone advance, share the knowledge.
LLMs aren't just a mere database containing indexed copies of other peoples' IP. AI companies are charging you for access to a sophisticated automated reasoning system, that necessarily had to memorize half of the Internet in the process of becoming capable of (some approximation of) reasoning.
(BTW. that you can even make a system this way is a huge breakthrough that's not being talked about enough.)
But even if they were a mere database indexing copies of other peoples' IP, then - copyright issues notwithstanding - the de-bullshittifying of information retrieval process alone would be service worth paying a lot of money for.
I think we are talking past each other. Let me try to narrow down where I think we disagree.
1) LLM providers harvest a common to create their product (don't think we disagree here much).
2) What happens next is where we diverge, I suspect: I think they will use their products to extract rents from that common while you think they will provide a fairly priced service.
Ultimately time will tell how the business model shakes out. Both could even be happening in sequence.
> 2) What happens next is where we diverge, I suspect: I think they will use their products to extract rents from that common while you think they will provide a fairly priced service.
Phrased like this, I can't really disagree with you. I don't expect a business to play fair in general, when it has a profitable option to do otherwise.
I guess my objection is more that right now, I don't see LLMs creating any kind of disincentive to publish quality content. In my eyes, LLMs are not a substitute for quality content in the first place - I see them more like using quality content to create a tool that competes with ad-hoc and shitty content.
That's not to say LLMs won't be able to eventually provide high-quality information on their own - but at that point, we'll have more important problems to deal with, such as chunk of humanity being rendered obsolete.
You're writing in English, which I doubt you came up with on your own, and I don't see you crediting the original speakers who developed your style or popularized the idioms you so casually use.
How do you claim the right to learn from the works of others and then demand government regulation and forceful intervention to keep from having to share whatever paltry innovations you may develop?
> One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.
The word "creation" is loaded. No one "creates" content. They discover it hidden in some idea-space... occasionally even two people might discover the same thing. The same melody, the same verse of a poem, the same fragment of art.
The idea that one should be rewarded, but the other is slandered the infringer is amusingly dumb.
> but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV
There is a big potential role for open source or more specifically copyleft / free AI here that is released as a community project but can be monetized as well. The evidence from software is that there is lots of interest in contributing to such products.
I'm organizing the publishing of my thoughts of technology development and design. This has been something I have been mulling for decades. Completely uncompensated. I'm not doing any of this for reasons I can understand it is just what I think about and do.
Originally I saw a website as a way to hang out a shingle. Until recently I was thinking I could just publish away and maybe someone would hire me based off the website.
Currently I don't feel the same way regarding publishing on the web. I will be me more guarded in what I share.
> Even if it isn't monetizeable IP, how to share the costs?
The internet started thanks to ample government funding for research. So have many other technologies, including AI.
I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?
> I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?
Is this tongue in cheek? I think it's called the government and taxes! :)
It isn’t clear to me (other than it’s an open engineering problem) why LLs couldn’t also include attribution as part of training. Also tracking attribution could lead to some insights on how its internal representations in vector space are created.
That is true I see no reason obvious reason why the LL companies take pride in not being able to document ideation process. I have no justification but I feel it is deceitful not technical reasoning.
The issue here is that memorization of any distinguishable part of IP is an incidental aspect - those models aren't memorizing stuff, they're learning it. We don't expect people to keep track of the source of every single piece of information they encounter. It would arguably make learning impossible - as much for humans as for LLMs.
As an intuition pump, when I write "2+2 = " and you mentally complete it with "4", should I chastise you for not completing it with "4, as per ${your elementary class math textbook} and ${that other book you read as a kid}, corroborated by ${your first math teacher} and ${your parent} quoting ${some other work}"?
When you make an omelette, what is the technical barrier making it practically impossible to tell which egg contributed how much to any given part of the meal?
Phind’s base model, which is GPT3.5/4 doesn’t itself do attribution, it’s made to do that with prompt engineering which provides the most relevant materials on the web based on a word embedding vector search, and then asks it to reference each source in the answer.
I mean, this is more-less what a student does when writing a paper, when they're forced to cite their sources. They first come up with an idea based on their own understanding/recollection, then they try to figure out where did they first took that idea from. If they remember a specific source, they'll cite that; if they don't (because there may not be one specific source they learned from), they'll search for some existing work that expresses the idea in question, and cite that.
I.e. in case of both the student and an LLM, correct citation doesn't actually mean the idea originates from the cited work - only that the work contains this idea.
Thank you I want to believe responsible development is happening. I just asked an LLM my first question and the interactive processing was great to watch.
This is the only valid take against LLM's I've seen to date.
At the same time, could it not be construed as humanity leveling up? We've abstracted away that, now low, level of thought. Much like we don't have people pressing the button in the elevator.
Most developers and writers are still better than an LLM, but its good enough to replace their input on simple tasks.
Writing is being commoditized much like other industries. The moment you release a new vacuum cleaner there's ten others that do the same thing and nobody is fussed over who invented it. People still know who to go for a premium vacuum though.
I've been wondering whether an LLM could list its sources, if the training data included source data for each document (perhaps in the form "The source of the following text is XYZ:")
I have no doubt the last thing the LLM firms want is to attribute their sources. I always see the claim, heck we don't know where the ideas come from that is impossible.
I'm not sure it's all that easy though. We don't entirely know how LLMs do some of the things they do, and we can't interpret what's in them. They don't internally look up particular sources, it's just a big mess of connection weights.
Maybe my simple scheme would be all it needs. Or maybe it needs some new breakthrough and right now nobody knows how to do it. I was hoping some resident expert could let me know.
First, it seems possible that if sources were in the training data like I described, then understanding of sources could be an emergent capability, just because the LLM reads "the source of the following is X."
Second, maybe a trainer LLM could be tasked with reading the trainee's answers and any sources it provides, and judging whether the source is correct.
Well, you can train the LLM to "provide source", but LLMs are prone to hallucination. You run into the same problem with a trainer model; the trainer also has no way to confirm where the model actually got the answer from. One thing that may work is a fundamental architectural shift where the LLM looks up all its info as it needs it, and then you can just list the sources it actually used. Microsoft tried that with Bing, but it turns out the model will search for a website, read it, and then ignore what the website says and claim "according to this website, <some other belief>". So it's definitely not easy at any rate.
Sure but I don't think it's necessary to confirm where the AI got something. There might be lots of sources. I often tell someone some fact I know, mention a source if I remember it, and possibly google a link. An AI could do much the same. In training, the trainer could just check whether the claimed sources actually say something similar to what the trainee claimed it said.
In operation, the "trainer" could do the same thing in the background. And then of course, human users could also check up on the sources if they need to be sure of catching hallucinations.
AI winter? Labs haven't even began to scratch the surface of the data available to train models on. We may be running out of quality human-written text but we have yet to dive into:
Video
Audio
Images
Heat
Motion/acceleration
Lidar
RF
Sonar
Radar
Network traffic
Atmospheric pressure
Wind vectors
Magnetic fields
System/application logs
Electrical current
UV
X-ray
Microwave
Ionizing particle emissions
The focus should eventually be on building AI such that they can be given access to data and then independently figure out new and creative ways to use it, rather than requiring humans to figure that out for them and then narrowly define: do this thing with the data I've provided. Given the scale of the data, they'll be better suited to that approach if it's breakthrough outcomes that we want.
People have been testing them for ages. There's something there, but not nearly as much as the hype implies. At the same time, the hype seems to only increase.
Anyway, I really like that name. Non-dataist AIs are clearly the best kind.
I really think the deep learning maximalists are leading us down a rabbit hole. We're going to waste a lot of digital storage and computing resources on creating bigger and bigger models for more and more diminishing returns, barring some breakthrough. Without a way of encoding expertise, which is limited right now, I don't think we're going to get the kind of performance that the uninformed public expects out of these systems.
With all these LLMs browsing the internet and picking up ideas, i wonder if anyone has considered making "ads" targeted at them? To influence them to promote a particular way of doing things, etc.
Stackoverflow ceased to be a public good around 2013. Now it's a source of solutions from ten years ago, with a bunch of closed duplicates that could've had up to date solutions, and closed questions for being too interesting.
>> models like ChatGPT efficiently provide users with information about various topics
>> But since users interact privately with the model, these models may drastically reduce the amount of publicly available human-generated data and knowledge resources.
The premise seems a stretch when the models are simply providing a better way to find the information. It is essentially reducing the number of searches or low quality questions floating around.
For creator, it is just an aid to improve the creation. The models help them find duplicates better or simulate variations faster.
This paper appears to construct a case of walled garden by mis-representing user questions as the actual content/creation.
User search questions always remained private with the site owners (Google search, Bing etc.,)
I don't know about that. Firstly, do we know that the site can financially withstand a gigantic drop in traffic gracefully enough to avoid becoming the next Quora? Advertising doesn't pay out more for developers with novel technical questions. Secondly, will a failed ChatGPT inquiry still lead to an SO post so new good quality questions keep rolling in? What if they start to lose prominence in search rankings because they wouldn't be the go-to anymore? I could see people using reddit or the like to serve the same purpose in a much friendlier format.
I think one thing that would tip this in a positive way is easy re-sharing of learning. I have plenty of "source code" now for conversations where I've taken something that was hard to find out (or code that was previously hard to write) through a non-LLM route. Publishing those conversations isn't as easy or re-adoptable into the commons as it could be.
(I'll note that proprietary models would be first-order disincentivised to stop this because such conversations can be used to better train other models, but of course they will benefit more generally for maintaining a commons of knowledge to draw from.)
Is it really removed if it is easier to get, and at a higher quality than a lot of SEO spam search results? On a primitive level I'm not sure it's much different than when computers & calculators replaced printed mathematical tables, such ones used that were used in artillery firing.
That said, there are other consideration of accessibility separate from the broader topic LLM's in general. Specifically, that LLM's truly suited as better alternatives to traditional content might end up gate-keeped by corporations charging high prices, an especially likely scenario if lawsuits regarding copyright rule in favor strict copyright protection of the work as not eligible for a fair use exemption. And/or if regulatory capture puts the barrier of entry into LLM's creation too high. Contrary as it might seem, losing the copyright battle might be a net benefit to Open AI and large competitors: It will be very difficult for more open alternatives to get the resources required to build open models.
I don't think use of LLM's, in themselves, constitutes a risk to public knowledge, in so far as they remain just as or more accessible than traditional content. That is the case right now, for some use cases, where I can get a much faster answer and immediately critique or get follow up responses to clarify things.
Large language models have a huge problem with hallucinating incorrect information and censorship of it.
Right now, most of those services very strictly refuse to say anything even mildly controversial or pornographic. Because LLMs have an increasingly large amount of influence on society, this is quite concerning, because many things will simply get memoryholed if the creators of the models deem it's not "safe" or acceptable.
I haven't felt the need to write my thoughts as much, but not because I have any concerns about it being used as part of some training data. Mostly it's because I don't know if I'm interacting with real people anymore.
We no longer need the public good (public answers) because we have a better solution (personalized answers). It's like worrying rendering mine safety regulators irrelevant because we switched to clean energy.
"Second, we investigate whether ChatGPT is simply displacing simpler or lower quality posts on Stack Overflow. To do so, we use data on up- and downvotes, simple forms of social feedback provided by other users to rate posts. We observe no change in the votes posts receive on Stack Overflow since the release of ChatGPT. This finding suggests that ChatGPT is displacing a wide variety of Stack Overflow posts, including high-quality content."
Can anyone tell me how they are related and what it means?
PS: I also asked ChatGPT (4). Here's what it says https://chat.openai.com/share/7029a1e5-63d0-4cec-bf76-10b0b5...