Hacker News new | past | comments | ask | show | jobs | submit login
Reddit is OpenAI’s moat (cyberdemon.org)
286 points by dmazin 11 months ago | hide | past | favorite | 304 comments

The leaked Google memo "We have no moat, and neither does OpenAI" is instructive here:


Original author is an ML researcher, and the crux of his argument is that most weights in a LLM are significantly overdetermined. Once you have ingested several terabytes of natural language, you know how to generate natural language.

The remaining misses are facts that it has never seen, usually because they are so obvious that nobody thinks to write them down explicitly. And so more training data doesn't necessarily help LLM performance, unless you're either ingesting either extremely basic facts that are so obvious most adult discourse overlooks them, or you're ingesting expert knowledge that's highly specific and only discussed in a few forums. Reddit data could perhaps help with the latter, but a.) Reddit is usually not the place to go for expert discourse and b.) there are other better sources of data for it. You'd usually be better off training on trade publications, scientific journals, or fandom than Reddit.

Also the LLaMa/RedPajama approach makes this stupidly easy, because you can pass around patch sets to the model trained on specific mini-corpora, and then update the weights appropriately. Hence why the author of the Google memo believes neither Google nor OpenAI has a viable moat.

>You'd usually be better off training on trade publications, scientific journals, or fandom than Reddit.

Bold claim.

There are plenty of engineers and doctors on reddit answering nuanced questions that arent talked about in any journals.

That is what is missing. The absurdly specific stuff that reddit gets. Sure you can answer with nonsense that sounds realistic, but you could also answer with the exact text that solved the problem in the past.

Its the difference between asking the question "How do I save money at the grocery store?" and getting generic answers like "Make a budget" and getting specific answers like "Buy fresh rather than processed".

> There are plenty of engineers and doctors on reddit answering nuanced questions that arent talked about in any journals.

There are lots of really good subreddits that have expert feedback, but I think the issue is that going through all that and separating the wheat from the chaff is going to be a much more involved process than feeding in those other sources of expert opinion.

Ironic that reddit self-sabotaged the one popular place to converse with verified experts in their field so hard (AMAs) as that was probably the most trustworthy input source on the site considering participant vetting

> There are plenty of engineers and doctors on reddit answering nuanced questions that arent talked about in any journals.

There are also plenty of people answering nuanced questions with complete BS.

It's not that bold to claim that as when taken as a whole the signal-noise is worse on Reddit than in a publication.

The replies of engineers and doctors are significantly outweighed by the responses of bored teenagers.

Yeah, but I imagine bored teenagers are not patrolling the embedded subreddit.

Well, it depends on what you're asking the LLM to generate.

If I ask an LLM to write me an essay in the style of a caveman, the LLM simulates a redditor simulating a fictional caveman (who speaks english but with short words and bad grammar). After all, that's what was in the training data.

If I ask the same LLM to write me an essay in the style of a Harvard professor - who's to say the same thing doesn't happen?

So I can see how, when training a model, you might want it to know what's really written by a Harvard professor, and what merely professes to be.

The GP ignored books for some reason.

But you get very different kinds of knowledge from each of those. They are not really comparable.

You won't find any useful explanation about how to solder surface electronic components in a PCB in a book, and you won't find a deep enough explanation about how to dimension an uncoupling capacitor for an amplifier in Reddit.

This is a great analysis that really cleared up the impact of that memo for that me. You clearly know but for any readers who aren’t clear: that memo is the opinion of one engineer at google, and far from the (apparent) opinion of the relevant execs

It’s dead wrong, I suggest reading the original. _No one_ thinks models trained on more data would be about the same, since everything flows out of that premise…then throws Current Thing on top…it’s very unhelpful.

The “patches” he refers to are LoRA and are treated as deus ex machina. They’re not, ex. playing with Stable Diffusion we can see they’re additive but they’re not nearly as good as training the original model on the data.

(Disclaimer: Googler)

> they’re not nearly as good as training the original model on the data.

How are you defining "nearly as good", can you be more specific?

It's obvious full fine-tuning > [PEFT tuning] but to my understanding the gap isn't that significant, as reported in various papers. (specifically with respects to language models, I'm not familiar with diffusion models).

> Reddit is usually not the place to go for expert discourse

Its not expert research but Reddit can be used to find probably-real-first-hand-experience on a given subject. Which is enough to prefix "reddit" to a lot of google searches. It just needs to be a marginal improvement to Google blogspam for it to have some intrinsic value.

Also really depends on what kind of experts you're looking for. 99.99% of the world's experts are not writing in trade publications or scientific journals.

It used to. Past 7ish years or so I saw a lot more posting that was straight up lies or people pretending to be experts that wasn’t heavily downvoted. Or maybe it was always like that and I got more aware. Either way, not something I’d want ingested into a model.

I don't even think blogspam is the problem as much as SEO-gamed content.

Blogs are fine, I don't mind reading them.

The classic recipe that is 95% down the page with every possible word you could google in the preceding 95% in the form of a fake story.

> you're ingesting expert knowledge that's highly specific and only discussed in a few forums. Reddit data could perhaps help with the latter

This is exactly where I think Reddit's value lies. I disagree, though, that people don't go to Reddit for it. Here are a couple recent queries where I appended reddit to my search query: * best places to visit from London * best mattress sold in UK * should I bring king mattress from US to UK

(Can you tell I just moved to London?)

Why do people go to Reddit? Because it's guaranteed (at least for now) that the answer came from humans who are not trying to sell you something. They may be wrong but... it's opinions.

> it's guaranteed (at least for now) that the answer came from humans who are not trying to sell you something.

No no no no! This is not guaranteed AT ALL on Reddit! Why do you think Redditors are constantly accusing each other of being shills? Because Reddit users are often corporate representatives doing "native marketing" or whatever they call it, and they are not upfront about it, because they are preying on the naivete of people like you!

Seeing the belief expressed that Redditors aren't trying to sell you something is.. distressing to me. Half of Reddit is some kind of ad, if not for a physical product it's for a political party or cause, or some celebrity's personal brand.

Please have a little cynicism online.. please, for the sake of everyone..

The only thing that didn't sit well with a lot of people about the leaked memo is that it ignored the quality of GPT4 vs GPT3 and made claims that all LLMs were poised to be on par, yet that isn't true till now.

What it also ignored (along with some of the comments here) is data ranking. Google didn't just build a search engine by crawling more of the web -- many search engines before it had already done that. Google managed to rank what's relevant and what isn't. Relevancy is hard. Similarly, not all scientific publications are ranked equally. Or for that matter, even publications with a lot of peer reviews or citations can become obsolete through new discoveries.

Reddit's data has value in that it can fill in a lot of the gaps left by more qualitative sources and furthermore the data is user-ranked by a trusted community. This also has implications for specialised querying, for example training on just r/fitness could be fairly useful for that community.

As a side note, other valuable data stores are not just text but voice/video as well. YouTube and podcast transcripts are readily available, for example to Google. Data and ranking is valuable all over again.

>most weights in a LLM are significantly overdetermined. Once you have ingested several terabytes of natural language, you know how to generate natural language.

more than anything, AI/ML/DL enables people to make claims that cannot be refuted by any application of the scientific method. it's literally "not even wrong" in exactly the same way as people claim string theory is.

String theory wishes it had (unique) claims as easily falsifiable as "Once you have ingested several terabytes of natural language, you know how to generate natural language."

Obviously you need a page of fine print to make a claim strictly falsifiable, but complaining about the absence of fine print in casual discussion is absurdly uncharitable unless you have reason to believe that agreeable fine print couldn't be drawn up and I'm 99% sure that in this case you have so such reason.

>String theory wishes

are you a string theorist? are you sure that you understand the claims made by string theory and how falsifiable they are?


>fine print couldn't be drawn up and I'm 99% sure...

where does your 99% surety derive from?

As someone trained both in mathematical physics and DL, i'm here to burst your bubble: the falsifiability of both disciplines is exacty the same because they're both premised on knowledged derived of models with an astronomical number of free parameters.

just to be clear: i'm not poopooing LLN here, i.e. the models themselves are fine, i'm talking about absurd meta-claims (like ops about generating language) derived of observing many such models.

One thing to consider is that while the large transformers used in LLMs might have these diminishing returns, we don't know what discrete jump in model architecture might come next. That model might gain a lot from even more training data. And might gain more from the semi structured data on reddit than the slightly less structured data on Wikipedia and Twitter


I would argue that Wikipedia is much more structured than Reddit.

> we don't know what discrete jump in model architecture might come next

This is true, but as long as these models are based on creating sentences based on what sentences that it's seen have looked like, and not on fetching and understanding verifiable facts, these services will do more harm than good overall.

If anything, these models need two sources of training data:

1. The standard language model (as it is now) to be able to generate and process queries and provide understandable answers

2. A database of verifiable factual information that it can query in order to prevent it from completely hallucinating information and then asserting that it is verifiable factual information when asked [0].

Until we can solve the AI hallucination problem, these systems are going to require users to be much more careful with information they're given than most people can manage right now.

[0] https://www.cbsnews.com/news/lawyer-chatgpt-court-filing-avi...

The author links this in the fourth paragraph; their argument is that while compute is not a viable moat, training data may be.

the end game is MechanicGPT trained on the obscure phpBB/vBulletin car mechanic boards from the early 00's

I think from Reddit's perspective, they are extremely upset with OpenAI, in the same way that I'm sure StackOverflow is upset -- OpenAI took:

- The entire corpus of data the community had curated over the last XX years

- The "goodwill" that these platforms had developed towards third party developers in allowing developers to work with their data

- Potentially large amounts of traffic that would normally come to their sites via Google (e.g. site:reddit.com), that is now available instantly (and customized) via ChatGPT

Despite Reddit's probably closer connections to OpenAI than other startups through Y-Combinator and Sam Altman, I wonder how keen they are to actually work with a company that potentially destroyed a ton of their value, right before they were ready to IPO.

Given that sama is a board member of Reddit Inc, and that this is happening after GPT-4 was trained on Reddit data, I wouldn't jump to conclude they're upset at OpenAI.

SO had publicly available, no-auth-required data dumps. This makes it difficult for them to know who is using their data. However, this surely isn't the case for Reddit who offered only API endpoints for this content, and I'm guessing you couldn't use .json to get the whole site (rate limits, etc). I wouldn't be inclined to believe that Reddit would miss a new major API user.

This is purely speculation disregarding Hanlon's razor, but I'm thinking that the API pricing comes down to killing two birds with one stone.

* Sama got to train his LLM on Reddit and some best-of-the-internet content there such as r/bodyweightfitness, informed discussions on niche topics etc for free. The catch-up players face prohibitive pricing.

* Third-party apps get killed, bringing the UX to Reddit's control. I think this is more important to Reddit than ad revenue, as they could've simply built an SDK for probably less than this PR nightmare will cost us.

* Their new development platform, however, hints at the Reddit app supporting serving "redditor-made apps" which "can be seamlessly reused between communities".

The description (and the idea to have apps in your app) weirdly reminds me of WeChat apps, and given that Tencent is a major shareholder in Reddit, I would consider the possibility that apps are something they're pushing. That idea has no chance of success without the UX being completely in Reddit's hands, even then it's questionable how it would work on Reddit.

Spez couldn't actually be that detached from Reddit?

While Reddit Inc didn’t provide data dumps, pushift.io published no-auth dumps of all site data back to 2008, which are still available on Academic Torrents.

Further, you could use the Reddit API to injest the full firehose of all site data in real time without violating rate limits. This is [one of the ways] how Pushift made their datasets.

Reddit was a lot more open that any other site approaching their size!

> and given that Tencent is a major shareholder in Reddit

Snoop Dogg owns more of reddit than Tencent. The Tencent investment is very overblown.

Snoop Dogg's investment was $50,000,000.

Indeed. And Tencent was at $150,000,000 three rounds later when the valuation was 7x.

Basic math days Snoop owns twice as much.

Oh shit. Never thought of it that way. Sam is playing real 4D chess here. I’d add that he is also guaranteeing (to some extent) the quality of his data source (reddit) by closing API access, because AI bots can’t easily ruin reddit data without API access.

> I think this is more important to Reddit than ad revenue, as they could've simply built an SDK for probably less than this PR nightmare will cost us.

Who is "us"?

Oh, that's a typo I didn't catch. Sorry for the confusion, I'm not a native English speaker so words slip once in a while.

Is there anything with the new pricing approach that prevents Reddit from giving OpenAI more favorable negotiated rates that are not publicly disclosed?

- suddenly after the ChatGPT success, they realized that they have valuable data

- next step is to stop third-party apps that generated these data

- then they let the moderators show their power

I’m not sure if people care about a CEO being exposed as a liar nowadays, but maybe some former Yahoo managers have another idea on how to destroy more value.

>former Yahoo managers

Specifically the ones that ended up in charge of tumblr, so they can suggest "ban adult content on a platform famous for its adult content"

Part of reddit's API changes were going to prevent NSFW content from being viewed through the API but they reversed course on that.

Indeed. Their API changes make very little sense if the goal is to capture the value that content they host has for training LLMs, especially because I don't think there's much value that they can capture there (AI researchers argue that using the content is fair use, and they can simply scrape it if they don't have API access: something which the courts have allowed in similar cases)

Eh I really strongly suspect this “you can use an LLM instead of google, I.e. as a knowledge model” to be a short lived trend. I hope Reddit sees it the same way. It’s kinda like using better bike infrastructure for sick cyberpunk roller derbies; a nice unexpected use case, sure, but its not built for that and sooner-or-later the issues will become all too apparent

IMO, the future will be using LLMs with live search results - which of course will probably require a funding model better than the terrible Display Ads setup we have for such content now. So best of luck to Reddit on that front - I hope your ipo fails

> the future will be using LLMs with live search results

N of 1, but I vastly prefer the Google Generative Results over chatGPT. Quality may not always be the same, but a “chat” seems like an awkward metaphor for finding info, and of course, chat GPT has no links to content when I’m worried about hallucinations.

I’m true google-search fashion, GenSearch has a lot less deep-in-the-weeds technical answers and will push you to simpler results. Eg. If you want to know what chemicals have a similar absorption properties to methane… you’re better off with ChatGPT or traditional search.

Ah, maybe. I see it differently though. reddit has never been about authentic content.

- When they started, I believe they boasted creating fake accounts to mimic engagement to help grow community

- It is painfully obvious the amount of political, gov sponsored, and corporate astro turfing that “gets through”. Both human and bot comment farms are real and have no doubt been artificially bolstering ideas and content for years

As I see it…

They have enabled this / looked the other away at this behavior forever in exchange for engagement. The super high tech AI chat bots are probably going to be welcomed for their clever content (hence why CEO could care less about its users anymore).

Their real value has always been having a controlling voice and being able to push viral ideas. The data thing is all hype noise IMO. It was never going to be a serious part of their IPO. Big messy gross company.

There is a decent amount of good stuff on reddit. With a RLHF model trained on shitposting, political grandstanding, porn etc and use that to filter out, you are left with great stuff. r/experienceddevs. I would fine tune on that filtering out negative voted comments and you have a career advisor. Mods did the data cleaning already!

Google or Reddit don't matter, if the Info Explosion problem that the Internet produced gets solved some other way.

Main reason such sites came into existence was there was too much info on the Internet.

These sites where attempts at simplifying the numerous websites and blogs ppl had to manually discover and track.

They have meandered around that problem, and got totally distracted by all kinds of other problems (many self created).

If the Info Explosion problem that the Internet produced gets solved some other way.

Sure it will - asymptotically.

But it took 18 years to grow Reddit. Even if a solid alternative emerges in a fraction of that time -- that's still a multi-year gap. Plus we're at a significant knowledge deficit (for us lowly humans, never mind LLMs) if Reddit's archives don't re-emerge sometime soon.

(Google search has been braindead for some time now -- so we can leave that source out of the equation).

It’s interesting to think about an information-deduplication solution for internet. Maybe now we can use LLMs and embeddings to create an archive of all (factual) information, then build a search on top of that?

>a company that potentially destroyed a ton of their value

I'm not so sure that's the case, for two reasons:

1) I have not stopped appending "reddit" to my searches or stopped visiting StackOverflow or other Stack* sites. ChatGPT is simply now an additional tool, and while there's some overlap in use cases there's also plenty I can do with *GPT that I couldn't with other sites.

2) To the extent that a training corpus relies on these (or any other) data sources, OpenAI has just made them much much more valuable! It's sort of like a mining company discovering that someone has an extremely valuable use case for decades of the mining company's tailings. This "someone" may have been allowed to haul away some of it without payment to the mining company, but now they know its value, now they can build a very lucrative business selling access to what remains, and what will be generates in the future.

That's all highly simplified though. It's of course much more complex than this, much more complex than can be captured in an HN discussion, but we can explore the outlines a bit, and even disagreement will reveal more and more of it.*

I agree with you but not in entirety. This comment from spez[0] about blaming the API price changes on LLM's is too far fetched.

A lot of commenters here on HN have already pointed that out already too. Unless they build a literal brick wall (paywall) around the site, that data can and will get scraped if the intention is to use for a model.

You could get it down to a science where you only scrape any new data whenever you train the next model.

[0]: https://old.reddit.com/r/reddit/comments/145bram/addressing_...

Yeah I think their real intention is to kill off third party Reddit apps so people are forced to use their own app, with all its tracking and garbage.

Reddit on mobile browser is a case study of insane dark patterns

Click to sort comments while not logged in? A popup appears asking to log in, with no close button. You have to click out of the box, but that’s not easily apparent

View an 18+ subreddit? Let them browse for 30 seconds, then ask they log in

Visit the site? Ask to login or use their app.

At this point, I’m more motivated to do anything but use their app if they’re this hellbent on getting me to download it.

This exactly, it’s unuseable and will only get worse. When “old” stops working it’ll be a sad, sad day. This blackout has made me realize how deeply I’ll miss reddit.

The thing is, to me, reddit would never be a viable enterprise without the volunteers moderation. If reddit had to pay for that they’d never ipo. And if they think the moderators will stay and be exploited they’re wrong. Thus if they go through with this they’ll fail as a public company. There is no there where they going.

Yeah they've been trying to kill old.reddit and third party apps for a long time. The experience degrades more and more as they add new features that are only fully supported (or supported at all) on new.reddit and the official app. This has been going on for years.

I'm convinced that this would have happened with or without OpenAI, especially with the mirage of an IPO on the horizon. Controlling the client to show ads and siphon data is just too valuable. Maybe the OpenAI thing pushed them to speed up the process.

I'd also point out, it's not like the pricing of the API is going to be uniform for everyone anyway - spez has already said/conceded that certain users of the API can continue to use it for free (above the free limits). At that point it's a deliberate decision to make 3rd party apps pay the same price as AI companies, if they wanted to keep 3rd party apps around they could have set a different more reasonable price point for them or went another route (Ex. require 3rd party app users to have Premium).

I think he's bullshitting even on API costs to maintain. They could just put it into reddit premium "want to use API ? Pay few bucks and use app of your choosing". Even $2 would easily cover the cost of lost ads and such. Then a much more expensive tier above for "data ingestors".

> , that data can and will get scraped if the intention is to use for a model.

How would that work from a legal perspective, though?

Let's say there's no paywall and Reddit's terms of use disallow unauthorized commercial use of their data. Wouldn't that be a violation of Reddit's terms and liable to some legal procedure?

It is OpenAI's and Microsoft's position that obtaining, using for training, and displaying data via AI is fair use.

I really really really hope the copilot and other suits are successful. The idea that you can literally steal content in the name of “””AI””” and profit off it is just insane! How is it not like copyright infringement? The Warhol case is just one step behind training data. It’s basically the same idea.

Reddit's terms are irrelevant. Unless Reddit requires a login to view its site (which would also prevent Google indexing), anyone can view the data without agreeing to the terms.

The only question is copyright, but I find it hard to argue that LLM training is not sufficiently transformative in 99% of cases.

Plus Reddit doesn't own the copyright to the posts, the users do.

> Wouldn't that be a violation of Reddit's terms and liable to some legal procedure?

It won't, with the LinkedIn vs HiQ precedent.

How does Google use Reddit's data in its models? You can access most (all?) Reddit pages without hitting Reddit at all via the "Cached" link in the search results.

Does Google have a special agreement with Reddit (and all other sites?) or is it legally "fair use" to reproduce web pages that are available freely online?

I think that's a different legal question than LLM training, but webpage caching has been found to be fair use based on a number of factors: https://www.pinsentmasons.com/out-law/news/google-cache-does...

In the US at least the courts has made it clear that scrapping is legal.


Interesting question but sadly I am in no position to answer it.

I think there are probably issues to address with scraping it blindly:

- Can Reddit imprint its data somehow? A watermark?

- Can Reddit prove that certain type of information appeared on Reddit first and thus that serves as proof its data was used without authorization?

If OpenAI can't work around this, I'm not sure they would be willing to cross any lines in terms of copyright, they've already done it with ChatGPT and I am guessing rules are only going to get stricter on this topic.

I think the bottom line is that Microsoft’s (and thus other for-profit AI initiatives) stance is that any and all data is fair game regardless of license or authorization. This results, in their opinion, from the fact that the AI alters the data, changes the output, and is otherwise “inspired” by the data in the same way an artist might be inspired by another without copyright infringement.

This sounds very dodgy. Will somebody be checking the degree of such "data alteration" and verify that the "AI" is actually inspired rather than copying?

To me this feels like its opening up the door for the elimination of copyright as any algorithmic layer interjected between scrapped data and end users could claim to be "inspired".

Welcome to the discussion lol. People have already provided examples of copilot producing niche code verbatim thereby proving their intuition incorrect. It’s a whole mess that will take years to be cleaned up by new legal conventions.

If my compiler was “inspired” by leaked Windows source code and altered it into a new form then I think their opinions on the matter would be very different.

Not a great example: if Apple’s code leaked, theoretically they wouldn’t include it in the training as it’s not supposed to be seen by the public. If it’s public you can be inspired by it (so their logic goes).

The true malicious, and probably effective, approach is to silently poison outputs if you suspect automated behaviour. These large language model things might be useful there. Or the old school NLP stuff.

Reddit don't own the copyright to it, just a license. That plus public web scraping is legal. Reproducing the data directly might violate the user's copyrights, but through an LLM it is assumed not.

>Unless they build a literal brick wall (paywall) around the site, that data can and will get scraped if the intention is to use for a model.

The irony if lowtax and :10bux: ends up being right all along about how to build a community...

Why are they overlooking their biggest value as an organization is to sell limited truth to AIbros who want the _true_ up/down vote information, rather than the fake number they decide to provide to the plebians?

One thing that's struck me which I assume only reddit has, is the social graph data which holds all the relationships trends on who votes on what/who. There's limits on that based on how pseudoanonymous it is and the ease of making new accounts, but that seems valuable (at least to an outsider), although possibly in a "fighting yesterday's war" way as 'true' social networks like Facebook get value out of posts, which would be different to the way ML training may value it.

I think the problem is that the votes, not necessarily representative or anything “true” either.

Isn't Sam Altman on their board? I wouldn't be surprised if he was the one who pushed for these API changes, to restrict others from building a competitor to OpenAI.

> There is no question that Reddit is extremely valuable as training data. How often do you append “reddit” to your searches?

Is it?

When I worked in networking, and later web dev work I found Reddit to be a TERRIBLE place for Q & A type situations.

Answers on Reddit are often skewed by truthy answers from people with limited perspective in the industry who are surprisingly sort of militant about a given topic.

For example the networking topics often were “small shop” focused and users unaware of very common enterprise level practices, and when they saw someone mention it they would react really poorly. The result was advice tailored to small shops that have limited staff, so you got ideas that “work” but would perform poorly at scale (at best) and at times could present security risks, miss opportunities where more performance was needed, better solutions were available with a sizable staff and planning. They weren’t wrong, but the answers were skewed.

Subreddits are weird, they might be about a topic but the community often moves that topic/ has a specific pov within that topic… people without those views tend to abandon those subs and those who remain aren’t necessarily “right”. Reddit isn’t stack overflow where being correct is at least on the surface valued where we can run code and see if it works. Top answer on Reddit could be wrong and anyone taking issue with it may simply not answer. Heck there are whole subreddits dedicated to wallowing in misinformation…

Now I don’t know if Q&A is the start and end of LLM training (not my area of expertise at all, I am open to the possibility that what I’m talking about isn’t a problem at all) but Reddit as a source makes me wonder what the results would be.

Yes. For anything from opinions of movies/games to DIY advice to nerdy stuff like the best watch to buy in a given price range or the best synthesiser. Or people's opinions on a particular episode of TV! Or basically any hobby.

It's not great for business stuff because there are way more people using e.g. AWS for their hobby project than running multi-million $ businesses on it.

What are the alternatives? Facebook and Discord are walled gardens. Quora is a shithole. The rest of the internet is blogspam and significantly more dubious than your average Reddit comment!

To be fair, I found the best alternatives are just friends, and Twitter.

In Twitter you can follow people whose content you find useful. If you're a Blender user, for example, you can follow 5 Blender tech artists and your feed will be now full of cool tips and art.

With friends, with more or less the same skillset and hobbies as you have, you can ask them whatever you wanna ask. Heck, you'll learn a lot from them without even asking!

I'd just avoid using Reddit or forums to decide if I'm gonna use or buy something.

For example, if you ask what's the best OS on the internet, people will say Linux. Best DAW? Reaper. Best game engine? Godot. Best headphones? whatever is new and costs 100$ in Amazon.

Meanwhile, friends who make money and are in the industry and have businesses just prefer Windows, FL Studio and Unity. They aren't on the Reddit saying that the software or hardware they use - they just use them and make money. :)

I like my friends a lot but we don't always agree. With reddit I can read like 5 opinions of a game I want to play and I can instantly feel like "this person doesn't like it and their reasons are similar to things that I don't like too" and make an informed opinion.

The point you make about "professionals don't have time to use Reddit" definitely applies though. /r/synthesizers for example is a great sub if you want to spend thousands on hardware. Not so good for making real music! Although I do feel like there are a lot of Unity game devs on there. Maybe you just need the right community.

Literally anything else. Having seen the general/highly upvoted Reddit opinions on things I have domain expertise in, and having seen the outsized effect that angry subs can have on media opinions, there is absolutely nothing I trust on Reddit at face value or without having interacted in very small subs long enough to know who has useful opinions.

The only consistently good advice I see on Reddit is "call a local expert/call a lawyer." Anything else is almost certainly abject nonsense.

Don't look at highly upvoted opinions. Look for barely upvoted opinions of everyday users and aggregate them.

> Quora is a shithole

It's actually funny how much this is true. They used to be a quality source of info, but these days the default option is to show "all related" so you don't even get direct answers to the question you searched for, ads from top to bottom (and even somewhere in between) and just hordes of bots answering nonsense. At least they made it official with Chatgpt integration now lol.

Quora is a shithole because you can earn money answering, so people who have no knowledge at all are putting together convincing articles (probably with ChatGPT). And the mixing of unrelated questions into the answer stream is extremely annoying. I added it to my uBlacklist, anyway...

I make it a point to downvote and report as spam the promoted answers that have no relation to the current question.

Agreed -- before making any major purchase I start at Reddit. The Internet as you say is a dumpster fire of blogspam top-10 lists where the top of the page is just STR."Updated in \{current_year}!"

> walled gardens

Walled garden is marketing speak. Don't use it. Use curated/controlled/whatever described the closed nature.

It spins what would otherwise be a negative term into one that gives images of delicious fruits, vegetables and beautiful flowers.

My favorite alternatives are email and going outside.

Going outside is a great alternative to mindlessly consuming Reddit, but we're talking about researching opinions here. That's something that Reddit is actually good for.

Perhaps GP meant that since desire is the root of suffering, that rather than desire that your server work and suffer and strive to return it to functionality, you could go outside and mediate on the grass until you relinquished your desire for a working server.

Or your desire for an umbrella that doesn't break in the wind, or the best public estimate of when Starship will launch, or any of the myriad other desires that cause people to append "reddit" to their searches.

Yeah, I could just meditate the entire day instead of doing anything. Why have desires at all?

So is going outside though. In fact, I think talking in person is usually a much better way of actually understanding someone else's opinion, because normative rules of social behaviour tend to stop the discussion from becoming a flamewar.

I could spend hours just neutrally asking someone questions about a particular view they hold, why they hold it, etc. I think you gain a lot more insight into how people think this way than through an online forum.

Who cares about "particular views they hold"? We're talking about fridge or tv brands, not some personal stuff.

>Who cares

I do.

I'm sure random passerby is great information on how to configure network equipment /s

My experience in the past 5-10 years on Reddit is:

- the voting system is about what people want the truth to be, not the truth.

- the users can be very easily gamed in comments to vote one way or another just by the initial voting being negative or positive (aka most just vote with the trend)

- the opinions all come from a bias of urban and major metro area people. It's painfully obvious they don't understand any mindset outside of this one.

- it wasn't always this way.

I agree with you, the quality went down as the quantity went up, and it may be possible to use the comments, but using the vote data as a signal of truth is going to produce a terribly narrow minded AI.

> the voting system is about what people want the truth to be, not the truth

Probably the best example is legal type advice.

Personal example: The whole situation with internet archive and their e-book lending process. I was vocal that they didn’t have a leg to stand on and would lose. The comment was just about the legal situation.

I was downvoted to hell for that opinion and identified as some sort of book publisher advocate/ internet archive hater. As was anyone else predicting a loss or just raising questions.

Personally I love the internet archive, but I didn’t think they had a chance at winning.

But go by the comments and it was a slam dunk win because of some idea of “freedom” the comments and voters had. And yet they lost the lawsuit…

Interestingly, here on HN the response seem to be a generally more balanced. That’s not an endorsement of HN all the time, but it was an interesting contrast.

I see the same thing with the situation surrounding the game "Dark and Darker". The TL;DR is that the game is likely stolen intellectual property and there is a lawsuit directed at the publisher and one of the developers. The lawsuit will likely prevent the game from being released, so the prevailing opinion on reddit is that the whole thing is an unjustified shakedown, despite the copious amount of evidence that the publisher stole intellectual property.

I struggle to find the right word for this, but I find that Redditors with 10K+ comment karma have an "attitude" that pervades how they respond on virtually every topic. There's this sense of faux-civility and if you have an opinion that slights them in some way the mask comes off and you end up with hostility or sardonic replies that completely misconstrue your point. Then there tends to be a dogpile with downvotes.

Yes, I know reddit can be valuable for niche and non-contentious topics but such examples are becoming increasingly rare. Snark and combativeness pervades the site and it seems to be a function of the voting system and monkey-see, monkey-do from how commenters are rewarded in the default subreddits for this kind of behavior.

I dread AI models being taught to comment like a redditor.

> Answers on Reddit are often skewed by truthy answers from people with limited perspective in the industry who are surprisingly sort of militant about a given topic.

Equally said of HN. Folks with a few years of dev or startup experience asserting this or that all over the place. Im sure less experienced folks read these "truthy" answers and take them as gospel in some cases.

That's really only an issue for technical topics. Stack Overflow would obviously be a much better source of quality training data for the kind of questions you presented.

But for most things outside of that Reddit is by far the best online source that actually represents answers you would get from real people. Questions like what is the nightlife like in x city don't have a single true answer and thrive due to Reddit's the diversity of thought.

> Stack Overflow would obviously be a much better source of quality training data for the kind of questions you presented.

I dont think we need a llm that only responds with "Your question was marked as a duplicate"

Yeah, interesting point.

I'm curious to what your take is on Chat-GPT's response to the same sorts of questions?

I also work in a specialist field, and I've come to understand that the instinct on of a particular question is going to give me a bad answer on Reddit or SO, is also finely tuned to about when GPT either starts hallucinating, or providing bad info.

> When I worked in networking, and later web dev work I found Reddit to be a TERRIBLE place for Q & A type situations.

Training in this sense is about how people use language not what is the right answer. Among other sources - the exact thing you describe is a contributor to some of the things within the (in my opinion misnamed) internal taxonomy of 'hallucinations'.

Right or wrong - reddit is human beings using (mostly english) language to speak to each other, and so if you want to train an ML model to do that - its a great source.

not only that but 80% of conversations on reddit is idiotic jokes and puns

This is absolutely true for any subreddit that shows up on the front page but many subreddits with smaller user bases or draconian moderation policies (like /r/askhistorians) can have excellent quality.

> Answers on Reddit are often skewed by truthy answers from people with limited perspective in the industry who are surprisingly sort of militant about a given topic.

That's kinda sorta exactly how ChatGPT answers stuff though, no?

I may be in the minority here, but if I want the opinion of Redditors on an issue, I will use a search engine to look for it specifically, thus knowing the provenance of the information I am receiving.

I don't really want the corpus of Reddit data influencing the output of a generative AI model...because it's Reddit, after all...

Even though I am pretty sure it is already included in the training dataset already...

I could be wrong, though...

I genuinely don't understand the appeal of AI for search. The provenance of information is just as important as the resulting information for pretty much anything I search for. I almost never accept a single search result as authoritative unless I'm pretty familiar with the source.

Reddit in particular seems like a terrible set of training data. Pretty much any opinion is going to have a counter-opinion somewhere in the thread. Synthesizing multiple comments together requires nuance based on circumstances that I don't think I would trust to an automated process- which heuristics were applied, what was the reputation of the people on either side, etc.

Hell, I don't even stop at the first recipe I find if I'm looking for something new to cook for dinner. I look at a couple variations on a dish first.

> I genuinely don't understand the appeal of AI for search

If you're good at googling the flow is: Ask the question > Clock which result isnt spam and click it > Figure out how to dismiss the cookies gate without accepting the cookies > Dismiss the google login box > Dismiss the popover pushing you to install an app > Scroll the page or ctrl+F to find the answer

With ChatGPT it's just type your question and your answer is appearing right away.

Of course I understand ChatGPT shouldn't be used for this because it will lie to you and make things up. But I am saying that's why you'll see people who don't know that preferring it over Google.

Aaand this is basically the story of silicon valley over and over again.

Build a product that's more convenient, and people will use it. Google is so full of shit now, and even answer boxes are below 4 ads, it's just way more comfortable and efficient to ask chatgpt.

This morning I wanted to know how many calories there are in a breaded chicken breast. Chatgpt told me in 3 seconds after asking. Google would have been way more time. (sidenote: I also hate google, so I'd much rather use chatgpt anyway)

Also Google (and distressingly, DDG as of late) sanitizes the hell out of the SERPs to only present you with “approved viewpoints”.

And I’m not talking about fringe Q Anon type stuff: the other day I was looking up the specifics of China’s “Blue Sky Initiative” climate change policies and the only thing Goog/DDG would show me (despite several attempts at rephrasing the query) was Western industry think tanks bellyaching about how the policies effect profits. It took me a good ten minutes of refining my search before I got an English translation of the actual policy bullet points.

I can’t imagine this being such a PITA on 2013 era Google.

out of curiosity, how did you verify the information?

I didn’t, but it sounded right. I know it might be incorrect.

However, I’ve seen plenty of bad information in google answer boxes too. And finding it in actual search results is going to be way more time. It’s not a life and death question.

It lies like crazy on surprising things. Database parameters for an enterprise provider, for example, I've seen hallucination in 5% of cases. That's _bad_ when its taken as authoritative.

I like that you can get results quickly and it solves the "I don't know what I'm looking for" problem.

I'm learning typescript and ran into a weird typescript construct the other day, I threw it into chatgpt and asked "what is this" and it explained it to me. I'm not entirely sure if pasting in a bunch of braces and parentheses into google would find me the same result.

Don't get me wrong, thats a great use case!

None of the results are surprising if you simply read what OpenAI writes about their own machine. It's an autocomplete engine.

An autocomplete engine with "sparks of AGI", because gotta get that sweet sweet VC money and market it as the Next Big Thing.

My googling flow: Ask the question > Clock which result isnt spam and click it > uBLock Origin blocks everything > ctrl+F to find the answer

Well, if you make up the answer from nowhere yourself, you save the work of creating a question and typing it on the ChatGPT prompt.

I mean the internet lies and makes stuff up too, being first on Google results has nothing to do with truthfulness.

That's the glory of a search engine. You have multiple sources available to review, compare and - with enough experience - learn which sources you can more readily trust than others.

AI offers none of the above. Ask the same question twice in a row, maybe you'll get a different answer, maybe you won't. You won't know which of the two answers are hallucinations, which are true factoids it trained on, and which are bogus things it had been trained on. There's literally no providence for the results- no trail, no references, nothing, because it's really a sophisticated game of "Whose Line" where everything's made up and nothing matters.

painful how accurate that is

I suspect this is due to a fundamental misunderstanding of what tools like ChatGPT actually are, what their capabilities are, etc. I think there is a large population of people who think they're simply more sophisticated versions of Siri. In fact, I'd go so far as to say that the vast majority of people don't even realize they're powered by different things, but instead just see the "marketing term currently known as AI" as a singular monolithic entity, and all related technology just gets lumped into the same category. That's miles away from understanding how "use a language model to interpret what I'm asking you to search for, and then plug that into a normal search engine, and read me back the most relevant result", is fundamentally different from "use a language model to predict the stastically most likely next sequence of tokens based on the ones you gave me."

I don't mean this as a way of thumbing my nose at people who don't know this stuff, but rather, to point out what a massive failure the completely nonexistent consumer education surrounding ML products has been.

It's funny that you mention recipes at the end because that's exactly what I like to use ChatGPT for. Every single recipe site has been so corrupted by SEO that you need to scroll past 12 paragraphs of nonsense to get to the actual recipe, and more often than not once you actually get to the recipe part you'll be bombarded with popups about newsletters and cookies or some late-loading ad will cause the view to shift past the step you're trying to look at.

ChatGPT is great for distilling that experience down to just a simple recipe for whatever it is I'm looking for.

Tangentially: props to AnyList for their amazing plugin that scrapes recipes off of sites like this and stores them in an easy-to-use format.

I think it's pretty safe to say if any of these AI-based models gain dominance, they will eventually add junk to their output if that junk makes money.

My prediction: By 2025, you'll be able to purchase sponsored sentences and sponsored paragraphs in ChatGPT's (or whoever's on top) output.

I just used ChatGPT4 this past weekend to come up with smoothie recipes and then a shopping list to take to Whole Foods. Can also give it ingredients you already have and ask for recipes. It’s a killer application for me ha

When you're not sure what to Google (such as when working in an unfamiliar domain), asking ChatGPT for help and then researching it's answer with normal searches is very effective.

The appeal is that AI answers your question. Maybe a wrong answer. Maybe from questionable sources.

Search, on the other hand, doesn't answer your question.

> I genuinely don't understand the appeal of AI for search. The provenance of information is just as important as the resulting information for pretty much anything I search for.

Search requires work on the part of the user to distinguish between good links and bad. AI is an oracle just tells you what you're looking for.

Now you and I might think this is a terrible way to evaluate the veracity of information. But think about the new generation of mobile-native users who were raised on simplistic discourse in tweet-length messages, and would rather watch a 1-min video on a topic than scan search results for 30 seconds.

For this group, searching for information where more than 2 clicks is required is going to be a "too complicated", anda "bad user experience".

> AI is an oracle just tells you what you're looking for.

LLMs are oracles that arrange words in a probabilistic order that are grammatically correct and may be factually correct. Unfortunately there's no way to evaluate the probability of confabulation with any of the LLM chat bots. The distribution of occurrences confabulation is also not regular or predictable nor is it fixed. So you can't ever say "ChatGPT is bad at X" because it can be bad today, good tomorrow, then bad the next day.

It sounds like we are dooming an entire generation by catering to a maladaption of technology-addiction induced ADD.

Idiocracy and The Machine Stops ... All At Once.

Simple: get the search results, have LLM read and evaluate them.

It's devilishly difficult to get citations in there, you're not going to do it with langchain, but its possible (c.f. Bing).

Python x langchain x LLMs makes it very easy to create demos so there's been an initial influx of meh stuff, I'm very excited for 6 months from now.

A niche provider that doesn't do generative AI very well but does "AI powered search with citations" is perplexity.ai

I've used for rather obscure queries and liked the summary the AI wrote as well as the links to dive deeper. I imagine that's what AI search will look like across all providers before long.

That site looks incredibly good.

I threw in a question Wikipedia does kinda answer but leaves a lot of details off, and it enumerated a set of possible answers, with my favored option linking into a forum where it looks like less than a dozen people post, but with people that tried all variations of it and know all of the details.

DDG and Google would never show me that site (yeah, I've tried).

Just to add, that forum has better answers than the ones GPT created directly on the search engine. If it was a "consult the Oracle" interface instead of a search engine, it wouldn't be useful at all.

This is the logical endgame for the search engine form factor. Remember watching mom ask google questions using complete sentences and punctuation with a “please and thank you”?

The appeal of a conversational response was too great to wait for a solution to the problem of training data quality.

LLMs are the worst method for information retrial, except for every other method we tried.

The appeal is that now regular search engines are so bad at giving you useful content that using a LLM is now the equivalent of "google-dorking" to find relevant information.

Only because the LLM didn't have any AI generated blogspam to get trained on. That's going to change very quickly.

Any non-trival LLM that works by scraping the internet is already sufficiently advanced to be able to classify blogspam.

The issue Google has with blogspam is they don't want to filter it because it inevitably uses Google's advertising. So Google gets to self-deal traffic to blogspam which makes them money. They're not incentivized to actually eliminate blogspam.

LLMs aren't automatically disincentivized from training on blogspam so they're not going to avoid it either.

ChatGPT like AI is the natural next step in the arms race of SEO vs Search. Right now is a mechanism to fight SEO. It will continue to work for a few years.

You are absolutely right minority or not. I actively avoid reddit because the majority of users there. The fact we're seeing government agencies start treading into using GPT models is frightening. We could find ourselves in a tragic comedy where all of the massive institutions and enterprises around us are addressing their serious issues via redditors by proxy.

>We could find ourselves in a tragic comedy where all of the massive institutions and enterprises around us are addressing their serious issues via redditors by proxy.

not with a bang, but with a "this. take my upvote my good sir"


depends of where you go on reddit. i've learned a lot of things for my hobbies. for example r/espresso and r/roasting are a source of good information. there are also places like r/askhistorians and many many otheres. reddit is not just r/funny.

niches and hobbies are dominated by beginners and ideas that can't be challenged (theres a word for this, I cant remember what it is). people aren't just talking about their experience on major subs though.

the opinions i run into real life can be very different then with people in the real world. the communities online are made up of the kinds of people who spend their time online, and the content you see on reddit is generally from the people who spend enough time on reddit that they want to browse new. these arent average people. only a minority professionals are actually engaged in reddit. even those ive seen run off because they don't agree with the acceptable opinions

>the opinions i run into real life can be very different then with people in the real world.

I think you messed that sentence up, but I get what you're trying to say....

And I think it's this. The opinions you get from people 'IRL' are not apt to be as strong as the ones online, and or will run into the regency bias.

For example, it's very unlikely you'll actually meet someone that has used 10 different coffee makers because they wanted to see which one was best. Online on some subreddit, you're very likely to meet someone who has done exactly that. Of course those people with strong opinions are the ones that are apt to post most online.

So who's option is wrong? Neither. That's why they are opinions.

>And I think it's this. The opinions you get from people 'IRL' are not apt to be as strong as the ones online, and or will run into the regency bias.

No, I've noticed what he's talking about, and it's not the strength of the opinion, it's what the opinion is. Reddit has a moral system that's completely misaligned with real world morality, where having a child or being autistic makes you a bad person, even if you didn't do anything wrong. You can find some really weird takes on r/AITA, which ironically points out that the subreddit has a fucked up sense of morality in its highest rated post.

>having a child

Not sure if you've noticed but younger generations that go out and do things are big into having children. Now, some groups on reddit are more extreme on that, but the general trend of Americans at least is to tell the act of having kids to screw off.


But to answer your question, it's not Reddit that has this moral system, it's social media in general that has a moral system that does not match reality (kinda). The loudest idiots tend to get voted up, moderates disappear in the bulk of posts. Binary voting systems on sites tend to amplify this. Content suggestion systems tend to lift up contentious posts for engagement. Welcome to the internet.

But coming back to (kinda). This is becoming reality. Behavior IRL affects behavior online. Online behavior affects IRL behavior. People don't talk to their neighbors these days in most places. Communities are spread all over the earth.

> there's a word for this

Hegemony, perhaps.

I think what the article tries to say is that OpenAI have already scraped Reddit for training data and with the recent API changes and subreddits going dark, new competitors in the AI space won't have it as easy to get the same training set.

Honestly this sounds like a shower-thought post. With even basic research, Internet Archive and The Eye have Reddit historical data freely available. My desktop PC has all comments and posts from 2007-early 2023, in a convenient jsonl zst. It's only 3TB.

The point isn't really to discover the opinions of redditors, it is to ingest the 'common sense' things that you would never find out from reading scientific papers or even books.

You’re not alone in thinking that. The value in using Reddit as training data would be the question response format of threaded comments. The downside is that the vast majority of comments on Reddit are very low quality and repetitive. You’d have to do a lot of filtering to make it usable and what you’d be left with would be a much smaller pile of training data.

repetitive is good tho, it presents validation. I'm sure programming a good AI means adding a grading system for repetitive facts, in reddit's case, it may even accommodate likes. The problem would be when we have run-away sarcasm/irony/memes which the llamas can't handle.

Repetition is not good, because too much of it leads to overfitting.

I don't think you're the minority. I want to actually read the thread and see what random individuals thought about a thing (with their different views intact), rather than getting an aggregate summary. Often the overlooked, 1 point comment (because he answered the question a month after the question was raised), is the one you're looking for.

One of the big values of reddit is that it serves as an easy index for non-garbage websites on the internet. This is exactly how it was used in the T5 paper: they threw away all the websites from their crawl that were not linked to from Reddit.

And by garbage I mean literally data/illegible/etc. that would ruin the pretraining of the model.

In my opinion, it comes back to search engines not giving you a the best answer. If you're looking for the best headset with whatever feature, you'll find mostly sponsored reviews, or at the very least hard to trust reviews.

On the other hand, some random people's opinions in a Reddit community with -apparently- no further agenda seem somewhat more honest.

Not that the answer is better but it gives you new data points in your search.

Basically, it's not one or the other, you can use both tools and that's probably why it makes sense to include Reddit in AI models (which do this job for you automatically)

Seriously hope OpenAI is stripping all the canned meme replies that go on for hundreds of sub threads and end up as the top comments in threads before they train their models. How would you even do that reliably?

They don't. That's why some tokens from /r/counting mess up their models: SolidGoldMagikarp and " davidj12" or something like that

Label data and build a meme classifier? Does not have to be perfect to be useful. But yeah, data curation is probably a huge endeavor at the companies making Language model that are fit for production. Like in practically all applications of Machine Learning.

But the Reinforcement Learning from Human Feedback (RLHF) is also one of the key tools to getting useful outputs.

I'm sure that the non technical public would be interested in a chatbot fed on reddit data, which is more interesting than how valid the AI models predictions are for the people making money off of it.

Reddits community is passive aggressive, thinks it’s really smart, loves memes. I certainly hope OpenAI doesn’t view it at as some kind of a source of truth

No, they view it as a language model. You can tell lies with the same language that you can tell truth with.

If you want truth, you don't want language, you want references to reviewed work. You also want things like 'show your work' chains of though. These are really different things.

If I tell GPT "make up a story" I don't want it coming back and saying, sorry I can only tell the truth.

You just described the C-suite at most SV companies.

I feel like I’m taking crazy pills: Reddit is on CommonCrawl, which means its all available without API access, permanently from AWS’s CDN. These discussions are completely moot because the path of least resistance was already available and in use by OpenAI

I need to stress this more obviously in the article, but the goal is to protect Reddit's future data.

Why would Reddit block CommonCrawl's access to future data?

Yeah, all the publicly available reddit data is freely available and the cat has been out of the bag for a while.

Reddit has a lot of other private information though. The entire corpus of moderated / censored comments is the internet's best alignment dataset. The real upvote/downvote numbers make for perfect RLHF.

Hell, Reddit could remove ads and replace them with paid RLHFing bots. Charge OpenAI money to selectively deploy AI bots to certain QA subreddits. All AI generated comments get deleted within a week. That way it doesn't lead to comment spam, but the bot still gets amazing upvote/downvote feedback and a ton of direct comment-reply feedback. 3rd party apps can't 'ad block' it either, since only Reddit knows which comments are fake.

(psst, reddit, if you're gonna steal my idea, I want to be paid for it)

Good call, but is upvote / downvote data in there? I'd imagine OpenAI is doing something with that.

Thank you, I had no idea about that site

If you accept that Reddit could be OpenAI's moat, I think you could explain Reddit's behavior even without OpenAI's intervention. Like the premise here is that Reddit's data is super valuable and so OpenAI would want to stop others from getting it. But it also makes sense to say Reddit's data is super valuable so Reddit would want to limit access to it and be able to charge a premium for it.

That said, I'm not totally sold on the idea that Reddit has truly unique and valuable data that's a cut above what can be found elsewhere. After all, several of the so-called "glitch tokens" in GPT's tokenizer are Reddit user names that occur over and over in threads of people just counting. See: https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldm...

> If you accept that Reddit could be OpenAI's moat, I think you could explain Reddit's behavior even without OpenAI's intervention. Like the premise here is that Reddit's data is super valuable and so OpenAI would want to stop others from getting it. But it also makes sense to say Reddit's data is super valuable so Reddit would want to limit access to it and be able to charge a premium for it.

I don't think I have a refutal, honestly. OpenAI aside, Reddit clearly has realized that their data is valuable, perhaps more valuable than their ads business. This is why my piece is frankly very speculative: the strategy aligns really well and the financial interests are there, but I wish I had a great argument why my theory is more likely than simply Reddit trying to capture the value of their data.

It’s a fun thought experiment, but I tend to revert to the simpler explanation:

They’re cutting off third party apps to increase their control over the user experience, to show more ads.

The relationship with OpenAI might also be a thing, but I think it’s the ad numbers they need to juice for their IPO.

I think that explanation certainly the simpler one (and thus more believable), but one thing that stands out to me is that Reddit's ads are really bad. I don't know how much of a business it is. I think the value as a corpus is much greater. If I owned both Reddit and OpenAI, I'd prioritize OpenAI since its upside is bigger.

Of course, I think another good explanation is that, OpenAI aside, Reddit simply wants to get paid for their data, which they now realize is valuable. Scraping aside, that might be the real goal here.

They can still scrap the website?

The fact is that many, many internet businesses today not make sense. Many of them are bleeding money with no possible path to profitability ahead. Mostly due to the fact that people's expectations are very different and I would argue even skewed. Clearly hosting Reddit is expensive, but who really pays for it? People don't like ads, don't like to pay for services, don't like to be reminded that they can provide donations (Wikipedia). What else is there? Hoping the customers have the suddenly conscience to fund the product?

I foresee many internet companies dying over the next few years with no real replacement to crop up. But people always forget and history repeats itself; so maybe we will just see an endless reincarnations of similarly failed products.

They know hardly anyone would pay for a social link aggregator and discussion board, but some people might donate a coffee to a beloved app developer on their phone. Hence the only way forward in their mind to charge for API calls. Even if it kills many of those apps, some will suck it up and charge their users. That it makes life harder for LLM competition and benefits their guy Sam is just a nice side effect. But I'm pretty confident that there will be daily reddit dumps available for training very soon as Torrent for anyone that can't afford to hide their scraping, if they not already are.

Then Reddit the company should charge Apollo a reasonable price, not an extortionate price that will kill the apps altogether. And lay off the majority of the staff, just like Twitter. Sure, the experience may get slightly worse, but it will be fine. The real issue is just trying to take Reddit to IPO.

Is hosting reddit hard? Or do they just have thousands of employees who have little reason to exist.

OpenAI has nothing to do with it I think

I think it is about forcing people to watch ads, particularly keeping reddit safe for "dark patterns".

It's an awful experience to follow a link from a search engine for reddit because reddit systematically injects ads into results and discussion to try to trick you into clicking them and injects even more blended links to irrelevant discussions to increase your chances of getting confused and clicking on an ad accidentally.

Third party clients destroy all that.

People who want to harvest text from reddit can just do it on the web. (Pro tip: ignore the API and implement your own 'IPA' that just works like a web browser; APIs are almost always nerfed in some way; if you have to crawl 20 web sites odds are a generic web crawler will work for 19 of them; if you are using APIs you will have to implement something different for all of them, maybe 3 out of 20 will have some feature in the API such as a complex and poorly documented authentication process that will take a few hours.)

APIs are not a gift, they usually are an attempt to take access away. (Considering Hacker News, there is no API to make a post or get your upvotes. It's like 20 lines of Python to code up an "IPA" for either against the web interface.)

100% Open AI is already making fake AI-generated posts and comments in reddit for a while. Pack it up boys, internet ran by humans is over.

Should've packed it up back in usenet days: https://en.wikipedia.org/wiki/Mark_V._Shaney

Also, if reddit is in any dataset of these models then it's no wonder that I'm not intrigued by artificial """intelligence""" at all.

Looking at the examples discussing Reagan on the Wiki page, is this generated text looking to sway public opinion? This could be used to sway elections! What sort of precautions did the hacker named Mark V Shaney take before releasing this into the wild? These so called “Mark Off Chains” are too dangerous to be let into the wild. They should only be made available to the largest corporations who can protect us from their dangers. I’m going to write my senator to make sure this isn’t made publicly available!

This post is wrong on several aspects the other commenters have already pointed out, but it raises this tangential point which I'm curious about

"They could buy out more 3rd party clients, which would somewhat appease the community. This would be a terrible move IPO-wise."

Why would this be a terrible move IPO wise? If Reddit has two major apps, one for casual users and one for power-users, and it has enough control to eventually put (tasteful) ads in both, how is that bad for their IPO? I don't think it's likely, given how much goodwill they've already burned with the community, but from a pure IPO perspective it seems like a solid go forward strategy

Could OpenAI acquire Reddit as an alternative to their IPO? Let's say it's worth $10B - quite a sum especially if OpenAI is 'only' worth $40B - $50B. But a world in which compute isn't a moat, LLMs aren't a moat... but real human-generated content IS a moat... maybe it could make sense?

I mean, that would be kind of insane, but it could actually work: OpenAI needs "clean", high-quality data as input for its AI; rather than harvesting data to sell to advertisers, they could and finance major internet infrastructure -- the equivalents of StackOverflow, Wikipedia, Reddit, Twitter, Facebook -- to harvest data to feed their models. That might actually make the world a better place.

Why would you pay $10B for data which you can scrape for $50 from a few seedboxes?

Or if that's too much work for you, just download one of them many archives that are already freely available online.

$10b is probably too low

I'm sure I'm missing something, but: aren't there publicly available corpuses of all reddit posts up to a certain date? Why wouldn't researchers train with these? Are they just not recent enough? Even if they aren't very recent, how big of a deal is that when it comes to training models that presumably use lots of other data sources as well?

The author is aware of that, mentions it towards the end:

> Most of Reddit’s current data has been scraped anyway, so the game is to protect Reddit’s data going forward.

But yes, Pushshift archives of all posts and comments until the recent ban [1] are freely available for download [2]

[1] https://old.reddit.com/r/pushshift/comments/135tdl2/a_respon... The ban was followed by allowing the parent non-profit of Pushshift (Network Contagion Research Institute) to use the API provided access is restricted to a use-case Reddit has care for: mod tools. Reddit hasnt replaced those with its own just yet. The rest of us are shut out.

[2] https://academictorrents.com/userdetails.php?id=9863

There are massive up to date archives available


I guess either OP doesn't know about them or he assumes that AI researchers won't think of using them.

Reddit's API change will not stop LLM training, it is far too easy to manually scrape the site. It only needs to be done once.

The intention of the API change is to kill third-party app users who cannot be as effectively monetized as first-party app users.

Basically the only thing that comes to mind is that our collective views and knowledge about topics can change over time. If we had a hypothetical LLM in the 1930's, it would have had quite different views on things like mental health, civil rights, and so on.

But like you said, the models would ideally be trained on many other data sources as well.

> up to a certain date?

> just not recent enough

thats a very big deal for a lot of the content you find on reddit which users might want to get answers from from an AI.

Reading this makes me thanks the gods wikipedia is what it is (aka : a nonprofit, dedicated to fullfilling its original benevolent mission). Imagine if google or microsoft had succeeded in overthrowing wikipedia, what internet would look like.

Wikipedia, in itself, accounts for a good 50% of the positive image i have of the internet.

It would make a lot of sense for Microsoft to buy reddit: a curated source of data is the perfect moat!

They could come as a white knight, restoring the API access but for apps only, conveniently blocking competitors while satisfying the users.

They have no desire for the political risk of running a non-professional social media as a major government contractor.

Heh... and take responsibility for all the porn and various "degeneracy" on there? It makes them a perfect political punching bag.

"We recognize the value online communities provide in training the state of the art in generative AI. With that in mind, we have acquired Reddit and intend to operate it as a wholly owned subsidiary. After an exhaustive search, we've selected $adult_ceo_with_social_experience to lead the team as they work to ensure Reddit can continue to operate as both a safe and vibrant digital oasis, where people can come together to share their interests and passions."

One might argue these communities are too valuable to generative AI to allow them to flail under current substandard management.

"Let's now show the new Bing, where high quality data from reddit is better ranked and the selected by default for {whatever_category_you_think_matters} thanks to automatic classification of the search questions by GPT"

There are many synergies that aren't properly considered by a naive take.

I use site:reddit.com with most of my queries. I now prefer bing to google. Bing market share is growing.

Coughing up a few B to avoid reddit going on with an IPO would be a very strategic move on many fronts.

Github acquisition was a savvy, Reddit has more peril involved but similar value.

I'm in full agreement, but I see an even better integration in other product lines (mostly gaming and search)

Also, Microsoft doesn't have a generic social network to mine data from.

They may prefer to stick to the professional world, but then acquiring a few high value sites just to close their API would be another move.

How much $$$ do you think it would take for YC to say "hell yes!" to put HN under Microsoft control?

How valuable would it be to add a few conditions like a very generous giveaway of azure credits to HN companies in return of exclusivity (no more S3 or google cloud for the new batch!)

> They may prefer to stick to the professional world, but then acquiring a few high value sites just to close their API would be another move.

They already have LinkedIn, which is far more valuable than HN for this purpose.

Yes, Linkedin is a professional social network, but no, you will not find high quality content there.

There's a lot of value for keeping in touch with the zeitgeist (ex: sqlite innovations, llama etc) and finding new trends on HN

There's about 0 value from using linkedin.

They could claim remaining true to the original spirit or whatever excuse will fly.

How many billions will AI be worth in the future? What's the cost of the potential reputation damage for having a edgy brand? All that matters is the operation can be expected to be profitable.

Better: what would be the synergies between the xbox audience and the reddit audience? The "degenerates" are clients like any others, maybe also with more disposable income.

The model will be aligned anyway, the internet is full of porn and degeneracy, not just Reddit.

This is so smart that I'm now convinced it's what's happening.

I agree wholeheartedly with this ... and given Microsoft's trail of corporate bodies in its wake.. I wouldn't put it passed them to be orchestrating the cutting of all data-lakes for AI training, especially a clean and pre-processed source like Reddit. if data is water that corporations drink(which it is).. REDDIT is like finding a naturally occurring spring of Perrier water(by natural, I mean, we are the ants who bring this water one drop at a time from the soft petals of plants which collect the night dew).

I would like to be compensated for the digital oil I've produced.

You can be compensated the same way we compensate the earth for real oil: by burning your output to create a toxic environment for you to exist in.

This is much more YC/VC maneuvering than it is Microsoft.

I wouldn't be surprised if ClosedAI closed yet another part of their product, in this case, the training data itself.

(I know it's a satire)

It's quite naive if someone think all the major tech big boys who have at least moderate ambitions about AI haven't already at least archived Wikipedia, Reddit and StackOverflow.

The data Reddit has on its NSFW content (you know which one.. yes, the pornographic one) must be worth millions.

No, no... jeez, you pervs... not the content.

The upvote/downvote data for every image will probably train any generative AI on exactly what makes a good porn video/pic - and we all know sex sells. The dirty comments, and etc.

On top of that, they'd make really useful leverage if you can connect any one of these accounts to their real world counterparts. I bet an AI with a large enough dataset could make short work of that.

>The data Reddit has on its NSFW content (you know which one.. yes, the pornographic one) must be worth millions.

Blackmailing people who participated in /r/jailbait? I would bump that up a 0 or two.

Sorry, just thinking about how an AI would go about acquiring resources.

Please tell me that admin guy wasn't part of that.

Why does everyone says “moat” these days? I haven’t heard this term so over-used even 2 months ago

I have literally never heard this word before the infamous Google post, since then it's everywhere.

What google post? I think I missed that one

If Reddit is such a valuable property it would make sense for one of the big AI players to come in and buy it out before it IPOs and the value shoots through the roof. Open Ai is the most likely one given the links between Altman and the management. But a new entry like maybe Apple could swoop in. We know how useless the management of Reddit are at making money from it, a sale in it's entirety might be the best exit for the VCs and for them.

>before it IPOs and the value shoots through the roof

Maybe they want to wait for it to IPO and then drop 90% instead.

Reddit is valuable for its milk, but gave away the cow years ago with the free API.

I love this theory from a conspiracy angle. But one issue is that they didn't cut off API access, they just made it more expensive. This creates a moat that small app developers can't cross, but not much of one for a funded AI company. If you that believed reddit's corpus was the key to training your own model, you'd just pay the money. It cuts off the wrong kind of person it would be targeted against if his theory is right.

> This blew up on Hacker News. An HN mod edited the title to add a ? at the end. It was not me. The answer to any headline with a ? at the end is “no.” I would not own myself like that.

Agree with the viewpoint. Hacker News is heavily moderated in a non-user-friendly way which results in some comments being nonsensical after moderation editorials and also disturbs searching for content by a title one remembers.

Maybe if reddit developed some expertise in using all of the user data they have to tag, categorize, annotate every comment, they would have something of value to provide. Their recent actions only angered users and won't stop future LLMs from using their data for the reasons many others here have noted.

Problem is, Reddit can’t obscure its content from Google because it is a source of traffic and new users for them.

Woah, this is a really good point. That brings up an interesting question: can Reddit stipulate in its TOS that a search engine can crawl the site for the sake of engine indexing, but not train its models on it? I don't see why they can't add such language.

I think the author is on to something here, but isn't there a bridge across the moat re: scraping?

Sure the API made things convenient, and scraping content will be a bit of an arms race, but scraping public postings that don't require a login to view still seems like a bridge over the moat. It's tempting to refer to the recent victory of LinkedIn over HiQ, but there's an important distinction in that ruling: It pertained to use of logged in accounts and did not explicitly overturn the prior ruling that public-facing data was fair game.

For now it's all still unknown territory. I would be skeptical of anyone who adamantly affirms a position that scraping for training is allowed/not-allowed because it is not yet settled law in the US where these companies are headquartered or the rest of the world where they operate.

> Organizations/people with an interest in OpenAI – like Sam Altman and a16z – have a significant stake in Reddit, and strong ties to the board.

This triggered me to review a thought I had.

VC's are very keen with attending board meetings without being directors, retaining the right to appoint a director without carrying that out, and exerting control thought other means. I think some tightening of the concept of shadow directors is in order. Shadow directors are supposed to have the same legal responsibilities as properly appointed directors, yet they currently carry all the power without the reponsibilty

> They could extend the June 30th deadline, which I think is likely. This would get rid of 3rd party apps while, perhaps, keeping most of the community intact.

I don't get what the author is trying to say here.

If the deadline was extended, why would it "get rid of 3rd party apps"? I assume because some (Apollo, RIF) said they're going to close anyway even if the deadline was extended? If so, the community will not be "intact".

If the author think the death of 3rd party won't affect the community, then Reddit don't need to extend the deadline.

I had a thought(maybe naive). There is this idea(not new or mine) that, API based access aside, LLMs are increasingly able to embed (semi)structured details(like code or Dom based UI navigation) needed for planning(in the ai sense). Then what even is a moat?

And then one fine day; some random guy has a local/private LLM trained to browse his facebook groups as him, his account gets deactivated meta; he sues arguing in court about right of using his own tool to mine personal data.

Data is the only moat in the AI age. The problem for businesses is that the output of models can be used to train other models, so even a massive data moat will eventually be eroded.

Eventually all B2C AI companies will be just slick interfaces over open models that are too large to run on commodity hardware. B2B AI is in a bit better shape, both because compiling niche business data sets can be expensive, and there won't be giant public data sets produced from their output.

Current VRAM prices are like that, because of Nvidias greed.

I assembled one of my previous gaming PCs 10 years ago and installing 32 GB of RAM wasn't a problem back then.

But you can't even buy a consumer GPU with 32 GB of VRAM. Data center cards are considerably more expensive.

There's no technical reason to not have 100+GB consumer GPUs today.

Historically consumer cards have been driven by the needs of gamers, and data center cards have mostly been slightly retooled consumer cards. Since game graphic needs have plateaued to some degree and I expect as AI gets incorporated into more things we'll see consumer cards that have lower game performance but a lot of memory and good basic model performance.

I’m not sure exactly what the moat would be here—the current data is already probably available, and future data OpenAI will have to pay for just like everyone else.

Ok, looking closer at the article I see

> The important piece is that it’s easiest for OpenAI to get the data (given that companies with co-investors help each other)

which seems like a pretty weak basis to form a moat (if Reddit IPOs as is their plan, sama’s influence will be not nearly as strong)

I’m always irritated by the theory of the motivation of closing API access being LLMs.

To forbid LLMs simply could be a legal paragraph in Reddit’s API terms of service. Good actors would abide, bad actors would still crawl and scrape the HTML. Practically the same situation with the closed API from July forward, but without the drama.

Killing 3rd-party clients seems the far more likely motivation.

Conde Nast is majority shareholder, not OpenAI, and they're not going to be onboard with wrecking reddit in an attempt to build a moat for another company. It's just silly.

Besides, if someone really wanted to get to the data they could just scrape it. Google, Bing & co index it after all. Bit of a pain in the ass but not impossible

Some strange claims in this post. The reddit post datasets are already 'out there' in the wild, and I'm fairly certain every other major LLM release has used their data. Also - did Midjourney "steal" DALLE-2's lunch? It's a restrictive service with essentially a discord-only based CLI.

Somewhat related, are there methods to embedd LLM "traps" like trap streets in maps to establish Provenance?

This article misses the mark for me.

It wields reddit as meaningful source of openAI's performance, and then goes down a rabbit hole from there.

OpenAI's moat is the thousands of 'human in the loop' contractors that they hired in south america... for years... Not reddit...

Why was a ? Added to the end of the title?

Reddit shut down their api for one reason IPO has to show growth and the only growth left for reddit is *FORCING* users to their own app.

They shut down i.reddit and reddit.compact so theres no optimized mobile web format.

Now they shut down api access to apps. Its as clear as day. I don't think they give a sh*t about AI

This might as well be as good place as any to say that I wouldn't be surprised if in 20 years time we find out that the acquisition of GitHub by Microsoft was also done in OpenAI's favor. I know it's probably a crazy theory, but it did cross my mind a few months ago.

Doesn't make any sense timeline wise, business execs at Microsoft were probably not thinking seriously about openai at the time

The really impressive feat here is not even that, but MS inventing time travel to acquire Github before OpenAI was a twinkle in Sam Altman’s eyes

OpenAI certainly existed in 2018 but I don't think I noticed most people being aware of them outside of the industry.

You’re right, my bad, I misremembered the Github acquisition year

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact