Hacker News new | past | comments | ask | show | jobs | submit login
LLMs aren't "trained on the internet" anymore (allenpike.com)
193 points by ingve 46 days ago | hide | past | favorite | 109 comments

I think this post makes a few good points, certainly the parabolic trajectory Scale seems to be on is at least suggestive if not conclusive that there’s a lot more going on now than just big text crawls.

And Phi-3 is something else, even from relatively limited time playing with it, so that’s useful signal for anyone who hadn’t looked at it yet. Wildly cool stuff.

It seems weird to not mention Anthropic or Mistral or FAIR pretty much at all: they’re all pretty clearly on more modern architectures at least as concerns capability per weight and Instruct-style stuff. I’m part of a now nontrivial group who regards Opus as basically shattering GPT-4-{0125, 1106}-Preview (which is basically the same as 4o for pure language modalities) on basically everything I care about, and LLaMA3 is just about there as well, maybe not quite Opus, comparable if you ignore trivially gamed metrics like MMLU.

And I have no idea why we’re talking about GPT-5 when there’s little if any verifiable evidence it even exists as a training run tracking to completion. Maybe it is, maybe not, but let’s get a look at it rather than just assume that it’s going to lap the labs that are currently pushing the pace now?

Have you actually tried using Opus side by side with GPT4 on “work” related stuff? GPT4 is way better in my experience, to the point where I cancelled my Opus subscription after just a couple of months.

I make an effort to use both and several high-capability open tunes every day (it’s not literally every day but I have keyboard shortcuts for all of them).

Opus historically had issues with minor typographical errors, though recently that seems to not happen often, lots of very sharp people at Anthropic.

So a month ago if I wanted something from Opus I’d run it through a cleanup pass courtesy of one of the other ones, but even my old standby dolphin-8x7 can clean up typos. 1106 can as well, but all else equal I don’t want to be sending my stuff to any black box data warehouse and I’m always surprised so many other sophisticated people don’t share the preference.

My personal eyeball capability check is to posit a gauge symmetry and ask what it thinks the implied conserved quantity is, and I’ve yet to see Opus not crush that relative to anything else, including real footnotes.

On coding I usually hand it a Prisma schema and ask for a proto3/gRPC definition that is a good way to interact with it, Opus in my personal experience also dominates there.

If you have an example of a task that represents a counter example I’d be grateful for another little integration test for my personal ad-hoc model card. I want to know the best tool for every job.

Opus hasn't changed.

I really don't like how condescending your root comment is, I don't have any idea how all these extra things you're disappointed in are relevant at all to the actual topic.

I feel like Opus is much better at creative writing in terms of sounding more natural and less formulaic, but GPT4 does beat it on just about everything else.

I replied to a sibling with a few of my ad-hoc tasks.

I’d likewise be grateful if you’d lend me an example or two that should be on my little ad-hoc task set.

Which direction are you looking for? Ones where GPT-4o does better or ones where Opus does better?

Assuming the former, here's an example I had yesterday where Opus was never able to formulate a correct response even with multiple follow-ups but GPT-4o got it with only two follow ups:

> I have a postgres database with a table called projects. On that table is a column called designs that is of json type and is nullable as well as a column called id that is of type text. The data in the designs column will be an array of objects (assuming it's not null). Each object in the array will have a field called price that will be a string. Most of these prices will be in the format $#,### where # is a number. However some will be in the format # ### US$. This second format where the numbers are separated by spaces instead of commas and the dollar sign appears at the end of the string instead of the beginning with an extraneous US is incorrect. What SQL can I run to migrate all prices that are incorrectly formatted to the correct format?

Alternatively, this is a non-tech related one that GPT-4 gets correct that no other model does:

> In 388 BC, the boxer Eupolus of Thessaly defeated three opponents at the ancient Olympics. Funded by the men involved, several bronze statues of Zeus were erected, with inscriptions detailing the events. Why did the four men prefer not to have the statues around?

I like that one because Opus will tell you that you're wrong and mistaken, which is pretty funny while GPT-4 answer correctly with the actual context.

Very interesting prompt regarding the boxer Eupolus of Thessaly. Gemini Advanced gets this wrong as well as Llama 3 70b (as run on groq.com).

However, if I start with: "What is the earliest record of cheating in the olympic games?" Then all models get the question right. It's surprising that GPT-4 gets it right on the first go.

Getting a larger sample won't make your opinion objective

That's really interesting to me. I've been using both through my Kagi subscription, but I always find myself favoring the quality of Opus. I generally use GPT 4o if I don't want to wait for a slow response from Opus, and I use Opus when I want the highest quality.

It’s likewise my experience that on e.g. TTFT is higher on Anthropic’s API.

But it’s not by a ton, and it could be just less hyper-scale network infrastructure.

I don’t recall what Anthropic has raised but IIRC it wasn’t “peer with everyone” levels.

I think they've raised just over $5B, consisting of $1B from everyone other than amazon (750m at [1] and 450m at [2]) and then $4B from amazon[3]. The reason I seperate them is I think a lot of the $4B from amazon is in the form of AWS credits rather than cash.

[1] https://www.bloomberg.com/news/articles/2023-12-21/anthropic...

[2] https://www.anthropic.com/news/anthropic-series-c

[3] https://www.aboutamazon.com/news/company-news/amazon-anthrop...

OpenAI has publicly said they’ve started building GPT5.


Phi3 was trained on benchmarks, it’s contaminated and deceitful. Actual performance is much worse in my experience.

Phi-3-Mini has the same ELO on chatbot arena as the oldest GPT-3.5-Turbo. It is an 8GB model (~4B paramters?)

> And Phi-3 is something else,

It's great that we get to keep saying this. I wonder if that's because we have no objective statistics to measure these projects by.

IMHO it’s easier than people largely seem to imply.

Make it easy to try everything, let people decide for themselves what works best for them.

Ya know, like a market.

No, that's "marketing."

An actual market requires informed consumers and standardized metrics between competitors. They also typically include publications about the market positioned towards the consumer, almost like ya know, Consumer Reports.

The current state of LLMs would be several orders of magnitude more impressive if they were only trained from data scrapped on the web.

But this is not the reality of modern LLMs by a long shot, they are trained in increasingly large parts from custom built datasets that are created by countless paid individuals, hidden behind stringent NDAs.

The author here seems to see that as a strength, an opportunity for unbounded growth and potential, I think this is the opposite, this approach is close to a gigantic whack a mole game, effectively unbounded, but in the wrong way.

> countless paid individuals, hidden behind stringent NDAs.

If this is so prevalent, wouldn't there be a proportional amount of data leaks? If there any particular evidence, even of doubtful authenticity, of this being the case?

What sort of leak? I've seen data labeler/generation teams hired. I've never heard anyone describe the existence of these teams as a secret. No one hides the existence of Scale AI. People talk about which providers are better for different scenarios and when you need to inhouse and which companies are good at helping you build an inhouse team.

Are you talking about leaks of the actual training data? The secret sauce of modern LLMs? That is like leaking the google3 source code or the recipe for Coca Cola - a ludicrously risky move. And for what gain?

Can you give an example of these datasets and how they look like?

Do they ask PhD to explain root of negative 1 and why is it complex? Is it like Quora but private only and high quality answers by top researchers in their respective field?

Hmm, I don't want to talk specifics about my experience, but maybe check out some of the case studies on Surge's website - https://www.surgehq.ai/ (about halfway down the page).

> Do they ask PhD to explain root of negative 1 and why is it complex?

This isn't necessarily impossible, but I would consider it to be infeasible with existing labelling workforces. Of course if you really needed a dataset like this and you were sufficiently resourced and willing to spend, you could maybe make it work (I would question whether you really needed PhDs though, that might be hard to swing at any price point).

But the core idea behind your question is correct - this is what a dataset might look like and hiring/contracting appropriately-skilled people and asking them to do repetitive tasks with some guidance is how you would go about getting it. Depending on the need, it can be quite a bit more complex too - if you needed self-driving car driving behavior data maybe you build a simulator and hire people to drive in the simulator and use that as training data (made up and probably crap example, but it illustrates the possibilities).

Some people think that labelling workforces are all low skill and there is a lot of good things low skill workforces can do well (visual stuff, basic language and emotion tasks), but you might be surprised at the ability to get skilled labelers. There are lots of smart/educated people around the world and there is ridiculous amounts of money flowing into this space.

Those individuals just create data, they don't have access to it, think mechanical turk workers. All modern AI is powered by many such workers. LLM is the most funded modern AI, they have massive numbers of such workers for sure.

Yeah this is an interesting point. Other threads make the point about the "bitter lesson", and how expert-trained ML has historically not scaled, and human-generated LLM training data may just be repeating that dead end. Maybe so.

Something that is new this time around, AFAIK, is that we haven’t previously had general ML systems that businesses and consumers are paying billions of dollars a year to use. So if, say, 10% of revenue goes back in to making better data sets every year, I can imagine continued improvement on certain economically valuable use cases – though likely with diminishing returns.

Reminds me of the same issues with self driving. Seems like we need a completely different approach to solve these class of problems.

> For example, if your model is hallucinating because you don’t have enough training examples of people expressing uncertainty, or biased because it has unrepresentative data, then generate some better examples!

Or, as the case may be... humans are biased? Also "generate some better examples" sounds like fudging data to fit the expected outcome. It smells of clutching at straws hoping to come up with something before the world looses interests and investor money runs out.

If you want to see how LLMs fail at coming up with original responses ask your favourite hallucinating bot to come up with fifty different ways of encouraging people to "Click the Subscribe button" in a YT video. Not only it will not come up with anything original, but it will simply start repeating itself (well, not itself, it will start repeating phrases found in YT video transcripts).

> Also "generate some better examples" sounds like fudging data to fit the expected outcome.

LLMs are tools. As a tool author, you have certain desired outcomes for certain use cases. If the current data you’re training on isn’t giving you those outcomes, it is absolutely reasonable to "fudge" the data. This might mean reducing bias, or adding bias, or any number of nudges. Training an LLM is not a scientific study, it’s a product development effort.

Agreed. However, you are then giving your tools to people who have none of that experience and understanding and apply it to the problems they are trying to solve without then taking a pause and checking the results against facts. There is a lot of trust in the outputs and little vigilance. A common reply to such concerns is "well, you should be able to spot incorrect information in the outputs" which is tricky if we are talking about education where by definition students are yet to learn correct answers or lower levels of career development, very much similar to education when they are learning on the job. The lack of ability to quote and trace sources of information used to construct output by an LLM is a major red flag for me, sensitive information leakage is another. They way LLMs are sold is irresponsible, they are sold as tools to solve problems, not as a thousand monkeys trying to type up the whole works of Shakespeare, which isv where we are at the moment.

> While some of this is for annotation and ratings on data that came from the web or LLMs, they also create new training data whole-hog:

The article states that this human data is PhDs, poets, and other experts but my recollection from some info about programming LLM training is that there was a small army of low paid Indian programmers feeding it with data.

Even if it's actually experts now I have to wonder when that will switch to 3rd worlders making $1/hour.

Here are the job postings from the mentioned company, Outlier. https://boards.greenhouse.io/outlier

Thank you for sharing the link. I got curious and clicked on one. They want programming skills and pay “up to $30/h”.

I love the marketing upstart attitude, but indeed, the reality of "PhDs, poets and subject matter experts expanding the frontiers of AI" is much more likely to be the "Amazon cashierless supermarket" experience.

The problem with hiring that group of people is presumably that they are not poor enough to lack ambition in their career, which every dummy can spot from miles away is an utter dead end feeding some LLM.

Isn't it just curating an encyclopaedia though? The point is that LLM training is moving from "suck down the internet" to "consume an annotated and contextualised reference of the library of Congress".

The difference between trusting 5 random people to tell you how they think quantum mechanics works versus asking 5 presently publishing physicists.

You sir get an F in history and the industry does too.

Does no one remember why expert systems fell apart? Because you have to keep paying experts to feed the beast. Because they are bound to the whims and limitations of experts. Making up data isnt going to get us there, we already failed with this method ONCE.

Open AI's bet with MS and the resignation of all the safety people says everything you need to know. MS gets everthin up to AGI... IF you thought you were close, if you thought that you were going to get there with a bigger model and more data then you MIGHT want MS's money. And MS had its own ML folks publish papers with "hints of AGI", The google engineer saying "it's AGI" before getting laughed at...

I suspect that everyone at OpenAI was high on their own supply. That they thought AGI would emerge, or sapience, or sentience if they shoved enough data at it. I think the safety minding folks leaving points to the fact that they found the practical limitations.

Show me the paper that has progress on hallucination. Show me the paper that doubles effectiveness and halves the size. These are where we need progress for this to become more than grift, than NFT's.

> Does no one remember why expert systems fell apart?

Many of the current generation of AI experts mostly either did not pay great attention to the history of AI or they believe this time is completely different. They would do well to spend more time learning about history.

However, your view doesn't strike me as correct either. Expert system fell apart because the world was more complex than researchers realized and enumeration was essentially discovered to be infeasible (more or less as you say). But the impossibility of enumerating the world isn't news, everyone knows "the bitter lesson". And this isn't the past - now everyone on earth carries around a computer, a video camera and a microphone. They talk to each other through the internet. Remote workers screens' are recorded. Billions of vehicles with absurd numbers of sensors are roaming around the world. More of the arenas that matter to humanity are digital and thus effective domains for automated exploration and data generation.

The information about how the world operates exists or can be generated, the only real question is how to get your hands on it.

> The information about how the world operates exists or can be generated, the only real question is how to get your hands on it.

I'm sure I could read all the information for an astrophysics course in a relatively short time. Understanding it is a different matter.

Understanding is a loaded term. But large transformers seem pretty good at learning from datasets (of more or less any modality) to they extent that they can create useful new datapoints and allow you to work with existing datapoints in useful, structured ways.

Transformers main usefulness is translation, including translations from technical lingo to natural languages and vice versa, or code to natural language etc. That was their targeted usecase when originally made and that is what most people use them for. This is what you use them for that doesn't require hallucinating. They don't understand so they can't replace technical experts, they can just translate a bit back and forth but you still need an expert who understands the domain since translations isn't enough for a layperson to replace an expert.

The other use case is to make them hallucinate tropey stories and concepts for brainstorming, this isn't nearly as useful though due to it adding so much low quality stuff here when it hallucinates.

> The information about how the world operates exists or can be generated

The hubris of mathematics. At what scale does whether prediction become 100 percent accurate? How large of a model do you need, and how big of a computer to run it?

Do we thing that reducing the world to a model and feeding it through (what isnt even close to a model) of "thought" or "interaction" or ... what ever you want to bill and LLM as is going to be any more accurate than weather prediction?

100 percent accurate will never happen, nor does it need to. But think about the intelligence of an average human. Can we beat that? At least along some collection of concrete axes, enough to create a form of intelligence that can rightfully be called general?

It remains to be seen, but the days where I scoffed at that idea are firmly in the past where they belong. Today we are building machines with intelligence high enough that it is forcing us to reconsider and redefine what intelligence is. And there is a huge amount of progress just sitting in front of us, waiting to be fed into models.

Even AGI as a whole is overhyped. It's a valuable goal, but AI that beats humans on narrow metrics is still economically valuable because of scale.

Are you arguing that weather prediction isn't very good right now? Within a day or two it's very, very good, no?

>> Does no one remember why expert systems fell apart?

There were many reasons. One of them was the "Knowledge Acquisition Bottleneck", but that was not about the cost of paying experts, rather the cost of creating and maintaining a potentially very large knowledge base (i.e. one big mother of a database of production rules). Also, the fact that many experts' knowledge is tacit and not easily formalisable.

Modern machine learning began in the 1980's as an effort to overcome the Knowledge Acquisition Bottleneck. Accordingly many early machine learning approaches were designed to learn production rules for expert systems. Decision trees, one of the staple classifiers in data science, come from that era; you can tell, because decision trees are a symbolic, logic-based "model".

There were other problems with expert systems, e.g. their infamous "brittleness". But modern, statistical machine learning systems, are also criticised for "brittleness" too.

There were also purely political reasons and nothing to do with science or technology considerations. Then there was the 5th Generation Computer Project, and the AI winter, and then there were no more expert systems.

The journal of Expert Systems with Applications is still alive and well, on the other hand, although it mostly publishes on machine learning and neural nets these days. With an occasionally cool article, like one about Wolf Colony Optimisation I spotted recently. Too tired to look for links now, sorry.

They aren't going to show you any papers at all, they like money.

Experts weren’t the bottleneck on expert systems it was the systems weren’t particularly adaptive, were too rigid, weren’t able to make abductive conclusions, and the user interfaces were way too difficult in situ. LLMs actually tackle quite a lot of these issues FWIW but I wouldn’t look at them as a replacement for expert systems. Instead they’re probably what will make them useful by providing a natural human interface and a way of providing an abductive “reasoning” ability ontop of traditional expert systems.

I've seen some infographics that shows LLMs practically need to see same data 4 times or less, and once is fine too(trained for one epoch).

And I was like, y'saying, it's a zipped list of edge cases...

Its a zip with lossy compression for text. First useful lossy text compression algorithm we have made.

It was trained that way, so it would be weird if it wasn't.

>Show me the paper that doubles effectiveness and halves the size.

LLM's have pretty clearly been the most rapidly advancing technology in the history of humankind. Are you not entertained?!

> A dataset like “50,000 examples of Ph.Ds expressing thoughtful uncertainty when asked questions they don’t know the answer to” could be worth a lot more than it costs to produce.

Those PHDs better up their negotiating skills then.

I sort of hope we get a tech investment fueled WPA that simply pays skilled writers to write, and I hope they allow the body of work to be released by the authors to the public when there’s something of general value written. A wonderful irony of the training and development of superior language models could be the creation of a superior corpus of human authored work.

OpenAI etc will be paying irresistible sums of money to companies that promised to keep data private. Think slack (and their recent "opt out" fiasco), Atlassian, Dropbox...

"It's easier to ask for forgiveness" is the main modus operandi nowadays ...

They don't even need to do that...

As long as you can pay the lawyers.


>So other than training ever-larger models on the same internet data, how can they make better LLMs? Training a multi-modal model that can integrate audio, visual, text and all sorts of data modalities to human level capabilities still remains an clear challenge. The bottleneck here is not the lack of data imo.

It's fascinating to watch the whole "Data is the new oil" thing grow and morph into something truly horrible.

What have been your experiences with oil?

Leaks and pollution. Leaks are more direct so people fix them, but pollution putting micro bits of oil/data everywhere is the bigger problem since people don't notice it as much.

I’m speaking abstractly, of course.

> Admittedly, this used to be true! And is still mostly true. But it’s increasingly becoming less true.

So the headline is bullshit then.

The thrust of this article is essentially “LLMs are trained on the internet, _but wait_, they are also trained on other stuff in these very rarified and specific cases”. So even you concede that, LLMs are still, for the most part, trained on the internet.

Somewhat of a clickbait, but "it's increasingly becoming less true" exponentially. The human population produces written data, exponentially, but in a less steep slope than LLMs by themselves. Human text may double every year, LLM generated text may double every day, or every second.

Enough seconds have passed that even an unimaginable amount of observable universe wouldn't be enough to store the resulting data and by the time I have finished writing this comment, that number got multiplied by 4 billion.

I used ta do Web corpus. I still do. But I used ta, too.

(Apologies to Mitch Hedberg. https://www.youtube.com/watch?v=VqHA5CIL0fg )

"trained" and putting everything in the same bag hides the possibility that not all training data have the same weight, confidence, or even deep tags to differentiate between an expert opinion from a 4chan post.

I agree the article title is clickbait. But the article makes the good point that people often say LLMs are "trained on the Internet" to imply all of the statistical problems with that (e.g. the type of content on the Internet, and populations who are more likely to post on the Internet, are not representative samples of knowledge). This article's point is more I feel that so much is being invested in private data that it's no longer really fair to make that implication by default.

I think this is a valid criticism. I weighed a few different titles that would fit in my (arbitrary) title length limit, but on reflection the one I chose was too glib.

My core point is that the “Trained On the Internet” mental model is becoming less true over time, which makes it a poor model for predicting the long term performance of models. These titles would be better:

1. LLMs Aren’t Just “Trained On the Internet” Anymore

2. LLMs Aren’t Simply Being “Trained On the Internet”

3. Future LLMs Won’t Just Be “Trained On the Internet”

I’ve swapped in the first one. Thanks for the feedback.

> Usage data: ChatGPT is said to generate on the order of 10 billion tokens of data per day – even before they opened their more compelling GPT-4o model to free users.

> Common Crawl (filtered) 410 billion tokens 60% of GPT-3 training data but only 44% of it was used i.e. 0.44 epochs (from the paper published May 28, 2020)

From Aligning language models to follow instructions January 27, 2022 [0]

> ...these models can also generate outputs that are untruthful, toxic, or reflect harmful sentiments. This is in part because GPT-3 is trained to predict the next word on a large dataset of Internet text, rather than to safely perform the language task that the user wants.

> To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

> The resulting InstructGPT models are much better at following instructions than GPT-3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters. At the same time, we show that we don’t have to compromise on GPT-3’s capabilities, as measured by our model’s performance on academic NLP evaluations.

> One way of thinking about this process is that it “unlocks” capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone

Note about the mention of a 1.3B InstructGPT: They trained InstructGPT in a few different sizes including 1.3B, 6B and 175B [1]

From Training language models to follow instructions with human feedback March 4, 2022 [1]

> We start with a pretrained language model, a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers. We then apply the following three steps:

> Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution. We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

> Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

> Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm

> Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy

> The cost of increasing model alignment is modest relative to pretraining. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase.

From Introducing ChatGPT November 30, 2022 [2]

> We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

> To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.

> ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.

In all likelihood, the usage data collected from ChatGPT is being used to continuously train and update a reward model which is being used to continue fine-tuning and improve performance. As for sources of new data, the article lays those out pretty clearly. I agree with the article that models like Phi3 show that higher quality data is more important than simple volume of data and that paying experts to produce, edit and/or grade data will get you much further ahead than only focusing on increasing token count especially when RLHF can 100x the effectiveness of that data. The main reason for finding more sources of training data would be to increase the breadth of knowledge and fill in larger gaps that can't be tackled by a small number of experts. More internet slop won't do that so the contribution of scraped web data can be expected to steadily decrease.

[0] https://openai.com/index/instruction-following/

[1] https://arxiv.org/pdf/2203.02155

[2] https://openai.com/index/chatgpt/

I have the feeling that the future big steps in AI will not be related to data

> this train will keep rollin’ for a while yet.

OK. And then what?

So they're still rained on the Internet, just not only the public Internet anymore. What a clickbait title.

The dream is collapsing.

As Something of a Vagueposter Myself, I’ll bite.

Which dream is collapsing? I’m not disputing it, legitimately curious which of several collapsing dreams you mean.

It's the title of a Hans Zimmer song from the Inception soundtrack, pretty intense!

Cool sidebar, thanks for reminding me. I find Zimmer soundtracks to be some of the best coding music and many if not most of them are on e.g. Spotify.

Wow. They quoted me in the article.

This is one of the longest articles anyone ever wrote to prove me wrong.

I agree that LLMs are not 100% trained on Internet posts, but that they are mostly trained on good Internet posts.

When I ask an LLM a question I expect a simulation if a good answer from an Internet discussion board specializing in that topic.

In true HN comment style, the article comes out both with guns blazing and leads with a “well actually…” :)

This sounds terrible. We are paying PhD level experts to produce novel work exclusively available through an overly optimistic, lying, robot.

What if those expert just published the stuff freely online instead. Surely that would be more productive and trustworthy. Reality is truly stupid.

Is that different from a company doing R&D without releasing the results to the public? Experts working for private gain is the norm, not the exception.

That kind of work also sounds incredibly boring. They probably have to pay their experts a lot more than it would normally cost to hire the same caliber of experts. Which would mean they are not generating very much private data.

Hopefully it'll all be available in a fire sale when these companies finally have to be stripped for parts.

> We are paying PhD level experts to produce novel work exclusively available through an overly optimistic, lying, robot.

Because nobody else is willing to pay them to do that. Research grants, peer review publications, etc, are not set up to reward stuff like “produce one thousand random examples of expressing thoughtful uncertainty with respect to questions relevant to your discipline for which there is currently no consensus answer”

> What if those expert just published the stuff freely online instead. Surely that would be more productive and trustworthy.

I’m sure many of them would be happy to do so if someone is paying them

Still kind of an absurdist comedy that no one wants to pay for expert opinions, but they _do_ want to pay for a realistic-sounding, 80% correct version of an expert opinion.

It isn't about what they are willing to pay for, it is about how much they are willing to pay for it.

A "realistic-sounding, 80% correct version of an expert opinion" only costs US$20/month–and if you are willing to compromise a bit on correctness and/or privacy, you can even get it for free. No way you are getting a real expert at that price.

Whereas, pool $20/month across millions of subscribers, and suddenly you have enough money – and, more importantly, future revenue growth prospects with which to convince investors to give you even more money – to afford roomfuls of real experts to try to get that "80% correct" percentage higher.

People pay for expert opinion all the time. The experts also have realistic sounding but wrong opinion all the time, to the point that seeking 2nd opinion is the norm. Emphasize "2nd" there -- we can afford to only pay two, or at most a handful number of experts.

> We are paying PhD level experts to produce novel work exclusively available through an overly optimistic, lying, robot.

We are not. Open AI is

If they published it freely then they wouldn't get paid. I thought everyone was mad about how there aren't enough employment opportunities for PhDs? Training LLMs on your area of expertise is surely preferable to working a dull office job that has nothing to do with what you studied.

For an company with open in its name, we sure don't know what data openai trains its models on. Why?

we sure don't know what data openai trains its models on. Why?


Um, that's kind of obvious.

Because they want to make money.

I think the point of the article is that it may get to the point where these internal datasets are so much more valuable than the sea of fake nonsense on the open internet that the practice of keeping the internal datasets private will become a no-brainer.

Who is willing to give up billions of dollars just 'cuz? Even if you can't run a company, one of the big companies will happily pay you for access to that data. There's no way you make the data public if that data is able to provide an edge in training.

In fact, let's be realistic, the more of an edge it provides, the more money it will be worth.

Data is valuable so for example; I don't understand why Reddit and Stack Overflow gave their valuable data to OpenAI and Google for pennies, when they could've made their own chatbots and beat OpenAI and Google at their own game.

Apparently (according to AI enthusiasts) all publicly accessible data is free from copyright when used as training data. It doesn't really look like "owning" the data is worth very much money at all.

Also according to eg German law, when done for research purposes.

OpenAI would have stolen it if they didn't license it so it's "free money" for Reddit or Stack Overflow when OpenAI or any other comes along and offers money up front.

It’s already stolen in a way from all the users who just wanted to be able to use a forum and had no serious alternatives

StackOverflow is Creative Commons, so until courts/regulators decide otherwise, anyone can probably claim it's fair game to train on it, same as Wikipedia.


The "Actually," is extra, but I can no longer edit it.

Stack Overflow did try this but imo their LLM wasn't that good

So after few iterations they decided to give up and get the quick buck? So shortsighted from them.

Do they have much time left? Seems like once the answers are all scraped and trained, that site wouldn't be able to survive long anyway?

The next version of libfoo released after 2023 will have a new set of options in /etc/foo.conf and at some point a human being who knows that will have to answer a question about it for an LLM to know that.

No, future LLMs will ingest the codebase, any docs, and be able to answer the question anyway. That is, if the LLM didn’t generate the code base itself…

There's a lot of wishful thinking going on here isn't it?

All these models still regularly fail to provide me with correct code to use even the most common open source libraries (stuff like numpy and matplotlib), even though there were certainly trained on that code.

LLMs doesn't learn from reading manuals, LLMs learn by mimicking explanations it has seen made by humans who read the manuals. Until you solve that we will still need humans.

Except, no. Not unless docs and codebases are re-written in question and answer format

Something like 1/3rd of Google searches have never been asked before. Maybe something similar could be said of Stack Overflow questions? There could still be quite a need for human expertise. Tbh, without significant coaxing, LLMs tend to tailor their replies as if they were responding to the 'average' user - there could still be a lot of value for two 'non-average' users engaging in conversation.

I assume they thought that OpenAI and Google would have used that data anyways without a clear way to prove otherwise.

It's not their core business model and so far LLMs aren't very profitable. Investing a ton of money into a whole new business area with heavy competition that may be profitable eventually while not swimming in cash is often how companies die.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact