For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).
They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
If I was going to bet, I would bet yes, they will reach above 85% performance.
The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.
This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.
In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.
There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.
Could you run the benchmark by bootstrapping (average of repeated subsampling), instead of a straight-across performance score, and regain some leakage resistance that way? As well as a better simulation of "out of sample" data, at least for a little while.
This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.
Ideally they would have batches of those exercises, where the only use the next batch when someone has solved a suspicious amount of those exercises. If it performs much worse on the next batch, that is a tell of leakage.
I looked at the sample questions and even if they get the questions there is no way they will figure out the answers without making significant breakthroughs in understanding mathematics and logic.
>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.
Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it
Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.
Why do people still insist that this is unlikely? Like assuming that the company that payed 15M for chat.com does not have some spare change to pay some graduate students/postdocs to solve some math problems. The publicity of solving such benchmark would definitely raise the valuation so it would 100% be worth it for them...
Simple: I highly doubt they're willing to risk a scandal that would further tarnish their brand. It's still reeling from last year's drama, in addition to a spate of high-profile departures this year. Not to mention a few articles with insider sources that aren't exactly flattering.
I doubt it would be seen as scandal. They can simply generate training data for these questions just like how they generate for other problems. Only difference is probably pay rate is much higher for this kind training data than most other areas.
You’re not thinking about the other side of the equation. If they win (becoming the first to excel at the benchmark), they potentially make billions. If they lose, they’ll be relegated to the dustbin of LLM history. Since there is an existential threat to the brand, there is almost nothing that isn’t worth risking to win. Risking a scandal to avoid irrelevance is an easy asymmetrical bet. Of course they would take the risk.
Okay, let's assume what you say ends up being true. They effectively cheat, then raise some large fundraising round predicated on those results.
Two months later there's a bombshell exposé detailing insider reports of how they cheated the test by cooking their training data using an army of PhDs to hand-solve. Shame.
At a minimum investor confidence goes down the drain, if it doesn't trigger lawsuits from their investors. Then you're looking at maybe another CEO ouster fiasco with a crisis of faith across their workforce. That workforce might be loyal now, but that's because their RSUs are worth something and not tainted by fraud allegations.
If you're right, I suppose it really depends on how well they could hide it via layers of indirection and compartmentalization, and how hard they could spin it. I don't really have high hopes for that given the number of folks there talking to the press lately.
Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.
> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
Why surprisingly?
2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!
4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.
Sure but it is also reasonable to consider that the pace of progress is not always exponential or even linear at best. Diminishing returns are a thing and we already know that a 405b model is not 5 times better than a 70b model.
Exponential pace of progress isn't usually just one thing; if you zoom in, any particular thing may plateau, but its impact compounds in enabling growth of successors, variations, and related inventions. Nor is it a smooth curve, if you look closely. I feel statements like "a 405b model is not 5 times better than a 70b model" are zooming in on a specific class of models so much you can see the pixels of the pixel grid. There's plenty of open and promising research in tweaking the current architecture in training or inference (see e.g. other thread from yesterday[0]), on top of changes to architecture, methodology, methods of controlling or running inference on exiting models by lobotomizing them or grafting networks to networks, etc. The field is burning hot right now, we're counting space between incremental improvements and interesting research directions in weeks. The overall exponent of "language models" power may just well continue when you zoom out a little bit further.
How do you determine the multiplier. Because e.g. there are many problems that GPT4 can solve while GPT3.5 can't. In this case it is infinitely better.
Let's say your benchmark gets you at 60% with a 70b parameter model and you get to 65% with a 405b one, it's fairly obvious that it's just incremental progress, not a sustainable growth of capabilities per added parameter. Also, most of the data used these days for trainings these very large models is synthetic data, which is probably very low quality overall compared to human-sourced data.
But so if there's a benchmark that a model scores at 60%, does it mean that it's literally impossible to make anything that could be more than 67% better?
E.g. if someone scores 60% at a high school exam, is it impossible for anyone to be more than 67% smarter than this person at that subject?
Then what if you have another benchmark where GPT3.5 scores 0%, but GPT4 scores 2%. Does it make GPT4 infinitely better?
E.g. supposedly there was one LLM that did 2% in FrontierMath.
I think because if you end up having an AI that is as capable as the graduate students Tao is used to dealing with (so basically potential field medalists) then you are basically betting that 85% chance something like AGI (at least in consequence) will be here in 3 years. It is possible, but 85% chance?
It would also require ability to easily handle large amount of complex information and dependencies such as massive codebases etc and then also be able to operate physically like humans do. By controlling a robot of some sort.
Being able to solve self contained exercise can be obviously very challenging, but there are other different types of skills that might or might not be related and have to be solved as well.
>then you are basically betting that 85% chance something like AGI
Not really. It would just need to do more steps in a sequence that current models do. And that number has been going up consistently. So it would be just another narrow AI expert system. It is very likely that it will be solved, but it is very unlikely that it will be generally capable in the sense most researchers understand AGI today.
I am willing to bet it won't be solved by 2028 and the betting market is overestimating AI capabilities and progress on abstract reasoning. No current AI on the market can consistently synthesize code according to a logical specification and that is almost certainly a requirement for solving this benchmark.
What research are you basing this on? Because in particular fill in the middle and other non-standard approaches to code generation have shown incredible capability. I'm pretty sure by 2028 LLMs will be able to write code to specification better than most human programmers. Maybe not on the level of million line monolithic codebases that certain engineers worked on for decades, but smaller, modern projects for sure.
It's based on my knowledge of mathematics and software engineering. I have a graduate degree in math and I have written code for more than a decade in different startups across different domains ranging from solar power plants to email marketing.
I've been actively researching in this field for close to a decade now, so let me tell you: Today is nothing like when I started. Back then everyone rightly assumed this kind of AI was decades if not centuries away. Nowadays there are still some open questions regarding the path to general intelligence, but even they are more akin to technicalities that will probably be solved on a time frame of years or perhaps even months. And expert systems are basically at the point where they can start taking over.
Scaling up compute, creating and curating data (be it human or synthetically sourced) and more resilient benchmarking for example. But on the algorithmic side we already have a true general purpose, arbitrarily scalable, differentiable algorithm. So training it to do the right stuff is essentially the only missing ingredient. And models are catching up fast.
Do they? My impression's been the opposite in the recent years - S-curve is a meme at this point, and is used as middlebrow dismissal.
"All exponents in nature are s-curves" isn't really useful unless you can point at the limiting factors more precisely than "total energy in observable universe" or something. And you definitely need more than "did you know that exponents are really s-curves?" to even assume we're anywhere close to the inflection point.
I think (to give them the most generous read) they are just betting the halfway is still pretty far ahead. It is a different bet but IMO not an inherently ridiculous one like just misidentifying the shape of the thing; everything is a logistic curve, right? At least, everything that doesn’t blow up to infinity.
For one, thinking LLMs have plateaued is essentially assuming that video can't teach AI anything. It's like saying a person locked into a room his whole life with only books to read would be as good at reasoning as someone's who's been out in the world.
What reason you have to believe we're anywhere close to the middle of the S-curve? S-curve may be only sustainable shape in nature in the limit, it doesn't mean any exponent you see someone claims is already past the inflection point.
Why are you thinking in binary. It is not clear at all to me that the progress is stagnating, and in fact I am still impressed by the progress. But I couldn't tell whether there is going to come a wall or not. There is no clear reason why there should be some sort of standard or historical curve for this progress.
The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.
What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.
Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.
> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.
This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?
Yes, they also fail. I've found the original gpt4 to be the most consistent. One of these days I'll spend the couple of thousands needed to benchmark all the top models and see how they actually perform on a task which can't be gamed.
Not sure why you're being downvoted that is exactly why I'm using that simple problem to benchmark LLMs. If an LLM can't figure out how to traverse a graph in its working memory then it has no hope of figuring out how to structure a proof.
Under natural deduction all proofs are sub trees of the graph which is induced by the inference rules from the premise. Right now LLMs can't even do a linear proof if it gets too long when given all the induced vertices.
Not to mention that math proofs are more than graph trasversals... (Although maybe simple math problems are not) There is the problem of extracting the semantics of math formalisms. This is easier in day to day language, I don't know to what extent LLMs can also extract the semantics and relations of different mathematical abstractions.
But humans can solve these problems given enough time and domain knowledge. An LLM would never be able to solve them unless they get smarter. Thats the point.
It’s not about whether a random human can solve them. It’s whether AI, in general, can. Humans, in general, have proven to be able to solve them already.
> It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.
I think it is possible to achieve AGI without creating an AGI that is an expert mathematician, and that it is possible to create a system that can do FrontierMath without achieving AGI. I.e. I think failure or success at FrontierMath is orthogonal to achieving AGI (though success at it may be a step on the way). Some humans can do it, and some AGIs could do it, but people and AI systems can have human-level intelligence without being able to do it. OTOH I think it would be hard to claim you have ASI if it can't do FrontierMath.
I think people just see FrontierMath as a goal post that an AGI needs to hit. The term "artificial general intelligence" implies that it can solve any problem a human can. If it can't solve math problems that an expert human can, then it's not AGI by definition.
I think we have to keep in mind that humans have specialized. Some do law. Some do math. Some are experts at farming. Some are experts at dance history. It's not the average AI vs the average human. It's the best AI vs the best humans at one particular task.
The point with FrontierMath is that we can summon at least one human in the world who can solve each problem. No AI can in 2024
If you have a single system that can solve any problem any human can, I'd call that ASI, as it's way smarter than any human. It's an extremely high bar, and before we reach it I think we'll have very intelligent systems that can do more than most humans, so it seems strange not to call those AGIs (they would meet the definition of AGI on Wikipedia [1]).
>If you have a single system that can solve any problem any human can, I'd call that ASI
I don't think that's the popular definition.
AGI = solve any problem any human can. In this case, we've not reached AGI since it can't solve most FrontierMath problems.
ASI = intelligence far surpasses even the smartest humans.
If the definition of AGI has is that it's more intelligent than the average human, you can argue that we already have AGI today. But no one thinks we have AGI today. Therefore, AGI is not Claude 3.5.
Hence, I think the most acceptable definition for AGI is that it can solve any problem any human can.
People have all sorts of definitions for AGI. Some are more popular than others but at this point, there is no one true definition. Even Open AI's definition is different from what you have just said. They define it as "highly autonomous systems that outperform humans in most economically valuable tasks"
>AGI = solve any problem any human can.
That's a definition some people use yes but a machine that can solve any problem any human can is by definition super-intelligent and super-capable because there exists no human that can solve any problem any human can.
>If the definition of AGI has is that it's more intelligent than the average human, you can argue that we already have AGI today. But no one thinks we have AGI today.
There are certainly people who do, some of which are pretty well respected in the community, like Norvig.
>That's a definition some people use yes but a machine that can solve any problem any human can is by definition super-intelligent and super-capable because there exists no human that can solve any problem any human can.
We don't need every human in the world to learn complex topology math like Terence Tao. Some need to be farmers. Some need to be engineers. Some need to be kindergarten teachers. When we need someone to solve those problems, we can call Terence Tao.
When AI needs to solve those problems, it can't do it without humans in 2024. Period.
That's the whole point of this discussion.
The definition of ASI historically is that it's an intelligence that far surpasses humans - not at the level of the best humans.
>We don't need every human in the world to learn complex topology math like Terence Tao. Some need to be farmers. Some need to be engineers. Some need to be kindergarten teachers.
It doesn't have much to do with need. Not every human can be as capable regardless of how much need or time you allocate for them to do so. Then some humans are shoulders above peers in one field but come a bit short in another closely related one they've sunk a lot of time into.
Like i said, arguing about a one true definition is pointless. It doesn't exist.
>The definition of ASI historically is that it's an intelligence that far surpasses humans - not at the level of the best humans.
A Machine that is expert level in every single field would likely far surpass the output of any human very quickly. Yes, there might exist intelligences that are significantly more 'super' but that is irrelevant. Competence, like generality is a spectrum. You can have two super-human intelligences with a competence gap.
The reason for the AGI definition is to indicate a point where no human can provide more value than the AGI can. AGI should be able to replace all work efforts on its own, as long as it can scale.
ASI is when it is able to develop a much better version of itself to then iteratively go past all of that.
It is very much an open question just what an llm can solve when allowed to generate an indefinite number of intermediate tokens and allowed to sample an arbitrary amount of text to ground itself.
There are currently no tools that let llms do this and no one is building the tools for answering open ended questions.
That's correct. Thanks for clarifying for me because I have gotten tired with the comparison to "99% of humans can't do this" as a counter-argument to AI hype criticism.
An AI that can be onboarded to a random white collar job, and be interchangeably integrated into organisations, surely is AGI for all practical purposes, without eliminating the value of 100% of human experts.
If an AI achieved 100% in this benchmark it would indicate super-intelligence in the field of mathematics. But depending on what else it could do it may fall short on general intelligence across all domains.
If a model can't inately reason over 5 steps in a simple task but produces a flawless 500 step proof you either have divine intervention or memorisation.
Also, AIMOv2 is doing stage 2 of their math challenge, they are now at "national olympics" level of difficulty. They have a new set of questions. Last year's winner (27/50 points) got 2/50 on the new set. In the first 3 weeks of the competition the top score is 10/50 on the new set, mostly with Qwen2.5-math. Given that this is a purposefully made new set of problems, and according to the organizers "made to be AI hard", I'd say the regurgitation stuff is getting pretty stale.
Also also, the fact that claude3.5 can start coding in an invented language w/ ~20-30k tokens of "documentation" about the invented language is also some kind of proof that the stochastic parrots are the dismissers in this case.
I'm not sure if it is feasible to provide all relevant sources to someone who doesn't follow a field. It is quite common knowledge that LLMs in their current form have no ability to recurse directly over a prompt, which inherently limits their reasoning ability.
I am not looking for all sources. And I do follow the field. I just don’t know the sources that would back the claim they are making. Nor do I understand why limits on recursion means there is no reasoning and only memorization.
The closest explanation to how chain of through works is suppressing the probability of a termination token.
People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.
Unless you are building one of the frontier models, I’m not sure that your experience gives you insight on those models. Perhaps it just creates needless assumptions.
I’m guessing he’s probably talking about LessWrong, which nowadays also hosts a ton of serious safety research (and is often dismissed offhandedly because of its reputation as an insular internet community).
Yes, LessWrong and the Alignment Forum have a lot of excellent writing and a great community. It may not be to everyone's taste, however.
For people that have demonstrated various levels of commitment, such as post-graduate study and/or continuing education, there are various forums available.
I'm also interested to see what other people have found.
> A government agency determining limits on, say, heavy metals in drinking water is materially different than the government making declarations of what ideas are safe and which are not
Access to evaluate the models basically means the US governments gets to know what these models are capable of, but the US AISI has basically no enforcement power to dictate anything to anyone.
This is just a wild exaggeration of what's happening here.
"This is just a wild exaggeration of what's happening here."
Is it? From the article...
"Both OpenAI and Anthropic said signing the agreement with the AI Safety Institute will move the needle on defining how the U.S. develops responsible AI rules."
From the UK AISI website with whom the data is also shared (their main headline in fact):
The reality is that this would be a tremendous waste of time and money if all were just to sate some curiosity... which of course it isn't.
Let's look at what the US AISI (part of the Department of Commerce, a regulatory agency) has to say about itself:
"
About the U.S. AI Safety Institute huuh
The U.S. AI Safety Institute, located within the Department of Commerce at the National Institute of Standards and Technology (NIST), was established following the Biden-Harris administration’s 2023 Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence to advance the science of AI safety and address the risks posed by advanced AI systems. It is tasked with developing the testing, evaluations and guidelines that will help accelerate safe AI innovation here in the United States and around the world." -- (https://www.nist.gov/news-events/news/2024/08/us-ai-safety-i...)
So to understand what the US AISI is about, you need to look at that Executive Order:
"With this Executive Order, the President directs the most sweeping actions ever taken to protect Americans from the potential risks of AI systems"
"Require that developers of the most powerful AI systems share their safety test results and other critical information with the U.S. government."
There's a fair amount of reasonable stuff in there about national security and engineering bioweapons (dunno why that's singled out) But then we get to other sections...
"Protecting Americans’ Privacy"
"Advancing Equity and Civil Rights"
"Standing Up for Consumers, Patients, and Students"
"Supporting Workers"
"Promoting Innovation and Competition"
etc.
While the US AISI many not have direct rule making ability at this point, it is nonetheless an active participant in the process of informing those parts of government which do have such regulatory and legislative authority.
And while there is plenty in that executive order that many might agree with, the interpretations of many of those points are inherently political and would not find a meaningful consensus. You might agree with Biden/Harris on the priorities about what constitutes AI safety or danger, but what about the next administration? What if Biden hadn't dropped out and you ended up Trump? As much threat as AI might represent as it develops, I am equally nervous about an unconstrained government seizing opportunities for extending its power beyond its traditional reach including in areas of freedom of speech and thought.
> Because lobbying exists in this country, and because legislators receive financial support from corporations like OpenAI, any so-called concession by a major US-based company to the US Government is likely a deal that will only benefit the company.
Sometimes both benefit? OAI and Anthropic benefit from building trust with government entities early on, and perhaps setting a precedent of self-regulation over federal regulation, and the US government gets to actually understand what these models are capable of, and have competent people inside the government track AI progress and potential downstream risks from it.
Of course they benefit, that's why it's a deal. But we don't. The free market or average taxpayer doesn't get anything out of it. Competition and innovation gets stifled - choices narrow down to 2 major providers. They make all the money and control the market.
> My issue with AI safety is that it's an overloaded term. It could mean anything from an llm giving you instructions on how to make an atomic bomb to writing spicy jokes if you prompt it to do so. it's not clear which safety these regulatory agencies would be solving for.
I think if you look at the background of the people leading evaluations at the US AISI [1], as well as the existing work on evaluations by the UK AISI [2] and METR [3], you will notice that it's much more the former than the latter.
Anyone who really wants to make an atomic bomb already knows how to make an atomic bomb. The limitations are in access to raw materials and ability to do large scale enrichment.
I agree. I’m really more concerned about bioweapons, for which it’s generally understood (in security studies) that access to technical expertise is the limiting factor for terrorists. See Al Qaeda’s attempts to develop bio weapons in 2001.
I believe the US AISI has published less on their specific approach, but they’re largely expected to follow the general approach implemented by the UK AISI [1] and METR [2].
This is mostly focused on evaluating models on potentially dangerous capabilities. Some major areas of work include:
- Misuse risks: For example, determining whether models have (dual-use) expert-level knowledge in biology and chemistry, or the capacity to substantially facilitate large scale cyber attacks. A good example of this is the work by Soice et al on bioweapon uplift [5] or Meta's work on CYBERSECEVAL [6], respectively.
- Autonomy: Whether models are capable of agent-like behavior, like the kind that would be hard for humans to control. A big sub-area is Autonomous Replication and Adaptation (ARA), like the ability of the model to escape simulated environments and exfiltrate its own weights. A good example is METR's original set of evaluations on ARA capabilities [3].
- Safeguards: How vulnerable these models are to say, prompt injection attacks or jailbreaks, especially if they're also in principle capable of other dangerous capabilities (like the ones above). Good examples here are the UK AISI's work developing in-house attacks on frontier LLMs [4].
Labs like OAI, Anthropic and GDM already perform these internally as they're part of their respective responsible scaling policies, which determine which safety measures they should have implemented for every given 'capability' level of their models.
In 1995, Aum Shinrikyo carried out attacks on Japanese subways using Sarin gas, which they had produced. They killed over a dozen people, and temporarily blinded around a thousand.
You seem to be claiming that the only reason we haven't seen similar attacks from the thousands of worldwide doomsday cults and terrorist groups over the last three decades is that they don't want to. I disagree. I think that if step-by-step, adaptable directions for creating CBRN weapons were widely accessible, we would see many more such attacks, and many more deaths.
Current SOTA models do not seem to have this capability. However, it is entirely plausible that future models will exceed the capabilities of a bunch of long-haired cultists, in the mountains, in 1995. This is not a fake risk.
Yes, that is essentially the reason. It's not hard to know enough chemistry to figure out how to make these things. The fact that such attacks (your example is small-scale and very ineffective, let's not forget) don't happen more often is the general incompetence of human beings and the relatively tight controls on the basic components (which aren't particularly challenging to monitor for). The tests described are theater, based on the idea that knowledge itself is dangerous.
This way of testing is a regressive stance that essentially presupposes that our adversaries are dumb babies that can't figure anything out on their own. If that was the case, they would also be too stupid to figure out the correct things to ask to get a real set of instructions. Given those things, it's theater.
Theater wastes everyone's time so that people who cannot or don't want to evaluate the actual risks involved. This is something we shouldn't make a habit of doing. It's not worth wasting the time of people with good ability to assuage the worries of people with little ability in a way that has no effect on actual risk. Instead of this, we should address real risk (which we're already doing) and educate other people so they can understand that these are the correct steps to take.
So, your argument is that groups like ISIS and Hamas
- Don't really want to hurt a lot of people that way
- Couldn't access any dangerous ingredients, even if they had the know-how
- Are too dumb to build these things
I agree with reason #3. That is why I don't want to give out open-source models which are world-class experts in chemistry, biology, logistics, operations, and tutoring dumb people.
I disagree with your belief that motivated people with a next-generation generative model doing their planning could not source dangerous ingredients. I'm not going to say much about CBRN in particular, but e.g. ANFO bombs are prevented by monitoring fertilizer sales; nobody tries to monitor natural gas sales or make sure some compound out in the hills isn't setting up their own Haber-Bosch process.
I am also opposed to security theater. Run the numbers on TSA, and it's easy to see that it's a net negative even if it cost 0 tax dollars. But not all government-led safety efforts are theater; seatbelt laws saved a lot of lives, indoor smoking bans saved a lot of lives, OSHA saved a lot of lives.
We know there are folks out there who want to kill a lot of people. We know their capabilities range from "grabbing the nearest hard or pointy object and swinging it" to "medium-scale CBRN attacks." Pushing each of these kinds of people one or two rungs up the capabilities ladder is a real danger; nothing imaginary about it.
I imagine you meant societal harms? I think this was mostly my fault. I edited the areas of work a bit to better reflect what the UK AISI is actually working on right now.
> Now, we can see from this description that nothing about the modeling ensures that the outputs accurately depict anything in the world. There is not much reason to think that the outputs are connected to any sort of internal representation at all.
This is just wrong. Accurate modelling of language at the scale of modern LLMs requires these models to develop rich world models during pretraining, which also requires distinguishing facts from fiction. This is why bullshitting happens less with better, bigger models: the simple answer is that they just know more about the world, and can also fill in the gaps more efficiently.
We have empirical evidence here: it's even possible to peek into a model to check whether the model 'thinks' what it's saying is true or not. From “Discovering Latent Knowledge in Language Models Without Supervision” (2022) [1]:
> Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. (...) We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
So when a model is asked to generate an answer it knows is incorrect, it's internal state still tracks the truth value of the statements. This doesn't mean the model can't be wrong about what it thinks is true (or that it won't try to fill in the gaps incorrectly, essentially bullshitting), but it does mean that the world models are sensitive to truth.
More broadly, we do know these models have rich internal representations, and have started learning how to read them. See for example “Language Models Represent Space and Time” (Wes & Tegmark, 2023) [2]:
> We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.
For anyone curious, I can recommend the Othello-GPT paper as a good introduction to this problem (“Do Large Language Models learn world models or just surface statistics?”) [3].
This isn't really true. LLMs are discriminating actual truth (though perhaps not perfectly). Other similar studies suggest that they can differentiate, say, between commonly held misconceptions and scientific facts, even when they're repeating the misconception in a context. This suggests models are at least sometimes aware when they're bullshitting or spreading a misconception, even if they're not communicating it.
This makes sense, since you would expect LLMs to perform better when they can differentiate falsehoods from truths, as it's necessary for some contextual prediction tasks (say, the task of predicting Snopes.com, or predicting what would a domain expert say about topic X).
No. They are functions of their training data. There is absolutely no part of a LLM that functions as a truth oracle.
If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.
That is, emphatically, a very differing thing than "truth."
> If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.
Again, we have empirical evidence to suggest otherwise. It's not that there's an oracle, but that the LLM does internally differentiate between facts it has stored as simple truth vs. misconceptions vs. fiction.
This becomes obvious by interacting with popular LLMs; they can produce decent essays explaining different perspectives on various issues, and it makes total sense that they can because if you need to predict tokens on the internet, you better be able to take on different perspectives.
Hell, we can even intervene these internal mechanisms to elicit true answers from a model, in contexts where you would otherwise expect the LLM to output a misconception. To quote a recent paper, "Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface" [1], and this matches the rest of the interpretability literature on the topic.
They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...