Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".
It's interesting that Figure 4 shows that Sonnet and Opus have a very clear distinct curve from all other models, even from GPT 5.4. Anthropic superiority I guess.
If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.
Apparently the score would be a little higher if it weren't for the fact that scores are penalized for being worse than the human baseline, but aren't rewarded for being better than the human baseline (which seems like an arbitrary decision. The human baseline is not optimal).
Once you have matched humans on a problem then further progress on that problem is not necessarily meaningful anymore, in terms of quantitative measurement of intelligence. ARC-AGI-3 is designed to compare AIs to humans, not to measure arbitrarily high levels of superhuman intelligence. For that you would want a different benchmark.
I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).
If I had a puzzle I really needed solved, then I would not ask a rando on the street, I would ask someone I know is really good at puzzles.
My point is: For AGI to be useful, it really should be able to perform at the top 10% or better level for as many professions as possible (ideally all of them).
An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
It seems they don't test for that, since they use the second-best human solution as a baseline.
And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned.
Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov.
> The only reason you remember is because it beat Kasparov
There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book.
It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time.
This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans.
> Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application.
I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
Even if your tests take a long time, you can always (if hardware permits) run multiple tests in parallel. This would enable you to explore many approaches at the same time.
Experiments for us cost on the order of tens of dollars, so doing 100 of them every night quickly becomes the price of an entire new employee. And that’s not even including the cost of letting agents run all night.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
The "price of an entire new employee" framing is spot on. I kept running into the same thing: individual experiments are cheap, but they add up fast, and nobody wants to approve that budget for speculative ideas.
I've been thinking of this as a gap between VC/Kickstarter and just doing it yourself. Most early ML experiments are too small for formal funding but too expensive to casually self-fund. So I built ML Patron where anyone can chip in a few bucks to sponsor an experiment they're curious about. I honestly don't have a good answer yet for how this turns into returns for sponsors in a traditional business sense. For now it's just open research patronage, like "I'd pay to know the answer to this". Platform runs it on cloud GPUs with public MLflow tracking.
I think what they mean by this is that, for example, in "If it's raining the outside is wet. It's raining, so the outside is wet", it's more important for the model to learn "If A then B. A, therefore B" than to learn what "raining" , "outside" and "wet" mean.
Honestly, the most interesting thing here is definitely that just 2D heads are enough to do useful computation (at least they are enough to simulate an interpreter) and that there is an O(log n) algorithm to compute argmax attention with 2D heads. It seems that you could make an efficient pseudosymbolic LLM with some frozen layers that perform certain deterministic operations, but also other layers that are learned.
This seems a really interesting path for interpretability, specially if a big chunk of a model's behavior occurs pseudo-symbolically. This is an idea I had thought about, integrating tools into the main computation path of a model, but I never imagined that it could be done efficiently with just a vanilla transformer.
reply