You shouldn't use the rate as an indicator. They did something similar to what I did on my hallucinations benchmark (https://github.com/lechmazur/confabulations/), only using questions where at least one model made a mistake. I added this note:
"The benchmark includes questions where at least one LLM confabulated, in order to minimize the number of questions requiring human assessment. Because of this, and since the questions are intentionally adversarial, the absolute percentage should not be used to infer that LLMs frequently confabulate. This leaderboard does not reflect a "typical" hallucination rate."
> instead of teaching the model how to look for answers from an external source (e.g. RAG)
My benchmark specifically focuses on the RAG use case. Even with provided texts, current models still hallucinate.
Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer
What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.
Confabulations are decreasing with newer models. I tested confabulations based on provided documents (relevant for RAG) here: https://github.com/lechmazur/confabulations/.
Note the significant difference between GPT-4 Turbo and GPT-4o.
Yes, that's very fast. The same query on Groq, which is known for its fast AI inference, got 249 tokens/s, and 25 tokens/s on Together.ai. However, it's unclear what (if any) quantization was used and it's just a spot check, not a true benchmark.
The author is in for a rough time in the coming years, I'm afraid. We've barely scratched the surface with AI's integration into everything. None of the major voice assistants even have proper language models yet, and ChatGPT only just introduced more natural, low-latency voices a few days ago. Software development is going to be massively impacted.
You should really have a look at Marx. He literally predicted what will happen when we reach the state of "let machines do all work", and also how this is exactly the way that finally implodes capitalism as a concept. The major problem is he believed the industrial revolution will automate everything to such an extend, which it didn't, but here we are with a reasonable chance that AI will do the trick finally.
It may implode capitalism as a concept, but the people who most benefit from it and hold the levers of power will also have their egos implode, which they cannot stand. Like even Altman has talked about UBI and a world of prosperity for all (although his latest puff piece says we just can't conceive of the jobs we'll have but w/e), but anyone who's "ruling" the current world is going to be the least prepared for a world of abundance and happiness for all where money is meaningless. They won't walk joyfully into the utopia they pandered in yesteryear, they'll try to prop up a system that positions them as superior to everyone else, and if it means the world goes to hell, so be it.
(I mean, there was that one study that used a chatbot to deradicalize people, but when you're the one in power, your mental pathologies are viewed as virtues, so good luck trying to change them as people.)
The Supreme Court decision was toothless because it allowed universities to create race proxies, and they're exploiting it to make the decision irrelevant. It's even perfectly fine to discuss race in college essays, for example: "Rice University asked students how their perspectives were shaped by their “background, experiences, upbringing, and/or racial identity.”" (https://apnews.com/article/college-application-affirmative-a...). They'll fine-tune them to achieve the exact racial makeup they want eventually. The bigger surprise is that a few schools haven't done it (yet?).
>> The Supreme Court decision was toothless because it allowed universities to create race proxies
It most certainly did not.
From the decision:
"despite the dissent’s assertion to the contrary, universities may not simply establish through application essays or other means the regime we hold unlawful today..."[W]hat cannot be done directly cannot be done indirectly. The Constitution deals with substance, not shadows," and the prohibition against racial discrimination is "levelled at the thing, not the name." ... A benefit to a student who overcame racial discrimination, for example, must be tied to that student’s courage and determination. Or a benefit to a student whose heritage or culture motivated him or her to assume a leadership role or attain a particular goal must be tied to that student’s unique ability to contribute to the university. In other words, the student must be treated based on his or her experiences as an individual - not on the basis of race".
As your own quote states, it's merely the opinion of part of the Supreme Court that they won’t be able to use proxies. This is not established law that has been tested and the other part of the Court disagrees.
They simply have to smart about it. In addition to the explicit statement from Chief Justice John Roberts: "Nothing prohibits universities from considering an applicant’s discussion of how race affected the applicant’s life," there are simple ways to use factors others than merit. Some are listed here https://apnews.com/article/supreme-court-affirmative-action-...: preferences for lower-income students, accepting fewer early admission applications (which are more likely to come from white students), giving preference to first-generation college students, placing less value on test scores and more on high school class ranks, basing admissions on zip codes, and devaluing high school quality, etc. Many more strategies will surface if colleges go through discovery. In the name of surface-level diversity, all of them result in students who will perform worse in college than if admissions were based solely on merit.
I think solving ARC-AGI will be necessary but not sufficient. My bet is that the converse will not be true - a model that will be considered "AGI" but does poorly on ARC-AGI. So in that sense, I think this is an important benchmark.
One of the key aspects of ARC is that its testing dataset is secret.
The usefulness of the ARC challenge is to figure out how much of the "intelligence" that current models trained on the entire internet is an emergent property and true generalization or how much it is just due to the fact that the training set truly contains an unfathomable amount of examples and thus the models may surprise us with what appears to be genuine insight but it's actually just lookup + interpolation.
I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.
How so? I think if a team is fine-tuning specifically to beat ARC that could be true but when you look at Sonnet and o1 getting 20%, I think a standalone frontier model beating it would mean we are close or already at AGI.
The creation and iteration of ARC has been designed in part to avoid this.
Francis talks in his "mid-career" work (2015-2019) about priors for general intelligence and avoiding allowing them. While he admits ARC allows for some priors, it was at the time his best reasonable human effort in 2019 to put together and extremely prior-less training set, as he explained on podcasts around that time (e.g. Lex Fridman). The point of this is that humans, with our priors, are able to reliably get the majority of the puzzles correct, and with time, we can even correct mistakes or recognise mistakes in submissions without feedback (I am expanding on his point a little here based on conference conversations so don't take this as his position or at least his position today).
100 different humans will even get very different items correct/incorrect.
The problem with AI getting 21% correct is that, if it always gets the same 21% correct, it means for 79% of prior-less problems, it has no hope as an intelligent system.
Humans on the other hand, a group of 10000 could obviously get 99% or 100% correct despite none of them having priors for all of them in all liklihood given humans don't tend to get them all right (and well - because Francis created 100% of them!).
The goal of ARC as I understood it in 2019, is not to create a single model that gets a majority correct, to show AGI, it has to be an intelligent system, which can handle prior or priorless situations, as good as a group of humans, on diverse and unseen test sets, ideally without any finetuning or training specifically on this task, at all.
From 2019 (I read his paper when it came out believe it or not!), he held a secret set that he alone has that I believe is still unpublished, and at the time the low number of items (hundreds) was designed to prevent effective finetuning(then 'training') but nowadays few shot training shows that it is clearly possible to do on-the-spot training, which is why in talks Francis gave, I remember him positing that any advanced in short term learning via examples should be ignored e.g. each example should be zero shot, which I believe is how most benchmarks are currently done. The puzzles are all "different in different ways" besides the common element of dynamic grids and providing multiple grids as input.
It's also key to know Francis was quite avant-garde in 2019: his work was ofcourse respected, but he became more prominent recently. He took a very bullish/optimistic position on AI advances at the time (no doubt based on keras and seeing transformers trained using it), but he has been proven right.