Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.
I think LLMs are better at finding the most helpful sources now, but that's more a testament to how much the front page of web search has lost to low value LLM content.
The fact that you can express "Only show me websites run by Italian companies incorporated by Greece owners born in Turkey" for example, and it'll be able to filter through a bunch of stuff, just makes searching so much easier. Fuzzy-search is also on another level with language models.
this is exactly it… if google was smart they would focus on providing that experience on search vs. AI summarizing results. I want that fluid search experience, with refinement that remembers my previous ask…
I've used it for live service video games, it's pretty good at summarizing major changes to a game since you've played it last. With regular web search you'd have to go to every major patch and a lot of games don't even have good patch notes / it's all stuck in content creator videos.
Though I still prefer Claude for this since it's better at citing sources.
Unironically you can put "what are good demos for agentic workflows at Google I/O that would be received well by the general public" into Gemini's AI Mode and get better suggestions for use cases than what they're showing.
Welp. I bit the bullet. Copied your question, opened the Gemini app on my iPhone, and pasted it to the new 3.5 flash model. IMO, the answer unnecessarily long, the suggestions were on par with what was mentioned and presented during the part of the keynote that I watched. Also, the summary table didn’t render correctly in the app but does render on web.
My conclusion: maybe they did use Gemini to brainstorm.
Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.
So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).
It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.
And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.
Focusing on the AI content generation aspect of this is disappointing to me, since those tools are fuzzy at best and even if the post was AI generated, it isn't necessarily a red flag since the user could be trying to disguise their writing style.
There are plenty of other signs this story is likely fake. The author claiming to be posting from a library on New Years' Day (most government buildings are closed) and was responding over 10 hours on the account. He's using a throwaway and a "burner laptop" at the library, but he also says he put his two weeks' notice in yesterday (also odd that this is on New Years' Eve) which would make identifying him trivial.
Fake stories get to the front page of Reddit every day, I wish journalists were pointing out the actual signs not to trust something to act as a better example.
I think this is one of those occasions when the story was so good that "journalists" skipped any validation and went right to publication. Someone played the media, and they bought it hook line and sinker.
It was a good indication of how things like conspiracy theories spread.
He gave his notice on NYE? Oh, he's probably just fudging that so he's not caught. He doesn't care about getting caught but he's on a burner laptop? Oh, people do weird things when they're stressed. And on and on.
Then, of course, the ultimate fallback: Well it sounds like something they would do. So even if most of this is made up, the bad things are still probably true.
FYI: Card 8's transcription is different than the image. In the image 5, 8, 12 is a Set but the transcription says Card 8 only has 2 symbols which removes that Set.
Oh no, thanks for pointing this out! I asked GTP-4o to convert the image to text for me and I only checked some of the cards, assuming the rest would be correct. That was a mistake.
I've now corrected the experiment to accurately take the image into account. This meant that Deepseek was no longer able to find all the sets, but o3-mini still did a good job.
Tesla's example is most alarming to me since people who seemingly have no business purpose to access highly sensitive data have access to it. Culturally people feel safe sharing this data on internal chats which means they don't think coworkers will report the data access violations and since the content is spreading "like wildfire" there's a significant number of people at the company who are abusing data access as opposed to an individual abusing their elevated access.
Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.
Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8
reply