> culture is intertwined with the hard problem of consciousness
Majority of people are sleep-walking as machines driven by imitation, habit and external forces. We live in a dreamlike, mechanical state lacking the awareness of this itself. apropos: Gurdjieff
Very uncharitable and questionable on a few levels. Every human exists in context of society, no human exists standalone—the very definition of self, as in self-awareness, has the existence of other as a prerequisite. People you see are perfectly aware of themselves; it’s just that awareness of yourself does not mean you have to violate societal norms and show how individual you are all the time—at best, it requires a more acute awareness of norms (you have to know what to violate first, cf. all the various counter-cultures), making one more socially integrated and in some ways paradoxicay less individual; at worst (if you are properly disconnected) it makes one less of a human, not more.
> People you see are perfectly aware of themselves
Are they rote-students imitating or copying memes and as such are driven by inadequate-ideas or are they students who understand the subject from its first assumptions and as such are driven by adequate-ideas. In the quote above, the suggestion is that majority are rote-students.
There’s a lot of ground between “imitate” and “understand the subject from its first assumptions”. Arguably, the former is how all learning happens at first. We imitate to get a taste for it and start enjoying it (humans are mirrors), then we can dig deeper if we become sufficiently interested. You can hardly become truly interested in music if you are presented with all the music theory up front and don’t get to have fun playing the instrument; same with math.
Even if someone never becomes sufficiently interested to dig deeper into some academic subject and sticks to imitating, I wouldn’t say they are somehow worse and have no awareness. They may have other interests and joys in life, there’s many fulfilling things outside academia. Why would you expect everybody to be like you?
> The basic force behind all culture formation is imitation
We are also limited by the linguistic structures we inhabit. And many languages have multiple variants. There is the respectful, obedient "formal" variant used at the workplace and the informal "colloquial" used in other places.
The "strong" sapir-whorf hypothesis, that cognitive and behavioral categories are limited by linguistic ones, is thoroughly discredited. At most they may influence our perceptions, but they do not constrain them.
Linguistics is one of the fields where HN consensus goes directly against the scholarly mainstream of the discipline for what I mostly find to be ideological reasons. So hopefully this isn't that and you're just a bit out of date. But there's been a big reevaluation of this in the last twenty years and virtually no contemporary working linguists represent the strong relative view anymore. It simply did not consistently produce useful results and has been abandoned.
I would tend to disagree. The tech types have a strong intellectual center, but weaker emotional and movement centers. I think a realignment is possible with practice. It takes time, and as one grows older, the centers begin to integrate better.
> Sometimes, yeah. I don't think we're disagreeing
I would disagree. Formalism and precision have a critical role to play which is often underestimated. More so with the advent of llms. Fuzziness of natural languages is both a strength and weakness. We have adopted precise but unnatural languages (math/C/C++) for describing machine models of the physical world or of the computing world. Such precision was a real human breakthrough which is often overlooked in these debates.
The pattern matching rote-student is acing the class. No surprises here.
There is no need to understand the subject from first principles to ace tests.
Majority of high-school and college kids know this.
That, respectfulness and politeness are more from intentions/actions than from speech alone. Politeness of language without any respect for the actual function of that speech is pointless. Indeed, that this what the LLMs are trained for. Form over function. And many humans get fooled by it and are also clueless like the person dropping the steaming turd of a PR.
> Science should be about reproducibility, and almost nothing here is reproducible.
I can see your frustration. You are looking for reproducible "benchmarks". But you have to realize several things.
1) research level problems are those that bring the "unknown" into the "known" and as such are not reproducible. That is why "creativity" has no formula. There are no prescribed processes or rules for "reproducing" creative work. If there were, then they would not be considered "research".
2) things learnt and trained are already in the realm of the "known", ie, boiler-plate, templated and reproducible.
The problems in 2) above are where LLMs excel, but they have been hyped into excelling at 1) as well. And this experiment is trying to test that hypothesis.
We will learn if the magical capabilities attributed to these tools are really true or not. Capabilities like they can magically solve any math problem out there. This is important because AI hype is creating the narrative that these tools can solve PhD level problems and this will dis-infect that narrative. In my book, any tests that refute and dispel false narratives make a huge contribution.
> We will learn if the magical capabilities attributed to these tools are really true or not.
They're not. We already know that. FrontierMath. Yu Tsumura's 553th problem, RealMath benchmark. The list goes on. As I said many times on this thread, there is nothing novel in this benchmark.
This fact that this benchmark is so hyped shows that the community knows nothing, NOTHING, about prior work in this space, which makes me sad.
> but gives me the "AI-generated reading fatigue."
Agree. The article could have been summarized into a few paragraphs. Instead, we get unnecessary verbiage that goes on and on in an AI generated frenzy. Like the "organic" label on food items, I can foresee labels on content denoting the kind of human generating the content: "suburbs-raised" "free-lancer" etc.
It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.
Are they really trying to do science, or are they just trying to determine pragmatically whether or not current AI is useful for a research mathematician in their day to day job?
If it's the latter case (which it has to be), it seems that attention credit (via, e.g., articles in NY Times) is very unfairly distributed.
None of the people that advanced the state of benchmarking and did the hard work on much bigger benchmarks got any, but a ridiculous benchmark of 10 question scored big.
What do you mean ? These are top-notch mathematicians who are genuinely trying to see how these tools can help solve cutting edge research problems. Not toy problems like those in AIME/AMC/IMO etc. or other similar benchmarks which are gamed easily.
> that others (e.g. FrontierMath) already did everything they claim to be doing
You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.
Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools. I see this as a welcome and positive development which is a win-win for the ecosystem.
> What do you mean ? These are top-notch mathematicians
YeS. I didn't dispute that. I disputed that they are NOT top notch ML specialist and have made one of the worst benchmarks of 2025-2026. Benchmarks like these would have worked maybe in early 2024 at latest. The field has moved on significantly since.
And yes, many many other benchmarks don't use toy problems -- their names are just a prompt away.
> You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.
They did 1) open source some of their datapoints (on a similar order of magnitude) and 2) they carried out detailed evals. Here is much to learn from their blog posts, much more than from the current dataset.
But fair. If you don't like them, have a look at IMProofBench. Have a look at the AIMO competition. Have a loom at HardMath. It's quite a landscape of datasets already.
> Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools
As mentioned above, realistic benchmarks that are bigger and better exist. Unfortunately, from a benchmarking POV, these mathematicians are the hypesters with a preprint that wouldnt even make it to the AI&Math workshops at ICML or NeurIPS.
Majority of people are sleep-walking as machines driven by imitation, habit and external forces. We live in a dreamlike, mechanical state lacking the awareness of this itself. apropos: Gurdjieff
reply