Hacker News new | past | comments | ask | show | jobs | submit | s-macke's comments login

This seems to be a formatting error. For such a huge author list, you usually write only the first name and then "et al." for "others".


The 'et al.' is used for in-article citations, if done in author-year format; references in the reference list are, to the extent that I've seen, always written out in full. I guess Google just wanted to make the life of any academic citing their work miserable. There are (unfortunately) conferences that have page limits that include the reference list; I wonder if an exception would be made here.


They want authors to think twice before citing someone. A curious incentive!


There is a FAQ in the link.


Ah, thanks. None of that information shows up in a mobile browser, apparently.


This emulator does basically the same but is much more speed optimized. It uses the OpenRISC architecture and even has networking. For what do you want to use such an emulator?

[0] https://github.com/s-macke/jor1k


Wow this is absolutely great!


Simple Bench goes in this direction: https://simple-bench.com/


Yet Another Benchmark, great I love benchmarks(!) but, will this page be kept up2date?


Yes, permanently. Sonnet 3.7 is already number one in the ranking. Grok3 has no API yet.


o3-mini was announced for today, and OpenAI typically publishes in the morning hours (PT). Many people were eagerly waiting. The publication was imminent. I kept checking both Twitter and Hacker News for updates. Just add ten more people like me and the news will become top news within a few minutes. That is legit.


> Notably, no self-reflection training data or prompt was included, suggesting that advanced System 2 reasoning can foster intrinsic self-reflection.

They suggest, that self-reflection is an emergent phenomena of reasoning. Impressive. Can't wait to see the code.


The term "agent" is quite broad. In my definition, an LLM becomes an agent when it utilizes the tool usage option.

ChatGPT is a good example: you ask for an image, and you receive one; you ask for a web search, and the chatbot provides an answer based on that search.

In both cases, the chatbot has the ability to rewrite your query for that tool and is even able to call the tools multiple times based on the previous result.


For me it is 9:05 by Adam Cadre [0]. Short, linear, easy but with a great twist.

[0] https://en.wikipedia.org/wiki/9:05


Well, my perspective on this is as follows:

The recurrent or transformer models are Turing complete, or at least close to being Turing complete (apologies, I’m not sure of the precise terminology here).

As a result, they can at least simulate a brain and are capable of exhibiting human-like intelligence. The "program" is the trained dataset, and we have seen significant improvements in smaller models simply by enhancing the dataset.

We still don’t know what the optimal "program" looks like or what level of scaling is truly necessary. But in theory, achieving the goal of AGI with LLMs is possible.


These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329


Someone (https://x.com/colin_fraser/status/1834336440819614036) shared an example that I thought was interesting relating to their reasoning capabilities:

A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

All LLMs I have tried this on, including GPT o1-preview, get this wrong, assuming that this the riddle relates to a gendered assumption about the doctor being a man, while it is in fact a woman. However, in this case, there is no paradox - it is made clear that the doctor is a man ("he exclaims"), meaning they must be the father of the person being brought in. The fact that the LLMs got this wrong suggests that it finds a similar reasoning pattern and then applies it. Even after additional prodding, a model continued making the mistake, arguing at one point that it could be a same-sex relationship.

Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem - perhaps humans also mostly reason using previous examples rather than thinking from scratch.


> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

Although we would like AI to be better here, the worse problem is that, unlike humans, you can’t get the LLM to understand its mistake and then move forward with that newfound understanding. While the LLM tries to respond appropriately and indulge you when you indicate the mistake, further dialog usually exhibits noncommittal behavior by the LLM, and the mistaken interpretation tends to sneak back in. You generally don’t get the feeling of “now it gets it”, and instead it tends to feels more like someone with no real understanding (but very good memory of relevant material) trying to bullshit-technobabble around the issue.


That is an excellent point! I feel like people have two modes of reasoning - a lazy mode where we assume we already know the problem, and an active mode where something prompts us to actually pay attention and actually reason about the problem. Perhaps LLMs only have the lazy mode?


I prompted o1 with "analyze this problem word-by-word to ensure that you fully understand it. Make no assumptions." and it solved the "riddle" correctly.

https://chatgpt.com/share/6709473b-b22c-8012-a30d-42c8482cc6...


My classifier is not very accurate:

    is_trick(question)  # 50% accurate
To make the client happy, I improved it:

    is_trick(question, label)  # 100% accurate
But the client still isn't happy because if they already knew the label they wouldn't need the classifier!

...

If ChatGPT had "sense" your extra prompt should do nothing. The fact that adding the prompt changes the output should be a clue that nobody should ever trust an LLM anywhere correctness matters.

[edit]

I also tried the original question but followed-up with "is it possible that the doctor is the boy's father?"

ChatGPT said:

Yes, it's possible for the doctor to be the boy's father if there's a scenario where the boy has two fathers, such as being raised by a same-sex couple or having a biological father and a stepfather. The riddle primarily highlights the assumption about gender roles, but there are certainly other family dynamics that could make the statement true.


It's not like GP gave task-specific advice in their example. They just said "think carefully about this".

If it's all it takes, then maybe the problem isn't a lack of capabilities but a tendency to not surface them.


The main point I was trying to make is that adding the prompt "think carefully" moves the model toward the "riddle" vector space, which means it will draw tokens from there instead of the original space.

And I doubt there are any such hidden capabilities because if there were it would be valuable to OpenAI to surface them (e.g. by adding "think carefully" to the default/system prompt). Since adding "think carefully" changes the output significantly, it's safe to assume this is not part of the default prompt. Perhaps because adding it is not helpful to average queries.


I have found multiple definitions in literature of what you describe.

1. Fast thinking vs. slow thinking.

2. Intuitive thinking vs. symbolic thinking.

3. Interpolated thinking (in terms of pattern matching or curve fitting) vs. generalization.

4. Level 1 thinking vs. level 2 thinking. (In terms of OpenAIs definitions of levels of intelligence)

The definitions describe all the same thing.

Currently all of the LLMs are trained to use the "lazy" thinking approach. o1-preview is advertised as being the exception. It is trained or fine tuned with a countless number of reasoning patterns.


> A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

> Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem

I admit I don't understand a single thing about this "problem". To me, it's just some statement.

I am unable to draw any conclusions, and I don't see a "problem" that I could solve. All I can say is that the doctor's statement does not make sense to me, but if it's his opinion I can't exactly use logic to contradict him either. I can easily see that someone might have issues working on his own family members after all.

Do I need some cultural knowledge for this?


I'm sure we fall back on easy/fast associations and memories to answer. It's the way of least resistance. The text you quote bears more than a superficial similarity to the old riddle (there's really nothing else that looks like it), but that version also stipulates that the father has died. That adds "gendered" (what an ugly word) information to the question, a fact which is missed when recalling this particular answer. Basically, LLMs are stochastic parrots.


How people don’t see the irony of commenting “stochastic parrots” every time LLM reasoning failure comes up is beyond me.

There are ways to trick LLMs. There are also ways to trick people. If asking a tricky question and getting a wrong answer is enough to disprove reasoning, humans aren’t capable of reasoning, either.


It's all in the architecture They literally predict the next word by association with the input buffer. o1 tries to fix part of the problem by posing external control over it, which should improve logical reasoning, but if it can't spot the missing information in its association, it's doomed to repeat the same error. Yes, quite a few people are also pretty stupid, emotion-driven, association machines. It's commonly recognized, except perhaps by their parents.


> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

We do, but we can generalize better. When you exchange "hospital" with "medical centre" or change the sentence structure and ask humans, the statistics would not be that different.

But for LLMs, that might make a lot of difference.


Both Claude-3.5 and o1-preview nail this problem

"Let's think through this step-by-step:

1. Alice has 3 brothers 2. Alice has 2 sisters 3. We need to find out how many sisters Alice's brother has

The key here is to realize that Alice's brothers would have the same sisters as Alice, except they would also count Alice as their sister.

So, Alice's brothers would have: - The 2 sisters Alice has - Plus Alice herself as a sister

Therefore, Alice's brothers have 3 sisters in total."


And here lies the exact issue. Single tests don’t provide any meaningful insights. You need to perform this test at least twenty times in separate chat windows or via the API to obtain meaningful statistics.

For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.

But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.

However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:

"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"


Your experience makes me think that the reason the models got a better success rate is not because they are better at reasoning, but rather because the problem made it to their training dataset.


Absolutely! It's the elephant in the room with these ducking "we've solved 80% of maths olympiad problems" claims!


We don't know. The paper and the problem was very prominent at that time. Some developers at Anthropic or OpenAI might have included that in some way. Either as test or as a task to improve the CoT via Reinforcement Learning.


It made it into their data set via RLHF almost assuredly. Wild these papers are getting published when RLHF'ers and up see this stuff in the wild daily and ahead of the papers.

Timeline is roughly:

Model developer notices a sometimes highly specific weak area -> ... -> RLHF'ers are asked to develop a bunch of very specific problems improving the weak area -> a few months go by -> A paper gets published that squeezes water out of stone to make AI headlines.

These researchers should just become RLHF'ers because their efforts aren't uncovering anything unknown and it's just being dressed up with a little statistics. And by the time the research is out, the the fixes are already identified internally, worked on, and nearing pushes.

I just realized AI research will be part of the AI bubble if it bursts. I don't think there was a .com research sub-bubble, so this might be novel.


We do have chatbot arena which to a degree already does this.

I like to use:

"Kim's mother is Linda. Linda's son is Rachel. John is Kim's daughter. Who is Kim's son?"

Interestingly I just got a model called "engine test" that nailed this one in a three sentence response, whereas o1-preview got it wrong (but has gotten it right in the past).


You also need a problem that hasn't been copy pasted a million times on the internet.


My problem with this puzzle, is how do you know that Alice and her brothers share both parents?

Is it not correct English to call two people who share one parent, sisters, or brothers?

I guess I could be misguided by my native Norwegian where you have to preamble the word with "hell" (full), or "halv" (half), if you want to specify the number of shared parents.


It is pretty much the same in English. Unqualified would usually mean sharing both parents but could include half- or step-siblings.


I am not a native English speaker. Can you reformulate the problem for me, so that every alternative interpretation is excluded?


Alice has N full sisters. She also has M full brothers. How many full sisters does Alice’s brother have?


Tried it with N=2 and M=1 (brother singular) with the gpt-4o model and CoT.

1. 50% success without "full" terminology.

2. 5% success with "full" terminology.

So, the improvement in clarity has exactly the opposite effect.


They would usually be called “half-sisters”. You could call them “sisters” colloquially though but given it’s presented as a logic question I think it’s fine to disregard


Here is the larger discussion about the Alice in Wonderland Paper on Hacker News.

https://news.ycombinator.com/item?id=40585039


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: