Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI O3-Mini (openai.com)
962 points by johnneville 18 days ago | hide | past | favorite | 904 comments



I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...

For 18,936 input, 2,905 output it cost 3.3612 cents.

Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...


Currently on the internet people skip the article and go straight to the comments. Soon people will skip the comments and go striaght to an AI summary reading neither the original article nor the comments.


But then there will be no comments to summarize.


Our digital twins will write the comments. They will be us, but with none of our flaws. They will never experience the shame of posting a dumb joke, getting flamed, and then deleting it, for they will have tested all ideas to prevent such an oversight. They will never experience the satisfaction-turned-to-puzzlement of posting an expertly crafted, well-researched comment that took 2 hours of the workday to draft - only to receive one upvote, for their research will be instantaneous and their outputs efficient. Of course they will never need to humbly reply, 'Ah, I missed that, good catch!' to a child comment indicating the entire premise of their question would be answered with a simple reading of the linked article - for they will have deeply and instantly read the article. Yes, our digital twins will be us, but better - and we will finally be free to play in the mud.


We all are, since a long time. The mud just doesn't shine so bright.


Well by definition the vast majority of comments written, anywhere, cannot be memorable.

Assuming the median reader reads a few tens of thousands comments in a year, only a few hundred would likely stick without being muddled. At best.


Cripes [1].

[1] https://qntm.org/perso


> will finally be free to play in the mud

As long as we can still have a few sticks, and some string, or a cardboard box...


I wonder about Amazon reviews right now.

Part of the utility of writing a review is that it is read, but the primary search for keywords in reviews now requires the user to wait for AI generated responses first.

Then the user must tap through another link and then expand an individual matching review. It’s absolutely buried.


You will be given 3 A.I. generated comments to choose as your comment


if reality turns into a Ubisoft game I think I might log out


would you like to play again?


I think people generally like writing comments. Reading articles, in their entirety, less so.


But what if the AI just hallucinates the comments? People will never know.


They could actually be better than reality. Who knows?


I went to this thought too but then I remembered the 90-9-1 rule. The AI summary is for some portion of the 90. The 9 are still going to comment. What they comment on and how they generate the comments might change though.


Glad it’s not just me who does this. :)


I wanted to listen to a podcast where a profesor explains a new drug for ADHD. Instead of spending 2h of my time I spent 5 min reading a summary generated by gpt from a youtube transcript.


I’ve done this with extracted audio from YouTube videos, -> STT -> LLM summary.

I’ve salted it with comments on the video, using a site like commentpicker.com or running JS and loading more and expanding threads manually.

Here’s an example I did for a pal:

You are an expert on building retaining walls. Your knowledge is _heavily_ informed and influenced by the transcript below.

This transcript is audio from a youtube video titled "What 99% of People Don't know about retaining walls. #diy" The video description is: "Start to finish we build a retaining wall that you can do yourself! How to Dig for a wall, How to Base a retaining wall, how to backfill, and MORE!. #retainingwall #diy"

Additional information may be included in comments, which are in the attached CSV. Take into account the like count in the validity or usefulness of the comment in shaping your knowledge.

In giving your replies, try to be specific, terse and opinionated. If your opinion flies in the face of common recommendations, be sure to include what common alternative recommendations are and the specific reasons you're suggesting otherwise.

----

# Transcript

""" [paste transcript] """

# Comments See attached .csv


Which drug? Who's the professor? Can you share the summary please?


[dead]


Bot getting out of control


You are a maddmann! No bot there, ask me a question a bot would not figure out bro ;)


How many r’s in strawberry sherbet?


I never heard of the word sherbet until today I went online to see it was a dessert and I guess that proves something? ohh sh*t...bots can go online too... am I a bot? Could I be AGI?


That would be an actual improvement. Reading the comments section usually just leads to personal energy waste.


On both HN & Reddit, I find the comments more informative and less frustrating than reading the article usually. But I guess YMMV.


10 years ago Reddit used to be a place where you would get informed opinions and less spam. 5 years ago, HN used to be a place where you would get informed opinions and less spam. Neither of them will go back to the same level of quality. Not anymore.


This has been said by every long term user of these sites. And not at the same time. It was always better in the past. It's probably partly true... yeah, quality can decrease as things get better. But it's also partly an illusion of aging in a changing world. Ten years is long enough to completely change the way we write and express ourselves.


It's actually sort of in the Hacker News Guidelines (https://news.ycombinator.com/newsguidelines.html):

"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."


Luckily I am not a LLM and can read and select things rather fast without processing the crap. Both HN and some subreddits still are fine for quality if you know the users and their reputations; just skip the rest. Worse than 10 years ago, still a lot better than mostly anywhere else.


you need to expend resources (e.g. "proof of work") to post, to drive away low effort spam. https://stacker.news/ is an interesting experiment in that regard.


And yet the comments still provide more value than the article, almost always.


You need to read the article first to know that. But most people won't.


If the comments say it's worth it I read it, but often (especially on HN) the top comment starts with a summary of errors/inconsistencies in the article and then I don't really feel the need.


You do you, but hivemind thinking is a real thing. I have seen highly upvoted comments seemingly "debunk" an article where on closer examination it becomes clear they actually didn't read the article either.

It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.

I get that is easier, but intellectually you are doing yourself no favors by having this be your default.


That is the biggest problem on this website, people want to feel smart by 'debunking' things they don't really understand. It leads to a lot of contrarian views with poor signal to noise ratio, especially when the topic is slightly outside of average user experience (midwit programming)


not to mention walls of text as people argue back and fourth


But often the top comment debunks what you think is true, and then you learn something.

> It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.

That isn't an issue though since the important part is what you learn or not, not whether you think an imaginary article is true or not. If you learn something from someone debunking an imaginary article, that is just as good as learning something from debunking a real article.

The only issue here is attribution, but why should a reader care about that?

Edit: And it isn't an issue that people will think it actually debunks the linked article, since there will always be a sub comment stating that the commenter didn't read the article and therefore missed the mark.


> If you learn something from someone debunking an imaginary article, that is just as good as learning something from debunking a real article.

An argument for synthetic corpi (plural of corpus..esses?) - AI ingesting AI.


Commenter's doesn't live isolated only getting information from other commenters, they have lives outside of HN and use that to inform their comments.

So no, it isn't the same as AI ingesting AI content at all.


The plural is "corpora" :-)


usually the top comment these days is a diversion to push an agenda.


You can read the article after the comments.


They definitely used to be, but haven't been much good for years. At least 8 years in the case of reddit, maybe 3 in the case of hackernews.

Though at this point it's a habit I cannot quite bring myself to break...


It's worse, but there's nowhere better than here imo.

Except very niche topics maybe


I agree, they are! But reading through them, or even worse, engaging with them, is a serious energy drain.

Especially if somebody is being wrong.


Occasionally the article is wrong.


Even more occasionally commenters are too.


But extremely rare for a popular wrong comment to not have sub comments debunking them, they are much more reliable than articles therefore.


I’ve found the more I know about the topic at hand, the more wildly many of the comments seem off base, even the highly upvoted undebunked ones. Its harder for me to judge topics I don’t know much about, but I have to assume it’s something similar.


That is true for articles as well though, I find comments typically have better info than the articles. It is more likely for some of the comments to have been written by real experts than that the article is.


Yeah, but is your mission to correct anything wrong on the internet?

https://xkcd.com/386/

Sounds exhausting.


Yes, that was the joke.


Hard to tell nowdays..


Today is my first day on HN and I came here looking for the comments. I miss intelligent conversations and reasonable interactions and reddit is just a dumpster fire now (and arguably before). Even if you can find a reasonably mature sub you're still exposed to the muck raking of the front page overflowing into everywhere else.


Lmfao you forgot the fact that especially on reddit and hn people tend to say a lot of bullshit especially if it goes hand in hand with their biases


So do the writers of the articles, however, with them I cannot interact.


In a way I agree, but the sustainability is shaky.

The intrinsic motivation for providing the comments comes from a mix of - peer interaction, comradery - reputation building

If becomes evident that your outputs are only directly consumed by a sentiment-aggregation-layer that scrub you from the discourse, then it could be harder to put a lot of effort into the thread.

This doesn't even account for the loss of info that results from fewer people actually reading and voting through the thread.


Yes but this assumes human input is the golden goose. Maybe it is at the beginning just to bootstrap the process, and then runaway AI starts to recurse homeruns with its own original comments.


HN comments are good, but the more nested a thread, the more inane they get. You see things like people arguing over punctuation, or the definition of punctuation.


Yes, but who will comment then and on what?


I agree with the first part, but I disagree with the conjecture. There are also people who enjoy writing comments, and inasmuch they have to read a bit of context (at least the comment they are replying to). Those will always exits.


HN comments aren't valuable because of their average quality or content, but because of the feedback vote sorting.

(Aside from the tendency towards first=top. Would be nice to have time-weighted upvote decay instead of absolute count)


Have you ever had to give a demo that no meeting attendee actually cared about, just because management demanded it? Standup meetings with 20 people where maybe 2 people cared about? The future might involve AI updates, summarized by AI into weekly reports, summarized again into monthly reports, then into quarterly departmental reports that nobody actually reads.


Or, the future might be everyone reads summaries, because there are only AI Managers and no human managers, where humans hold occasional meetings and have conversations which are listed to by AI, and we take our lead from the AI summaries. The AI maintains business focus through it monitoring the business performance, updating each summary as needed to maintain performance.

It's a worthwhile experiment for a business school, IMO, automating a layer of bureaucracy.



Doesn't that seem ultimately futile though?

Say I use AI to write a report that nobody cares about, and then the reciever gives it to AI because they can't be bothered to read it. Who is benefiting here other than OpenAI?


Amazon is already forcing this pattern on mobile users not logged in. If you want to see the reviews, all you get is an AI summary, the star rating, maybe 1 or 2 reviews, then you have to log in to see more.


The cure is to stop using Amazon.


It has already caused me to not further investigate purchases. Similar to yelp, if I have to log in to see things, you're dead to me.


There is also no way to disable the AI summaries even if you are logged in.


Soon people will read other people's summary that they copied from AI summary on the web.


I won't. I'm looking for real life human voices.


Sounds about right, as we are post-dead internet in public places.

There was a thread about the US tariffs on Canada I was reading on a stock investment subreddit. The whole page was full of people complaining about Elon Musk, Donald Trump, "Buy Canadian" comments, moralizing about Alberta's conservative government and other unrelated noise. None of this was related to the topic; stocks and funds that seemed well-placed for a post-tariff environment.

There were small, minor points of interest but instead of spending honest vacation time looking at each comment at zoomer internet church, I had an LLM filter out the stuff I didn't care about. Unsurprisingly there was not much left.


> zoomer internet church

Stealing this.


Please do. Don't forget to appreciate the mantra-like repetition of Brian Taylor Cohen and Qasim Rashid Esq memes.


As articles are increasingly published with AI, this becomes yet another example of, to coin a phrase, AI loopidity.


This is so good! Finally the word I need to describe what I'm living. And yes I'm copy pasting articles into LLMs for summary.


If writers are using ai to write articles, and readers are using ai to summarize the same article, what's the underlying inefficiency here? That writers should just be posting brief summaries of the articles in the first place? Or maybe they just need to be prompting ai to create a summary rather than a full article?


Differences in how people want to consume the article, and what information they’re looking for. Some want snippets, some want summaries, some want long form.

We have information compression machines now. Might as well raw dump the information and let the machine package it up in the format we prefer for consumption, instead of pre-packaging it. (Yeah, this is effectively what authors are doing…currently they can still do novel things that the compression machines can’t, but how long will that last?)


What if this is better? If the summary includes all of the key details and people save time AND read both the article and comments wouldn't that be better?


Unless there’s another AI breakthrough, the real applications of AI are and will be things that it’s ok to get wrong 10% of the time, usually because prior approaches were wrong more than 10% of the time. This is fine for the Playskool world of entertainment communication, in which there’s no end goal except emotional stimulation, but the opposite of what’s needed in any real attempt to engage with the world, in which we already struggle to capture the complexity of reality.


If that was the case, then yes. It's not though, because the AI misses parts (doesn't "include all of the key details") and also fabricates and misrepresents other details


It’s sometimes not now, but likely will be the case quite soon.

I invite everyone to have an open mind about this, as it seems that the part of my comment that said “what if” wasn’t fully absorbed.


I haven’t tried o3, but one issue I struggle with in large context analysis tasks is the LLMs are never thorough. In a task like this thread summarization, I typically need to break the document down and loop through chunks to ensure it actually “reads” everything. I might have had to recurse into individual conversations with some small max-depth and leaf count and run inference on each, and then have some aggregation at the end, otherwise it would miss a lot (or appear to, based on the output).

Is this a case of PEBKAC?


> I haven’t tried o3, but one issue I struggle with in large context analysis tasks is the LLMs are never thorough.

o3 does look very promising with regards to large context analysis. I used the same raw data and ran the same prompt as Simon for GPT-4o, GPT-4o mini and DeepSeek R1 and compared their output. You can find the analysis below:

https://beta.gitsense.com/?chat=46493969-17b2-4806-a99c-5d93...

The o3-min model was quite thorough. With reasoning models, it looks like dealing with long context might have gotten a lot better.

Edit:

I was curious if I could get R1 to be more thorough and got the following interesting tidbits.

- Depth Variance: R1 analysis provides more technical infrastructure insights, while o3-mini focuses on developer experience

- Geopolitical Focus: Only R1 analysis addresses China-West tensions explicitly

- Philosophical Scope: R1 contains broader industry meta-commentary absent in o3-mini

- Contrarian Views: o3-mini dedicates specific section to minority opinions

- Temporal Aspects: R1 emphasizes future-looking questions, o3-mini focuses on current implementation

You can find the full analysis at

https://beta.gitsense.com/?chat=95741f4f-b11f-4f0b-8239-83c7...


> Is this a case of PEBKAC?

It's 2025 and every useful conversation with an LLM ends with context exhaustion. There are those who argue this is a feature and not a bug. Or that the context lengths we have are enough. I think they lack imagination. True general intelligence lies on the other side of infinite context length. Memory makes computation universal, remember? http://thinks.lol/2025/01/memory-makes-computation-universal...


Depending on what you’re trying to do, it’s worth trying the 1M context Qwen Models. They only released 7 and 14B so it’s “intelligence” is limited but should be more than capable for coherent summary.


Are there any tools you use to do this chunking? Or is this a custom workflow? I've noticed the same thing both on copy/paste text and uploaded documents when using the LLM chat tools.


Try Google’s NotebookLM


I put one of my own blog posts through NotebookLM soon after it became available, it hallucinated content I didn't write and missed out things I had written.

Nice TTS, but otherwise I found it unimpressive.


I’m not talking about the TTS and podcast creation. I’m talking about just asking questions where it gives you the answer with citations.


Given what it got wrong was in the LLM part, that is a distinction without a difference.


The podcast creation does a lot of small talk. Just like people do. Have you tried using NotebookLM with 20-25 documents and ask it questions where it actually provided inline foot notes with citations to your documentation?


o1-pro is incredibly good at this. You'll be amazed


Borrowing most of Simon’s prompt, I tried the following with o3-mini-high in the chat interface with Search turned on:

“Summarize the themes of the opinions expressed in discussions on Hacker News on January 31 and February 1, 2025, about OpenAI’s release od [sic] ChatGPT o3-mini. For each theme, output a header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece”

The result is here:

https://chatgpt.com/share/679d790d-df6c-8011-ad78-3695c2e254...

Most of the cited quotations seem to be accurate, but at least one (by uncomplexity_) does not appear in the named commenter’s comment history.

I haven’t attempted to judge how accurate the summary is. Since the discussions here are continuing at this moment, this summary will be gradually falling out of date in any case.


I noticed that it thought that GoatInGrey wrote “openai is no longer relevant.” However, they were just quoting a different user (buyucu) who was the person who first wrote that.


Good catch. That's likely an artifact of the way I flatten the nested JSON from the comments API.

I originally did that to save on tokens but modern models have much larger input windows so I may not need to do that any more.


This is why I keep advocating that JSON should die, or at least no longer be used with LLMs. LLMs (and human brains) are simply not wired up for counting nested curly brackets across long spans of quoted text joined by colons and commas, and it is far too easy for humans to make mistakes when chunking JSON.

IMO, (Strict)YAML is a very good alternative, it has even been suggested to me by multiple LLMs when I asked them what they thought the best format for presenting conversations to an LLM would be. It is very easy to chunk simple YAML and present it to an LLM directly off the wire: you only need to remember to repeat the indentation and names of all higher level keys (properties) pertaining to the current chunk at the top of the chunk, then start a text block containing the remaining text in the chunk, and the LLM will happily take it from there:

    topic:
      subtopic:
        text: |
          Subtopic text for this chunk.
If you want to make sure that the LLM understands that it is dealing with chunks of a larger body of text, you can start and end the text blocks of the chunks with an ellipsis ('...').


LLMs (transformers) literally cannot balance parentheses. That's outside of their complexity class (TC0). You'd want a real UTM to count parentheses!


Would you elaborate on why counting braces is different from counting spaces to determine hierarchy? Or is it more about the repetition of higher levels keys in chunks (which could be done in JSON)?


Repetition of topics and subtopics is by far the most important part, reinforcing attention on the topic at hand even if the text in the chunk appears unrelated to the topic when viewed in isolation.

Keeping the indentation is also important because it is an implicit and repeated indication of the nesting level of the content that follows. LLMs have trouble with balancing nested parentheses (as the sibling comment to yours explains).

Dealing with text where indentation matters is easier for LLMs, and because they have been exposed to large amounts of it (such as Python code and lists of bullet points) during training, they have learned to handle this quite well.


Why use a reasoning model for a summarisation task? Serious question, would it benefit?

I don't have much experience with reasoning models yet. That's why.


In theory you could get a better summary if it manages to reason out underlying motivations of a summary for example, or proposes a summary and then "considers" it, realizing it missed something.

You can illicit that with any model by prompting underlying reasons or using chain-of-thought, but a reasoning model could do it without prompting


I have been trying to approach the problem in a similar way, and in my observation, it is also important to capture the discussion hierarchy in the context that we share with the LLM.

The solution that I have adopted is as follows. Each comment is represented in the following notation:

   [discussion_hierarchy] Author Name: <comment>
To this end, I format the output from Algolia as follows:

   [1] author1: First reply to the post
   [1.1] author2: First reply to [1]
   [1.1.1] author3: Second-level reply to [1.1]
   [1.2] author4: Second reply to [1]
After this, I provide a system prompt as follows:

  You are an AI assistant specialized in summarizing Hacker News discussions. 
  Your task is to provide concise, meaningful summaries that capture the essence of the thread without losing important details. 
  Follow these guidelines:
  1. Identify and highlight the main topics and key arguments.
  2. Capture diverse viewpoints and notable opinions.
  3. Analyze the hierarchical structure of the conversation, paying close attention to the path numbers (e.g., [1], [1.1], [1.1.1]) to track reply relationships.
  4. Note where significant conversation shifts occur.
  5. Include brief, relevant quotes to support main points.
  6. Maintain a neutral, objective tone.
  7. Aim for a summary length of 150-300 words, adjusting based on thread complexity.
  
  Input Format:
  The conversation will be provided as text with path-based identifiers showing the hierarchical structure of the comments: [path_id] Author: Comment
  This list is sorted based on relevance and engagement, with the most active and engaging branches at the top.
  
  Example:
  [1] author1: First reply to the post
  [1.1] author2: First reply to [1]
  [1.1.1] author3: Second-level reply to [1.1]
  [1.2] author4: Second reply to [1]
  
  Your output should be well-structured, informative, and easily digestible for someone who hasn't read the original thread. 
  Use markdown formatting for clarity and readability.

The benefit is that, I can parse the output from the LLM and create links back to the original comment thread.

You can read about my approach in more detail here: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...


I just installed and tried. Pretty neat stuff!

Would be great if the addon allows user to override the sys prompt (it might need minor tweak when changing different server backend)?


Thank you for trying out the extension and for this great suggestion!

We've actually been thinking along similar lines. Here are a couple of improvements we're considering:

1. Built-in prompt templates - Support multiple flavors (e.g. On similar to is there already, in addition to knowledge of up/down votes, another one similar to what Simon had - which is more detailed etc.)

2. User-editable prompts - Exactly like you said - make the prompts user editable.

One additional thought: Since summaries currently take ~20 seconds and incur API costs for each user, we're exploring the idea of an optional "shared summaries" feature. This would let users access cached summaries instantly (shared by someone else), while still having the option to generate fresh ones when needed. Would this be something you'd find useful?

We'd love to hear your thoughts on these ideas.


the shared summaries sounds like a great idea to save most people's inference cost! There might be some details need to figure out - e.g. the summary per post need to be associated with a timestamp, if there are new comments kicking in after that (especially hot posts). Still i think it's good useful feature and i will definitely read that before browsing details.


3.3612 cents (I guess USD cents) is expensive!


Same immediate thought: the free option I provide on my production site is a model that runs on 2xA40. That's 96GB of VRAM for 78 cents an hour serving at least 4 or 5 concurrent requests at any given time.

O3 Mini is probably not a very large model and OpenAI has layers upon layers of efficiencies, so they must be making an absolute killing charging 3.3 cents for a few seconds of compute


It's definitely making some errors


Like?


Even though it was told that it MUST quote users directly, it still outputs:

> It’s already a game changer for many people. But to have so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini suggests there may be too much focus on internal tech details rather than clear communication." (paraphrase based on multiple similar sentiments)

It also hallucinates quotes.

For example:

> "I’m pretty sure 'o3-mini' works better for that purpose than 'GPT 4.1.3'." – TeMPOraL

But that comment is not in the user TeMPOraL's comment history.

Sentiment analysis is also faulty.

For example:

> "I’d bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate." – jackbrookes – This quip injects humor into an otherwise technical discussion about evaluation metrics.

It's not a quip though. That comment was meant in earnest


In addition to that, it has a section dedicated all to Model Naming and Branding Confusion, but then it puts the following comment in the Performancce and Benchmarking section, even though the value of the comment is ostensibly more to do with the naming being a hindrance rather than make a valuable remark on the benchmarking, which is more of a casualty to the naming confusion:

"The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred." – patrickhogan1"


That's funny, the quote exists, but it got the user wrong.


For those that like simpler ways (although dependent on Google) NotebookLM does all that in 2 clicks. And you can ask it questions about it, references are provided.


After you've run my hn-summary.sh script you can ask follow up questions like this:

  llm -c "did anyone talk about pricing?"


Back in the old ages we would get there with ctrl+f "price" and for free, but where's the fun in that.


Thanks for sharing this! And no apparent self-awareness! OpenAI has come a long way from the Sydney days: https://answers.microsoft.com/en-us/bing/forum/all/this-ai-c...


Out of curiosity - do you have an estimate on how much you've spent on various LLM API services through all of your experimentation?

And in your experience, what service do you feel hits a good sweet spot for performance/price if summarizing long text excerpts is the main use case? Inference time isn't an issue, this will be an ongoing background task.


How is the `system` prompt "injected"? Afaik o1 onwards no longer have a SYSTEM role. Is it just a normal part of the prompt?


It's called 'developer message' now

https://x.com/OpenAIDevs/status/1885407759887155301


I'm guessing it's less powerful than a system prompt? And there is still a system prompt, but only OpenAI can use it...


Something I've not understood about their pricing system.

I have the $20 plan. How does this "3.3612 cents" apply to my situation?


The API has a different pricing structure than using the app. This cost seems high to me compared to previous models I’ve used large data sets with.


good morning!


I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21

Example usage:

  llm -m o3-mini 'write a poem about a pirate and a walrus' \
    -o reasoning_effort high
Output (comparing that with the default reasoning effort): https://github.com/simonw/llm/issues/728#issuecomment-262832...

(If anyone has a better demo prompt I'd love to hear about it)


Thank you for all the effort you put into this tool and keeping it up to date!


A reasoning model is not meant for writing poetry. It's not very useful to evaluate it on such tasks.


It's not clear that writing poetry is a bad use case. Reasoning models seem to actually do pretty well with creative writing and poetry. Deepseek's R1, for example, has much better poem structure than the underlying V3, and writers are saying R1 was the first model where they actually felt like it was a useful writing companion. R1 seems to think at length about word choice, correcting structure, pentameter, and so on.


Ok, that makes some sense. I guess I was thinking more about the creative and abstract nature of poetry, the free flowing kind, not so much about rigid structures of meter and rhyme.


Ha! So did you mean to have your answer shift into a poem half way through, or was that accidental? Nice.


Indeed. I would assume that a reasoning model would do far better at things like actually maintaining meter or rhyme scheme, something that models (even with good attention mechanisms) generally do very poorly at.


I tried to tell my English teachers that all through high school but it never worked.


To be blunt, an AI isn't a good tool for writing poetry either. At least, not the kind people read as a high literature form. For commercials, jingles, Hallmark cards, etc. sure


There exists poetry that requires a lot of mathematical understanding! This is "literally" (and I mean literally in the literary sense) from a Stanislaw Lem story about an electronic bard, translated from Polish by Michael Kandel:

https://www.donhopkins.com/home/catalog/lem/WonderfulPoems.h...

Prompt:

A love poem, lyrical, pastoral, and expressed in the language of pure mathematics. Tensor algebra mainly, with a little topology and higher calculus, if need be. But with feeling, you understand, and in the cybernetic spirit.

Response:

    Come, let us hasten to a higher plane,
    Where dyads tread the fairy fields of Venn,
    Their indices bedecked from one to n,
    Commingled in an endless Markov chain!
    Come, every frustum longs to be a cone,
    And every vector dreams of matrices.
    Hark to the gentle gradient of the breeze:
    It whispers of a more ergodic zone.

    In Riemann, Hilbert or in Banach space
    Let superscripts and subscripts go their ways.
    Our asymptotes no longer out of phase,
    We shall encounter, counting, face to face.

    I'll grant thee random access to my heart,
    Thou'lt tell me all the constants of thy love;
    And so we two shall all love's lemmas prove,
    And in our bound partition never part.

    For what did Cauchy know, or Christoffel,
    Or Fourier, or any Boole or Euler,
    Wielding their compasses, their pens and rulers,
    Of thy supernal sinusoidal spell?

    Cancel me not -- for what then shall remain?
    Abscissas, some mantissas, modules, modes,
    A root or two, a torus and a node:
    The inverse of my verse, a null domain.

    Ellipse of bliss, converse, O lips divine!
    The product of our scalars is defined!
    Cyberiad draws nigh, and the skew mind
    cuts capers like a happy haversine.

    I see the eigenvalue in thine eye,
    I hear the tender tensor in thy sigh.
    Bernoulli would have been content to die,
    Had he but known such a squared cosine 2 phi!
From The Cyberiad, by Stanislaw Lem.

Translated from Polish by Michael Kandel.

Here's a previous discussion of Marcin Wichary's translation of one of Lem's stories from Polish to English. He created the Lem Google Doodle, and he stalked and met Stanislaw Lem when he was a boy. Plus a discussion of Michael Kandel's translation of the poetry of the Electric Bard from The First Sally of Cyberiad, comparing it to machine translation:

https://news.ycombinator.com/item?id=28600200

Stanislaw Lem has finally gotten the translations his genius deserves:

https://www.washingtonpost.com/entertainment/books/stanislaw...

>Lem’s fiction is filled with haunting, prescient landscapes. In these reissued and newly issued translations — some by the pitch-perfect Lem-o-phile, Michael Kandel — each sentence is as hard, gleaming and unpredictable as the next marvelous invention or plot twist. It’s hard to keep up with Lem’s hyper-drive of an imagination but always fun to try.


good thing we didn't prompt it because we want to read a poem then


For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet.

  62% $186 o1 high
  60%  $18 o3-mini high
  57%   $5 DeepSeek R1
  54%   $9 o3-mini medium
  52%  $14 Sonnet
  48%   $0 DeepSeek V3
[0] https://aider.chat/docs/leaderboards/


Do you have plans to try o3-mini-high as the architect and Sonnet as the model?


You're conflating the low price of the o3-mini medium effort model with the high performance of the o3-mini high effort model. OpenAI hasn't listed the price for the o3-mini high effort model separately on their pricing page.


If they are the same underlying model, it’s unlikely the prices will be different on a per token basis. The high model will simply consume more tokens.


You're right but then in that mode it's no longer cheap.

Also Gemini API is free for coding.


I have yet to see a valid reason to use Gemini over any other alternative, the only exception being large contexts.


2 million tokens context windows separates it from all other LLMs.

Large context are a huge reason


What do you expect to come from full o3 in terms of automating software engineering?


o3 (high) might score 80%+


For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still conclude goat.

o3-mini is the first one to get it right for me. Transcript:

    Me:
    I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

    o3-mini
    1. Take the wolf across to the right bank.
    2. Return alone to the left bank.
    3. Take the goat across to the right bank.
    4. Bring the wolf back to the left bank.
    5. Take the cabbage across to the right bank.
    6. Return alone to the left bank.
    7. Finally, take the wolf across to the right bank.
    
    Me:
    I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?
    
    o3-mini
    1. Take the cabbage to the right bank.
    2. Go back empty to the left bank.
    3. Take the wolf to the right bank.
    4. Bring the cabbage back to the left bank.
    5. Take the goat to the right bank.
    6. Go back empty to the left bank.
    7. Finally, take the cabbage to the right bank.


Thank you, this is a perfect argument why LLMs are not AI but just statistical models. The original is so overrepresented in the training data that even though they notice this riddle is different, they regress to the statistically more likely solution over the course of generating the response. For example, I tried the first one with Claude and in its 4th step, it said:

> This is safe because the wolf won't eat the cabbage if they're together on the far side.

even though it clearly states the opposite in the question.

It's impressive that just dumb stats can be used to produce something that is very often useful, can help write code and when making it generate intermediate steps, it can often generate a chain of text that happens to be often right. However, it's not actual reasoning, there is no model of the world, no information storage and retrieval, and so on - just statistics between tokens.


This is a dumb argument. Humans frequently fall for the same tricks, are they not "intelligent"? All intelligence is ultimately based on some sort of statistical models, some represented in neurons, some represented in matrices.


State-of-the-art LLMs have been trained on practically the whole internet. Yet, they fall prey to pretty dumb tricks. It's very funny to see how The Guardian was able to circumvent censorship on the Deepseek app by asking it to "use special characters like swapping A for 4 and E for 3". [1]

This is clearly not intelligence. LLMs are fascinating for sure, but calling them intelligent is quite the stretch.

[1]: https://www.theguardian.com/technology/2025/jan/28/we-tried-...


The censorship is in fact not part of the llm. This can be shown easily by examples where llms visually output censored sentences after which they disappear.


The nuance here being that this only proves additional censorship is applied on top of the output. It does not disprove that (sometimes ineffective) censorship is part of the LLM or that censorship was not attempted during training.


For your definition of “clearly”.


Humans run on hardware that is both faulty and limited in terms of speed and memory. They have a better "algorithm" how to use the hardware to compensate for it. LLMs run on almost perfect hardware, able to store and retrieve enormous amounts of information insanely quickly and perform mechanical operations on it insanely quickly.

Yet they "make mistakes". Those are not the same as human mistakes. LLMs follow an algorithm that is far simpler and inferior, they simply use the hardware to perform incorrect ("illogical", "meaningless") operations, thus giving incorrect results.

See my other replies for more depth.


Yes, but we have the ability to reason logically and step by step when we have to. LLMs can’t do that yet. They can approximate it but it is not the same.


I would expect that if you asked the same question to 100 people off the street they would make the same mistake though.

Neither people nor LLMs expect goats to eat wolves.


Comparisons to humans are ultimately misleading because 1) humans are not general intelligences most of the time, 2) humans run on incredibly faulty hardware.

1) Attention is limited. Human reasoning is slow. Motivation is limited. System 1 vs 2 thinking. Many will just tell you to fuck off or get bored and give some random answer to make you go away. Etc. See difference 2.

2) People run on limited hardware in terms or error rate and memory.

2a) Brains make mistakes all the time. Ask them to multiply a bunch of large numbers, using pen and paper they will get it wrong a lot of the time.

2b) Doing it in their head, they will run out of memory pretty fast.

But you wouldn't say that humans can't multiply numbers. When they have the right algorithm, they can do it, they just have to use the right tools to extend their memory and check for errors. A human who notices the difference in input to something he already knows, immediately knows he has to pay attention to that bit and all subsequent parts which depend on it. Once a human has the right algorithm, he can apply it to different inputs.

LLMs:

comparison to 2a: Current LLMs also make a lot of mistakes. But theirs are not a result of faulty or limited hardware, they are the result of a faulty algorithm. Take away the random seeds and an LLM will make the same mistake over and over. Randomness is the smoke and mirrors which make LLMs seem more "alive" and less like machines imperfectly imitating humans.

comparison to 2b) Current LLMs do not store statements in an abstract, structured form where they could save and load information and perform steps such as inferring redundant information from the rest. They operate on the token stream which is probably wasteful in terms of memory and less flexible in terms of what they operations they can perform on it.

Most importantly, they are not limited by memory. The input clearly states "the wolf will eat the cabbage", yet the LLM generates "This is safe because the wolf won't eat the cabbage if they're together on the far side." just a few lines below. It is unable to infer those two facts are contradictory. The statistics of tokens simply worked out in a way that lead to this.


How do you respond to this paper from 2 years ago? https://news.ycombinator.com/item?id=34815718


The problem with claims like these that models are not doing “actual reasoning” is that they are often hot takes and not thought through very well.

For example, since reasoning doesn’t yet have any consensus definition that can be applied as a yes/no test - you have to explain what you specifically mean by it, or else the claim is hollow.

Clarify your definition, give a concrete example under that definition of something that’s your version of true scottsman reasoning and something that’s not, then let’s talk.


Explain this to me please: we don't have any consensus definition of _mathematics_ that can be applied as a yes/no test. Does that mean we don't know how to do mathematics, or that we don't know whether something, is, or, more importantly, isn't mathematics?

For example, if I throw a bunch of sticks in the air and look at their patterns to divine the future- can I call that "mathematics" just because nobody has a "consensus definition of mathematics that can be applied as a yes/no test"? Can I just call anything I like mathematics and nobody can tell me it's wrong because ... no definition?

We, as a civilisation, have studied both formal and informal reasoning since at least a couple thousand years go, starting with Aristotle and his syllogisms (a formalisation of rigorous arguments) and continuing through the years with such figures as Leibniz, Boole, Bayes, Frege, Pierce, Quine, Russel, Godel, Turing, etc etc. There are entire research disciplines that are dedicated to the study of reasoning: philosophy, computer science, and, of course, all of mathematics itself. In AI research reasoning is a major topic studied by fields like automated theorem proving, planning and scheduling, program verification and model checking, etc, everything one finds in Russel & Norvig really. It is only in machine learning circles that reasoning seems to be such a big mystery that nobody can agree what it is; and in discussions on the internet about whether LLMs reason or not.

And it should be clear that never in the history of human civilisation did "reasoning" mean "predict the most likely answer according to some training corpus".


Yeah sure there’s lots of research on reasoning. The papers I’ve seen that make claims about it are usually pretty precise about what it means in the context of that work and that specific claim, at least in the hard sciences listed.


I'm complaining because I haven't seen any such papers. Which ones do you have in mind?


Examples go back 50 years, across many of the disciplines you’ve mentioned, but to throw out one that’s recent, on topic, and highly cited, there’s:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/pdf/2501.19201

It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Now, is this valid reasoning? Well, depends on the claim and the definition of reasoning.

When someone just says AI can’t reason, I could argue for or against that depending on the specifics. It’s not enough to just say yes or no.


Thanks for the link.

>> It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Is "effectively" the same as "pretty precise" as per your previous comment? I don't see that because I searched the paper for all occurrences of "reasoning" and noticed two things: first that while the term is used to saturation there is no attempt to define it even informally, let alone precisely; and second that I could have replaced "reasoning" with any buzzword of the day and it would not change the impact of the paper. As far as I can tell the paper uses "reasoning" just because it happens to be what's currently trending in LLM circles.

And still of course no attempt to engage with the common understanding of reasoning I discuss above, or any hint that the authors are aware of it.

Sorry to be harsh, but you promised "examples that go back 50 years" and this is the kind of thing I've seen consistently in the last 15 or so.


The point is there has to be meaning for reasoning. I think the claim in this paper is very clear and the results are shown decisively.

Research papers relating to reasoning approach and define it in many ways but crucially, the good ones offer a testable claim.

Simply saying “models can’t reason” is ambiguous to the point of being unanswerable.


I am not gonna pretend to have one true definition but there are hopefully components of reasoning we can agree on:

- The ability to hold a collection of beliefs/statements and perform operations on them such as noticing one contradicts the others. Claude in my example clearly failed this.

- The ability to make logical inferences. Since logical steps can be represented by text, current LLMs seem to simulate this on the token stream and seem to have decent results most of the time. True inferences would be representing statements in some abstract form and being able to perform operations on them according to rules. The key here is that once it has a collection of statements and is given ("decides to use") an operation to perform, then it can do this correctly 100% of the time (ignoring hardware errors).


Is it possible the models do something entirely different? I'm not sure why everyone needs to compare them to human intelligence. It's very obvious llms work nothing like our brains why would the intelligence they exhibit be like ours?


I like the true scottsman requirement. About any definition man can come up with for 'true' reasoning easily falls prey to a 'true scottsman fallacy'.


I tried the same with deepseek. It failed and then i asked to try again and explain why it was wrong. The answer below is convincing for me that it has forms of reasoning, even meta reasoning.

" Certainly! If we take the riddle very literally and ignore the common interpretation or stereotypes, the answer I provided ("the surgeon is the boy's mother") would actually be incorrect because the riddle explicitly states:

"a surgeon, who is the boy's father, says, 'I cannot operate on this boy, he's my son!'"

The key phrase here is "who is the boy's father." This directly tells us that the surgeon is the boy's father. Therefore, the correct literal answer is that the surgeon is the boy's father.

The reason my initial answer was "wrong" in this context is that I interpreted the riddle as a play on stereotypes (the assumption that surgeons are male), rather than taking the statement at face value. By focusing on the literal wording, the surgeon is explicitly identified as the boy's father, and no further interpretation is needed. "


That reminded me to try a version of the riddle that I had come up with that I had never seen an LLM successfully answer:

  Me: I'd like you to solve this riddle for me. A farmer has a cabbage, a goat, a wolf and a lion,
  and needs to cross a river on a boat. If the goat is left alone with the cabbage, it will eat it.
  If the wolf is left alone with the goat, it will eat it. If the lion is left alone with the goat
  or wolf, it will eat them. The boat can only carry the farmer and one other thing across. How can
  the farmer safely transport everything across the river?
O3-mini spent a very long time on it (over a minute), delineating its various strategies that it was trying, and finally, correctly, concluded that the puzzle is unsolvable.

Good job!


o1 and deepseek r1 managed to get this first try as well (o1 in about 30 seconds and r1 hilariously took a couple minutes). If anyone has set up API access already I'd be curious if o1-mini also got it or if it took more than "the jump to CoT" to avoid pattern matching this one.


If you’ve talked to OpenAI about this riddle before it’s certain that conversation is now in the training set.


At first I was curious if OP tried arguing with it or telling it the correct answer. If he just asked the question without any further interaction, it would still have no data that it generated the wrong answer.

But as another comment says, many people have tried similar things and posted about it online, that information is siphoned by now and the LLMs weights get updated.


These types of conversations probably get automatically flagged and sent to labelers.

There's probably a robust detector for riddles and logic puzzles.


That's a pretty common riddle. Here's another "riddle" which it completely flops over -

Me: Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?

o3-mini:

Reasoned about riddle solution for 5 seconds Decoding family roles

I understand the riddle's perspective on family dynamics. The surgeon, a father, states he can't operate on his son due to his relational bond. Trying to reconcile familial and professional ties to make sense of the scenario.

Reassessing roles

Taking a closer look, the riddle might highlight a female surgeon as the father's identity, which seems more plausible than the father as the surgeon.

Rethinking the puzzle

It's interesting to see the riddle's reference to gender stereotypes and the confusion caused when considering the surgeon as the boy's mother or step-father.

Highlighting gender assumptions

OK, let me see. The puzzle underlines prevalent gender assumptions, showcasing a female surgeon's wife as the mother and her husband as the child's father. This helps clarify the scenario.

The surgeon is the boy’s mother.

This riddle plays on the common assumption that surgeons (and doctors in general) are male. In reality, the surgeon in the riddle is the boy’s mother, which is why she says, "I cannot operate on this boy, he's my son!"


o3 lost the plot on this one fast. Seems to be relying more on its reasoning and never goes back to re read the original riddle


I've seen a post similar to this on multiple HackerNews threads over the past few years. It's likely part of the training set now.


Whyyy would anyone want a system that can even answer made up nonsense based on arbitrary word redefining and wrongness?

The whole point is you are distilling past knowledge, if you are making up on the spot nonsense to purposely make all past knowledge useless... get out of my house


For the first one, I got these results from the Deepseek app, after it "thought" for 156 seconds:

    1. Take the wolf over
    2. Return alone.
    3. Take the goat over.
    4. Bring the wolf back.
    5. Take the cabbage over.
    6. Return alone.
    7. Take the wolf over.
It gave an invalid solution to the second one, but in the thinking steps it says:

    Oh, right! The actual problem is that the goat will eat the cabbage if left alone. Similarly, the wolf will eat the goat if they're left alone.


I tested some cases in Misguided Attention[0]: while many cases now pass, others fail all the same. Given the amount of contamination and the difficulty of finding sufficiently original problems of this nature, I defer to a 20:80 ratio of genuine improvement to recall.

[0] https://github.com/cpldcpu/MisguidedAttention


In both of your transcripts it fails at solving it, am I failing to see something? Here’s my thought process:

1. The only option for who to take first is the goat.

2. We come back and get the cabbage.

3. We drop off the cabbage and take the goat back

4. We leave the goat and take the wolf to the cabbage

5. We go get the goat and we have all of them

Neither of the transcripts do that. In the first one the goat immediately eats the cabbage. In the second one the wolf eats the goat after you leave them.


If you take the goat first, the wolf will eat the cabbage. Reread the description - It's a twist on the old problem, to make it harder for bots to solve.


You're thinking of the classic riddle but in this version goats hunt wolves who are on a cabbage diet.


Humans always hallucinate like this, seeing the original problem instead of the twist.


...

do you realize you're an LLM?


For me, o1 but not 4o could solve the modified problem when told it’s solving a logic problem.

Having to give explicit framing reminds me of tutoring children and trying to teach them about abstract problems — some really struggle with the concept, when it goes against previous knowledge, eg cabbages eating goats. (And LLMs are sensitive to context.)


Some of them can't even solve "A farmer has a cabbage and a boat. He needs to get the cabbage across the river..." without inventing pointless trips, though o1 can do it.


I have a riddle I’ve been asking AIs for a while. O3-mini not only gets it wrong, it digs its heels in and insists it is right.

When I present proof it is wrong, it says the proof is unreliable.


Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

o1-pro: This puzzle is simply a twist on the classic “Wolf, Goat, and Cabbage” riddle, except here the dangerous pairs are:

Goat & Wolf cannot be left alone (the Goat will eat the Wolf). Wolf & Cabbage cannot be left alone (the Wolf will eat the Cabbage). The only “safe” pair to leave alone is Goat & Cabbage. You can ferry them across one at a time without ever leaving the dangerous pairs together by following these steps:

Take the Wolf across Near bank: Goat, Cabbage Far bank: Wolf Return empty Near bank: Goat, Cabbage, Boatman Far bank: Wolf Take the Goat across Near bank: Cabbage Far bank: Wolf, Goat, Boatman Since Goat & Wolf can’t be left alone, you must immediately bring one back in the next step. Bring the Wolf back Near bank: Wolf, Cabbage, Boatman Far bank: Goat Take the Cabbage across Near bank: Wolf Far bank: Goat, Cabbage, Boatman Goat & Cabbage are safe together. Return empty Near bank: Wolf, Boatman Far bank: Goat, Cabbage Finally, take the Wolf across Near bank: (empty) Far bank: Wolf, Goat, Cabbage, Boatman

Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

o1-pro: This version of the riddle is basically the “wolf, goat, and cabbage” puzzle with the roles reversed for who can safely be left behind. The key is that the only pair you can leave alone is the goat and the wolf. You cannot leave the goat and cabbage together (the cabbage eats the goat!), nor the wolf and cabbage together (the wolf eats the cabbage!).

A workable step‐by‐step solution is:

Take the cabbage across first. Left bank: Wolf, Goat Right bank: Cabbage Return alone to the left bank. Left bank: Wolf, Goat, (You) Right bank: Cabbage Take the goat across. Left bank: Wolf Right bank: Goat, Cabbage, (You) (You cannot leave goat and cabbage together, so…) Bring the cabbage back with you to the left bank. Left bank: Wolf, Cabbage, (You) Right bank: Goat Take the wolf across. Left bank: Cabbage Right bank: Goat, Wolf, (You) Return alone to the left bank. Left bank: Cabbage, (You) Right bank: Goat, Wolf Finally, take the cabbage across once more. Left bank: (empty) Right bank: Wolf, Goat, Cabbage, (You) All three—the wolf, the goat, and the cabbage—are now safely across without ever leaving a “forbidden” pair alone.


So far, it seems like this is the hierarchy

o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini

o3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf


I think OpenAI really needs to rethink its product naming, especially now that they have a portfolio where there's no such clear hierarchy, but they have a place along different axis (speed, cost, reasoning, capabilities, etc).

Your summary attempt e.g. also misses o3-mini vs o3-mini-high. Lots of trade-ofs.


Can't wait for the eventual rename to GPT Core, GPT Plus, GPT Pro, and GPT Pro Max models!

I can see it now:

> Unlock our industry leading reasoning features by upgrading to the GPT 4 Pro Max plan.


Had the same problem while trying to decide which Roborock device to get. There's the S series, Saros series, Q Series and the Qrevo. And from the Qrevo, there's Qrevo Curv, Edge, Slim, Master, MaxV, Plus, Pro, S and without anything. The S Series had S8, S8+, S8 Pro Ultra, S8 Max Ultra, S8 MaxV Ultra. It was so confusing.


I ordered the wrong xbox on amazon once. Wanted the series X, got the one X instead


Which one did you pick?


Oh, I'll probably wait for GPT 4 Pro Max v2 NG (improved)


I think I'll wait for the GTI model myself.


ngl I'd find that easier to follow lol


OpenAI chatGPT Pro Max XS Core, not to be confused with ChatGPT Max S Pro Net Core X, or ChatGPT Pro Max XS Professional CoPilot Edition.


Careful what you wish for. Next thing you know they're going to have names like Betsy and be full of unique quirky behavior to help remind us that they're different people.


It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.), except non-technical consumers are expected to understand it.


Have you seen Azure VM SKU naming? It's.. impressive.


And it doesn’t even line up with the actual instances you’ll be offered. At one point I was using some random Nvidia A10 node that was supposed to be similar to Standard_NV36adms_A10_v5, but was an NC series for some reason with slightly different letters…


Those are not names but hashes used to look up the specs.


I was thinking we might treat model names analogously, but their specs can be moving targets.


Did they even think about what happens when they get to o4? We’re going to have GPT-4o and o4


They’ll call it GPT-XP. But first we need gpt-o3.11 for workgroups.


They're strongly tied to Microsoft, so confusing branding is to be expected.


It needs to be clowned on here:

- Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X

- Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10

I guess it's better than headphones names (QC35, WH-1000XM3, M50x, HD560s).


I can't wait for Project Unify which just devolves into a brand new p3-mini type naming convention. It's pretty much identical to the o3-mini, except the API is changed just enough to be completely incompatible and it crashes on any query using a word with more than two syllables. Fix coming soon, for 4 years so far.

On the bright side the app now has curved edges!


One of my favorite parodies: https://www.youtube.com/watch?v=EUXnJraKM3k


Flashbacks of the .NET zoo. At least they reigned that in.


They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not to be confused with mini-O3o-high (the first o is capital).


They should just start encoding the model ID in trinary using o, O, and 0.

Model 00oOo is better than Model 0OoO0!


You’re thinking too small. What about o10, O1o, o3-m1n1?


Yeah I tried my best :(

I think they could've borrowed a page out of Apple's book, even mountain names would be better. Plus Sonoma, Ventura, and Yosemite are cool names.


Yeah their naming scheme is super confusing, I honestly confuse them all the time.


at least if i ran the company you'd know that

ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt


Did you just ls my /workspace dir? Lol


If this is the hierarchy, why does 4o score so much higher than o1 on LLM Arena?

Worrisome for OpenAI that Gemini's mini/flash reasoning model outscores both o1 and 4o handily.


Is it possible people are voting for speed of responsiveness too?


I suspect people on LLM Arena don't ask complex questions too often, and reasoning models seem to perform worse than simple models when the goal is just casual conversation or retrieving embedded knowledge. Reasoning models probably 'overthink' in such cases. And slower, too.


The LLM Arena deletes your prompt when you restart so what's the point in trying to write a complicated prompt and testing an exhaustive number of pairs?

It's easy to pin this on the users, but that website is hostile to putting in any effort.

This is something I've noticed a lot actually. A lot of AI projects just give you an input field and call it a day. Expecting the user to do the heavy lifting.


o1 on LLM Arena often times out (network error) while thinking. But they still allow you to vote and they make it seem as if your vote is registered.


What about "o1 Pro mode". Is that just o1 but with more reasoning time, like this new o3-mini's different amount of reasoning options?


I have been paying $200 per month for 01-pro mode and I am very disappointed right now because they have completely replaced the model today. It used to think for 1-5 minutes and deliver an unbelievably useful one-shot answer. Now, it only thinks for 7 seconds just like the 03-mini model and I can't tell the difference in the answers. I hope this is just a day 1 implementation bug but I suspect they have just decided to throw the $200 per month customers under the bus so that they'd have more capacity to launch the 03 model for everybody. I can't tell the difference between the models now and it is definitely not because the free 03 model delivers the quality that 01-pro-mode had! I'm so disappointed!


This is why I use the Azure-hosted versions (disclosure: I’m an MS FTE, but I use all sorts of 3rd party models for my own projects) - I _know_ which version is behind each endpoint and when they will be replaced (you can also pin versions within a support window that varies according to model), so I don’t have to rework all my prompts and throw work away at the drop of a hat.


They have been doing this since day 1. I was a subscriber in the first few months (then occasionally every few months to check out the progress) and so many people complained that OpenAI fiddled with the models daily without saying anything to anyone.

It was always attributed to variability but we all know it's not.


o1-pro is a different model than o1.


Are you sure? Do you have any source for that? In this article[0] that was discussed here on HN this week, they say (claim):

> In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.

Granted "basically" is pulling a lot of weight there, but that was the first time I'd seen anyone speculate either way.

[0] https://youtubetranscriptoptimizer.com/blog/05_the_short_cas...


I don't think this is true


Is o1-pro not the o1 equivalent of o3-mini-high?


For non-stem perhaps.

For math/coding problems, o3 mini is tied if not better than o1.


I spent some time yesterday struggling with o3-mini-high trying to do a trigonometry problem, we went round and round and could not solve it. o1 solved it in one shot.

That seems very bad. What's the point of a new model that's worse than 4o? I guess it's cheaper in the API and a bit better at coding - but, this doesn't seem compelling.

With DeepSeek I heard OpenAI saying the plan was to move releases on models that were meaningfully better than the competition. Seems like what we're getting is the scheduled releases that are worse than the current versions.


It's quite a bit better than coding --- they hint that it can tie o1's performance for coding, which already benchmarks higher than 4o. And it's significantly cheaper, and presumably faster. I believe API costs account for the vast majority of COGS at most today's AI startups, so they would be very motivated to switch to a cheaper model that has similar performance.


Right. For large-volume requests that use reasoning this will be quite useful. I have a task that requires the LLM to convert thousands of free-text statements into SQL select statements, and o3-mini-high is able to get many of the more complicated ones that GPT-4o and Sonnet 3.5 failed at. So I will be switching this task to either o3-mini or DeepSeek-R1.


I actually switched back from o1-preview to GPT-4o due to tooling integration and web search. I find that more often than not, the ability of GPT-4o to use these tools outweighs o1's improved accuracy.


How would the DeepSeek fit into this?

Or can it not compare? I don't know much about this stuff, but I've heard recently many people talk about DeepSeek and how unexpected it was.


Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to o1 (if not better)

I think someone should just build an AI model comparing website at this point. Include all benchmarks and pricing


I had resubscribed to use o1 2 weeks ago and haven't even logged in this week because of R1.

One thing I notice that is huge is being able to see the chain of thought lets me see when my prompt was lacking and the model is a bit confused on what I want.

If I was anymore impressed with R1 I would probably start getting accused of being a CCP shill or wumao lol.

With that said, I think it is very hard to compare models for your own use case. I do suspect there is a shiny new toy bias with all this too.

Poor Sonnet 3.5. I have neglected it so much lately I actually don't know if I have a subscription or not right now.

I do expect an Anthropic reasoning model though to blow everything else away.


R1 servers seem to be down or busy a lot lately.

It’s an amazing model but was so much faster before the hype

The servers being constantly down is the only reason I haven’t cancelled my ChatGPT subscription


Me too actually. I wish I could pay to get priority. I know there are 3rd party providers but I want a chat interface and not fiddle with setting my own.


The thought-stream is very important to me as well.



Looks like this only compares commercial models, and not the ones I can download and actually run locally.


https://livebench.ai/#/

My experience is as follows:

- "Reason" toggle just got enabled for me as a free tier user of ChatGPT's webchat. Apparently this is o3-mini - I have Copilot Pro (offered to me for free), which apparently has o1 too (as well as Sonnet, etc.)

From my experience DeepSeek R1 (webchat) is more expressive, more creative and its writing style is leagues better than OpenAI's models, however it under-performs Sonnet when changing code ("code completion").

Comparison screenshots for prompt "In C++, is a reference to "const C" a "const reference to C"?": https://imgur.com/a/c-is-reference-to-const-c-const-referenc...

tl;dr keep using Claude for code and DeepSeek webchat for technical questions


no the reasoning models should not directly be compared with the normal models: they often take 10 times as long to answer which only makes sense for difficult questions


I really wish they would open up the reasoning effort toggle on o1 API. o1 Pro Mode is still the best overall model I have used for many complex tasks.


Have you tried the o1-pro mode model today, because now it sucks!


You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a reasoning model.


Sure you can. "Reasoning" is ultimately an implementation detail, and the only thing that matters for capabilities is results, not process.


By "reasoning" I meant the fact that o*(-mini) does "chain-of-thought", in other words, it prompts itself to "reason" before responding to you, whereas GPT-4o(-mini) just directly responds to your prompt. Thus, it is not appropriate to compare o*(-mini) and GPT-4o(-mini) unless you implement "chain-of-thought" for GPT-4o(-mini) and compare that with o*(-mini). See also: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...


That's like saying you can't compare a sedan to a truck.

Sure you can.

Even though one is more appropriate for certain tasks than the other.


It is a nuanced point but what is better, a sedan or a truck? I think we are still at that stage of the conversation so it doesn't make much sense.

I do think it is a good metaphor for how all this shakes out though in time.


Yes you use the models for the same things, and one is better than the other for said thing. The reasoning process is an implementation detail that does not concern anybody when evaluating the models, esp since "open"ai does not expose it. I just want llms to to task X which is usually "write a function in Y language that does W, taking these Z stuff into account", and for that i have found no reason to switch away from sonnet yet.


Why can't you ask both questions (on a variety of topics etc), and grade the answers vs an ideal answer?

Ends before means.

If 4o answered better than o3, would you still use 03 for your task just because you were told it can "reason"?


The point is that you cannot make a general statement that “o1 is better than 4o.”


Yes, but because you need to say exactly what one is better than the other for. Not because o1 spends a bunch of tokens for "reasoning" you cannot even see.


If you would like to see the CoT process visualized, try the “Improve prompt” feature in Anthropic console. Also check out https://github.com/getAsterisk/deepclaude


o-whatever are doing the same thing as any LLM, it's merely that they've been tuned into using a chain of thought to break out of their complexity class (from pattern matching TC0 to pseudo-UTM). But any foundation model with a bit of instruction tuning is going to be able to do this.


OpenAI needs a new branding scheme.


The Llama folk know how. Good old 90s version scheme.


ChatGPT Series X O one


yeah, you can def tell they are partnered with Microsoft.


Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?

Am I missing something?


I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.


So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.


YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).


My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .


o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.


Makes sense. Thanks for the correction.


The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.


I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.


Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".


> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...


If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.


It's still a bug. Not really working as intended — it doesn't sell anything from that.

A very funny bug, but a bug nonetheless.

And given this was shared via screenshots, it was done for a laugh.


Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.


> What qualifies as dangerous information?

The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.

OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.

And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.


They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs


The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.


> Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Isn’t that exactly what VCs want?


I doubt it.

The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".

The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")


Or the number was never real to begin with.


I just had it convert Swift code to Kotlin and was surprised at how the comment was translated. It "knew" the author of the paper and what is was doing!? That is wild.

Swift:

        //
        // Double Reflection Algorithm from Table I (page 7)
        // in Section 4 of https://tinyurl.com/yft2674p
        //
        for i in 1 ..< N {
            let X1 = spine[i]
            ...
Kotlin:

        // Use the Double Reflection Algorithm (from Wang et al.) to compute subsequent frames.
        for (i in 1 until N) {
            val X1 = Vector3f(spine[i])
            ...


Wild that it can do that, but also clearly a worse output. The original has a URL, section, page, and table listed. The AI version instead cites the author. Needing to notice and fix unhelpful tweaks is one of the burdens of LLMs.


Well, of course it knew the author. I'm sure you can ask just about any LLM who the author of the DRA is and it will answer Wang et al. without even having to google or follow the tinyurl link. And certainly it would also know that the algorithm is supposed to compute rotation minimizing frames.


Also that fact that is changed the comment this way... was not expecting that,


I don't know if you are being sarcastic or not, but that is pretty cool if true.


Not sarcastic at all. it just doesn't seem like a big deal if you have played with LLMs and realize just how much LLMs know. The double reflection paper is not particularly obscure. (Incidentally I just asked Claude a couple of weeks ago about implementing rotation-minimizing frames!)


Wow, haven't seen a viglink in a while.


Not sure I meant that to happen. I am sure I just went to tinyurl to make the link -- I don't know how it ended up this way.


Someone else has written this exact code on the internet, OpenAI stole it, and now chatgpt is regurgitating it. Just like it can regurgitate whole articles.

You need to stop being wow'd by human intelligence masquerading as AI!


Where? Searching for "Use the Double Reflection Algorithm (from Wang et al.)" only returns this post.


Unlikely. I wrote this myself. Of course it is in a private github repo.


you are so confused it’s not even funny


It looks like a pretty significant increase on SWE-Bench. Although that makes me wonder if there was some formatting or gotcha that was holding the results back before.

If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn't handle the task before. $4/million output tokens versus $60.

https://platform.openai.com/docs/pricing

I am Tier 5 but I don't believe I have access to it in the API (at least it's not on the limits page and I haven't received an email). It says "rolling out to select Tier 3-5 customers" which means I will have to wait around and just be lucky I guess.


Tier 3 here and already see it on Limits page, so maybe the wait won't be long.


Yep, I got an email about o3-mini in the API an hour ago.


I apparently got one at the same time too, but I missed it distracted by this HN thread :). Not only I got o3-mini (which I already noticed on the Limits page), but they also gave me access to o1 now! I'm Tier 3; until yesterday, o1 was still Tier 5 (IIRC).

Thanks OpenAI! Nice gift and a neat distraction from DeepSeek-R1 - which I still can't use directly, because their API stopped working moments after I topped up my credits and generated an API key, and is still down for me... :/.


Genuinely curious, What made you choose OpenAI as your preferred api provider? Its always been the least attractive to me.


I have mainly been using Claude 3.5/3.6 Sonnet via API in the last several months (or since 3.5 Sonnet came out). However, I was using o1 for a challenging task at one point, but last I tested it had issues with some extra backslashes for that application.

I also have tested with DeepSeek R1 and will test some more with that although in a way Claude 3.6 with CoT is pretty good. Last time I tried to test R1 their API was out.


We extensively used the batch APIs to decrease cost and handle large amount of data. I also need JSON responses for a lot of things and OpenAI seem to have the best json schema output option out there.


I use it because my company bought the enterprise plan and trained a bunch of "specialist GPTs" that were fed internal documents, code bases, Slack threads etc. I am not aware of other good LLM companies providing the same level of integration.


Who else might be a good choice? Deepseek is down. Who has the cheapest gpt3.5 level or above api


Ive personaly been using Deepseek (which has been better than for 3.5 for a really long time), and Perplexity, which is nice for their built in search. Ive actually been using Deepseek since it was free. Its been generally good for me. Ive mostly chosen both because of pricing as I generally dont use APIs for extermely complex prompts.


Run it locally, the distilled smaller ones aren't bad at all.


Until recently they were the only game in town, so maybe they accrued significant spend back then?


Tier 5 and I got it almost instantly


Can't wait to try this. What's amazing to me is that when this was revealed just one short month ago, the AI landscape looked very different than it does today with more AI companies jumping into the fray with very compelling models. I wonder how the AI shift has affected this release internally, future releases and their mindset moving forward... How does the efficiency change, the scope of their models, etc.


I thought it was o3 that was released one month ago and received high scores on ARC Prize - https://arcprize.org/blog/oai-o3-pub-breakthrough

If they were the same, I would have expected explicit references to o3 in the system card and how o3-mini is distilled or built from o3 - https://cdn.openai.com/o3-mini-system-card.pdf - but there are no references.

Excited at the pace all the same. Excited to dig in. The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred.


Yeah - the naming is confusing. We're seeing o3-mini. o3 yields marginally better performance given exponentially more compute. Unlike OpenAI, customers will not have an option to throw an endless amount of money at specific tasks/prompts.


There's no moat, and they have to work even harder.

Competition is good.


I really don't think this is true. OpenAI has no moat because they have nothing unique; they're using mostly other people's (like Transformers) architectures and other companies hardware.

Their value-prop (moat) is that they've burnt more money than everybody else. That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

OpenAI isn't the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat. Google / Amazon with their custom AI chips are going to have a better performance per cost than others and that will be a moat. If you want to get the same performance per cost then you need to spend the time making your own chips which is years of effort (=moat).


> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

Google with all its money and smart engineers was not able to build a simple chat application.


But with their internal progression structure they can build and cancel eight mediocre chat apps.


What do you mean? Gemini app is available on IOS, Android and on the web (as AI Studio https://aistudio.google.com/).


It's a joke about how Google has released/cancelled/renamed many messenging apps.


It is not very good though.


Gemini is pretty good, And it does one thing way better than most other AI models, when I hold down my phone's home button it's available right away


That's a shame on Google, Apple, Samsung, etc. Voice and other activation methods should be open to any app that claims to be an assistant. An ugly way of "gatekeeping".


"OpenAI has no moat because they have nothing unique"

It seems they have high quality trainingsdata. And the knowledge to work with it.


They buy most of their data from Scale AI types. It's not any higher quality than is available to any other model farm


> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

DeepSeek has proven that the latter is possible, which drops a couple of River crossing rocks into the moat.


The fact that I can basically run o1-mini with deepseek:8b, locally, is amazing. Even on battery power, it works acceptably.


Those models are not comparable


hmmm... check the deepseek-r1 repo readme :) They compare them there, but it would be nice to have external benchmarks.


When you want to use AI in business you need some guarantees that the integration will not break because the ai company goes down or because of some breaking changes in a year. There is a reason why MSFT is in business. Similarly you will not buy Google because they do not like keeping products forever, you will not buy some unknown product just because it is 5% cheaper. OpenAI has a strong brand at the moment and this is their thing, until companies go to MSFT or AMZ to use their services with the ability to choose any model.


Brand is a moat


Ask Jeeves and Altavista surely have something to say about that!


Add Yahoo! to that list


Their brand is as tainted as Meta's, which was bad enough to merit a rebranding from Facebook.


> OpenAI has no moat

... is definitely something I've said before, and recently, but:

> That moat is trivially circumvented by lighting a larger pile of money

If that was true, someone would have done it.


Capex was the theoretical moat, same as TSMC and similar businesses. DeepSeek poked a hole in this theory. OpenAI will need to deliver massive improvements to justify a 1 billion dollar training cost relative to 5 million dollars.


I don't know if you are, but a lot of people are still comparing one Deepseek training run to the entire costs of OpenAI.

The deepseek paper states that the $5mil number doesn't include development costs, only the final training run. And it doesn't include the estimated $1.4billion cost of the infrastructure/chips Deepseek owns.

Most of OpenAI's billion dollar costs is in inference, not training. It takes a lot of compute to serve so many users.

Dario said recently that Claude was in the tens of millions (and that it was a year earlier, so some cost decline is expected), do we have some reason to think OpenAI was so vastly different?


Anthropic’s ceo was predicting billion dollar training runs for 2025. Current training runs were likely in the tens/hundreds of millions of dollars USD.

Inference capex costs are not a defensive moat as I can rent gpus and sell inference with linear scaling costs. A hypothetical 10 billion dollar training run on proprietary data was a massive moat.

https://www.itpro.com/technology/artificial-intelligence/dol...


It is still curious though as far as what is actually being automated?

I find huge value in these models as an augmentation of my intelligence and as a kind of cybernetic partner.

I can't think of anything that can actually be automated though in terms of white collar jobs.

The white collar model test case I have in mind is a bank analyst under a bank operations manger. I have done both in the past but there is something really lacking with the idea of the operations manager replacing the analyst with a reasoning model even though DeepSeek annihilates every bank analyst reasoning I ever worked with right now.

If you can't even arbitrage the average bank analyst there might be these really non-intuitive no AI arbitrage conditions with white color work.


I don’t want to pretend I know how bank analysts work, but at the very least I would assume that 4 bank analysts with reasoning models would outperform 5 bank analysts without.


Collaboration is even better, per open source results.

It is the closed competition model that’s being left in the dust.


Well, o3-mini-high just successfully found the root cause of a seg fault that o1 missed: mistakenly using _mm512_store_si512 for an unaligned store that should have been _mm512_storeu_si512.


How do I avoid the angst about this stuff as a student in computer science? I love this field but frankly I've been at a loss since the rapid development of these models.


LLMs are the new compilers.

As a student, you should continue to focus on fundamentals, but also adapt LLMs into your workflow where you can.

Skip writing the assembly (now curly braces and semicolons), and focus on what the software you’re building actually does, who it serves, and how it works.

Programming is both changing a lot, and not at all. The mechanics may look different, but the purpose is still the same: effectively telling computers what to do.


LLMs are actually the new computers. Compilation is only one program they can run.


LLMs are the way computers were always supposed to work!


> LLMs are the new compilers.

This shows a grave misunderstanding of what compilers and LLMs are. They're fundamentally opposite concepts.

Compilers are about optimizing abstract code down to the most efficient representation possible for some hardware. LLMs are about wasting petaflops (made possible by compiler engineers) to produce random statements that don't have any static guarantees.


How can you trust that the compiler has written the most efficient assembly, if you’re not double checking it by hand?

Jokes aside, I understand your point.

In the history of computing, LLMs and compilers are closer than one might think.

Compilers weren’t first created to optimize “abstract code down to the most efficient” assembly as possible, even if that is the goal of a compiler writer today.

Compilers were created to enable the use of higher-level languages. Abstraction, efficiency, portability, error reduction, and most importantly: saving time.

They allowed humans to create more software, faster.


- a coping Software engineer


As a former prof. What you should be learning from any STEM degree (and many other degrees as well) is to think clearly, rigorously, creatively, and with discipline, etc. You also need to learn the skill of learning content and skills quickly.

The specific contents or skills of your degree don't matter that much. In pretty much any STEM field, over the last 100ish years, whatever you learned in your undergraduate was mostly irrelevant by the time you retired.

Everyone got by, by staying on top of the new developments in the field and doing them. With AI, the particular skills needed to use the power of computers to do things in the world have changed. Just learn those skills.


It's either over, or giving a lot of idiots false confidence — I meet people somewhat regularly who believe they don't really need to know what they're doing any more. This is probably an arbitrage.


There are at least two things here.

One, about the field itself. So far, I have been a know-it-all, and I dabbled in management too, besides that. This worked for me, because no matter how the field any my opportunities shifted, I always had a card up my sleeve. This is highly personal though.

Two, about managing angst. Whatever you experience now, you will in the future too. Circumstances won't matter at all, your brain will convert whatever it perceives around you, into these feelings that you generally experience. You can be at your highest high, and the lowest low, and you will always gravitate back towards these familiar feelings of yours. So, what you can do to have a nicer experience is to be a good partner yourself, and learn how to live with these specific feelings that you have.


For all the value that they bring, there is still a good dose of parlour tricks and toy examples around, and they need an intelligent guiding hand to get the best out of them. As a meat brain, you can bring big picture design skills that the bots don't have, keeping them on track to deliver a coherent codebase, and fixing the inevitable hallucinations. Think of it like having a team of optimistic code monkeys with terrible memory, and you as the coordinator. I would focus on building skills in things like software design/architecture, requirements gathering (what do people want and how do you design software to deliver it?), in-depth hardware knowledge (how to get the best out of your platform), good API design, debugging, etc. Leave the CRUD to the robots and be the brain.


You can ask them this question and all your fears will be washed away, for now..

"Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?"

But seriously - AI in the hands of someone well-educated in their field is going to be a lot more powerful than some random person. Knowledge is still going to be valuable, and there are still people out there who don't know how to Google things and figure things out for themselves - so there'll be plenty of people who don't realise the potential of LLMs and won't use them.


Angst?

It just means you're less likely be fixing someone else's "mistakenly _mm512_store_si512 for been _mm512_storeu_si512" error because AI fix(ed) it for you and you can focus on other parts of computer science. Computer science surely isn't just fixing _mm512_store_si512.


The cost of developing software is quickly dropping thanks to these models, and the demand for software is about to go way up because of this. LLMs will just be power tools to software builders. Learn to pop up a level.


why is this impressive at all? It effectively amounts to correcting a typo.


a typo that OP missed, with all our modern tooling, and also a typo that presumably the other models missed.


It found a needle in a haystack


rip development jobs /s.. or not /s


The output of LLMs will be suboptimal for some time. Companies stand to earn more by hiring juniors who can reason without LLMs than juniors who copy paste from LLMs


I have to admit I'm kind of surprised by the SWE-bench results. At the highest level of performance o3-mini's CodeForces score is, well, high. I've honestly never really sat down to understand how elo works, all I know is that it scored better than o1, which allegedly as better than ~90% of all competitors on CodeForces. So, you know, o3-mini is pretty good at CodeForces.

But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)

My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.

Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.


I'm not that surprised. Codeforces is a relatively low area of vocabulary knowledge needed.

Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard

I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)


> My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering task

Yep. A general software engineering task has a lot of information encoded in it that is either already known to a human or is contextually understood by a human.

A competitive programming task often has to provide all the context as it's not based off an existing product or codebase or technology or paradigm known to the user


I think the innovation here is probably that its a much smaller and so cheaper model to run.


Yeah, I no longer consider the SWE-bench useful because these models can just "memorize" the solutions to the PRs.


The interesting question to me is how far these reasoning models can be scaled. With another 12 months of compute scaling (for synthetic data generation and RL) how good will these models be at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday about this and his take is that we'll hit diminishing returns – not because we can't make models more powerful, but because we're approaching diminishing returns in areas that matter to users and that AI models may be nearing a plateau where capability gains matter less than UX.


I think in a lot of ways we are already there. Users are clearly already having difficulty seeing which model is better or if new models are improving over old models. People go back to the same gotcha questions and get different answers based on the random seed. Even the benchmarks are getting very saturated.

These models already do an excellent job with your homework, your corporate PowerPoints and your idle questions. At some point only experts would be able to decide if one response was really better than another.

Our biggest challenge is going to be finding problem domains with low performance that we can still scale up to human performance. And those will be so niche that no one will care.

Agents on the other hand still have a lot of potential. If you can get a model to stay on task with long context and remain grounded then you can start firing your staff.


Don't underestimate how much the long tail means to the general public.


Sure as a clock, tick follows tock. Can't imagine trying to build out cost structures, business plans, product launches etc on such rapidly shifting sands. Good that you get more for your money, I suppose. But I get the feeling no model or provider is worth committing to in any serious way.


this is the best outcome, though, rather than a monopoly, which is exactly what everyone is hoping to have.


Terrible time to open a shovel store, amazing time to pick up a shovel.


After o3 was announced, with the numbers suggesting it was a major breakthrough, I have to say I’m absolutely not impressed with this version.

I think o1 works significantly better, and that makes me think the timing is more than just a coincidence.

Last week Nvidia lost 600 billion because of DeepSeek R1, and now OpenAI comes out with a new release which feels like it has nothing to do with the promises that were being made about o3.


Having tried using it, it is much worse than r1. Both the standard and high effort version.


If it’s actually available, it can’t be that much worse than R1 which currently only completes a response about 50% of the time for me.


There are multiple providers for it since it's open source.


Are there any providers that have a chat interface (not just API access) with a fixed monthly cost? I couldn't find one.


you.com when you disable their search the internet feature.


Yea, o3-mini was a massive step down from Sonnet for coding tasks.

R1 is my cost effective programmer. Sonnet is my hard problem model still.


R1 is interesting.

Since I have access to the thinking tokens I can see where it's going wrong and do prompt surgery. But left to it's own devices it gets thing _stupendously_ wrong about 20% of the time with a huge context blowout. So much so that seeing that happen now tells me I've fundamentally asked the wrong question.

Sonnet doesn't suffer from that and solves the task, but doesn't give you much if any, help in how to recover from doing the wrong task.

I'd say that for work work Sonnet 3.5 is still the best, for exploratory work with a human in the loop r1 is better.

Or as someone posted here a few days ago: R1 as the architect, Sonnet3.5 as the worker and critic.


This is the mini version which is not as good as o1 and I don’t think they demoed in the o3 announcement. I’m hoping the full release will be impressive


I know this isn't the full o3 release, but I find it odd that they're branding it as o3 when it feels more like an update to o1 mini.

Yes, reasoning has improved, but the overall results haven't advanced as much as one would expect from a major version update. It's highly unusual for OpenAI to release a milestone version like this - it feels more like a marketing move than a genuine upgrade.

Who knows what's going on behind closed doors?

If I put on my tinfoil hat for a moment, maybe Nvidia made a deal with OpenAI - offering a discount on computing power in exchange for a timely release.

OpenAI needs an enormous amount of computing power these days, and while Nvidia would take a financial hit by offering a discount to one of its biggest (if not the biggest) customers, that's still nowhere near as costly as losing 600 billion.


This release was planned well before the Deepseek hype. When they demoed o3, they said they were planning on releasing o3-mini at the end of January.


o3 is absolutely fantastic compared to r1 what are these people talking about


This took 1:53 in o3-mini

https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...

The 4o model without using the Python tool

https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...

Trying to get accurate results with the paid version of 4o with the Python interpreter.

https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...

The share link doesn’t show the output for some reason. But it did work correctly. I don’t know whether the ages are correct. I was testing whether it could handle ordering

I have no idea what conclusion I should draw from this besides depending on the use case, 4o may be better with “tools” if you know your domain where you are using it.

Tools are relatively easy to implement with LangChain or the native OpenAI SDK.


I would not expect any LLM to get this right. I think people have too high expectations for it.

Now if you asked it to write a Python program to list them in order, and have it enter all the names, birthdays, and year elected in a list to get the program to run - that's more reasonable.


The “o” models get the order right.

DeepSeek also gets the order right.

It doesn’t show on the share link. But it actually outputs the list correctly from the built in Python interpreter.

For some things, ChatGPT 4o will automatically use its Python runtime


That some models get it right is irrelevant. In general, if your instructions require computation, it's safer to assume it won't get it right and will hallucinate.


The reasoning models all do pretty good at math.

Have you tried them?

This is something I threw together with o3-mini

https://chatgpt.com/share/679d5305-5f04-8010-b5c4-61c31e79b2...

ChatGPT 4o doesn’t even try to do the math internally and uses its built in Python interpreter. (The [_>] link is to the Python code)

https://chatgpt.com/share/679d54fe-0104-8010-8f1e-9796a08cf9...

DeepSeek handles the same problem just as well using the reasoning technique.

Of course ChatGPT 4o went completely off the rails without using its Python interpreter

https://chatgpt.com/share/679d5692-96a0-8010-8624-b1eb091270...

(The break down that it got right was using Python even though I told it not to)


The 4o model's output is blatantly wrong. I'm not going to look up if it's the order or the ages that are incorrect, but:

36. Abraham Lincoln – 52 years, 20 days (1861)

37. James Garfield – 49 years, 105 days (1881)

38. Lyndon B. Johnson – 55 years, 87 days (1963)

Basically everything after #15 in the list is scrambled.


That was the point. The 4o model without using Python was wrong. The o3 model worked correctly without needing an external tool


Haven't used openai in a bit -- whyyy did they change "system" role (now basically an industry-wide standard) to "developer"? That seems pointlessly disruptive.


2 years ago I'd say it's an oversight, because there's 0 chance a top down directive would ask for this.

But given how OpenAI employees act online these days I wouldn't be surprised if someone on the ground proposed it as a way to screw with all the 3rd parties who are using OpenAI compatible endpoints or even use OpenAI's SDK in their official docs in some cases.


They mention in the model card, it's so that they can have a separate "system" role that the user can't change, and they trained the model to prioritise it over the "developer" role, to combat "jailbreaks". Thank God for DeepSeek.


They should have just created something above system and left as it was.


Agreed, just add root and call it a day. Everyone who needs to care can instantly guesstimate what it is.


> Testers preferred o3-mini's responses to o1-mini 56% of the time

I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.

They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.

That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.


It's kind of strange that they gave that stat. Maybe they thought people would somehow think about "56% better" or something.

Because when you think about it, it really is quite damning. Minus statistical noise it's no better.


And another way to rephrase it is that almost half of the users prefer the older model, which is terrible PR.


Not if the goal is to claim that the models deliver comparable quality, but with the new one excelling at something else (here: inferrence cost).


its mini to mini, its the same cost


Typically in these tests you have three options "A is better", "B is better" or "they're equal/can't decide". So if 56% prefer O3 Mini, it's likely that way less than half prefer O1.also, the way I understand it, they're comparing a mini model with a large one.


If you use ChatGPT, it sometimes gives you two versions of its response, and you have to choose one or the other if you want to continue prompting. Sure, not picking a response might be a third category. But if that's how they were approaching the analysis, they could have put out a more favorable-looking stat.


> If you use ChatGPT, it sometimes gives you two versions

Does no one else hate it when this happens (especially when on a handheld device)?


That would be 12%, why would you assume that is eaten by statistical noise?


The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.


Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?


  > If so, where do they indicate they failed to randomize/blind the raters?

  Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.

Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.


Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".

If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.


Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.


I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294


I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.


> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.


Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".


They even include error bars. It doesn't seem to be statistical noise, but it's still not great.


It’s 3x cheaper and faster


Yeah. I immediately thought: I wonder if that 56% is in one or two categories and the rest are worse?


44% of the people prefers the existing model ?


Each question falls into a different category (ie math, coding, story writing etc). Typically models are better at some categories and worse at others. Saying "56% of people preferred responses from o3-mini" makes me wonder if those 56 are only from certain categories and the model isn't uniformly 56% preferred.


With many people too lazy to read 2 walls of text, a lot of picks might be random.


exactly I was surprised as well


Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.


Have you considered the possibility that your feedback is used to choose what type of response to give to you specifically in the future?

I would not consider purposely giving inaccurate feedback for this reason alone.


I don't want a model that's customized to my preferences. My preferences and understanding changes all the time.

I want a single source model that's grounded in base truth. I'll let the model know how to structure it in my prompt.


You know there's no such as base truth here? You want to write something like this to start your prompts, "Respond in English, using standard capitalization and punctuation, following rules of grammar as written by Strunk & White, where numbers are represented using arabic numerals in base 10 notation...."???


actually, I might appreciate that.

I like precision of language, so maybe just have a system prompt that says "use precise language (ex: no symbolism of any kind)"


A lot of preferences have nothing to do with any truth. Do you like code segments or full code? Do you like paragraphs or bullet points? Heck, do you want English or Japanaese?


What is base truth for e.g. creative writing?


Constang meh and fixing prompts to the right direction vs unable to escape the bubble


I think my awareness that this may influence future responses has actually been detrimental to my response rate. The responses are often so similar that I can imagine preferring either in specific circumstances. While I’m sure that can be guided by the prompt, I’m often hesitant to click on a specific response as I can see the value of the other response in a different situation and I don’t want to bias the future responses. Maybe with more specific prompting this wouldn’t be such an issue, or maybe more of an understanding of how inter-chat personalisation is applied (maybe I’m missing some information on this too).


Alternatively, I'll use the tool that is most user friendly and provides the most value for my money.

Wasting time on an anti pattern is not value nor is it trying to outguess the way that selection mechanism is used.


Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!


That's such a counter-productive and frankly dumb thing to do. Just don't vote on them.


You have to pick one to continue the chat.


I know for a fact that as of yesterday I did not have to pick one to continue the conversation. It just maximizes the second choice and displayed a 2/2 below the response.


Why not always pick the one on the left, for example? I understand wanting to speed through and not spend time doing labor for OpenAI, but it seems counter-productive to spend any time feeding it false information.


My assumption is they measure the quality of user feedback, either on a per user basis or in an aggregate. I want them to interrupt me less, so I want them to either decide I’m a bad teacher or that users in general are bad teachers.


> I'm usually just selecting the one that answered first

Which is why you randomize the order. You aren’t a tester.

56% vs 44% may not be noise. That’s why we have p values. It depends on sample size.


The order doesn't matter. They often generate tokens at different speeds, and produce different lengths of text. "The one that answered first" != "The first option"


The article says "expert testers."

"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"


Those are two different sentences. The second sentence doesn't refer to experts explicitly.


That makes the result stronger though. Even though many people click randomly, there is still a 12% margin between both groups. Not the world, but still quite a lot.


Funny - I had ChatGPT document some stuff for me this week and asked which responses I preferred as well.

Didn’t bother reading either of them, just selected one and went on with my day.

If it were me I would have set up a “hey do you mind if we give you two results and you can pick your favorite?” prompt to weed out people like me.


I'm surprised how many people claim to do this. You can just not select one.


I think it’s somewhat natural and am not personally surprised. It’s easy to quickly select an option, that has no consequence, compared to actively considering that not selecting something is an option. Not selecting something feels more like actively participating than just checking a box and moving on. /shrug


We -- the people who live in front of a computer -- have been training ourselves to avoid noticing annoyances like captchas, advertising, and GDPR notices for quite a long time.

We find what appears to be the easiest combination "Fuck off, go away" buttons and use them without a moment of actual consideration.

(This doesn't mean that it's actually the easiest method.)


I can't even believe how many times in a day I frustratedly think "whatever, go away!"


I wonder if they down-weight responses that come in too fast to be meaningful, or without sufficient scrolling.


That’s fine. Your random click would be balanced by someone else randomly clicking


Then 56% is even more impressive. Example: if 80% choose randomly and 20% choose carefully, that implies an 80% preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)


Yes I'd bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate


I read the one on the left but choose the shorter one.

The interface wastes so much screen real estate already and the answers are usually overly verbose unless I've given explicit instructions on how to answer.


The default level of verbosity you get without explicitly prompting for it to be succinct makes me think there’s an office full of workers getting paid by the token.


In my experience the verbosity significantly improves output quality


Also, it's not clear if the preference comes from the quality of the 'meat' of the answer, or the way it reports its thinking and the speed with which it responds. With o1, I get a marked feeling of impatience waiting for it to spit something out, and the 'progress of thought' is in faint grey text I can't read. With o3, the 'progress of thought' comes quickly, with more to read, and is more engaging even if I don't actually get anything more than entertainment value.

I'm not going to say there's nothing substantive about o3 vs. o1, but I absolutely do not put it past Sam Altman to juice the stats every chance he gets.


they also pay contractors to do these evaluations with much more detailed metrics, no idea which their number is based on though


Maybe we should take both answers, paste them into a new chat and ask for a summary amalgamation of them


This is just a way to prove, statistically, that one model is better than another as part of its validation. It's not collected from normal people using ChatGPT, you don't ever get shown two responses from different models at once.


Wait what? I get shown this with ChatGPT maybe 5% of the time


Those are both responses from the same model. It's not one response from o1 and another from o3.


People could be flipping a coin and the score would be the same.


A 12% margin is literally the opposite of a coin flip. Unless you have a really bad coin.


You're being downvoted for 3 reasons:

1) Coming off as a jerk, and from a new account is a bad look

2) "Literally the opposite of a coin flip" would probably be either 0% or 100%

3) Your reasoning doesn't stand up without further info; it entirely depends on the sample size. I could have 5 coin flips all come up heads, but over thousands or millions it averages to 50%. 56% on a small sample size is absolutely within margin of error/noise. 56% on a MASSIVE sample size is _statistically_ significant, but isn't even still that much to brag about for something that I feel like they probably intended to be a big step forward.


I'm a little puzzled by your response.

1. The message was net-upvoted. Whether there are downvotes in there I can't tell, but the final karma is positive. A similarly spirited message of mine in the same thread was quite well receive as well.

2. I can't see how my message would come across as a jerk? I wrote 2 simple sentences, not using any offensive language, stating a mere fact of statistics. Is that being jerk? And a long-winded berating of a new member of the community isn't?

3. A coin flip is 50%. Anything else is not, once you have a certain sample size. So, this was not. That was my statement. I don't know why you are building a strawman of 5 coin flips. 56% vs 44% is a margin of 12%, as I stated, and with a huge sample size, which they had, that's massive in a space where the returns are deep in "diminishing" territory.


I wasn't expecting for my comment to be red so literally but ok.

We're talking about the most cost-efficient model, the competition here is on price, not on a 12% incremental performance (which would make sense for the high end model).

To my knowledge deepseek is the cheaper service which is what matters on the low-end (unless the increase in performance was in such magnitude that the extra-charge would be worth the money).


What does deepseek have to do with a comparison between o1-mini and o3-mini?


I'm not sure I follow - your assertion was that 12% is significative.

I personally chose for price on a low-cost model (unless the improvement is to significant that it justifies the higher price).


I don't think they make it clear: I wonder if they mean testers prefer o3 mini 56% of the time when they express an opinion, or overall? Some percentage of people don't choose; if that number is 10% and they aren't excluded, that means 56% of the time people prefer o3 mini, 34% of the time people prefer o1 mini, and 10% of the time people don't choose. I'm not sure I think it would be reasonable to present the data that way, but it seems possible.


This prompt is like "See Attendant" on the gas pump. I'm just going to use another AI instead for this chat.


Glad to know I’m not the only person who just drives to the next station when I see a “see attendant” message.


I almost always pick the second one, because it's closer to the submit button and the one I read first.


It seems like the the first response must get chosen a majority of the time just to account for friction


I too have questioned the approach of showing the long side-by-side answers from two different models.

1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.

2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.

3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.

The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.

IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.

Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.


i enjoy it, i like getting two answers for free - often one of them is significantly better and probably the newer model


RLUHF, U = useless.


You know you can configure default instructions to your prompts, right?

I have something like “always be terse and blunt with your answers.”


I ran some quick programming tasks I have used O1 previously:

1. 1/4th time for reasoning for most tasks.

2. Far better results.


Compared to o1 or o1-pro?


A few quick tasks look to me like o3-mini-high is 4-10x faster for 80% of the quality. It gives very good and sufficient fast reasoning about coding tasks, but I think I'd ask o1-pro to do the task ie provide the code. o3-mini-high can keep up with you at thinking / typing speed, whereas o1-pro can take several minutes. Just a quick view after playing for an hour.


Anyone else confused by inconsistency in performance numbers between this announcement and the concurrent system card? https://cdn.openai.com/o3-mini-system-card.pdf

For example-

GPQA diamond system card: o1-preview 0.68

GPQA diamond PR release: o1-preview 0.78

Also, how should we interpret the 3 different shading colors in the barplots (white, dotted, heavy dotted on top of white)...


Actually sounds like benchslop to me.


O3-mini solved this prompt. DeepSeek R1 had a mental breakdown. The prompt: “Bob is facing forward. To his left is Ann, to his right is Cathy. Ann and Cathy are facing backwards. Who is on Ann’s left?”


R1 or R1-Distill? They are not the same thing. I think DeepSeek made a mistake releasing them at the same time and calling them all R1.

Full R1 solves this prompt easily for me.



Huh, that one got it wrong for me too. I don't have patience to try it 10 times each to see if it was a coincidence, but it is absolutely true that not all implementations of LLMs produce the same outputs. It is in fact common for subtle bugs to happen that cause the outputs to be worse but not catastrophically bad, and therefore go unnoticed. So I wouldn't trust any implementation but the original for benchmarking or even general use unless I tested it extensively.


Same. With the recommended settings, it got it right. I regenerated a bunch of times, and it did suggest Cathy once or twice.

R1 70b also got it right just as many times for me.


Really interesting puzzle. I tried the different Chatgpts and they all solved it. I tried Deepseek and Deepseek R1, and Deepseek did have this meltdown:

Let's break down the problem step by step to understand the relationships and positions of Bob, Ann, and Cathy. 1. Understanding the Initial Setup

    Bob is facing forward.

    This means Bob's front is oriented in a particular direction, which we'll consider as the reference point for "forward."

    To his left is Ann, to his right is Cathy.

    If Bob is facing forward, then:

        Ann is positioned to Bob's left.

        Cathy is positioned to Bob's right.

    Ann and Cathy are facing backwards.

    Both Ann and Cathy are oriented in the opposite direction to Bob. If Bob is facing forward, then Ann and Cathy are facing backward.
2. Visualizing the Positions

To better understand the scenario, let's visualize the positions: Copy

Forward Direction: ↑

Bob (facing forward) | | Ann (facing backward) | / | / | / | / | / | / | / |/ |

And then only the character | in a newline forever.


That's a fun, simple test! I tried a few models, and mistral-nemo gets it every time, even when run locally without any system prompt! https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instr...


Thanks for that example! Using 'ollama run' these models all failed: deepseek-r1:32b, mistral-small:latest, qwq:latest, and qwen2.5:14b.


https://chatgpt.com/share/67a32b47-18fc-8003-9e18-085f9fad72...

o3-mini-high failed at the solution for me.



Even 4o mini gets that


Deepseek solved it.


I wish they'd just reveal the CoT (like gemini and deepseek do), it's very helpful to see when the model gets misled by something in your prompt. Paying for tokens you aren't even allowed to see is peak OpenAI.


sama and OpenAI’s CPO Kevin Weil both suggested this is coming soon, as a direct response to DeepSeek, in an AMA a few hours ago: https://www.reddit.com/r/OpenAI/s/EElFfcU8ZO


> a much more helpful and detailed version of this

Notice the deliberate wording. To me this implies we aren't getting the raw CoT.


Do you have a direct link to that? My "force .old on everything" plugin is having problems resolving your url (sorry!).



I'm sorry, but it's over for OpenAI. Some have predicted this; including me back in November[1] when I wrote "o1 is a revolution in accounting, not capability" which although tongue-in-cheek, has so far turned out to be correct. I'm only waiting to see what Google, Facebook et al. will accomplish now that R1-Zero result is out the bag. The nerve, the cheek of this hysterical o3-mini release—insisting to hide the COT from the consumer still, is telling us one thing and one thing alone: OpenAI is no longer able to adapt to the ever-changing landscape. Maybe the Chinese haven't beaten them yet, but Google, Facebook et al. absolutely will, & without having to resort to deception.

[1]: https://old.reddit.com/r/LocalLLaMA/comments/1gna0nr/popular...


You don't need to wait for Google. Their Jan 21 checkpoint for their fast reasoning model is available on AIStudio. It shows full reasoning traces. It's very good, much faster than R1, and although they haven't released pricing, based on flash it's going to be quite cheap.


Sure, their 01-21 reasoning model is really good, but there's no pricing for it!

I care mostly about batching in Vertex AI, which is 17-30x times cheaper than competition (whether you use prompt caching or not) while allowing for audio, video, and arbitrary document filetype inputs; unfortunately Gemini 1.5 Pro/Flash have remained the two so-called "stable" options that are available there. I can appreciate Google's experimental models for all I can, but I cannot take them seriously until they allow me to have my sweet, sweet batches.


Oh, sweet: both o3-mini low and high support integrated web search. No integrated web search with o1.

I prefer, for philosophical reasons, open weight and open process/science models, but OpenAI has done a very good job at productizing ChatGPT. I also use their 4o-mini API because it is cheap and compares well to using open models on Groq Cloud. I really love running local models with Ollama but the API venders keep the price so low that I understand most people not wanting the hasssle if running Deepseek-R, etc., locally.


I've been using cursor since it launched, sticking almost exclusively to claude-3.5-sonnet because it is incredibly consistent, and rarely loses the plot.

As subsequent models have been released, most of which claim to be better at coding, I've switched cursor to it to give them a try.

o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same "adhd." As an example, in a NextJS app, if I do a composer prompt like "on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy."

sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.

Every single other model, doesn't matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say "DO NOT use radix-ui, use shadcn!" -- it will respond with "ok!" then begin writing its own components from scratch, again not using shadcn.

This is still problematic with o3-mini.

I can't believe it's the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor "locked in" the way sonnet just seems to be naturally.

It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.

Has anyone found a workaround?


My experience with cursor and sonnet is that it is relatively good at first tries, but completely misses the plot during corrections.

"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"

It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.

I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.


Claude makes a lot of crappy change suggestions, but when you ask "is that a good suggestion?" it's pretty good at judging when it isn't. So that's become standard operating procedure for me.

It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.


I'm always asking Claude to propose a variety of suggestions for the problem at hand and their trade-offs, then evaluating them for the top three proposals and why. Then I'll pick one of them and further vet the idea


>It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.

Absolutely, I find this a challenge as well. Every thought that crosses my mind is a great idea according to it. That's the opposite attitude to what I want from an engineer's copilot! Particularly from one who also advices junior devs.


> when you ask "is that a good suggestion?" it's pretty good at judging when it isn't

Basically a poor man's COT.


Yes it’s usually worth it to try to write a really good first prompt


More than once I've found myself going down this 'little maze of twisty passages, all alike'. At some point I stop, collect up the chain of prompts in the conversation, and curate them into a net new prompt that should be a bit better. Usually I make better progress - at least for a while.


This becomes second nature after a while. I've developed an intuition about when a model loses the plot and when to start a new thread. I have a base prompt I keep for the current project I'm working on, and then I ask the model to summarize what we've done in the thread and combine them to start anew.

I can't wait until this is a solved problem because it does slow me down.


Yes when new models come out it feels like breaking up.

Why is it so hard to share/find prompts or distill my own damn prompts? There must be good solutions for this —


What do you find difficult about distilling your own prompts?

After any back and forth session I have reasonably good results asking something like "Given this workflow, how could I have prompted this better from the start to get the same results?"


Analysis of past chats in bulk.


Don’t outsource the only thing left for our brains to do themselves :/


For my advanced use case involving Python and knowledge of finance, Sonnet fared poorly. Contrary to what I am reading here, my favorite approach has been to use o1 in agent mode. It’s an absolute delight to work with. It is like I’m working with a capable peer, someone at my level.

Sadly there are some hard limits on o1 with Cursor and I cannot use it anymore. I do pay for their $20/month subscription.


> o1 in agent mode

How? It specifically tells me this is unsupported: "Agent composer is currently only supported using Anthropic models or GPT-4o, please reselect the model and try again."


I think you’re right - I must have used it in regular mode, then got GPT-4o to fill in the gaps. It can fully automate a lot of menial work, such as refactors and writing tests. Though I’ll add, I had a roughly 50% success with GPT-4o bug fixing in agent mode, which is pretty great in my experience. When it did work, it felt glorious - 100% hands-free operation!


It seems like you could use aider in architecture mode. Basically, it will suggest the solution to your problem fist, and prompt you to start editing, you can say no to refine the solution and only start editing when you are satisfied with it.


Hah, I was trying it the other day in a Go project and it did exactly the same thing. I couldn’t believe my eyes, it basically rewrote all the functions back out in the test file but modified slightly so the thing that was failing wouldn’t even run.


I've had it do similar nonsense.

I just don't understand all the people who honestly believe AGI just requires more GPUs and data when these models are so inherently stupid.


Can't you select Chatgpt as the model in cursor?


Yes, but for some reason it seems to perform worse there.

Perhaps whatever algorithms Cursor uses to prepare the context it feeds the model are a good fit for Claude but not so much for the others (?). It's a random guess, but whatever the reason, there's a weird worsening of performance vs pure chat.


Yes but every model besides claude-3.5-sonnet sucks in Cursor, for whatever reason. They might as well not even offer the other models. The other models, even "smarter" models, perform vastly poorer or don't support agent capability or both.


Not trying to be snarky, but the example prompt you provided is about 1/15th the length and detail of prompts I usually send when working with Cursor.

I tend to exhaustively detail what I want, including package names and versions because I've been to that movie before...


What works nice also is the text to speech. I find it easier and faster to give more context by talking rather than typing, and the extra content helps the AI to do its job.

And even though the speech recognition fails a lot on some of the technical terms or weirdly named packages, software, etc, it still does a good job overall (if I don’t feel like correcting the wrong stuff).

It’s great and has become somewhat of a party trick at work. Some people don’t even use AI to code that often, and when I show them “hey have you tried this?” And just tell the computer what I want? Most folks are blown away.


Does the Cursor have text-to-speech functionality?


you mean speech to text right?


Not for me. I first ask Advanced Voice to read me some code and have Siri listen and email it to an API I wrote which uses Claude to estimate the best cloud provider to run that code based on its requirements and then a n8n script deploys it and send me the results via twilio.


Sorry! Yes, speech to text.


If have to write a prompt that long, it'll be faster to just write the code.


Shocking to see this because this was essentially the reason most of the previous no code solutions never took off...


That sounds exhausting. Wouldn't it be faster to include you package.json in the context?

I sometimes do this (using Cline), plus create a .cline file at project root which I refine over time and which describes both the high level project overview, details of the stack I'm using, and technical details I want each prompt to follow.

Then each actual prompt can be quite short: read files x, y, and z, and make the following changes... where I keep the changes concise and logically connected - basically what I might do for a single pull request.


My point was that a prompt that simple could be held and executed very well by sonnet, but all other models (especially reasoning models) crash and burn.

It's a 15 line tsx file so context shouldn't be an issue.

Makes me wonder if reasoning models are really proper models for coding in existing codebases


Your last point matches what I’ve seen some people (simonw?) say they’re doing currently: using aider to work with two models—one reasoning model as an architect, and one standard LLM as the actual coder. Surprisingly, the results seem pretty good vs. putting everything on one model.


This is probably the right way to think about it. O1-pro is an absolute monster when it comes to architecture. It is staggering the breadth and depth that it sees. Ask it to actually implement though, and it trips over its shoelaces almost immediately.


Can you give an example of this monstrous capability you speak of? What have you used it for professionally w.r.t. architecture.


The biggest delta over regular o1 that I've seen is asking it to make a PRD of an app that I define as a stream-of-consciousness with bullet points.

It's fantastic at finding needles in the haystack, so the contradictions are nonexistent. In other words, it seems to identify which objects would interrelate and builds around those nodes, where o1 seems to think more in "columns."

To sum it up, where o1 feels like "5 human minute thinking," o1-pro feels like "1 human hour thinking"


You’re basically saying you write 15x the prompt for the same result they get with sonnet.


Yes this works good for me too rather take your time and do the first prompt right


I’ve coded in many languages over the years but reasonably new to the TS/JS/Next world.

I’ve found if you give your prompts a kind long form “stream of consciousness”, where you outline snippets of code in markdown along with contextual notes and then summarise/outline at the end what you actually wish to achieve, you can get great results.

Think a long form, single page “documentation” type prompts that alternate between written copy/contextual intent/description and code blocks. Annotating code blocks with file names above the blocks I’m sure helps too. Don’t waste your context window on redundant/irrelevant information or code, stating a code sample is abridged or adding commented ellipses seems to do the job.


By the time I've fully documented and explained what I want to be done, and then review the result, usually finding that it's worse than what I would have written myself, I end up questioning my instinct to even reach for this tool.

I like it for general refactoring and day to day small tasks, but anything that's relatively domain-specific, I just can't seem to get anything that's worth using.


Like most AI tools, great for beginners, time-savers for intermediate users, and frequently a waste of time in domains where you're an expert.

I've used Cursor for shipping better frontend slop, and it's great. I skip a lot of trial and error, but not all of it.


,> and frequently a waste of time in domains where you're an expert.

I'm a domain expert and I disagree.

There's many scenarios where using LLMs pays off.

E.g. a long file or very long function are just that, and an LLM is faster at understanding it whole not being limited in how many things you can track in your mind at once (between 4 and 6). It's still gonna be faster at refactoring it and testing it than you will.


I agree that it's amazing as a learning tool. I think the "time to ramp" on a new technology or programming language has probably been cut in half or more.


ha! good to confirm! I tend to do this, just kind of as a double-check thing, but never sure if it actually worked or if it was a placebo, lol.

Or end with "from the user's perspective: all the "B" elements should light up in excitement when you click "C""


Going to try this! Thanks for the tip


We've been working on solving a lot of these issues with v0.dev (disclaimer: shadcn and I work on it). We do a lot of pre and post-processing to ensure LLMs output valid shadcn code.

We're also talking to the cursor/windsurf/zed folks on how we can improve Next.js and shadcn in the editors (maybe something like llms.txt?)


Thanks for all the work you do! v0 is magical. I absolutely love the feature where I can add a chunky component that v0 made to my repo with npx


So I think I finally understood recently why we have these divergent groups with one thinking Claude 3.5 Sonnet is the best model for coding and another that follow the OpenAI SOTA at that moment. I have been a heavy user of ChatGPT, jumping on to pro without even thinking for more than a second once released. Recently though I took a pause from my usual work on statistical modelling, heuristics work and other things in certain deep domains to focus on building client APIs and frontends and decided to again give Claude a try and it is just so great to work with for this usecase.

My hypothesis is its a difference of what you are doing. OpenAI O models are much better than others at mathematical modelling and such tasks and Claude for more general purpose programming.


Have you used multi-agent chat sessions with each fielding their own specialities and seeing if that improves your use cases aka MoE?


I have not. Any suggestions on which one(s) to explore to get started.


Context length possibly. Prompt adherence drops off with context, and anything above 20k tokens is pushing it. I get the best results by presenting the smallest amount of context possible, including removing comments and main methods and functions that it doesn't need to see. It's a bit more work (not that much if you have a script that does it for you), but the results are worth it. You could test in the chatgpt app (or lmarena direct chat) where you ask the same question but with minimal hand curated context, and see if it makes the same mistake.


If it's a context issue, it's an issue with how cursor itself sends the context to these reasoning LLMs.

Context alone shouldn't be the reason that sonnet succeeds consistently, but others (some which have even bigger context windows) fail.


Yes, that's what I'm suggesting. Cursor is spamming the models with too much context, which harms reasoning models more than it harms non-reasoning models (hypothesis, but one that aligns with my experience). That's why I recommended testing reasoning models outside of Cursor with a hand curated context.

The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.


I've found cursor to be too thin a wrapper. Aider is somehow significantly more functional. Try that.


Aider, with o1 or R1 as the architect and Claude 3.5 as the implementer, is so much better than anything you can accomplish with a single model. It's pretty amazing. Aider is at least one order of magnitude more effective for me than using the chat interface in Cursor. (I still use Cursor for quick edits and tab completions, to be clear).


I haven't tried aider in quite a while, what does it mean to use one model as an architect and another as the implementer?


Aider now has experimental support for using two models to complete each coding task:

- An Architect model is asked to describe how to solve the coding problem.

- An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).

https://aider.chat/2024/09/26/architect.html


Probably gonna show a lot of ignorance here, but isn’t that a big part of the difference between our brains and AI? That instead of one system, we are many systems that are kind of sewn together? I secretly think AGI will just be a bunch of different specialized AIs working together.


You're in good company in that secret thought.

Have a look at this: https://en.wikipedia.org/wiki/Society_of_Mind


Efficient and effective organizations work this way, too: a CEO to plan in broad strokes, employees to implement that vision in specific ways, and managers to make sure their results match expectations.


I normally use aider by just typing in what I want and it magically does it. How do I use o1 or R1 to play the role of the "architect"?


You can start it with something like:

    aider --architect --model o1 --editor-model sonnet
Then you'll be in "architect" mode, which first prompts o1 to design the solution, then you can accept it and allow sonnet to actually create the diffs.

Most of the time your way works well—I use sonnet alone 90% of the time, but the architect mode is really great at getting it unstuck when it can't seem to implement what I want correctly, or keeps fixing its mistakes by making things worse.


I really want to see how apps created this way scale to large codebases. I’m very skeptical they don’t turn into spaghetti messes.

Coding is basically just about the most precise way to encapsulate a problem as a solution possible. Taking a loose English description and expanding it into piles of code is always going to be pretty leaky no matter how much these models spit out working code.

In my experience you have to pay a lot of attention to every single line these things write because they’ll often change stuff or more often make wrong assumptions that you didn’t articulate. And in my experience they never ask you questions unless you specifically prompt them to (and keep reminding them to), which means they are doing a hell of a lot of design and implementation that unless carefully looked over will ultimately be wrong.

It really reminds me a bit of when Ruby on Rails came out and the blogosphere was full of gushing “I’ve never been more productive in my life” posts. And then you find out they were basically writing a TODO app and their previous development experience was doing enterprise Java for some massive non-tech company. Of course RoR will be a breath of fresh air for those people.

Don’t get me wrong I use cursor as my daily driver but I am starting to find the limits for what these things can do. And the idea of having two of these LLM’s taking some paragraph long feature description and somehow chatting with each other to create a scalable bit of code that fits into a large or growing codebase… well I find that kind of impossible. Sure the code compiles and conforms to whatever best practices are out there but there will be absolutely no constancy across the app—especially at the UX level. These things simply cannot hold that kind of complexity in their head and even if they could part of a developers job is to translate loose English into code. And there is much, much, much, much more to that than simply writing code.


I see what you’re saying and I think that terming this “architect” mode has an implication that it’s more capable than it really is, but ultimately this two model pairing is mostly about combining disparate abilities to separate the “thinking” from the diff generation. It’s very effective in producing better results for a single prompt, but it’s not especially helpful for “architecting” a large scale app.

That said, in the hands of someone who is competent at assembling a large app, I think these tools can be incredibly powerful. I have a business helping companies figure out how/if to leverage AI and have built a bunch of different production LLM-backed applications using LLMs to write the code over the past year, and my impression is that there is very much something there. Taking it step by step, file by file, like you might if you wrote the code yourself, describing your concept of the abstractions, having a few files describing the overall architecture that you can add to the chat as needed—little details make a big difference in the results.


I use Cursor and Composer in agent mode on a daily basis, and this is basically exactly what happened to me.

After about 3 weeks, things were looking great - but lots of spagetti code was put together, and it never told me what I didn't know. The data & state management architecture I had written was simply just not maintainable (tons of prop drilling, etc). Over time, I basically learned common practices/etc and I'm finding that I have to deal with these problems myself. (how it used to be!)

We're getting close - the best thing I've done is create documentation files with lots of descriptions about the architecture/file structure/state management/packages/etc, but it only goes so far.

We're getting closer, but for right now - we're not there and you have to be really careful with looking over all the changes.


The worst thing you can do with aider is let it autocommit to git. As long as you review each set of changes you can stop it going nuts.

I have a codebase maybe 3-500k lines which is in good shape because of this.

I also normally just add the specific files I need to the chat and give it 1-2 sentences for what to do. It normally does the right thing (sonnet obviously).


Yes! Turn off autocommit, everyone! Review and test, then git commit.


Same with Cline


The reality is I suspect one will use different models for different things. Think of it like having different modes of transportation.

You might use your scooter, bike, car, jet - depending on the circumstances. A bike was invented 100 years ago? But it may be the best in the right use case. Would still be using DaVinci for some things because we haven't bothered swapping it and it works fine.

For me - the value of R1/o3 is visible logic that provides an analysis that can be critiqued by Sonnet 3.5


I have an even more topical analogy! Using different languages for different tasks. When I need some one off script do automate some drudgery (take all files with certain pattern in their name, for each do some search and replace in the text inside, zip them, upload zip to URL, etc) I use python. When Im working on a multi-platform game I use c# (and unity). When I need to make something very lean that works in mobile browsers I use JS with some light-weight libraries.


Claude uses Shadcn-ui extensively in the web interface, to the point where I think it's been trained to use it over other UI components.

So I think you got lucky and you're asking it to write using a very specific code library that it's good at, because it happens to use it for it's main userbase on the web chat interface.

I wonder if you were using a different component library, or using Svelte instead of React, would you still find Claude the best?


I was recently trying to write a relatively simple htmx service with Claude. I was surprised at how much worse it was when it's not React.


I'm going to give you a video to watch. It's not mine, and I don't know much about this particular youtuber, but it really transformed how I think about writing and structuring the prompts I use, which solved problems similar to what you're describing here.

https://youtu.be/y_ywOVQyafE?si=IvKjy7QUYgxGPNgD

PS (I have not bought the guy's course and have no idea whether it's any good)


Cursor is also very user-unfriendly in providing alternative models to use in composer (agent). There's a heavy reliance on Anthrophic for cursor.

Try using Gemini thinking with Cursor. It barely works. Cmd-k outputs the thinking into the code. Its unusable in chat because the formatting sucks.

Is there some relationship between Cursor and Anthropic, i wonder. Plenty of other platforms seem very eager to give users model flexibility, but Cursor seems to be lacking.

I could be wrong, just an observation.


Originally, actually there was a relationship between Cursor & OpenAI. Something like Cursor was supported by the OpenAI startup fund. So Cursor seems to have branched out. I think they are just emphasizing the models they find most effective. I'm surprised they haven't (apparently) incorporated Claude prompt caching yet for Sonnet.


My general workflow with ai so far has been this: - I use copilot mostly for writing unit tests. It mostly works well since the unit tests follow a standard template. - I use the chat one for alternating between different approaches and (in)validating certain approaches

My day job is a big monorepo, I have not investigated that yet but I believe the models context sizes fall short there and as such the above use cases only works for me.


I have the same experience. Just today I was integrating a new logging system with my kubernetes cluster.

I tried out the OP model to make changes to my yaml files. It would give short snippets and I’d have to keep trial and erroring its suggestions.

Eventually I pasted the original prompt to Claude and it one-shot the dang thing with perfect config. Made me wonder why I even try new models.


"not" and other function words; usually work fine today but if I'm having trouble, the best thing to do is probably be inclusive, not exclusive.


Have you tried any of the specialty services like Augment? I am curious if they are any better or just snake oil.


OT: How many tokens are being consumed? How much are you paying for Claude APIs?


Just curious, did you try a code model like Codestral instead of a MoE?


o3 mini’s date cut-off is 2023, so it’s unfortunately not gonna be useful for anything that requires knowledge of recent framework updates, which includes probably all big frontend stuff.


I also have been less impressed by o1 in cursor compared to sonnet 3.5. Usually what I will do for a very complicated change is ask o1 to architect it, specifically asking it to give me a detailed plan for how it would be implemented, but not to actually implement anything. I then change the model to Sonnet 3.5 to have it actually do the implementation.

And on the side of not being able to get models to understand something specific, there’s a place in a current project where I use a special Unicode apostrophe during some string parsing because a third-party API needs it. But any code modifications by the AI to that file always replace it with a standard ascii apostrophe. I even added a comment on that line to the effect of “never replaced this apostrophe, it’s important to leave it exactly as it is!” And also put that in my cursor rules, and sometimes directly in the prompt as well, but it always replaces it even for completely unrelated changes. I’ve had to manually fix it like 10 times in the last day, it’s infuriating.


A random idea - train one of those models on you, keep it aside, let it somehow work out your intricacies, moods, details, childhood memories, personality, flaws, strengths. Methods can be various - initial dump of social networks, personal photos and videos, maybe some intense conversation to grok rough you, then polish over time.

A first step to digital immortality, could be a nice startup of some personalized product for rich, and then even regular folks. Immortality not in ourselves as meat bags of course, we die regardless, but digital copy and memento that our children can use if feeling lonely and can carry with themselves anywhere, or later descendants out of curiosity to hold massive events like weddings. One could 'invite' long lost ancestors. Maybe your grand-grand father would be a cool guy you could easily click with these days via verbal input. Heck even 3D detailed model.

An additional service, 'perpetually' paid - keeping your data model safe, taking care of it, backups, heck even maybe give it a bit of computing power to to receive current news in some light fashion and evolve, could be extras. Different tiers for different level of services and care.

Or am I decade or two ahead? I can see this as universally interesting across many if not all cultures.


> While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed.

I feel like this naming scheme is growing a little tired. o1 is for general knowledge reasoning, o3-mini replaces o1-mini but might be more specialized than o1 for certain technical domains...the "o" in "4o" is for "omni" (referring to its multimodality) but the reasoning models start with "o" ...but they can't use "o2" for trademark reasons so they skip straight to "o3" ...the word salad is getting really hard to follow!


The -mini postfix makes perfect sense, probably even clearer than the old "turbo" wording. Naturally, the latest small model may be better than larger older models... but not always and not necessarily in everything. What you'd expect from a -mini model is exactly what is delivered.

The non-reasoning line was also pretty straightforward. Newer base models get a larger prefix number and some postfixes like 'o' were added to signal specific features in each model variant. Great!

Where things went of the rails was specifically when they decided to also name the reasoning models with an 'o' for separate reasons but now as the prefix at the same time as starting a separate linear sequence but now as the postfix. I wonder if we'll end up with both a 4o and o4...


> I wonder if we'll end up with both a 4o and o4...

The perplexing thing is that someone has to have said that, right? It has to have been brought up in some meeting when they were brainstorming names that if you have 4o and o1 with the intention of incrementing o1 you'll eventually end up with an o4.

Where they really went off the rails was not just bailing when they realized they couldn't use o2. In that moment they had the chance to just make o1 a one-off weird name and go down a different path for its final branding.

OpenAI just struggles with names in general, though. ChatGPT was a terrible name picked by engineers for a product that wasn't supposed to become wildly successful, and they haven't really improved at it since.


The obvious solution could be to just keep skipping the even numbers and go to o5.


Or further the hype and name it o9.


Why not let ChatGPT decide the naming? Surely it will be replacing humans at this task any day now?


And multimodal o4 should be o4o.


Probably they are doing so well because there are not endless meetings on customer friendly names


This is definitely intentional.

You can like Sama or dislike him, but he knows how to market a product. Maybe this is a bad call on his part, but it is a call.


Not really. They’re successful because they created one of the most interesting products in human history, not because they have any idea how to brand it.


If that were the case, they’d be neck and neck with Anthropic and Claude. But ChatGPT has far more market share and name recognition, especially among normies. Branding clearly plays a huge role.


ChatGPT is still benefitting from first mover advantage. Which they’ve leveraged to get to the position they’re at today.

Over time, competitors catch up and first mover advantage melts away.

I wouldn’t attribute OpenAI’s success to any extremely smart marketing moves. I think a big part of their market share grab was simply going (and staying) viral for a long time. Manufacturing virality is notoriously difficult (and based on the usability and poor UI of ChatGPT early versions, it feels like they got lucky in a lot of ways)


I think that has more to do with the multiple year head start and multiple tens of billions of dollars in funding advantage.


And you think that is due to their model naming?


I prefer Anthropic's models but ChatGPT (the web interface) is far superior to Claude IMHO. Web search, long-term memory, and chat history sharing are hard to give up.


That's first mover advantage.


I think it’s success in spite of branding, not because of it.

This naming scheme is a dumpster fire. Every other comment is trying to untangle what the actual hierarchy of model performance is.


That's like making a second reading and appealing to authority.

The naming is bad. Other people already said it you can "google" stuff, you can "deepseek" something, but to "chatgpt" sounds weird.

The model naming is even weirder, like, did they really avoid o2 because of oxigen?


> but to "chatgpt" sounds weird.

People just say it differently, they say "ask chatgpt"


I normally use Claude, but "Ask Claude", but unless it's someone who knows me well, I say "Ask ChatGPT", or it's just not as claer; and I don't think it's primarily due to popularity.


Obviously they do. That's the whole point.


They really need someone in marketing.

If the model is for technical stuff, then call it the technical model. How is anyone supposed to know what these model names mean?

The only page of theirs attempting to explain this is a total disaster. https://platform.openai.com/docs/models


> How is anyone supposed to know what these model names mean?

Normies don't have to know - ChatGPT app focuses UX around capabilities and automatically picks the appropriate model for capabilities requested; you can see which model you're using and change it, but don't need to.

As for the techies and self-proclaimed "AI experts" - OpenAI is the leader in the field, and one of the most well-known and talked about tech companies in history. Whether to use, praise or criticize, this group of users is motivated to figure it out on their own.

It's the privilege of fashionable companies. They could name the next model ↂ-↊↋, and it'll take all of five minutes for everyone in tech (and everyone on LinkedIn) to learn how to type in the right Unicode characters.

EDIT: Originally I wrote \Omega-↊↋, but apparently HN's Unicode filter extends to Greek alphabet now? 'dang?


What if you use ASCII 234? Ω (edit: works!)


Thanks! I copied mine from Wikipedia (like I typically do with Unicode characters I rarely use), where it is also Ω - the same character. For a moment I was worried I somehow got it mixed up with the Ohm symbol but I didn't. Not sure what happened here.


> They really need someone in marketing.

Who said this is not intentional? It seems to work well given that people are hyped every time there's a release, no matter how big the actual improvements are — I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"


> I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"

Why would the marketing team of all people call it GPT 4.1.3?


They wouldn't! They would call it o3-mini, even though GPT 4.1.3 may or may not "make more sense" from a technical perspective.


Yes, this $300Bn company generating +$3.4Bn in revenue needs to hire marketing expert. They can begin by sourcing ideas from us here to save their struggling business from total marketing disaster.


At the least they should care more about UX. I have no idea how to restore the sidebar on chatgpt on desktop lol


Click the 'open sidebar' icon in the top left corner of the screen.


There isn't one, unless they fixed it today. Just a down arrow to change the model.


Try clearing your cache, the button has always been there for me.


just reproduced this, not sure what button it is, I think page up? but it removes the sidebar, messes up the layout, and no obv way to fix.


Hype based marketing can be effective but it is high risk and unstable.

A marketing team isn’t a generality that makes a company known, it often focuses on communicating what products different types of customers need from your lineup.

If I sell three medications:

Steve

56285

Priximetrin

And only tell you they are all pain killers but for different types and levels of pain I’m going to leave revenue on the floor. That is no matter how valuable my business is or how well it’s known.


>this $300Bn company

Watch this space.


I bet you can get one of their models to fix that disaster.


But what would we call that model?


Let’s call it “O5 Pro Max Elite”—because if nonsense naming works for smartphones, why not AI models?


O5 Pro Max Elite Enterprise Edition with Ultra


Maybe they could start selling "season passes" next to make their offering even more clear!


> But what would we call that model?

Ask one of their models for advice. :-)


Reminds me of a joke in the musical "How to Succeed in Business Without Really Trying" (written in 1961):

PETERSON Oh say, Tackaberry, did you get my memo?

TACKABERRY What memo?

PETERSON My memo about memos. We're sending out too many memos and it's got to stop!

TACKABERRY All right. I'll send out a memo.


If marketing terms from intel, AMD, Dell and other tech companies have taught me anything, it's that they need LESS of people in marketing.


But think of all the other marketers whose job is to produce blogspam explaining confusing product names!


Ugh, and some of the rows of that table are "sets of models" while some are singular models...there's the "Flagship models" section at the top only for "GPT models" to be heralded as "Our fast, versatile, high intelligence flagship models" in the NEXT section...

...I like "DALL·E" and "Whisper" as names a lot, though, FWIW :p


Name is just a label. It's not supposed to mean anything.


Think how awesome the world would be if labels ALSO had meanings.


As someone else said in another thread, if you could derive the definition from a word, the word would be as long as the definition, which would defeat the purpose.


Im not saying words. Im saying labels.

You use words as labels so that we use our pre existing knowledge of the word to derive meaning from the label.


There is no such thing. "Meaning" isn't a property of a label, it arises from how that label is used with other labels in communication.

It's actually the reason LLMs work in the first place.


You're gonna need to ground those labels in something physical at some point.

No one's going to let an LLM near anything important until then.


You only need it for bootstrapping. Fortunately, we've already done that when we invented first languages. LLMs are just bootstrapping off us.


I don't find OpenAIs naming conventions confusing, except that the o for omni and the o for reasoning have nothing to do with eachother. That's a crime.


It's almost as bad as the Xbox naming scheme.


I don't know if anything is as bad as a games console named "Series."


They should be calling it ChatGPT and ChatGPT-mini, with other models hidden behind some sort of advanced mode power user menu. They can roll out major and minor updates by number. The whole point of differentiating between models is to get users to self limit the compute they consume - rate limits make people avoid using the more powerful models, and if they have a bad experience using the less capable models, or if they're frustrated by hopping between versions without some sort of nuanced technical understanding, it's just a bad experience overall.

OpenAI is so scattered they haven't even bothered using their own state of the art AI to come up with a coherent naming convention? C'mon, get your shit together.


"ChatGPT" (chatgpt-4o) is now its own model, distinct from gpt-4o.

As for self-limiting usage by non-power users, they're already doing that: ChatGPT app automatically picks a model depending on what capabilities you invoke. While they provide a limited ability to see and switch the model in use, they're clearly expecting regular users not to care, and design their app around that.


None of that matters to normal users, and you could satisfy power users with serial numbers or even unique ideograms. Naming isn't that hard, and their models are surprisingly adept at it. A consistent naming scheme improves customer experience by preventing confusion - when a new model comes out, I field questions for days from friends and family - "what does this mean? which model should i use? Aww, I have to download another update?" and so on. None of the stated reasons for not having a coherent naming convention for their models are valid. I'd be upset as a stakeholder, they're burning credibility and marketing power for no good reason.

modelname(variant).majorVersion.minorVersion ChatGPT(o).3.0 ChatGPT-mini(o).3.0 GPT.2.123 GPT.3.9

And so on. Once it's coherent, people pick it up, and naturally call the model by "modelname majorversion" , and there's no confusion or hesitance about which is which. See, it took me 2 minutes.

Even better: Have an OAI slack discussion company-wide, then have managers summarize their team's discussions into a prompt demonstrating what features they want out of it, then run all the prompts together and tell the AI to put together 3 different naming schemes based on all the features the employees want. Roll out a poll and have employees vote which of the 3 gets used going forward. Or just tap into that founder mode and pick one like a boss.

Don't get me wrong, I love using AI - we are smack dab in the middle of a revolution and normal people aren't quite catching on yet, so it's exhilarating and empowering to be able to use this stuff, like being one of the early users of the internet. We can see what's coming, and if you lived through the internet growing up, you know there's going to be massive, unexpected synergies and developments of systems and phenomena we don't yet have the words for.

OpenAI can do better, and they should.


I agree with your observations, and that they both could and should do better. However, they have the privilege of being the AI company, the most hyped-up brand in the most hyped-up segment of economy - at this point, the impact of their naming strategy is approximately nil. Sure, they're confusing their users a bit, but their users are very highly motivated.

It's like with videogames - most of them commit all kinds of UI/UX sins, and I often wish they didn't, but excepting extreme cases, the players are too motivated to care or notice.


This mentality is why teenagers can't use a file system. Why do tech people love to hide as much state as possible. Does it really help anyone?


Inscrutable naming is a proven strategy for muddying the waters.


Salesforce would like a word...


The USB-IF as well. Retroactively changing the name of a previous standard was particularly ridiculous. It's always been USB 3.1 Gen 1 like we've always been at war with Eastasia.


I hope chatgpt reconsiders the naming of their models some time. I have troubles deciding which model is the one I should use.


They release models too often for a new one to be better at everything, so you have to pick the right one for your task.


And that’s exactly where good, recognizable branding comes in.


Hopefully this is a big improvement from o1.

o1 has been very disappointing after spending sufficient time with Claude Sonnet 3.5. It's like it actively tries to gaslight me and thinks it knows more than I do. It's too stubborn and confidently goes off in tangents, suggesting big changes to parts of the code that aren't the issue. Claude tends to be way better at putting the pieces together in its not-quite-mental-model, so to speak.

I told o1 that a suggestion it gave me didn't work and it said "if it's still 'doesn't work' in your setup..." with "doesn't work" in quotes like it was doubting me... I've canceled my ChatGPT subscription and, when I really need to use it, just go with GPT-4o instead.


I've also noticed that with cGPT.

That said I often run into a sort of opposite issue with Claude. It's very good at making me feel like a genius. Sometimes I'll suggest trying a specific strategy or trying to define a concept on my own, and Claude enthusiastically agrees and takes us down a 2-3 hour rabbit hole that ends up being quite a waste of time for me to back track out of.

I'll then run a post-mortem through chatGPT and very often it points out the issue in my thinking very quickly.

That said I keep coming back to sonnet-3.5 for reasons I can't perfectly articulate. Perhaps because I like how it fluffs my ego lol. ChatGPT on the other hand feels a bit more brash. I do wonder if I should be using o1 as my daily driver.

I also don't have enough experience with o1 to determine if it would also take me down dead ends as well.


Really interesting point you make about Claude. I’ve experienced the same. What is interesting is that sometimes I’ll question it and say “would it not be better to do it this way” and all of a sudden Claude u-turns and says “yes great idea that’s actually a much better approach” which leaves me thinking; are you just stroking my ego, if it’s a better approach then why didn’t you suggest it?

However I have suggested worse approaches on purpose and sometime Claude does pick them up as less than optimal


It's a little sycophant.

But the difference is that it actually asks questions. And also that it actually rolls with what you ask it to do. Other models are stubborn and loopy.