Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI's o1-pro now available via API (platform.openai.com)
131 points by davidbarker 31 days ago | hide | past | favorite | 129 comments



Pricing: $150 / 1M input tokens, $600 / 1M output tokens. (Not a typo.)

Very expensive, but I've been using it with my ChatGPT Pro subscription and it's remarkably capable. I'll give it 100,000 token codebases and it'll find nuanced bugs I completely overlooked.

(Now I almost feel bad considering the API price vs. the price I pay for the subscription.)


As far as I'm concerned, all of the other models are a waste of time to use in comparison. Most people dont know how good this model is


Interesting... Most benchmarks show this model as being worse than o3-mini-high and sonnet3.7.

What difference are you seeing from these models that makes it better?

I say this as someone considering shelling out $200 for ChatGPT pro for this.


If you're in the habit of breaking down problems to Sonnet-sized pieces you won't see a benefit. The win is that o1pro lets you stop breaking down one level up from what you're used to.

It may also have a larger usable context window, not totally sure about that.


> lets you stop breaking down one level up from what you're used to.

Can you provide an example of what you mean by this? I provide very verbose prompts where I know what needs to be done and just let AI “do” the work. I’m curious how this is different?


Partly it means you can tell it to do X and it will figure out that implies Y and Z without you having to spell it out

And partly it can actually execute more at the same time without starting to make mistakes


Sonnet 3.7 and O1 Pro both have 200K context windows. But O1 Pro has a 100K output window, and Sonnet 3.7 has a 128K output window. Point for Sonnet.

I routinely put about 100K + of context into Sonnet 3.7 in the form of source code, and in the Extended mode, given the right prompt, it will output perhaps 20 large source files before having to make a "continue" request (for example if it's asked to convert a web app from templates to React).

I'm curious whether O1 Pro actually exceeds Sonnet 3.7 in Extended mode for coding or not. Looking forward to seeing some benchmarks.


I am very curious how 3.7 and o1 pro perform in this regard:

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://arxiv.org/abs/2502.05167


Anyone ever tries to restructure a 10K text? For example, structure a 45min - 1hr interview transcript in an organized way without losing any detailed numbers / facts / supporting evidence. I find that none of OpenAI's model is capable of this task. Models are trying to summarize and omitting details. I think such task does not require much intelligence, but surprisingly OpenAI's "large" context model cannot make it.


"Usable" is the key word here. Not all context is created equal.

Have a look at the RULER benchmark for a bit more detail.


There actually were almost no benchmarks for o1 pro before because it wasn't on the API. o1 pro is a different model from o1 (yes, even o1 with high reasoning).


I regularly push 100k+ tokens into it. So most of my code base/large portions. I use the Repo Prompt product to construct the code prompts. It finds bugs and solutions at a rate that is far better than others. I also speak into the prompt to describe my problem, and find spoken language is interpreted very well.

I also frequently download all the source code of libraries I am debugging, and when running into issues, pass that code in along with my own broken code. Its very good


How long is it's thinking time when compared to o1?

The naming would suggest that o1-pro is just o1 with more time to reason. The API pricing makes that less obvious. Are they charging for the thinking tokens? If so, why is it so much more expensive if there are just more thinking tokens anyways?


I think o1 pro runs multiple instances of o1 in parallel and selects the best answer, or something of the sort. And you do actually always pay for thinking models with all providers, OpenAI included. It's especially interesting if you remember the fact that OpenAI hides the CoT from you, so you're in fact getting billed for "thinking" that you can't even read yourself.


I dont have the answers for you, I just know that if they charged 400$ a month I would pay it. It seems like a different model to me. I never use o3-mini or o3-mini-high. Just gpt4o or o1 pro


Remarkably capable is a good description.

Shameless plug: One of the reasons I wrote my AI coding assistant is to make it easier to get problems into o1pro. https://github.com/jbellis/brokk


I wonder what the input/output tokens will be priced at for AGI.


They won't. Your use cases won't be something the AI can't do itself, so why would they sell it to you instead of replace you with it?

AGI means the value of a human is the same as an LLM, but the energy requirements of a human are higher than those of an LLM, so humans won't be economical any more.


Actually, I think humans require much less energy than LLMs. Even raising a human to adulthood would be cheaper from a calorie perspective than running an AGI algorithm (probably). Its the whole reason why the premise of the Matrix was ridiculous :)

Some quick back of the envelope says that it would take around 35 MWh to get to 40 years old (2000 kcal per day)


I read an article once that claimed an early draft/version that was cut for time or narrative complexity had the human brains being used as raw compute for the machines, with the Matrix being the idle process to keep the minds sane and functional for their ultimate purpose.


I've read a file that claimed to be that script; it made more sense for the machines to use human brains to control fusion reactors than for humans to be directly used as batteries.

(And way more sense than how the power of love was supposed to be a nearly magical power source in #4. Boo. Some of the ideas in that film were interesting, but that bit was exceptionally cliché.)


I'd love to read that file. Of course, we're close (really close?) to being able to just ask an LLM to give us a personalized version of the script to do away with whatever set of flaws bother us the most.


One of the ways I experiment with LLMs is to get them to write short stories.

Two axies: Quality and length.

They're good quality. Not award winning, but significantly better than e.g. even good Reddit fiction.

But they still struggle with length, despite what the specs say about context length. You might manage the script length needed for a kid's cartoon, but not yet a film.

I'll see if I can find another copy of the script; what I saw was long enough ago my computer had a PPC chip in it.


> PPC chip

Pizza box? I loved the 6100.


Beige proto-iMac. I had a 5200 as a teen and upgraded to either a 5300 or a 5400 at university for a few years — the latter broke while at university and I upgraded again to an eMac, but I think this was before then.

Looks like there's many different old scripts, no idea which, if any, was what I read back in the day: https://old.reddit.com/r/matrix/comments/rb4x93/early_draft_...

I miss those days. Even software development back then was more fun with REALbasic than today with SwiftUI.


HA! I used REALbasic a bit back in the day, then spent my time comparing it to LiveCode, back then called Revolution. Geoff Perlman and I once co-presented at WWDC to compare the two tools.


You need to consider all the energy spent to bring those calories to you, easily multiplying your budget by 10 or 100.


A human runs on ~100W, even when not doing anything useful. It's entirely plausible that 100W will be enough to run a future AGI level model.


OpenAI doesn’t have the pre-existing business, relationships, domain knowledge, etc to just throw AGI at every possible use case. They will sell AGI for some fraction of what an equivalent human behind a computer screen would cost.

“AGI” is also an under-specified term. It will start (maybe is already there) equivalent to, say, a human in an overseas call center, but over time improve to the equivalent of a Fortune 500 CEO or Nobel prize winner.

“ASI”, on the other hand, will just recreate entire businesses from scratch.


There's could be something to what you wrote. If AGI were to be achieved by a model, why would they give access to it via an API? Why not just sell what it can do? E.g. business services. That would be far more of a moat.


Can you describe this "find a bug" workflow?


Is your prompt {$codebase} find bugs?


Typically something like:

  Look carefully through my codebase and identify any bugs/issues, or refactors that could improve it.

  <codebase>
  …
  </codebase>
Doesn't have to be anything overly complicated to get good results. It also does well if you give it a git diff.


Sorry if this is a noob question, but are you just pasting file strings inbetween those tags? like the contents of file1.js and file2.js?


I do something similar, but "raw" markdown instead + filename, so all my prompts end up like this basically:

    Do blah blah blah while taking blah and blah into account. Here is my current code:

    File `file1.js`:

    ```javascript
    console.log('I am number one!')
    ```

    File `file2.js`:

    ```javascript
    console.log("I am number two :(")
    ```
Not sure if I'm imagining, but when I tried with/without the markdown code blocks, it seems to do better when I used markdown code blocks, so wrote a quick CLI that takes a directory path + prompt and creates something like that automatically for me. Often times I send identical prompts to ChatGPT+DeepThink+Claude, compare the approaches and continue with the one that works best for that particular problem, so having something reusable really saved time for this.

Edit: fuck it, in case people are curious how my little CLI works, I threw it up here: https://github.com/victorb/prompta (beware of bugs and whatnot, I've quite literally hacked this together without much thought)


I actually have a perfect tool called siphon-cli for this. It adds the headers in between files and everything. https://docs.siphon-cli.com/


Yeah, like a lightweight version of my prompta CLI :)

What I end up with, is one .md file that uses variables like "$SRC", "$TESTS" and "$DOCS" inside of it, that gets replaced when you run `prompta output`, and then there is also a JSON file that defines what those variables actually get replaced with.

Bit off-topic, but curious how your repository ends up having 8023 lines of something for concatenating files, while my own CLI sits on 687 lines (500 of those are Rust) but has a lot more functionality :)


Not OP, but practically all of those lines are from a package-lock.json file (6755 lines) and a changelog (541 lines). It looks like the actual source is 179 lines long.


Repomix can take care of this for you. I pack it, cat the file to my clipboard with pbcopy, and just paste it into the prompt.

https://github.com/yamadashy/repomix


I tried the web demo (https://repomix.com/) and it seems to generate unnecessarily complex "packs" for no reason, probably hurts LLM performance too. Why is there "Usage Guidelines" and "File Format" explanations in this, when it's supposed to just be the code "packed"? Better to just have the contents+filename, it'll infer that its directory structure and everything else.


While possibly being strange defaults, both of those are options. Remove the file summary and directory structure, both featured on the UI, and on the CLI tool, and voila, it's in your "better" state. There are also additional compression options beyond those two tweaks.


That's right. I made a VS Code extension to combine all the files I have open into one long string I can copy & paste.

https://marketplace.visualstudio.com/items?itemName=DVYIO.co...


I wrote a script to do exactly this a while ago (with the help of o1 pro). Makes it way easier https://github.com/keizo/ggrab


Do you get this -

When you say: But is that really a bug?

GPT: That's right. Now that I see it again this is not a bug….and a lot of blah blah.


This is their first model to only be available via the new Responses API - if you have code that uses Chat Completions you'll need to upgrade to Responses in order to support this.

Could take me a while to add support for it to my LLM tool: https://github.com/simonw/llm/issues/839


Oh interesting. I thought they were going to have forward compatibility with Completions. Apparently not.


It does. There are two endpoints. Eventually, all new models will only be in the new endpoint. The data interfaces are compatible.


It shouldn't be too bad. The responses API accepts the same basic interface as the chat completion one.


The harder bit is the streaming response format - that's changed a bunch, and my tool supports both streaming and non-streaming for both Python sync and async IO - so there are four different cases I need to consider.


Even the basic interface is different, actually - "input" vs "messages", no "max_completion_tokens" nor "max_tokens". That said, changing those things is quite easy.


If it's easier, just ask Cursor to make the upgrade. Give it a link to the OpenAI doc. You might be surprised at how easy it is.


Simon, I see if via Chat Completions as well as Responses in their API platform playground.


I just tried sending an o1-pro prompt to the chat completions API in the playground and got:

  This is not a chat model and thus not supported in the
  v1/chat/completions endpoint. Did you mean to use v1/completions?


Sorry, since the Platform UI featured it as an option, I figured OpenAI might enable o1-Pro via the chat completions endpoint, I just got around to testing it, and I also get the same 404 `invalid_request_error` error via the platform UI and API. It's such an odd and old 404 message, to suggest using the old completions API! It's hard to believe it could be an intentional design decision. Maybe they see it as an important feature to avoid wasting (and refunding) o1-pro credit. I noticed that their platform's dashboard queries https://api.openai.com/dashboard/which lists a supported_methods property of models. I can't see anything similar in the huge https://raw.githubusercontent.com/openai/openai-openapi/refs... schema yet (commit ec54f88 right now), and it lacks mention of o1-pro at all. Like the whole developer messages thing, the UX of the API seems like such an after-thought.


It cost me 94 cents to render a pelican riding a bicycle SVG with this one!

Notes and SVG output here: https://simonwillison.net/2025/Mar/19/o1-pro/


I’m no expert but that does not look like a 94c pelican to me.


Better than my svg pelican would be, but it's a low bar.


Your collection of pelicans is so bloody funny, genuinely brightened my day.

I don't know what I was expecting when I clicked the link but it definitely wasn't this: https://simonwillison.net/tags/pelican-riding-a-bicycle/


Whenever you experience a new pelican I always have to check it against your past pelicans to see progress towards the Artificial Super Pelican Singularity:

https://simonwillison.net/tags/pelican-riding-a-bicycle/


At this point you’d come out ahead just buying a pelican. Even before the tax benefits.


I have been using ChatGPT to generate 3d models by pasting output into OpenSCAD. Often feels like coaching someone wearing a blindfold, but it can sometimes kick things forward quickly for low effort.


Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.

OpenAI is now within an order of magnitude of a highly skilled humans with their frontier model pricing. o3 pro may change this but at the same time I don’t think they would have shipped this if o3 was right around the corner.


If you start paying someone and give them some onboarding docs, to a first approximation they'll start doing the job and you'll get value.

If you attach a credit card to o3 and give it some onboarding docs, it'll give you a nice summary of your onboarding docs that you didn't need.

We're a long way from a model doing arbitrary roles. Currently at the very minimum, you need a competent office worker to run the model, filter its output through their judgement, and act on it.


More like: every time you tell o3 to do something, it will first reread the onboarding docs (and charge you for doing so) before it does anything else.


Right, value per token is much more important (but harder to quantify). A medical AI that could provide a one-paragraph diagnosis and treatment plan for rare / untreatable diseases could be generating thousands of dollars of value per token. Meanwhile, Claude has probably racked up millions of tokens wandering around Mt. Moon aimlessly.


“Untreatable” disease.

Yet somehow the AI knows a treatment?


Miraculously, it does know the treatment! The question is, does it have malpractice insurance?


The treatment is a cranial amputation. No more symptoms!


I can just imagine all the executives and actuaries in Lloyd’s of London just salivating and wringing their hands: ”that’s how we get in on the AI fad too!”


I think that’s the remarkable thing - even with all of its flaws and its insane pricing, there’s plenty of people that will pay for it (myself included).

LLM’s are good at a class of tasks that humans aren’t.


> Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.

I guess...if by office worker you mean a manager that does nothing but attend meetings and otherwise talk to people. For other workers you probably want to count the token equivalent of their actual work output and not just the chatting.


I suspect inner monologue is the useful metric for token count. I don't know if (any, let alone most or all) human brains think in token-like chunks, but if we do, and that's at 180/minute, thats 180x60x5x48 (working weeks/year) = 20,736,000 tokens/year. At that rate, $160k/year would be ~$7700/million tokens.

My guess is that this is better than a human who would cost $16k/year to hire. But with the logarithmic improvements in quality for linear price increases, I'm not sure it would be good enough to replace a $160k/year worker.


Just noticed I missed the 8x in the LHS, but the total is correct:

> 180x60x5x48 (working weeks/year) = 20,736,000 tokens/year


How do you reconcile issues such as the o1 pro model erroring out every 3rd attempt at an extremely large context? (that still fits but is near the limit)

Every time I try to get this thing to read my codebase and onboarding docs (about 40k line angular codebase) it is "pull your hair out" failing leading to frustration.


It has a 2023 knowledge cut-off, and 200k context window... ? That's pretty underwhelming.


On the flip side, the cutoff date probably makes it a lot more upbeat.


Don't know if it's me, but this is really funny.


For a second I was like "2023 isn't that bad"... and then I realized we're well into 2025...


o1-pro still holds up to every other release, including Grok 3 think and Claude 3.7 think (haven't tried Max out though), and that's over 3 months ago, practically an eternity in Ai time.

Ironic since I was getting ready to cancel my Pro subscription, but 4.5 is too nice for non-coding/math tasks.

God I can't wait for o3 pro.


"Max" as in "Claude 3.7 Sonnet MAX" is apparently Cursor-specific marketing - by default they don't use all the context of the model and set the thinking budget to a lower value than the maximum allowed. So essentially it's the exact same 3.7 Sonnet model, just with different settings.


4.5 works on Plus! I know. I was surprised too.


Those that have tested it and liked it. I feel very confident with Sonnet 3.7 right now,if I would wish for something its it to be faster. Most of the problems I’m facing are like execution problems I just want AI to do it faster than me coding everything on my own.

To me it seems like o1-pro would be to be used as a switch-in tool or to double-check your codebase, than a constant coding assistant? (Even with lower price), as I assume I would need to get done a tremendous amount of work including domain knowledge done to come up for the 10x more speed (estimated) of Sonnet?


o1-pro can be very useful but it's ridiculously slow. If you find yourself wishing Sonnet 3.7 was faster, you really won't like o1-pro.

I pay for it and will probably keep doing so, but I find that I use it only as a last resort.


I have always suspected that the o1-Pro is some kind of workflow on the o1 model. Is it possible that it dispatches to say 8 instances of o1 then do some type of aggregation over the results?


Did not know it was that expensive to run. I'm going to use it more in my Pro subscription now. I frankly do not notice a huge difference between o1 Pro and o3-mini-high - both fail on the fairly straightforward practical problems I give them.


At first I thought, great, we can add it now to our platform. Now that I have seen the price, I am hesitant enabling the model for the majority of users (except rich enterprises) as they will most certainly shoot themselves in the foot.


> they will most certainly shoot themselves in the foot

...and then ask you for a refund or service credit.


> $150/Mtok input, $600/Mtok output

What use case could possibly justify this price?


It enables obscene unnatural things at a fraction of most SWE hourly rates. One win that jumps to mind was writing a complete implementation of a Windows PCM player, as a flutter plugin, with some unique design properties and emergent API behavior that it needed to replicate from existing iOS/Android code


Does it really? Your average software engineer is like £20-30 an hour, for the cost of 1m output tokens you can get a Dev for a full week.


The math doesn’t check out. A day maybe. Also it’s not just about a placeholder dev. The person needs to know your use-case and have the tech chops to deliver successfully in that timeframe.

Now to have that delivered to you in less than an hour? That’s a huge win.


No, even contractors who are more expensive per day can be had for £200-300 per diem. Employee costs you are looking at closer to £165 including national insurance and pension on a standard £35k a year salary. Even an employee on £50k a year is only £241.97 a day.

IR35 rules largely closed down a lot of the high-value contracting business and pushed day rates down heavily.


Sure.

So why are you saying we'd get a dev for a week for $600? :)

Even we take that at face value, now we're claiming devs are $30K/year. (This was sort of baseline full-time pay when I was in high school, 20 years ago)

Why do I need 1,000,000 tokens for 500 loc?

I don't think further futzing with the numbers makes this work, it's off by multiple OOMs while barely credible


Ok closer to 4 days than a week, still an insane differential in cost.

As we all know it's never just 500 lines of code, and 500 lines can take up quite a lot of tokens, especially on a reasoning model like o1-pro which we know can spend quite a hefty chunk of tokens on thinking.

Once you add in iteration and testing you can find yourself racking up quite the bill even on small projects.


In a separate thread I estimated cost of doing the full output 10x. $18. It certainly adds up, yes, the thing I get indignant about (not with you, just, in general) is even the most expensive is definitely an OOM off and most likely 2 OOMs off from paying an engineer.

It's non-trivial work that requires domain knowledge, i.e. it would have been a 4-6 week project for a noogler, and I would have been impressed if they did it without heavy guidance. (Emergent requirements around purposely using an older API, stuff like that).


I don't agree with your cost estimate, but even accepting it there's no way your spending £1800/4-6 weeks on something as small as 500 lines of code and as simple as PCM audio player code. Your description sounds more like the thing an undergraduate might make in a Hackathon.


That's fair, for better or worse, I'm not going to spend my time proving it was hard.


Leaving the dissection of this to the separate reply, let's estimate cost:

- 80 chars per line, 30 occupied (avg'd across 300 KLOC in codebase)

- 500 lines of code

- 15000 characters

- 4 chars / token

- 3750 tokens output

- 10 full iterations, and don't apply cached token pricing that's 90% off

- 37,500 tokens req'd in output

- $600 / 1M tokens

- $0.60 / 1K tokens

- $18


Even cached input is only half price with openAI (and they don't even offer it for o1-pro).

Further we also aren't counting input here which can get long since it includes the previous output, which for the last request will be 33,750 + reasoning + any prompts, which will increase your cost quite a bit there.

But yes that is more reasonable than I'd expect I must admit, but I still think it needs to be at least an order of magnitude cheaper to compete against the other models out there.

I'm not sure I know a lot of employees who would allocate that sort of constant funding to a consumables tool for an employee, given that's your usual monthly cost for a typical saas product.


Been a minute since I touched audio code but isn't PCM quite basic? Really hard to beat $18 though! Even the 2hrs it'd take a decent SWE would be easily 30x that.


The avg. SWE is a toss up if they create more issues than they solve over time. Factor in on-boarding, bugs and taking time away from other expensive people becomes >$100/hr real quick.


If this were true there would be a lot less hiring of suggests engineers in the world and a lot more people getting laid off.


There would be if it was easy to recognize.

You generally can't see this due to all the middle men and bloat, and that no corp really wants to measure dev productivity against revenue like this as it'd raise a lot of uncomfortable questions by both devs and shareholders alike.


More mediocre software is all the world needs.


A tool is a tool. Your output is what you decide.


Probably not great (or even unnatural) example. There are tons of examples of PCM players as Flutter plugins on the net and Gemini from the free AI Studio spits an implementation out in about 20 seconds and 0$.

YMMV


No, you're wrong. I wish you weren't. I hate posting this stuff because at least a few people reply to the absolutely weakest version of what I actually said.

Go check out flutter_pcm_sound_fork, find me even one package with the same streaming PCM => speakers functionality, and I'll give you $500. All I ask is, as a personal favor to me, you read the part in the Hacker News FAQ about "coming with curiosity"


I used O1 Pro to write a .NET authorization filter which when I wrote it I didn’t even know what that was. I was like “I have this problem, how can I fix it” and it just started going and the solution worked the first try. Everyone at work was like “great job!” I guess I did feed it a bunch of surrounding code and the authorization policy, but the policy only allowed us to attach one security trait when we wanted it to be “attach any number of security attributes and verify the user has at least one”. Still, it solved what likely would have been at least a day or two of research in an hour or so conversation.


Is it secure?


Synthetic data generation. You can have a really powerful, expensive model create evals so you can tune a faster, cheaper system with similar performance.


You could do that, but OpenAI specifically doesn't want you to: https://openai.com/policies/row-terms-of-use/

What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not: Use Output to develop models that compete with OpenAI.

Presumably you run the risk of getting banned if they realize what you're doing.


> You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not: Use Output to develop models that compete with OpenAI.

This reads as if they consider developing models that compete with OpenAI as illegal, harmful or abusive. Which is crazy. (The other dot points in their list in the linked terms seem better).


Screw their TOS.

OpenAI trained on the world's data. Data they didn't license.

Anyone should be able to "rip them off" and copy their capabilities on the cheap.


The irony isn't lost on me, but irony isn't going to stop them kicking you off their platform if they feel like it.


I wonder if some of the high pricing is specifically an attempt to ward off this sort of "slow distillation" of a powerful model


I suspect they have no way to enforce that without risking false positive hurting their rich customers (and their business).


If it was possible:

1) Why wasn't OpenAI doing it themselves?

2) This means we've reached technological singularity if AI models can improve themselves (as in getting a smarter model, not just compressing existing ones like Deepseek)


It’s not a singularity because the synthetic data generated by the previous frontier model isn’t usually fed directly into the training for the next frontier model - a “discriminator” is applied to select only the highest quality responses. That discriminator could be a field expert or mechanical turk or another model trained to select for higher quality responses (i.e. trained on a dataset of books rather than internet content).

As far as I know, OpenAI has been doing this, using both experts and Kenyan workers as well as their own discriminator models. Unfiltered synthetic data is generally used more for distilled models and fine tunes for a specific use case.


Compressing existing models is exactly what we are talking about here.


Then why wasn't OpenAI doing that themselves ?


Who says they arent?


Synthetic data is just as useful for building app layers evals. Probably significantly cheaper ways to get the data if you’re training your own model.


I compete with AI, not my models.


Full file refactoring. But I just use the webUI for this and will continue to at these prices... probably.


o1-pro doesn't support streaming, so it's reasonable to assume that they doing some kind of best-of-n type technique to search over multiple answers.

I think you can probably get similar results for a much lower price using llm-consortium. This lets you prompt as many models as you can afford and then chooses or synthesises the best response from all of them. And it can loop until a confidence threshold is reached.


Seems underwhelming when openai's best model, o3, was demoed almost 4 months ago.


Deepseek r1 is much better than this.


Interesting take, care to explain more exactly how it is much better?


It's exactly "much" better!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: