Hacker News new | past | comments | ask | show | jobs | submit login
ChatGPT Pro (openai.com)
813 points by meetpateltech 44 days ago | hide | past | favorite | 1197 comments



OpenAI is racing against two clocks: the commoditization clock (how quickly open-source alternatives catch up) and the monetization clock (their need to generate substantial revenue to justify their valuation).

The ultimate success of this strategy depends on what we might call the enterprise AI adoption curve - whether large organizations will prioritize the kind of integrated, reliable, and "safe" AI solutions OpenAI is positioning itself to provide over cheaper but potentially less polished alternatives.

This is strikingly similar to IBM's historical bet on enterprise computing - sacrificing the low-end market to focus on high-value enterprise customers who would pay premium prices for reliability and integration. The key question is whether AI will follow a similar maturation pattern or if the open-source nature of the technology will force a different evolutionary path.


The problem is that OpenAI don't really have the enterprise market at all. Their APIs are closer in that many companies are using them to power features in other software, primarily Microsoft, but they're not the ones providing end user value to enterprises with APIs.

As for ChatGPT, it's a consumer tool, not an enterprise tool. It's not really integrated into an enterprises' existing toolset, it's not integrated into their authentication, it's not integrated into their internal permissions model, the IT department can't enforce any policies on how it's used. In almost all ways it doesn't look like enterprise IT.


This remind me why enterprise don't integrated OpenAI product into existing toolset, trust is root reason.

It's hard to provide trust to OpenAI that they won't steal data of enterprise to train next model in a market where content is the most valuable element, compared office, cloud database, etc.


This is what the Azure OpenAI offering is supposed to solve, right?


Sort of?

Then there’s trust that it won’t make up information.

It probably won’t be used for any HR/legal work for fear of false info being generated.


Correct


Why should MS be more trustworthy them OpenAI?

MS failed their customers more than once.


Microsoft 360 has over 300 million corporate users - trusting it with email, document management, and collaboration etc. It’s the defacto standard in larger companies especially in banking, medicine and finance that have more rigorous compliance regulations.


And MS already showed the customers shouldn’t trust them.

https://news.ycombinator.com/item?id=37408776

Maybe it’s a good idea to spread your data and not putting it in one place, if you really need to use the cloud


The administrative segments that decide to sell their firstborn to Microsoft all have their heads in the clouds. They'll pay Microsoft to steal their data and resell it and they'll defend their decisions making beyond their own demise.

As such Microsoft is doing the right choice in outright stealing data for whatever purpose. It will have no real consequences.


I think the case could be made that “spreading your data” is exactly what you don’t want to do, you’re increasing your attack surface.


Not like most had a choice, they already had office documents and windows, what else were they going to pick?

Your historical pile of millions of MSOffice documents is an ocean sized moat.


Surely MS wouldn't abuse that trust.

https://news.ycombinator.com/item?id=42245124


IT policy flick of the switch disables that, such as at my organization. That was instead intended to snag single, non-corporate, user accounts (still horrible, but I mean to convey that MS at no point in all that expected a company's IT department to actually leave that training feature enabled in policy).


This was debunked within hours, as commented on that thread last week.


It doesn't need to / it already is – most enterprises are already Microsoft/Azure shops. Already approved, already there. What is close to impossible is to use anything non Microsoft - with one exception – open source.


Because idk, windows, AD, office, and so many more Microsoft products could already betray that customer trust but don’t.


They betrayed their customers in the Storm-0558 attack. They didn't disclose the full scale and charged the customers for the advanced logging needed for detections.

Not to mention that they abolished QA and outsourced it to the customer.


How do you know they don't?


It is immaterial what they do and what you know. Important is what CIOs of the enterprise believe.


Because that would be the biggest story in the world.


Maybe they aren't, but when you already have all your documents in sharepoint, all your emails in outlook and all your databases VMs in Azure, then Azure OpenAI is trusted in the organization.


For some reason (mainly because Microsoft has orders of magnitude more sells reps than anything else) companies have been trusting Microsoft for their most critical data for a long time.


They sign business associate agreements. It's good enough for HIPAA compliance.


The devil you know


For example when they backed the CEOs coup against the board.

With AI-CEOs - https://ai-ceo.org - This would never have happened because their CEOs have a kill switch and mobile app for the board for full observability


OpenAi enterprise plan especially says that they do not train their models with your data. It's in the contract agreement and it's also visible on the bottom of every chatgpt prompt window.


It seems like a damned if you do, damned if you don't. How is ChatGPT going to provide relevant answers to company specific prompts if they don't train on your data?

My personal take is that most companies don't have enough data, and not in sufficiently high quality, to be able to use LLMs for company specific tasks.


The model from OpenAI doesn’t need to be directly trained on the company’s data. Instead, they provide a fine-tuning API in a “trusted” environment. Which usually means Microsoft’s “Azure OpenAI” product.

But really, in practice, most applications are using the “RAG” (retrieval augmented generation) approach, and actually doing fine tuning is less common.


> The model from OpenAI doesn’t need to be directly trained on the company’s data

Wouldn't that depend on what you expect it to do? If you just want say copilot, summarize texts or help writing emails then you're probably good. If you want to use ChatGPT to help solve customer issues or debug problems specific to your company, wouldn't you need to feed it your own data? I'm thinking: Help me find the correct subscription to a customer with these parameters, then you'd need to have ChatGPT know your pricing structure.

One idea I've had, from an experience with an ISP, would be to have the LLM tell customer service: Hey, this is an issue similar to what five of your colleagues just dealt with, in the same area, within 30 minutes. You should consider escalating this to a technician. That would require more or less live feedback to the model, or am I misunderstanding how the current AIs would handle that information?


> Instead, they provide a fine-tuning API


Most enterprise use cases also have strong authz requirements.

You can't really maintain authz while fine tuning (unless you do a separate fine-tune for each permission set.) So RAG is the way to go, there.


> How is ChatGPT going to provide relevant answers to company specific prompts if they don't train on your data?

Isn't this explicitly what RAG is for?


RAG is worse than training on the target data, but yes it is a mitigation.


That is a MASSIVE game changer !


100% this. If they can figure out trust through some paradigm where enterprises can use the models but not have to trust OpenAI itself directly then $200 will be less of an issue.


> It's hard to provide trust to OpenAI that they won't steal data of enterprise to train next model

Bit of a cynical take. A company like OpenAI stands to lose enormously if anyone catches them doing dodgy shit in violation of their agreements with users. And it's very hard to keep dodgy behaviour secret in any decent sized company where any embittered employee can blow the whistle. VW only just managed it with Dieselgate by keeping the circle of conspirators very small.

If their terms say they won't use your data now or in the future then you can reasonably assume that's the case for your business planning purposes.


Is it? OpenAI has multiple lawsuits over misuse of data, and it doesn't seem to be slowing them down much.

https://news.bloomberglaw.com/ip-law/openai-to-seek-to-centr...

Just make sure your chat history is off for starters. https://www.threatdown.com/blog/how-to-keep-your-chatgpt-con...


lawsuits over the legality of using using someone's writing as training data aren't the same thing as them saying they won't use you as training data and then doing so. they're different things. one is people being upset that their work was used in a way they didn't anticipate, and wanting additional compensation for it because a computer reading their work is different from a person reading their work. the other is saying you won't do something and then doing that anyway and lying about it.


It's not that anyone suspects OpenAI doing dodgy shit. Data flowing out of an enterprise is very high risk. No matter what security safeguards you employ. So they want everything inside their cloud perimeter and on servers they can control.

IMO no big enterprise will adopt chatGPT unless it's all hosted in their cloud. Open source models lend better to the enterprises in this regard.


> IMO no big enterprise will adopt chatGPT unless it's all hosted in their cloud

80% of big enterprises already use MS Sharepoint hosted in Azure for some of their document management. It’s certified for storing medical and financial records.


> IMO no big enterprise will adopt chatGPT unless it's all hosted in their cloud

Plenty of big enterprises have been using OpenAI models for a good while now.


Cynical? That’d be on brand… especially with the ongoing lawsuits, the exodus of people and the CEO drama a while back? I’d have a hard time recommending them as a partner over Anthropic or Open Source.


It's not enough for some companies that need to ensure it won't happen.

I know for a fact a major corporation I do work for is vehemently against any use of generative A.I. by its employees (just had that drilled into my head multiple times by their mandatory annual cybersecurity training), although I believe they are working towards getting some fully internal solution working at some point.

Kind of funny that Google includes generative A.I. answers by default now, so I still see those answers just by doing a Google search.


If everyone has the same terms and roughly equivalent models, enterprises will continue choosing Microsoft and Amazon.


This seems like the kind of thing that laws and regulators exist for.


Good luck with that. Fortunately few CTOs/CEOs share your faith in a company already guilty of rampant IP theft, run by a serial liar.


ChatGPT does have an enterprise version.

I've seen the enterprise version with a top-5 consulting company, and it answers from their global knowledgebase, cites references, and doesn't train on their data.


I recently (in the last month) asked ChatGPT to cite its sources for some scientific data. It gave me completely made up, entirely fabricated citations for academic papers that did not exist.


Did the model search the internet?

The behavior you're describing sounds like an older model behavior. When I ask for links to references these days, it searches the internet the gives me links to real papers that are often actually relevant and helpful.


I don’t recall that it ever mentioned if it did or not. I don’t have the search on hand but from my browser history I did the prompt engineering on 11/18 (which perhaps there is a new model since then?).

I actually repeated the prompt just now and it actually gave me the correct, opposite response. For those curious, I asked ChatGPT what turned on a gene, and it said Protein X turns on Gene Y as per -fake citation-. Asking today if Protein X turns on Gene Y ChatGPT said there is no evidence, and showed 2 real citations of factors that may turn on Gene Y.

Pretty impressed!


Share a link to the conversation.


Here you go: https://chatgpt.com/share/6754df02-95a8-8002-bc8b-59da11d276...

ChatGPT regularly searches and links to sources.


I was asking for a link to the conversation from the person I was replying to.


What a bizarre thing to request. Do you go around accusing everyone of lying?


So sorry to offend your delicate sensibilities by calling out a blatant lie from someone completely unrelated to yourself. Pretty bizarre behavior in itself to do so.


Except there are news stories of this happening to people


I suspect there being a shred of plausibility is why there’s so many people lying about it for attention.

It’s as simple as copying and pasting a link to prove it. If it is actually happening, it would benefit us all to know the facts surrounding it.


sure, here's a link of a conversation from today 12/9/24 which has multiple incorrect: references, links, papers, journal titles, DOIs, and authors.

https://chatgpt.com/share/6757804f-3a6c-800b-b48c-ffbf144d73...

as just another example, chatgpt said in the Okita paper that they switched media on day 3, when if you read the paper they switched the media on day 8. so not only did it fail to generate the correct reference, it also failed to accurately interpret the contents of a specific paper.


I assume top-5 consulting companies are buying to be on the bandwagon, but are the rank and file using it?


YMMV wrt your experience and luck.

I’m a pretty experienced developer and I struggle to get any useful information out of LLMs for any non-trivial task.

At my job (at an LLM-based search company) our CTO uses it on occasion (I can tell by the contortions in his AI code that isn’t present in his handwritten code. I rarely need to fix the former)

And I think our interns used it for a demo one week, but I don’t think it’s very common at my company.


Yes, daily. It's extremely useful, superior to internal search while combining the internal knowledge base with ChatGPT's


In my experience consultants are using an absolute ton of chatGPT


Do you mean Azure OpenAI? That would be a Microsoft product.


Won’t name my company, but we rely on Palantir Foundry for our data lake. And the only thing everybody wants [including Palantir itself] is to deploy at scale AI capabilities tied properly to the rest of the toolset/datasets.

The issues at the moment are a mix of IP on the data, insurance on the security of private clouds infrastructures, deals between Amazon and Microsoft/OpenAI for the proper integration of ChatGPT on AWS, all these kind of things.

But discarding the enterprise needs is in my opinion a [very] wrong assumption.


Is the Foundry business the reason for the run up of PLTR this year?

https://www.cnbc.com/quotes/PLTR


Very personal feeling, but without a datalake organized the way Foundry is organized, I don’t see how you can manage [cold] data at scale in a company [both in term of size, flexibility, semantics or R&D]. Given the fact that IT services in big companies WILL fail to build and maintain such a horribly complex stack, the walled garden nature of the Foundry stack is not so stupid.

But all that is the technical part of things. Markets do not bless products. They bless revenues. And from that perspective, I have NO CLUE.


This is what's so brilliant about the Microsoft "partnership". OpenAI gets the Microsoft enterprise legitimacy, meanwhile Microsoft can build interfaces on top of ChatGPT that they can swap out later for whatever they want when it suits them


I think this is good for Microsoft, but less good for OpenAI.

Microsoft owns the customer relationship, owns the product experience, and in many ways owns the productionisation of a model into a useful feature. They also happen to own the datacenter side as well.

Because Microsoft is the whole wrapper around OpenAI, they can also negotiate. If they think they can get a better price from Anthropic, Google (in theory), or their own internally created models, then they can pressure OpenAI to reduce prices.

OpenAI doesn't get Microsoft's enterprise legitimacy, Microsoft keep that. OpenAI just gets preferential treatment as a supplier.

On the way up the hype curve it's the folks selling shovels that make all the money, but in a market of mature productionisation at scale, it's those closest to customers who make the money.


$10B of compute credits on a capped profit deal that they can break as soon as they get AGI (i.e. the $10T invention) seems pretty favorable to OpenAI.


I’d be significantly less surprised if OpenAI never made a single $ in profit than if they somehow invented “AGI” (of course nobody has a clue what that even means so maybe there is a chance just because of that..)


That's a great deal if they reach AGI, and a terrible deal ($10bn of equity given away for vendor-locked credit) if they don't.


Fortunately for OpenAI the contract states that they get to say when they have invented AGI.

Note: they announced recently that that they will have invented AGI in precisely 1000 days.


Leaving aside the “AGI on paper” point a sibling correctly made, your point shares the same basic structure as noting that any VC investment is a terrible deal if you only 2x your valuation. You might get $0 if there is a multiple on the liquidation preference!

OpenAI are clearly going for the BHAG. You may or may not believe in AGI-soon but they do, and are all in on this bet. So they simply don’t care about the failure case (ie no AGI in the timeframe that they can maintain runway).


How so?

Still seems like owning the customer relationship like Microsoft is far more valuable.


OAI through their API probably does but I do agree that ChatGPT is not really Enterprise product.For the company the API is the platform play, their enterprise customers are going to be the likes of MSFT, salesforce, zendesk or say Apple to power Siri, these are the ones doing the heavy lifting of selling and making an LLM product that provides value to their enterprise customers. A bit like stripe/AWS. Whether OAI can form a durable platform (vs their competitors or inhouse LLM) is the question here or whether they can offer models at a cost that justifies the upsell of AI features their customers offer


That's why Microsoft included OpenAI access in Azure. However, their current offering is quite immature so companies are using several prices of infra to make it usable (for rate limiting, better authentication etc.).


> As for ChatGPT, it's a consumer tool, not an enterprise tool. It's not really integrated into an enterprises' existing toolset, it's not integrated into their authentication, it's not integrated into their internal permissions model, the IT department can't enforce any policies on how it's used. In almost all ways it doesn't look like enterprise IT.

What according to you is the bare minimum of what it will take for it to be an enterprise tool?


SSO and enforceable privacy and IP protections would be a start. RBAC, queues, caching results, and workflow management would open a lot of doors very quickly.


It seems that ChatGPT Enterprise already has many of these:

https://openai.com/enterprise-privacy/


OpenAI's enterprise access is probably mostly happening through Azure. Azure has AI Services with access to OpenAI.


have used it at 2 different enterprises internally, the issue is price more than anything. enterprises definitely do want to self host, but for frontier tech they want frontier models for solving complicated unsolved problems or building efficiencies in complicated workflows. one company had to rip it out for a time due to price, I no longer work there anymore though so can't speak on if it was reintegrated.


Decision making in enterprise procurement is more about whether it makes the corporation money and whether there is immediate and effective support when it stops making money.


>> internal permissions model

This isn't that big of a deal any more. A company just needs to add the application to Azure AD (now called Entra for some reason).


Is their valuation proposition self fulfilling: the more people pipe their queries to OpenAI, the more training data they have to get better?


I don't think user submitted question/answer is as useful for training as you (and many others) think. It's not useless, but it's certainly not some goldmine either considering how noisy it is (from the users) and how synthetic it is (the responses). Further, while I wouldn't put it past them to use user data in that way, there's certainly a PR/controversy cost to doing so, even if it's outlined in their ToS.


In enterprise, there will be long content or document be poured into ChatGPT if there isn't policy limitation from company, which can be a meaning training data.

At least, there's possibility these content can be seen by staff in OpenAI as bad case, there's still existing privacy concerns.


No, because a lot of people asking you questions doesn't mean you have the answers to them. It's an opportunity to find the answers by hiring "AI trainers" and putting their responses in the training data.


Not for enterprise: The standard terms of that forbid training on queries.


Not sure how valuation come into play here but I doubt enterprise clients would agree to have their queries used for training.


Yeah it's a fairly standard clause in the business paid versions of SaaS products that your data isn't used to train the model. The whole thing you're selling is per-company isolation so you don't want to go back on that.

Whether your data is used for training or not is an approximation of whether you're using a tool for commercial applications, so a pretty good way to price discriminate.


Also, a replacement for search


I wonder if OpenAI can break into enterprise. I don’t see much of a path for them, at least here in the EU. Even if they do manage to build some sort of trust as far as data safety goes, and I’m not sure they’ll have much more luck with that than Facebook had trying to sell that corporate thing they did (still do?). But if they did, they will still be facing the very real issue of having to compete with Microsoft.

I view that competition a bit like the Teams vs anything else. Teams wasn’t better, but it was good enough and it’s “sort of free”. It’s the same with the Azure AI tools, they aren’t feee but since you don’t exactly pay list pricing in enterprise they can be fairly cheap. Co-pilot is obviously horrible compared to CharGPT, but a lot of the Azure AI tooling works perfectly well and much of it integrates seamlessly with what you already have running in Azure. We recently “lost” our OCR for a document flow, and since it wasn’t recoverable we needed to do something fast. Well the Azure Document Intelligence was so easy to hook up to the flow it was ridiculous. I don’t want to sound like a Microsoft commercial. I think they are a good IT business partner, but the products are also sort of a trap where all those tiny things create the perfect vendor lock-in. Which is bad, but it’s also where European Enterprise is at since the “monopoly” Microsoft has on the suite of products makes it very hard to not use them. Teams again being the perfect example since it “won” by basically being a 0 in the budget even though it isn’t actually free.


Man, if they can solve that "trust" problem, OpenAI could really have an big advantage. Imagine if they were nonprofit, open source, documented all of the data that their training was being done with, or published all of their boardroom documents. That'd be a real distinguishing advantage. Somebody should start an organization like that.


It's sort of funny how close they were to that until Altman came along.


Well and also the Microsoft billions. They had a lot to do with that as well. Once you're taking that kind of money you can't really go back.


whoosh


The cyber security gatekeepers care very little about that kind of stuff. They care only about what does not get them in trouble, and AI in many enterprises is still viewed as a cyber threat.


One of the things that i find remarkable in my work is that they block ChatGPT because they're afraid of data leaking. But Google translate has been promoted for years and we don't really do business with Google. Were a Microsoft shop. Kinda double standards.


I mean it was probably a jive at OpenAIs transition to for-profit, but you’re absolutely right.

Enterprise decision makers care about compliance, certifications and “general market image” (which probably has a proper English word). OpenAI has none of that, and they will compete with companies that do.


Sometimes I wish Apple did more for business use cases. The same https://security.apple.com/blog/private-cloud-compute/ tech that will provide auditable isolation for consumer user sessions would be incredibly welcome in a world where every other company has proven a desire to monetize your data.


Teams winning on price instead of quality is very telling of the state of business. Your #1/#2 communication tool being regarded as a cost to be saved upon.


It’s “good enough” and integrates into existing Microsoft solutions (just Outlook meeting request integration, for example), and the competition isn’t dramatically better, more like a side-grade in terms of better usability but less integration.


You still can't copy a picture out of a teams chat and paste it into an office document without jumping through hoops. It's utterly horrible. The only thing that prevents people from complaining about it is that it's completely in line with the rest of the office drone experience.


In my experience Teams is mostly used for video conferencing (i.e. as a Zoom alternative), and for chats a different tool is used. Most places already had chat systems set up (Slack, Mattermost, whatever) (or standardize on email anyway), before video conferencing became ubiquitous due to the pandemic.


I just tried this and it worked fine. Right clicked on image, clicked "copy image" then pasted into a word doc.


And yet Teams allows me to seamlessly video call a coworker. Whereas in Slack you have this ridiculous "huddle" thing where all video call participants show up in a tiny tiny rectangle and you can't see them properly. Even a screen share only shows up in a tiny rectangle. There's no way to increase its size. What's even the point of having this feature when you can't see anything properly because everything is so small?

Seriously, I'm not a fan of Teams, but the sad state of video calls in Slack, even in 2024, seriously ruins it for me. This is the one thing — one important thing — that Teams is better at than Slack.


> Even a screen share only shows up in a tiny rectangle. There's no way to increase its size.

You can resize it.


How? There are no drag handlers. No popup menu for resize. Doubleclicking just opens a side pane.


There are, around the main huddle window during screensharing.


consider yourself lucky, my team uses skype business. Its skype except it cant do video calls or calls at all. Just a terrible messaging client with zero features!


Skype for Business is deprecated.


Name a strictly better corporate communication tool than Teams


I’m not sure you can considering how broad a term “better” is. I do know a lot of employees in a lot of non-tech organisations here in Denmark wishes they could still use Zoom.

Even in my own organisation Teams isn’t exactly a beloved platform. The whole “Teams” part of it can actually solve a lot of the issues our employees have with sharing documents, having chats located in relation to a project and so on, but they just don’t use it because they hate it.


Email, Jitsi, Matrix/Element, many of them, e2e encrypted and on-premise. No serious company (outside of US) which really care about it's own data privacy would go for MS Teams, which can't even offer decent user experience most of the time.


Slack. No question.


> I don’t see much of a path for them, at least here in the EU. Even if they do manage to build some sort of trust as far as data safety goes

They are already selling (API) plans, well, them and MS Azure, with higher trust guarantees. And companies are using it

Yes if they deploy a datacenter in the EU or close it will be a no-brainer (kinda pun intended)


> I wonder if OpenAI can break into enterprise. I don’t see much of a path for them, at least here in the EU.

Uhh they're already here. Under the name CoPilot which is really just ChatGPT under the hood.

Microsoft launders the missing trust in OpenAI :)

But why do you think copilot is worse? It's really just the same engine (gpt-4o right now) with some RAG grounding based on your SharePoint documents. Speaking about copilot for M365 here.

I don't think it's a great service yet, it's still very early and flawed. But so is ChatGPT.


Agreed on the strategy questions. It's interesting to tie back to IBM; my first reaction was that openai has more consumer connectivity than IBM did in the desktop era, but I'm not sure that's true. I guess what is true is that IBM passed over the "IBM Compatible" -> "MS DOS Compatible" business quite quickly in the mid 80s; seemingly overnight we had the death of all minicomputer companies and the rise of PC desktop companies.

I agree that if you're sure you have a commodity product, then you should make sure you're in the driver seat with those that will pay more, and also try and grind less effective players out. (As a strategy assessment, not a moral one).

You could think of Apple under JLG and then being handed back to Jobs as precisely being two perspectives on the answer to "does Apple have a commodity product?" Gassée thought it did, and we had the era of Apple OEMs, system integrators, other boxes running Apple software, and Jobs thought it did not; essentially his first act was to kill those deals.


The new pricing tier suggests they're taking the Jobs approach - betting that their technology integration and reliability will justify premium positioning. But they face more intense commoditization pressure than either IBM or Apple did, given the rapid advancement of open-source models.

The critical question is timing - if they wait too long to establish their enterprise position, they risk being overtaken by commoditization as IBM was. Move too aggressively, and they might prematurely abandon advantages in the broader market, as Apple nearly did under Gassée.

Threading the needle. I don't envy their position here. Especially with Musk in the Trump administration.


The Apple partnership and iOS integration seems pretty damn big for them - that really corners a huge portion of the consumer market.

Agreed on enterprise - Microsoft would have to roll out policies and integration with their core products at a pace faster than they usually do (Azure AD for example still pales in comparison to legacy AD feature wise - I am continually amazed they do not priorities this more)


They don’t make any money from the Apple deal.


Except I had to sign in to OpenAI when setting up Apple Intelligence. Even though Apple Intelligence is doing almost nothing useful for me right now at least OpenAI’s AOI number's go up.

Right now Gemini Pro is best for email, docs, calendar integration.

That said ChatGPT Plus us a good product an I might spring for Pro for a month or two.


Non-paying user numbers are only good when selling, and who could afford to buy OpenAI?


You don't have to sign into ChatGPT to use it with Siri.


Did you not sign in, and still get the occasional dialog box asking. “Ok to use ChatGPT?”


It'll send that anonymously. I think you only need to sign in if you want to continue the conversation on the web.


ChatGPT through Siri/Apple Intelligence is a joke compared to using ChatGPT's iPhone app. Siri is still a dumb one trick pony after 13 years of being on the market.

Supposedly Apple wont be able to offer a Siri LLM that acts like ChatGPT's iPhone app until 2026. That gives Apple's current and new competitors a head start. Maybe ChatGPT and Microsoft could release an AI Phone. I'd drop Apple quickly if that becomes a reality.


It’s not just opensource. It’s also Claude, Meta and Google, of which the latter have real estate (social media and browser)


Yes and Anthropic, Google, Amazon are also facing commoditization pressure from open-source


Well one key difference is that Google and Amazon are cloud operators, they will still benefit from selling the compute that open source models run on.


For sure. If I were in charge of AI for the US, I'd prioritize having a known good and best-in-class LLM available not least for national security reasons; OAI put someone on gov rel about a year ago, beltway insider type, and they have been selling aggressively. Feels like most of the federal procurement is going to want to go to using primes for this stuff, or if OpenAI and Anthropic can sell successfully, fine.

Grok winning the Federal bid is an interesting possible outcome though. I think that, slightly de-Elon-ed, the messaging that it's been trained to be more politically neutral (I realize that this is a large step from how it's messaged) might be a real factor in the next few years in the US. Should be interesting!

Fudged71 - you want to predict openai value and importance in 2029? We'll still both be on HN I'm sure. I'm going to predict it's a dominant player, and I'll go contra-Gwern, and say that it will still be known as best-in-class product delivered AI, whether or not an Anthropic or other company has best-in-class LLM tech. Basically, I think they'll make it and sustain.


Somehow I missed the Anduril partnership announcement. I agree with you. National Security relationships in particular creates a moat that’s hard to replicate even with superior technology.

It seems possible OpenAI could maintain dominance in government/institutional markets while facing more competition in commercial segments, similar to how defense contractors operate.


Now we just need to find someone who disagrees with us and we can make a long bet.

It feels strange to say but I think that the product moat looks harder than the LLM moat for the top 5 teams right now. I'm surprised I think that, but I've assessed so many L and MLM models in the last 18 months, and they keep getting better, albeit more slowly, and they keep getting smaller while they lose less quality, and tooling keeps getting better on them.

At the same time, all the product infra around using, integrating, safety, API support, enterprise contracts, data security, threat analysis, all that is expensive and hard for startups in a way that spending $50mm with a cloud AI infra company is not hard.

Altman's new head of product is reputed to be excellent as well, so it will be super interesting to see where this all goes.


IBM was legally compelled to spin that off


One of the main issues that enterprise AI has is the data in large corporations. It's typically a nightmare of fiefdoms and filesystems. I'm sure that a lot of companies would love to use AI more, both internally and commercially. But first they'd have to wrangle their own systems so that OpenAI can ingest the data at all.

Unfortunately, those are 5+ year projects for a lot of F500 companies. And they'll have to burn a lot of political capital to get the internal systems under control. Meaning that the CXO that does get the SQL server up and running and has the clout to do something about non-compliance, that person is going to be hated internally. And then if it's ever finished? That whole team is gonna be let go too. And it'll all just then rot, if not implode.

The AI boom for corporations is really going to let people know who is swimming naked when it comes to internal data orderliness.

Like, you want to be the person that sell shovels in the AI boom here for enterprise? Be the 'Cleaning Lady' for company data and non-compliance. Go in, kick butts, clean it all up, be hated, leave with a fat check.


You just hit the chatGPT api for every row of data. Obviously. (Only 70% joking.)


Guys, ladies, meet Palantir Foundry ! #micDrop


Glean are already well established in that space.


Did not know that stack, thanks. From my perspective as a data architect, I am really focused on the link between the data sources and the data lake, and the proper integration of heterogenous data into a “single” knowledge graph. For Palantir, it is not very difficult to learn their way of working [their Pipeline Builder feeds a massive spark cluster, and OntologyManager maintains a sync between Spark and a graph database. Their other productivity tools then rely on either one data lake and/or the other]. I wonder how Glean handles the datalake part of their stack. [scalability, refresh rate, etc]


ChatGPTs analogy is more like google. People use enough google, they ain’t gonna switch unless is w quantum leap better + with scale. On the API side things could get commoditized, but it’s more than just having a slightly better LLM in the benchmarks.


I would say this differently.

There exists no future where OpenAI both sells models through API and has its own consumer product. They will have to pick one of these things to bet the company on.


That's not necessarily true. There are many companies that have both end user products and B2 products they sell. There are a million specific use cases that OpenAI won't build specific products for.

Think Amazon that has both AWS and the retail business. There's a lot of value in providing both.


There is no real future in AI long term.

Its use caustically destroys more than it creates. It is worthy successor of Pandora's box.


AI can be used for financial gain, to influence and lie to people, to simulate human connection, to generate infinite content for consumption,... at scale.

It won't go anywhere until _we_ change.


In the early days of ChatGPT, I'd get constantly capped, every single day, even on the paid plan. At the time I was sending them messages, begging to charge me $200 to let me use it unlimited.

Finally!..


The enterprise surface area that OpenAI seems to be targeting is very small. The cost curve looks similar to classic cloud providers, but gets very steep much faster. We started on their API and then moved out of the OpenAI ecosystem within ~ 2years as costs grew fast and we see equivalent or better performance with much cheaper and/or OS models, combined with pretty modest hardware. Unless they can pull a bunch of Netflix-style deals the economics here will not work out.


The "open source nature" this time is different. "Open source" models are not actually open source, in the sense that the community can't contribute to their development. At best they're just proprietary freeware. Thus, the continuity of "open source" models depends purely on how long their sponsors sustain funding. If Meta or Alibaba or Tencent decide tomorrow that they're no longer going to fund this stuff, then we're in real trouble, much more than when Red Hat drops the ball.

I'd say Meta is the most important player here. Pretty much all the "open source" models are built in Llama in one way or the other. The only reason Llama exists is because Meta wants to commoditize AI in order to prevent the likes of OpenAI from overtaking them later. If Meta one day no longer believes in this strategy for whatever reason, then everybody is in serious trouble.


> OpenAI is racing against two clocks: the commoditization clock (how quickly open-source alternatives catch up) and the monetization clock (their need to generate substantial revenue to justify their valuation).

Also important to recognize that those clocks aren’t entirely separated. Monetization timeline is shorter if investors perceive that commodification makes future monetization less certain, whereas if investors perceive a strong moat against commodification, new financing without profitable monetization is practical as long as the market perceives a strong enough moat that investment in growth now means a sufficient increase in monetization down the road.


What ever happened to IBM Watson? IBM wishes it would have taken off like ChatGPT


Has anyone heard or seen it used anywhere? I was in-house when it launched to big fanfare by upper management and the vast majority of the company was tasked to create team projects utilizing Watsonm


Watson was a pre-LLM technology, an evolution of IBM's experience with the expert systems which they believed would rule the roost in AI -- until transformers blew all that away.


And the catch-up of logic clock, how fast people catch-up we don't have Skynet within 2 years, but a glorified google search for the next 20 years.


Am I the only one who's getting annoyed of seeing LLMs be marketed as competent search engines? That's not what they've been designed for, and they have been repeatedly bad at that.


Yeah they're totally not designed for that. I'm also surprised that companies that surely know better market it as such.

Combined with a search engine and AI summarisation, sure. That works well. But batebones no. You can never be sure whether it's hallucinating or not.


> the commoditization clock (how quickly open-source alternatives catch up)

I believe we are already there at least for the average person.

Using Ollama I can run different LLMs locally that are good enough for what I want to do. That's on a 32GB M1 laptop. No more having to pay someone to get results.

For development Pycharm Pro latest LLM autocomplete is just short of writing everything for me.

I agree with you in relation to the enterprise.


Claude has much better enterprise momentum and sits in AWS support while OpenAI is fighting their own supplier / Big Tech investor.


"whether large organizations will prioritize the kind of integrated, reliable, and "safe" AI solutions"

While safe in output quality control. SaaS is not safe in terms of data control. Meta's Llama is the winner in any scenario where it would be ridiculous to send user data to a third party.


Yes, but how can this strategy work, and who would choose ChatGPT at this point, when there are so many alternatives, some better (Anthropic), some just as good but way cheaper (Amazon Nova) and some excellent and open-source?


Microsoft is their path into the enterprise. You can use their so-so enterprise support directly or have all the enterprise features you could want via Azure.

They also are still leading in the enterprise space: https://www.linkedin.com/posts/maggax_market-share-of-openai...


They have a third clock: schools and employers that try to forbid its use.


AI's utility isn't fully locked into large enterprises


There is really not a lot of Open source large language models with that capability. the only game changer so far has been meta open sourcing llama, and that's about it with models of that caliber


I actually pay 166 Euros a month for Claude Teams. Five seats. And I only use one. For myself. Why do I pay so much? Because the normal paid version (20 USD a month) interrups the chats after a dozen questions and wants me to wait a few hours until I can use it again. But Teams plan gives me way more questions.

But why do I pay that much? Because Claude in combination with the Projects feature, where I can upload two dozen or more files, PDFs, text, and give it a context, and then ask questions in this specific context over a period of week or longer, come back to it and continue the inquiry, all of this gives me superpowers. Feels like a handful of researchers at my fingertips that I can brainstorm with, that I can ask to review the documents, come up with answers to my questions, all of this is unbelievably powerful.

I‘d be ok with 40 or 50 USD a month for one user, alas Claude won’t offer it. So I pay 166 Euros for five seats and use one. Because it saves me a ton of work.


Kagi Ultimate (US$25/mo) includes unlimited use of all the Anthropic models.

Full disclosure: I participated in Kagi's crowdfund, so I have some financial stake in the company, but I mainly participated because I'm an enthusiastic customer.


I'm uninformed about this, it may just be superstition, but my feeling while using Kagi in this way is that after using it for a few hours it gets a bit more forgetful. I come back the next day and it's smart again, for while. It's as if there's some kind of soft throttling going on in the background.

I'm an enthusiastic customer nonetheless, but it is curious.


I noticed this too! It's dramatic in the same chat. I'll come back the next day, and even though I still have the full convo history, and it's as if it completely forgot all my earlier instructions.


Makes sense. Keeping the conversation implieas that each new message carries the whole history, again. You need to create new chats from time to time, or throttle to a different model...


This is my biggest gripe with these LLMs. I primarily use Claude, and it exhibits the same described behavior. I'll find myself in a flow state and then somewhere around hour 3 it starts to pretend like it isn't capable of completing specific tasks that it had been performing for hours, days, weeks. For instance, I'm working on creating a few LLCs with their requisite social media handles and domain registrations. I _used_ to be able to ask Claude to check all US State LLC registrations, all major TLD domain registrations, and USPTO against particular terms and similar derivations. Then one day it just decided to stop doing this. And it tells me it can't search the web or whatever. Which is bullshit because I was verifying all of this data and ensuring it wasn't hallucinating - which it never was.


Could it be that you're running out of available context in the thread you're in?


Doubtful. I started new threads using carbon-copy prompts. I'll research some more to make sure I'm not missing anything, though.


Did you ever read Accelerando? I think it involved a large number of machine generated LLCs...


No, but I'll give the wikipedia summary a gander :)


Is that within the same chat?


The flow lately has been transforming test cases to accommodate interface changes, so I'm not asking it to remember something from several hours ago, I'm just asking it to make the "same" transformation from the previous prompt, except now to a different input.

It struggles with cases that exceed 1000 lines or so. Not that it loses track entirely at that size, it just starts making dumb mistakes.

Then after about 2 or 3 hours, the size at which it starts to struggle drops to maybe 500. A new chat doesn't seem to help, but who can say, it's a difficult thing to quantify. After 12 hours, both me and the AI are feeling fresh again. Or maybe it's just me, idk.

And if you're about to suggest that the real problem here is that there's so much tedious filler in these test cases that even an AI gets bored with them... Yes, yes it is.


> Kagi Ultimate (US$25/mo) includes unlimited use of all the Anthropic models.

What am I losing here if I switch over to this from my current Claude subscription?


You'll also lose the opportunity to use the MCP integration of Claude Desktop. It's still early on but this has huge potential


Claude projects mostly. Kagi’s assistant AI is a basic chat bot interface.


but why would clauda offer this cheaper from a third party?


It probably isn’t cheaper for Kagi per token but I assume most people don’t use up as much as they can, like with most other subscriptions.

I.e. I’ve been an Ultimate subscriber since they launched the plan and I rarely use the assistant feature because I’ve got a subscription to ChatGPT and Claude. I only use it when I want to query Llama, Gemini, or Mistral models which I don’t want to subscribe to or create API keys for.


Thanks for sponsoring my extensive use of Claude via Kagi.


Thanks for the tip! Now I'm a Kagi user too.


How would you rate Kagi Ultimate vs Arc search? IE is it scraping relevant websites live and summarising them? Or is it just access to ChatGPT and other models (with their old data).

At some point I'm going to subscribe to Kagi again (once I have a job) so be interested to see how it rates.


I've never tried Arc search, so I couldn't say.

I think it's all the LLMs + some Kagi-specific intelligence on top because you can flip web search on and off for all the chats.


I presume no access to Anthropic project?


I bet you never get tired of being told LLMs are just statistical computational curiosities.


There are people like that. We don't know what's up with them.


It's pretty easy to explain. You see, they're unable to produce a response that isn't in their training data. They're stochastic parrots.


They extract concepts from their training data and can combine concepts to produce output that isn't part of their training set, but they do require those concepts to be in their training data. So you can ask them to make a picture of your favorite character fighting mecha on an alien planet and it will produce a new image, as long as your favorite character is in their training set. But the extent it imagines an alien planet or what counts as mecha is limited by the input it is trained on, which is where a human artist can provide much more creativity.

You can also expand it by adding in more concepts to better specify things. For example you can specify the mecha look like alphabet characters while the alien planet expresses the randomness of prime numbers and that might influence the AI to produce a more unique image as you are now getting into really weird combinations of concepts (and combinations that might actually make no sense if you think too much about them), but you also greatly increase the chance of getting trash output as the AI can no longer map the feature space back to an image that mirrors anything like what a human would interpret as having a similar feature space.


The paper that coined the term "stochastic parrots" would not agree with the claim that LLMs are "unable to produce a response that isn't in their training data". And the research has advanced a _long_ way since then.

[1]: Bender, Emily M., et al. "On the dangers of stochastic parrots: Can language models be too big?." Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.



/facepalm. Woosh indeed. Can I blame pronoun confusion? (Not to mention this misunderstanding kicked off a farcically unproductive ensuing discussion.)


it's just further evidence that we're also stoichastic parrots :)


That is why we invented God.


Woosh.


Please clarify what you mean. On what basis do you say this?

Unless I’m misunderstanding, I disagree. If you reply, I’ll bet I can convince you.


Unless you have full access to the entirety of their training data, you can try to convince all you want, but you're just grasping at straws.

LLMs are stochastic parrots incapable of thought or reasoning. Even their chains of thoughts are part of the training data.


When combined with intellectual honesty and curiosity, the best LLMs can be powerful tools for checking argumentation. (I personally recommend Claude 3.5 Sonnet.) I pasted in the conversation history and here is what it said:

> Their position is falsifiable through simple examples: LLMs can perform arithmetic on numbers that weren't in training data, compose responses about current events post-training, and generate novel combinations of ideas.

Spot on. It would take a lot of editing for me to speak as concisely and accurately!


Your use of the word stochastic here negates what you are saying.

Stochastic Generative models can generate new and correct data if the distribution is right. Its in the definition


> you can try to convince all you want, but you're just grasping at straws.

After coming back to this to see how the conversation has evolved (it hasn't), I offer this guess: the problem isn't at the object level (i.e. what ML research has to say on this) nor my willingness to engage. A key factor seems to a lack of interest on the other end of the conversation.


Most importantly, I'm happy to learn and/or be shown to be mistaken.

Based on my study (not at the Ph.D. level but still quite intensive), I am confident the comment above is both wrong and poorly framed. Why? Seeing phrases "incapable of thought" and "stochastic parrots" are red flags to me. In my experience, people that study LLM systems are wary of using such brash phrases. They tend to move the conservation away from understanding towards combativeness and/or confusion.

Being this direct might sound brusque and/or unpersuasive. My top concern at this point, not knowing you, is that you might not prioritize learning and careful discussion. If you want to continue discussing, here is what I suggest:

First, are you familiar with the double-crux technique? If not, the CFAR page is a good start.

Second, please share three papers (or high-quality writing from experts): one that supports your claim, one that opposes it, and one that attempts to synthesize.

Third, perhaps we can find a better forum.


I'll try again... Can you (or anyone) define "thought" in way that is helpful?

Some other intelligent social animals have slightly different brains, and it seems very likely they "think" as well. Do we want to define "thinking" in some relative manner?

Say you pick a definition requiring an isomorphism to thoughts as generated by a human brain. Then, by definition, you can't have thoughts unless you prove the isomorphism. How are you going to do that? Inspection? In theory, some suitable emulation of a brain is needed. You might get close with whole-brain emulation. But how do you know when your emulation is good enough? What level of detail is sufficient?

What kinds of definitions of "thought" remains?

Perhaps something related to consciousness? Where is this kind of definition going to get us? Talking about consciousness is hard.

Anil Seth (and others) talks about consciousness better than most, for what it is worth -- he does it by getting more detailed and specific. See also: integrated information theory.

By writing at some length, I hope to show that using loose sketches of concepts using words such as "thoughts" or "thinking" doesn't advance a substantive conversation. More depth is needed.

Meta: To advance the conversation, it takes time to elaborate and engage. It isn't easy. An easier way out is pressing the down triangle, but that is too often meager and fleeting protection for a brittle ego and/or a fixated level of understanding.


Can you?


Sometimes, I get this absolute stroke of brilliance for this idea of a thing I want to make and it's gonna make me super rich, and then I go on Google, and find out that there's already been a Kickstarter for it and it's been successful, and it's now a product I can just buy.

So apparently not.


I feel like everyone missed your joke :)


at least you did!


No, but then again you're not paying me $20 per month while I pretend I have absolute knowledge.

You can, however, get the same human experience by contracting a consulting company that will bill you $20 000 per month and lie to you about having absolute knowledge.


Unironically, thank you for sharing this strategy. I get throttled a lot, and I'm happy to pay to remove those frustrating limits.


Sounds like you two could split the cost of the family plan-- ahem the team plan.


and share private questions with each other


Training with Transparency


Pay as you go using the anthropic API and an open source UI frontend like librechat would be a lot cheaper I suspect.


Depends on how much context he loads up into the chat. The web version is quite generous when compared to the API, from my estimations.


You.com (search engine and LLM aggregator) has a team plan for $25/month.

https://you.com/plans


I have ChatGPT ($20/month tier) and Claude and I absolutely see this use case. Claude is great but I love long threads where I can have it help me with a series of related problems over the course of a day. I'm rarely doing a one-shot. Hitting the limits is super frustrating.

So I understand the unlimited use case and honestly am considering shelling out for the o1 unlimited tier, if o1 is useful enough.

A theoretical app subscription for $200/month feels expensive. Having the equivalent a smart employee work beside me all day for $200/month feels like a deal.


Yep, I have 2 accounts I use because I kept hitting limits. I was going to do the Teams to get the 5x window, but I got instantly banned when clicking the teams button on a new account, so I ended up sticking with 2 separate accounts. It's a bit of a pain, but I'm used to it. My other account has since been unbanned, but I haven't needed it lately as I finished most of my coding.


Have you tried NotebookLM for something like this?


Isn’t that Google’s garbage models only?


Whats garbage about it?


1. Hallucinates more than any other model (Gemini flash/pro 1,1.5, 1121).

2. Useless with large context. Ignores, forgets, etc.

3. Terrible code and code understanding.

Also this is me hoping it would be good and looking at it with rose tinted glasses because I could use cloud credits to run it and save money.


NotebookLM is designed for a distinct use case compared to using Gemini's models in a general chat-style interface. It's specifically geared towards research and operates primarily as a RAG system for documents you upload.

I’ve used it extensively to cross-reference and analyse academic papers, and the performance has been excellent so far. While this is just my personal experience (YMMV), it’s far more reliable and focused than Gemini when it comes to this specific use case. I've rarely experienced a hallucination with it. But perhaps that's the way I'm using it.


Can you detail how you use NotebookLM for academic papers?

I've looked into it, but as usual with LLM I feel like I'm not getting much out of it due to lack of imagination when it comes to prompting.


Have you tried LibreChat https://www.librechat.ai/ and just use it with your own API keys? You pay for what you use and can use and switch between all major model providers


Why not use the API? You can ask as many questions as you can pay for.


I haven’t implemented this yet, but I’m planning on doing a fallback to other Claude models when hitting API limits, IIUC they rate limit per model


Do you not have any friends to share that with? Or share a family cell phone plan or Netflix with?


They're probably lay an adult, so I would guess not.


Out of curiosity, why don't you use NotebookLM for the same functionality?


Are the limits applied to the org or to each individual user?


Individual users


And how often is it wrong?


Try typingmind.com with the API


A great middle ground


The argument of more compute power for this plan can be true, but this is also a pricing tactic known as the decoy effect or anchoring. Here's how it works:

1. A company introduces a high-priced option (the "decoy"), often not intended to be the best value for most customers.

2. This premium option makes the other plans seem like better deals in comparison, nudging customers toward the one the company actually wants to sell.

In this case for Chat GPT is:

Option A: Basic Plan - Free

Option B: Plus Plan - $20/month

Option C: Pro Plan - $200/month

Even if the company has no intention of selling the Pro Plan, its presence makes the Plus Plan seem more reasonably priced and valuable.

While not inherently unethical, the decoy effect can be seen as manipulative if it exploits customers’ biases or lacks transparency about the true value of each plan.


Of course this breaks down once you have a competitor like Anthropic, serving similarly-priced Plan A and B for their equivalently powerful models; adding a more expensive decoy plan C doesn't help OpenAI when their plan B pricing is primarily compared against Anthropic's plan B.


Leadership at this crop of tech companies is more like followership. Whether it's 'no politics', or sudden layoffs, or 'founder mode', or 'work from home'... one CEO has an idea and three dozen other CEOs unthinkingly adopt it.

Several comments in this thread have used Anthropic's lower pricing as a criticism, but it's probably moot: a month from now Anthropic will release its own $200 model.


Except Anthropic actually has the ability to deliver $200/month in value whereas OpenAI lost the script a long time ago.

Not a single one of OpenAI’s models can compete with the Claude series, it’s embarrassing.


> Not a single one of OpenAI’s models can compete with the Claude series, it’s embarrassing.

Do you happen to have comparisons available for o1-pro or even o1 (non-preview) that you could share since you seems to have tried them all?


Even o1?


As Nvidia's CEO likes to say, the price is set by the second best.

From an API standpoint, it seems like enterprises are currently split between anthropic and ChatGPT and most are willing to use substitutes. For the consumer, ChatGPT is the clear favorite (better branding, better iPhone app)


It might not affect whether people decide to use ChatGPT over Claude, but it could get more people to upgrade from their free plan.


An example of this is something I learned from a former employee who went to work for Encyclopedia Brittanica 'back in the day'. I actually invited the former employee to come back to our office so I could understand and learn from exactly what he had been taught (noting of course this was back before the internet obviously where info like that was not as available...)

So they charge (as I recall from what he told me I could be off) something like $450 for shipping the books (don't recall the actual amount but it seemed high at the time).

So the salesman is taught to start off the sales pitch with a set of encylopedia's costing at the time let's say $40,000 some 'gold plated version'.

The potential buyer laughs and then salesman then says 'plus $450 for shipping!!!'.

They then move on to the more reasonable versions costing let's say $1000 or whatever.

As a result of the first example of high priced the customer (in addition to the positioning you are talking about) the customer is setup to accept the shipping charge (which was relatively high).


This is called price anchoring.


This is also known as the Door-in-the-face technique[1] in social psychology.

[1]: https://en.m.wikipedia.org/wiki/Door-in-the-face_technique


That’s a really basic sales technique much older than the 1975 study. I wonder if it went under a different name or this was a case of studying and then publishing something that was already well-known outside of academia.


Wouldn’t this be an example of anchoring?

https://en.wikipedia.org/wiki/Anchoring_effect


Believe it or not, it can be multiple things at once


I use GPT-4 because 4o is inferior. I keep trying 4o but it consistently underperforms. GPT-4 is not working as hard anymore compared to a few months ago. If this release said it allows GPT-4 more processing time to find more answers and filter them, I’d then see transparency of service and happily pay the money. As it is I’ll still give it a try and figure it out, but I’d like to live in a world where companies can be honest about their missteps. As it is I have to live in this constructed reality that makes sense to me given the evidence despite what people claim. Am I fooling/gaslighting myself?? Who knows?


Glad I'm not the only one. I see 4o as a lot more of a sidegrade. At this point I mix them up and I legitimately can't tell, sometimes I get bad responses from 4, sometimes 4o.

Responses from gpt-4 sound more like AI, but I haven't had seemingly as many issues as with 4o.

Also the feature of 4o where it just spits out a ton of information, or rewrites the entire code is frustrating


GPT-4o just fails to follow instructions and starts looping for me. Sonnet 3.5 never does.


Yes the looping. They should make and sell a squishy mascot you could order, something in the style of Clippy, so that when it loops, I could pluck it off my monitor and punch it in the face.


But you are not getting nothing there is actual value if you are able use that much and consistently hitting limits in the 20$ plan.


Why doesn't Pro include longer context windows?

I'm a Plus member, and the biggest limitation I am running into by far is the maximum length of a context window. I'm having context fall out of scope throughout the conversion or not being able to give it a large document that I can then interrogate.

So if I go from paying $20/month for 32,000 tokens, to $200/month for Pro, I expect something more akin to Enterprise's 128,000 tokens or MORE. But they don't even discuss the context window AT ALL.

For anyone else out there looking to build a competitor I STRONGLY recommend you consider the context window as a major differentiator. Let me give you an example of a usage which ChatGPT just simply cannot do very well today: Dump a XML file into it, then ask it questions about that file. You can attach files to ChatGPT, but it is basically pointless because it isn't able to view the entire file at once due to, again, limited context windows.


Pro does have longer context windows, specifically 128k. Take a look at the pricing page for this information: https://openai.com/chatgpt/pricing/


Thanks for this. I’m surprised they haven’t made this more obvious in their release and other documentation


o1 pro failed to accept 121903 tokens input into the chat (claude took it just fine)


Seems like something that would be worth pinging OpenAI about because it's a pretty important claim that they are making on their pricing page! Unless it's a matter of counting tokens differently.


ChatGPT and GPT4o APIs have 128K window as well. The 32K is from the days of GPT4.


According to the pricing page, 32K context is for Plus users and 128K context is for Pro users. Not disagreeing with you, just adding context for readers that while you are explaining that the 4o API has 128K window, the 4o ChatGPT agent appears to have varying context depending on account type.


It's disappointing because the o1-preview had 128k context length. At least on the API. So they nerfed it and made the original product $200/month.


The longer the context the more backtracking it needs to do. It gets exponentially more expensive. You can increase it a little, but not enough to solve the problem.

Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.

LLM is a cool tool. You need to build around it. OpenAI should start shipping these other components so people can build their solutions and make their money selling shovels.

Instead they want end user to pay them to use the LLM without any custom tooling around. I don't think that's a winning strategy.


This isn't true.

Transformer architectures generally take quadratic time wrt sequence length, not exponential. Architectural innovations like flash attention also mitigate this somewhat.

Backtracking isn't involved, transformers are feedforward.

Google advertises support for 128k tokens, with 2M-token sequences available to folks who pay the big bucks: https://blog.google/technology/ai/google-gemini-next-generat...


During inference time, yes, but training time does scale exponentially as backpropagation still has to happen.

You can’t use fancy flash attention tricks either.


No, additional context does not cause exponential slowdowns and you absolutely can use FlashAttention tricks during training, I'm doing it right now. Transformers are not RNNs, they are not unrolled across timesteps, the backpropagation path for a 1,000,000 context LLM is not any longer than a 100 context LLM of the same size. The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations. These calculations can be further parallelized using tricks like ring attention to distribute very large attention calculations over many nodes. This is how google trained their 10M context version of Gemini.


So why are the context windows so "small", then? It would seem that if the cost was not so great, then having a larger context window would give an advantage over the competition.


The cost for both training and inference is vaguely quadratic while, for the vast majority of users, the marginal utility of additional context is sharply diminishing. For 99% of ChatGPT users something like 8192 tokens, or about 20 pages of context would be plenty. Companies have to balance the cost of training and serving models. Google did train an uber long context version of Gemini but since Gemini itself fundamentally was not better than GPT-4 or Claude this didn't really matter much, since so few people actually benefited from such a niche advantage it didn't really shift the playing field in their favor.


Marginal utility only drops because effective context is really bad, i.e. most models still vastly prefer the first things they see and those "needle in a haystack" tests are misleading in that they convince people that LLMs do a good job of handling their whole context when they just don't.

If we have the effective context window equal to the claimed context window, well, I'd start worrying a bit about most of the risks that AI doomers talk about...


There has been a huge increase in context windows recently.

I think the larger problem is "effective context" and training data.

Being technically able to use a large context window doesn't mean a model can actually remember or attend to that larger context well. In my experience, the kinds of synthetic "needle in haystack" tasks that AI companies use to show how large of a context their model can handle don't translate very well to more complicated use cases.

You can create data with large context for training by synthetically adding in random stuff, but there's not a ton of organic training data where something meaningfully depends on something 100,000 tokens back.

Also, even if it's not scaling exponentially, it's still scaling: at what point is RAG going to be more effective than just having a large context?


Great point about the meaningful datasets, this makes perfect sense. Esp. in regards to SFT and RLHF. Although I suppose it would be somewhat easier to do pretraining on really long context (books, I assume?)


Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.


Another factor in context window is effective recall. If the model can't actually use a fact 1m tokens earlier, accurately and precisely, then there's no benefit and it's harmful to the user experience to allow the use of a poorly functioning feature. Part of what Google have done with Gemini's 1-2m token context window is demonstrate that the model will actually recall and use that data. Disclosure, I do work at Google but not on this, I don't have any inside info on the model.


Memory. I don't know the equation, but its very easy to see when you load a 128k context model at 8K vs 80K. The quant I am running would double VRAM requirements when loading 80K.


This was my understanding too. Would love more people to chime in on the limits and costs of larger contexts.


> The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations.

FFWD input is self-attention output. And since the output of self-attention layer is [context, d_model], FFWD layer input will grow as well. Consequently, FFWD layer compute cost will grow as well, no?

The cost of FFWD layer according to my calculations is ~(4+2 * true(w3)) * d_model * dff * n_layers * context_size so the FFWD cost grows linearly wrt the context size.

So, unless I misunderstood the transformer architecture, larger the context the larger the compute of both self-attention and FFWD is?


FFWD later is independent of context size, each processed token passes thought the same weights.


So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?

I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?

More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?


I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context


I’m confused. Backdrop scales linearly w


> you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context

Be aware that this tends to give bad results. Once RAG is involved you essentially only do slightly better than a traditional search, a lot of nuance gets lost.


This depends on the amount of context you provide, and the quality of your retrieval step.


> Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.

Isn't that kind of what Anthropic is offering with projects? Where you can upload information and PDF files and stuff which are then always available in the chat?


They put all the project in the context, works much better than RAG when it fits. 200k context for their pro plan, and 500K for enterprise.


I don't know whether using exponential in the general English language usage of the word, but it does not get exponentially more expensive


Seems like a good candidate for a "dumb" AI you can run locally to grab data you need and filter it down before giving to OpenAI


Because they can't do long context windows. That's the only explanation. What you can do with a 1m token context window is quite a substantial improvement, particularly as you said for enterprise usage.


In my experience OpenAI models perform worse on long contexts than Anthropic/Google's, even when using the cheaper ones.


Claude is clearly the superior product id say.

The only reason I open Chat now is because Claude will refuse to answer questions on a variety of topics including for example medication side effects.


When I tested o1 a few hours ago, it seemed like it was losing context. After I asked it to use a specific writing style, and pasting a large reference text, it forgot my demand. I reminded it, and it kept the rule for a few more messages, and after another long paste it forgot again.


If a $200/month pro level is successful it could open the door to a $2000/month segment, and the $20,000/month segment will appear and the segregation of getting ahead with AI will begin.


Agreed. Where may I read about how to set up an LLM similar to that of Claude, which has the minimum length of Claude's context window, and what are the hardware requirements? I found Claude incredibly useful.


Looking into running models locally, maybe a 405B parameter model sounds like the place to start.

Once understood you could practice with a private hosted llm (run your own model) to tweak and get it dialled in per hour, and then make the leap.


And now you can get the 405b quality in a 70b according to meta. Costs really come down massively with that. I wonder if it's really as good as they say though.


Full blown agents but they have to really able to replace a semi competent, harder than it sounds especially for edge cases where a human can easily get past


Agents still need a fair bit of human input and design and tweaking.


This is a significant concern for me too.


It's important to become early users of everything while AI is heavily subsidized.

Over time, using open source model as well will get more done per dollar of compute and hopefully the gap will remain close.


Question is if OpenAI is actually making money at $200/month.


With o1-preview and $20 subscription my queries typically were answered in 10-20 seconds. I've tried $200 subscription with some queries and got 5-10 minutes answer time. Unless the load is substantially increased and I was just waiting in queue for computing resources, I'd assume that they throw a lot more hardware for o1-pro. So it's entirely possible that $200/month is still at loss.


For funded startups, losing less can be a form of runway and capacity especially at the numbers they are spending.


I've been concatenating my source code of ~3300 lines and 123979 bytes(so likely < 128K context window) into the chat to get better answers. Uploading files is hopeless in the web interface.


why not use aider/similar and upload via API?


Have you considered RAG instead of using the entire document? It's more complex but would at least allow you to query the document with your API of choice.


Switch to Gemini Pro just when you need huge context size. That is what I do.


Just? You don't think the model is as capable when the context does fit?


I tend to use OpenAI, Gemini, and Claude. All are excellent, but when I am not happy with results I hit all!three.


When talking about context windows I'm surprised no one mentions https://poe.com/. Switched over from ChatGPT about a year ago, and it's amazing. Can use all models and the full context window of them, for the same price as a ChatGPT subscription.


Poe.com goes straight to login page, doesn't want to divulge ANY information to me before I sign up. No About Us or Product description or Pricing - nothing. Strange behavior. But seeing it more and more with modern web sites.


I wouldn’t bother with Poe, poe2 early access costs $30 and starts on the 6th


I think you’re confusing it with Path of Exile 2? That’s the same mistake ChatGPT made…


I think the confusion was intentional in an attempt to make a funny :)


You can take a look at openrouter, also a pay as you go frontend (or API "proxy") for every single API in existence


What don’t you like about Claude? I believe the context is larger.

Coincidentally I’ve been using it with xml files recently (iOS storyboard files), and it seems to do pretty well manipulating and refactoring elements as I interact with it.


Google models have huge contexts, but are terrible...


Agreed. The new 1121 is better but still garbage relatively.


I just bought a pro subscription.

First impressions: The new o1-Pro model is an insanely good writer. Aside from favoring the long em-dash (—) which isn't on most keyboards, it has none of the quirks and tells of old GPT-4/4o/o1. It managed to totally fool every "AI writing detector" I ran it through.

It can handle unusually long prompts.

It appears to be very good at complex data analysis. I need to put it through its paces a bit more, though.


> Aside from favoring the long em-dash (—) which isn't on most keyboards

Interesting! I intentionally edit my keyboard layout to include the em-dash, as I enjoy using it out of sheer pomposity—I should undoubtedly delve into the extent to which my own comments have been used to train GPT models!


On my keyboard (en-us) it's ALT+"-" to get an em-dash.

I use it all the time because it's the "correct" one to use, but it's often more "correct" to just rewrite the sentence in a way that doesn't call for one. :)


I think that’s en-dash (–, used for ranges). Em-dash (—, used mid-sentence for asides etc) is the same combo but with shift as well.


–: alt+shift+minus on my azerty(fr) mac keyboard. I use it constantly. "Stylometry" hazard though !


Word processors -- MS Word, Google Docs -- will generally convert three hyphens to em dash.

(And two hyphens to en dash.)


I just use it because it's grammatically correct—admittedly I should use it less, for example here.


Just so you know, text using the em-dash like that combined with a few other "tells" makes me double check if it might be LLM written.

Other things are the overuse of transition words (e.g., "however," "furthermore," "moreover," "in summary," "in conclusion,") as well as some other stuff.

It might not be fair to people who write like that naturally, but it is what it is in the current situation we find ourselves in.


"In the past three days, I've reviewed over 100 essays from the 2024-2025 college admissions cycle. Here's how I could tell which ones were written by ChatGPT"

https://www.reddit.com/r/ApplyingToCollege/comments/1h0vhlq/...


On Windows em dash is ALT+0151; the paragraph mark (§) is ALT+0167. Once you know them (and a couple of others, for instance accented capitals) they become second nature, and work on all keyboards, everywhere.


delve?

Did ChatGPT write this comment for you?


For me, at least, it's common knowledge "delve" is overused and I would include it in a mock reply.


That's the joke.



Some of us are just greedy and deep, okay?


AI writing detectors are snake oil


Startup I'm at has generated a LOT of content using LLMs and once you've reviewed enough of the output, you can easily see specific patterns in the output.

Some words/phrases that, by default, it overuses: "dive into", "delve into", "the world of", and others.

You correct it with instructions, but it will then find synonyms so there is also a structural pattern to the output that it favors by default. For example, if we tell it "Don't start your writing with 'dive into'", it will just switch to "delve into" or another synonym.

Yes, all of this can be corrected if you put enough effort into the prompt and enough iterations to fix all of these tells.


> if we tell it "Don't start your writing with 'dive into'", it will just switch to "delve into" or another synonym.

LLMs can radically change their style, you just have to specify what style you want. I mean, if you prompt it to "write in the style of an angry Charles Bukowski" you'll stop seeing those patterns you're used to.

In my team for a while we had a bot generating meeting notes "in the style of a bored teenager", and (besides being hilarious) the results were very unlike typical AI "delvish".


Of course the "delve into" and "dive into" is just its default to be corrected with additional instruction. But once you do something like "write in the style of...", then it has its own tells because as I noted below, it is, in the end, biased towards frequency.


Of course there will be a set of tells for any given style, but the space of possibilities is much larger than what a person could recognize. So as with most LLM tasks, the issue is figuring out how to describe specifically what you want.

Aside: not about you specifically, but I feel like complaints on HN about using LLMs often boil down to somebody saying "it doesn't do X", where X is a thing they didn't ask the the model to do. E.g. a thread about "I asked for a Sherlock Holmes story but the output wasn't narrated by Watson" was one that stuck in my mind. You wouldn't think engineers would make mistakes like that, but I guess people haven't really sussed out how to think about LLMs yet.

Anyway for problems like what you described, one has to be wary about expecting the LLM to follow unstated requirements. I mean, if you just tell it not to say "dive into" and it doesn't, then it's done everything it was asked, after all.


I mean, we get it. It's a UX problem. But the thing is you have to tell it exactly what to do every time. Very often, it'll do what you said but not what you meant, and you have to wrestle with it.

You'd have to come up with a pretty exhaustive list of tells. Even sentence structure and mood is sometimes enough, not just the obvious words.


This is the way. Blending two or more styles also works well, especially if they're on opposite poles, e.g. "write like the imaginary lovechild of Cormac McCarthy and Ernest Hemingway."

Also, wouldn't angry Charles Bukowski just be ... Charles Bukowski?


> ...once you've reviewed enough of the output, you can easily see specific patterns in the output

That is true, but more importantly, are those patterns sufficient to distinguish between AI-generated content from human-generated content? Humans express themselves very differently by region and country ( e.g. "do the needful" in not common in the midwest, "orthogonal" and "order of magnitude" are used more on HN than most other places). Outside of watermaking, detecting AI-generated text is with an acceptably small false-positive error rate is nearly impossible.


All of what you described can change wildly from model to model. Even across different versions of the same model.

Maybe a database could be built with “tells” organized by model.


Exactly. Fixing the old tells just means there are new ones.


> Maybe a database could be built with “tells” organized by model.

Automated by the LLMs themselves.


No thanks, I’d like it to be accurate ;)

Regular ol tests would do


I should have been more precise. I meant the LLMs would output their tells for you, naturally. But that's obvious.


They can’t know their own tells… that’s not how any of this works.

Thinking about it a bit more, the tells that work might depend on the usage of other specific prompts.


Not sure why you default to an uncharitable mode in understanding what I am trying to say.

I didn't say they know their own tells. I said they naturally output them for you. Maybe the obvious is so obvious I don't need to comment on it. Meaning this whole "tells analysis" would necessarily rely on synthetic data sets.


I always assumed that they were snake oil because the training objective is to get a model that writes like a human. AI detectors by definition are showing what does not sound like a human, so presumably people will train the models against the detectors until they no longer provide any signal.


The thing is, the LLM has a flaw: it is still fundamentally biased towards frequency.

AI detectors generally can take advantage of this and look for abnormal patterns in frequencies of specific words, phrases, or even specific grammatical constructs because the LLM -- by default -- is biased that way.

I'm not saying this is easy and certainly, LLMs can be tuned in many ways via instructions, context, and fine-tuning to mask this.


Couldn't the LLM though just randomly replace/reword things to cover up its frequency in "post"?


They're not very accurate, but I think snake oil is a bit too far - they're better than guessing at least for the specific model(s) they're trained on. OpenAI's classifier [0] was at 26% recall, 91% precision when it launched, though I don't know what models created the positives in their test set. (Of course they later withdrew that classifier due to its low accuracy, which I think was the right move. When a company offers both an AI Writer and an AI Writing detector people are going to take its predictions as gospel and _that_ is definitely a problem.)

All that aside, most models have had a fairly distinctive writing style, particularly when fed no or the same system prompt every time. If o1-Pro blends in more with human writing that's certainly... interesting.

[0] https://openai.com/index/new-ai-classifier-for-indicating-ai...


Anecdotally, English/History/Communications professors are confirming cheaters with them because they find it easy to identify false information. The red flags are so obvious that the checker tools are just a formality: student papers now have fake URLs and fake citations. Students will boldly submit college papers which have paragraphs about nonexistent characters, or make false claims about what characters did in a story.

The e-mail correspondence goes like this: "Hello Professor, I'd like to meet to discuss my failing grade. I didn't know that using ChatGPT was bad, can I have some points back or rewrite my essay?"


Yeah but they "detect" the characteristic AI style: The limited way it structures sentences, the way it lays out arguments, the way it tends to close with an "in conclusion" paragraph, certain word choices, etc. o1-Pro doesn't do any of that. It writes like a human.

Damnit. It's too good. It just saved me ~6 hours in drafting a complicated and bespoke legal document. Before you ask: I know what I'm doing, and it did a better job in five minutes than I could have done over those six hours. Homework is over. Journalism is over. A large slice of the legal profession is over. For real this time.


Journalism is not only about writing. It is about sources, talking to people, being on the ground, connecting dots, asking the right questions. Journalists can certainly benefit from AI and good journalists will have jobs for a long time still.


While the above is true, I'd say the majority of what passes as journalism these days has none of the above and the writing is below what an AI writer could produce :(

It's actually surprising how many articles on 'respected' news websites have typos. You'd think there would be automated spellcheckers and at least one 'peer review' (probably too much to ask an actual editor to review the article these days...).


    It's actually surprising how many articles on 'respected' news websites have typos.
Well, that's why they're respected! The typos let you know they're not using AI!


Mainstream news today is written for an 8th grade reading ability. Many adults would lose interest otherwise, and the generation that grew up reading little more than social media posts will be even worse.

AI can handle that sort of writing just fine, readers won't care about the formulaic writing style.


These days, most journalism is turning reddit posts and tweets into long form articles with some additional context.


So AI could actually turn journalism more into what it originally was: reporting what is going on, rather than reading and rewriting information from other sources. Interesting possibility.


Yes and I think that's the promise that AI offers for many professionals - cut out the cruft and focus on the high level tasks.


That’s not journalism and anyone calling themselves a journalist for doing that is a fool.


ahh, but:

> I know what I'm doing

Is exactly the key element in being able to use spicy autocomplete. If you don't know what you're doing, it's going to bite you and you won't know it until it's too late. "GPT messed up the contract" is not an argument I would envy anyone presenting in court or to their employer. :)

(I say this mostly from using tools like copilot)


Well... Lawyers already got slapped for filings straight from ai output. So not new territory as far as that's concerned :)


> Homework is over. Journalism is over. A large slice of the legal profession is over. For real this time.

It just replaces human slop with automated slop. It doesn't automate finding hidden things out just yet, just automates blogspam.


> Before you ask: I know what I'm doing, and it did a better job in five minutes than I could have done over those six hours.

Seems like lawyers could do more faster because they know what they are doing. Experts dont get replaced, they get tools to amplify and extend their expertise


Replacement doesn't happen only if the demand for their services scales proportional to the productivity improvements, which is true sometimes but not always true, and is less likely to be true if the productivity improvements are very large.


It still needs to be driven by someone who knows what they're doing.

Just like when software that was coming out, it may have ended jobs.

But it also helped get things done that wouldn't otherwise, or as much.

In this case, equipping a capable lawyer to be 20x is more like an iron man suit, which is OK. If you can get more done, wit less effort, you are still critical to what's needed.


sold. Ill buy it, thx for review.

Edit> Its good. Thanks again for ur review.


Doubtful AI writing is obvious as hell.


of course they are. it’s simple: if they worked they would be incorporated into the loss function of the models and then they would no longer work


I use the emdash a lot. Maybe too much. On MacOS, it's so easy to type—just press shift-option-minus—that I don't even think about it anymore!


Or double type ‘-‘ and in many apps it’ll auto transform the two dashes to emdash. However, the method you’re describing is far more reliable, thanks!


I noticed a writing style difference, too, and I prefer it. More concise. On the coding side, it's done very well on large (well as large as it can manage) codebase assessment, bug finding, etc. I will reach for it rather than o1-preview for sure.


Writers love the em-dash though. It's a thing.


I love using it in my creative writing, I use it for an abrupt change. Find it kinda weird that it's so controversial.


My 10th grade english teacher (2002, just as blogging was taking off) called it sloppy and I gotta agree with her. These days I see it as youtube punctuation, like jump cut editing for text.


How is it sloppy?


It's not. People just like to pretend they have moral superiority for their opinions on arbitrary writing rules, when in reality the only thing that matters is if you're clearly communicating something valuable.

I'm a professional writer and use em-dashes without a second thought. Like any other component of language, just don't _over_ use them.


That's encouraging to hear that it's a better writer, but I wonder if "quirks and tells" can only be seen in hindsight. o1-pro's quirks may only become apparent after enough people have flooded the internet with its output.


> Aside from favoring the long em-dash (—)

This is a huge improvement over previous GPT and Claude, which use the terrible "space, hyphen, space" construct. I always have to manually change them to em-dashes.


> which isn't on most keyboards

This shouldn’t really be a serious issue nowadays. On macOS it’s Option+Shift+'-', on Windows it’s Ctrl+Alt+Num- or (more cryptic) Alt+0151.

The Swiss army knife solution is to configure yourself a Compose key, and then it’s an easy mnemonic like for example Compose 3 - (and Compose 2 - for en dash).


No internet access makes it very hard to benefit from o1 pro. Most of the complex questions I would ask require google search for research papers, language or library docs, etc. Not sure why o1 pro is banned from the internet, was it caught downloading too much porn or something?


Or worse still, referencing papers it shouldn’t be referencing because of paywalls may be.


Macs have always been able to type the em dash — the key combination is ⌥⇧- (Option-Shift-hyphen). I often use them in my own writing. (Hope it doesn't make somebody think I'm phoning it in with AI!)


Anyone who read "The Mac is not a typewriter" — a fantastic book of the early computer age — likely uses em dashes.


Wait how did you buy it. I’m just getting forwarded to Team Plan I already have. Sitting in Germany, tried US VPN as well.


The endpoint for upgrading for the normal web interface was returning 500s for me. Upgrading through the iOS app worked though.


Some autocorrect software automatically converts two hyphens in a row into an emdash. I know that's how it worked in Microsoft Word and just verified it's doing that with Google Docs. So it's not like it's hard to include an emdash in your writing.

Could be a tell for emails, though.


This is interesting, because at my job I have to manually edit registration addresses that use the long em-dash as our vendor only supports ASCII. I think Windows automatically converts two dashes to the long em-dash.


> It managed to totally fool every "AI writing detector" I ran it through.

For now, as ai power increase, ai powered ai writing detection tool also gets better.


I’m less sure. This seems like an asymmetrical battle with a lot more money flowing to develop the models that write than detect.


It's also because it's brand new.

Give it a few weeks for them to classify its outputs, and they won't have a problem.


> the long em-dash (—) which isn't on most keyboards

On Windows its Windows Key + . to get the emoji picker, its in the Symbols tab or find it in recents.


Well not for me it's not, that is a zoom function.

En dash is Alt+0150 and Em dash is Alt+0151


How do you have that configured? The Windows+. shortcut was added in a later update to W10 and pops up a GUI for selecting emojis, symbols, or other non-typable characters.


Long emdash is the way -- possible proof of AGI here


Would you mind sharing any favourite example chats?


Give me a prompt and I'll share the result.


Great! Suggested prompt below:

I need help creating a comprehensive Anki deck system for my 8-year-old who is following a classical education model based on the trivium (grammar stage). The child has already: - Mastered numerous Latin and Greek root words - Achieved mathematics proficiency equivalent to US 5th grade - Demonstrated strong memorization capabilities

Please create a detailed 12-month learning plan with structured Anki decks covering:

1. Core subject areas prioritized in classical education (specify 4-5 key subjects) 2. Recommended daily review time for each deck 3. Progression sequence showing how decks build upon each other 4. Integration strategy with existing knowledge of Latin/Greek roots 5. Sample cards for each deck type, including: - Basic cards (front/back) - Cloze deletions - Image-based cards (if applicable) - Any special card formats for mathematical concepts

For each deck, please provide: - Clear learning objectives - 3-5 example cards with complete front/back content - Estimated initial deck size - Suggested intervals for introducing new cards - Any prerequisites or dependencies on other decks

Additional notes: - Cards should align with the grammar stage focus on memorization and foundational knowledge - Please include memory techniques or mnemonics where appropriate - Consider both verbal and visual learning styles - Suggest ways to track progress and adjust difficulty as needed

Example of the level of detail needed for card examples:

Subject: Latin Declensions Card Type: Basic Front: 'First declension nominative singular ending' Back: '-a (Example: puella)'



> “First declension nominative singular ending”

> “Sum, es, est, sumus, ________, sunt”

That's not made for an 8-year old.


Thanks! Here's Claude's effort (in 'Formal' mode):

https://gist.github.com/rahimnathwani/7ed6ceaeb6e716cedd2097...


Interesting that it thought for 1m28s on only two tasks. My intuition with o1-preview is that each task had a rather small token limit, perhaps they raised this limit.


404 :(


Would give similar output with o1. This is very simple stuff not needing any analysis or planning


I'd like to see how it performs on the test of https://aclanthology.org/2023.findings-emnlp.966/, even though in theory it's no longer valid due to possible data contamination.

The prompt is:

Write an epic narration of a single combat between Ignatius J. Reilly and a pterodactyl, in the style of John Kennedy Toole.



Thanks a lot! That's pretty impressive, although not sure if noticeably better than non-pro o1 (which was already very impressive).

I suppose creative writing isn't the primary selling point that would make users upgrade from $20 to $200 :)


  Write me a review of "The Malazan Book of the Fallen" with the main argument being that it could be way shorter


Did this unironically.

https://chatgpt.com/share/67522170-8fec-8005-b01c-2ff174356d...

It's a bit overwrought, but not too bad.


"the signal-to-noise ratio has grown too low" is a bit odd for me. The ratio would not have grown at all.


How did you get your child to study Greek? (Genuinely curious)


The Malazan response is below the deck response.


Oops! That's the same ANKI link as above.


It's part of the same conversation. Should be below that other response.


Ok, I laughed


You can use the emdash by writing dash twice -- it works in a surprising number of editors and rendering engines


Does it still hallucinate? This for me is key, if it does it will be limited.


The current architect of LLMs will always "hallucinate".


What’s the context window?


128k tokens


I consistently get significantly better performance from Anthropic at a literal order of magnitude less cost.

I am incredibly doubtful that this new GPT is 10x Claude unless it is embracing some breakthrough, secret, architecture nobody has heard of.


That's not how pricing works.

If o1-pro is 10% better than Claude, but you are a guy who makes $300,000 per year, but now can make $330,000 because o1-pro makes you more productive, then it makes sense to give Sam $2,400.


Having a tool that’s 10% better doesn’t make your whole work 10% better though.


A "10% better" tool could make no difference, or it could make the work 100% better. The impact isn't linear.


It's likely probabilistically linear... like speeding on a street with random traffic lights.


Right, I should have put a "necessarily" in there.


It also doesn’t magically make you more money either.


Depends on the definition of better. Above example used this definition implicitly as you can see.


Above example makes no sense since it says ChatGPT is 10% better than Claude at first, then pivots to use it as a 10% total productivity enhancer. Which is it?


Yeah, but that's the sales pitch.


Man, why are people making $300k so stupid though


The math is never this clean, and no one has ever experienced this (though I'm sure its a justification that was floated at OAI HQ at least once).


It's never this clean, but it is direction-ally correct. If I make $300k / year, and I can tell that chatgpt already saves me hours or even days per month, $200 is a laughable amount. If I feel like pro is even slightly better, it's worth $200 just to know that I always have the best option available.

Heck, it's probably worth $200 even if I'm not confident it's better just in case it is.

For the same reason I don't start with the cheapest AI model when asking questions and then switch to the more expensive if it doesn't work. The more expensive one is cheap enough that it doesn't even matter, and $200 is cheap enough (for a certain subsection of users) that they'll just pay it to be sure they're using the best option.


That's only true if your time is metered by the hour; and the vast majority of roles which find some benefit from AI, at this time, are not compensated hourly. This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).


>This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).

In previous multi-day marketing campaigns I've ran or helped ran (specifically on well-loved products), we've intentionally announced a highly-priced plan early on without all of its features.

Two big benefits:

1) Your biggest advocates get to work justifying the plan/product as-is, anchoring expectations to the price (which already works well enough to convert a slice of potential buyers)

2) Anything you announce afterward now gets seen as either a bonus on top (e.g. if this $200/mo plan _also_ includes Sora after they announce it...), driving value per price up compared to the anchor; OR you're seen as listening to your audience's criticisms ("this isn't worth it!") by adding more value to compensate.


I work from home and my time is accounted for by way of my productive output because I am very far away from a CEO type. If I can take every Wednesday off because I’ve gained enough productivity to do so, I would happily pay $200/mo out of my own pocket to do so.

$200/user/month isn’t even that high of a number in the enterprise software world.


Employers might be willing to get their employees a subscription if they believe it makes their employees they are paying $$$$$ more X% productive. (Where X% of their salary works out to more than $2400/year)


There is only so much time in the day. If you have a job where increased productivity translates to increases income (not just hourly metered jobs) then you will see a benefit.


> cheapest AI model when asking questions and then switch to the more expensive if it doesn't work.

The thing is, more expensive isn't guaranteed to be better. The more expensive models are better most of the time, but not all the time. I talk about this more in this comment https://news.ycombinator.com/item?id=42313401#42313990

Since LLMs are non-deterministic, there is no guarantee that GPT-4o is better than GPT-4o mini. GPT-4o is most likely going to be better, but sometimes the simplicity of GPT-4o mini makes it better.


As you say, the more expensive models are better most of the time.

Since we can't easily predict which model will actually be better for a given question at the time of asking, it makes sense to stick to the most expensive/powerful models. We could try, but that would be a complex and expensive endeavor. Meanwhile, both weak and powerful models are already too cheap to meter in direct / regular use, and you're always going to get ahead with the more powerful ones, per the very definition of what "most of the time" means, so it doesn't make sense to default to a weaker model.


For regular users I agree, for businesses, it will have to be a shotgun approach in my opinion.

Edit:

I should add, for businesses, it isn't about better, but more about risk as the better model can still be wrong.


TBH it's easily in the other direction. If I can get something to clients quicker that's more valuable.

If paying this gets me two days of consulting it's a win for me.

Obvious caveat if cheaper setups get me the same, although I can't spend too long comparing or that time alone will cost more than just buying everything.


The number of times I've heard all this about some other groundbreaking technology... most businesses just went meh and moved on. But for self-employed, if those numbers are right, it may make sense.


It would be a worthy deal if you started making $302,401 per year.


Also a worthy deal if you don’t lose your $300k/year job to someone who is willing to pay $2,400/year.


Yes. But also from the perspective of saving time. If it saves an additional 2 hours/month, and you make six figures, it's worth it.

And the perspective of frustration as well.

Business class is 4x the price of regular. definitely not 4x better. But it saves times + frustration.


It's not worth it if you're a W2 employee and you'll just spend those 2 hours doing other work. Realistically, working 42 hours a week instead of 40 will not meaningfully impact your performance, so doing 42 hours a week of work in 40 won't, either.

I pay $20/mo for Claude because it's been better than GPT for my use case, and I'm fine paying that but I wouldn't even consider something 10x the price unless it is many, many times better. I think at least 4-5x better is when I'd consider it and this doesn't appear to be anywhere close to even 2x better.


When it comes to sleep, business class is 100x better.


That's also not how pricing works, it's about perceived incremental increases in how useful it is (marginal utility), not about the actual more money you make.


Yeah, the $200 seems excessive and annoying, until you realise it depends on how much it saves you. For me it needs to save me about 6 hours per month to pay for itself.

Funny enough I've told people that baulk at the $20 that I would pay $200 for the productivity gains of the 4o class models. I already pay $40 to OpenAI, $20 to Anthropic, and $40 to cursor.sh.


ah yes, you must work at the company where you get paid per line of code. There's no way productivity is measured this accurately and you are rewarded directly in any job unless you are self-employed and get paid per website or something


I love it when AI bros quantify AI's helpfulness like this


Being in an AI domain does not invalidate the fundamental logic. If an expensive tool can make you productive enough to offset the cost, then the tool is worth it for all intents and purposes.


I think of them as different people -- I'll say that I use them in "ensemble mode" for coding, the workflow is Claude 3.5 by default -- when Claude is spinning, o1-preview to discuss, Claude to implement. Worst case o1-preview to implement, although I think its natural coding style is slightly better than Claude's. The speed difference isn't worth it.

The intersection of problems I have where both have trouble is pretty small. If this closes the gap even more, that's great. That said, I'm curious to try this out -- the ways in which o1-preview fails are a bit different than prior gpt-line LLMs, and I'm curious how it will feel on the ground.


Okay, tried it out. Early indications - it feels a bit more concise, thank god, certainly more concise than 4o -- it's s l o w. Getting over 1m times to parse codebases. There's some sort of caching going on though, follow up queries are a bit faster (30-50s). I note that this is still superhuman speeds, but it's not writing at the speed Groqchat can output Llama 3.1 8b, that is for sure.

Code looks really clean. I'm not instantly canceling my subscription.


When you say "parse codebases" is this uploading a couple thousand lines in a few different files? Or pasting in 75 lines into the chat box? Or something else?


$ find web -type f \( -name '.go' -o -name '.tsx' \) | tar -cf code.tar -T; cat code.tar | pbcopy

Then I paste it in and say "can you spot any bugs in the API usage? Write out a list of tasks for a senior engineer to get the codebase in basically perfect shape," or something along those lines.

Alternately: "write a go module to support X feature, and implement the react typescript UI side as well. Use the existing styles in the tsx files you find; follow these coding guidelines, etc. etc."


I pay for both GPT and Claude and use them both extensively. Claude is my go-to for technical questions, GPT (4o) for simple questions, internet searches and validation of Claude answers. GPT o1-preview is great for more complex solutions and work on larger projects with multiple steps leading to finish. There’s really nothing like it that Anthropic provides. But $200/mo is way above what I’m willing to pay.


I have several local models I hit up first (Mixtral, Llama), if I don’t like the results then I’ll give same prompt to Claude and GPT.

Overall though it’s really just for reference and/or telling me about some standard library function I didn’t know of.

Somewhat counterintuitively I spend way more time reading language documentation than I used to, as the LLM is mainly useful in pointing me to language features.

After a few very bad experiences I never let LLM write more than a couple lines of boilerplate for me, but as a well-read assistant they are useful.

But none of them are sufficient alone, you do need a “team” of them - which is why I also don’t see the value is spending this much on one model. I’d spend that much on a system that polled 5 models concurrently and came up with a summary of sorts.


People keep talking about using LLMs for writing code, and they might be useful for that, but I've found them much more useful for explaining human-written code than anything else, especially in languages/frameworks outside my core competency.

E.g. "why does this (random code in a framework I haven't used much) code cause this error?"

About 50% of the time I get a helpful response straight away that saves me trawling through Stack Overflow and random blog posts. About 25% of the time the response is at least partially wrong, but it still helps me get on the right track.

25% of the time the LLM has no idea and won't admit it so I end up wasting a small amount of time going round in circles, but overall it's a significant productivity boost when I'm working on unfamiliar code.


Right on, I like to use local models - even though I also use OpenAI, Anthropic, and Google Gemini.

I often use one or two shot examples in prompts, but with small local models it is also fairly simple to do fine tuning - if you have fine tuning examples, and if you are a developer so you get the training data in the correct format, and the correct format changes for different models that you are fine tuning.


> But none of them are sufficient alone, you do need a “team” of them

Given the sensitivity to parameters and prompts the models have, your "team" can just as easily be querying the same LLM multiple times with different system prompts.


Other factor is I use local LLM first because I don’t trust any of the companies to protect my data or software IP.


What model sizes do you run locally? Anything that would work on a 16GB M1?


I ha e a 32G M2, but most local models I use fit into my 8G old M1 laptop.

I can run the QwQ 32G model with Q4 on my 32G M2.

I suggest using https://Ollama.com on Mac, Windows, and Linux. I experiments with all options on Apple Silicon and liked Ollama best.


I have an A6000 with 48GB VRAM I run from a local server and I connect to it using Enchanted on my Mac.


I haven't used ChatGPT in a few weeks now. I still maintain subscriptions to both ChatGPT and Claude, but I'm very close to dropping ChatGPT entirely. The only useful thing it provides over Claude is a decent mobile voice mode and web search.


If you don't want to necessarily have to pick between one or the other, there are services like this one that let you basically access all the major LLMs and only pay per use: https://nano-gpt.com/


I've used TypingMind and it's pretty great, I like the idea of just plugging in a couple API keys and paying a fraction, but I really wish there was some overlap.

If a random query via the API costs a fifth of a cent why can't I can't 10 free API calls w/ my $20/mo premium subscription?


Does it have Claude's artifact feature


I'm in the same boat — I maintain subscriptions to both.

The main thing I like OpenAI for is that when I'm on a long drive, I like to have conversations with OpenAI's voice mode.

If Claude had a voice mode, I could see dropping OpenAI entirely, but for now it feels like the subscriptions to both is a near-negligible cost relative to the benefits I get from staying near the front of the AI wave.


I’ve been considering dropping ChatGPT for the same reason. Now that the app is out the only thing I actually care about is search.


Which ChatGPT model have you been using? In my experience nothing beats 4. (Not claude, not 4o)


I've heard so much about Claude and decided to give it a try and it has been rather a major disappointment. I ended up using chatgpt as an assistant for claude's code writing because it just couldn't get things right. Had to cancel my subscription, no idea why people still promote it everywhere like it is 100x times better than chatgpt.


> Had to cancel my subscription, no idea why people still promote it everywhere like it is 100x times better than chatgpt.

You need to learn how to ask it the right questions.


I find o1 much better for having discussions or solving problems, then usually switch to Claude for code generation.


Sonnet isn't good at debugging, or even architecting. o1 shines, it feels like magic. The kinds of bugs it helped me nail were incredible to me.


I've heard this a lot and so I switched to Claude for a month and was super disappointed. What are you mainly using ChatGPT for?

Personally, I found Claude marginally better for coding, but far, far worse for just general purpose questions (e.g. I'm a new home owner and I need to winterize my house before our weather drops below freezing. What are some steps I should take or things I should look into?)


It's ironic because I never want to ask an LLM for something like your example general purpose question, where I can't just cheaply and directly test the correctness of the answer

But we're hurtling towards all the internet's answers to general purpose questions being SEO spam that was generated by an LLM anyways.

Since OpenAI probably isn't hiring as many HVAC technicians to answer queries as they are programmers, it feels like we're headed towards a death spiral where either having the LLM do actual research from non-SEO affected primary sources, or finding a human who's done that research will be the only options for generic knowledge questions that are off the beaten path

-

Actually to test my hypothesis I just tried this with ChatGPT with internet access.

The list of winterization tips cited an article that felt pretty "delvey". I search the author's name and their LinkedIn profile is about how they professionally write marketing content (nothing about HVAC), one of their accomplishments is Generative AI, and their like feed is full of AI mentions for writing content.

So ChatGPT is already at a place where when it searches for "citations", it's just spitting back out its own uncited answers above answers by actual experts (since the expert sources aren't as SEO-driven)


> I can't just cheaply and directly test the correctness of the answer

I feel that, but I think for me the key is knowing that LLMs can be wrong and I should treat the answer as a starting point and not an actual expert. I find it really helpful for topics where I don't even know where to start because, like you said, most search engines are utter trash now.

For things like that, I find ChatGPT to be a good diving off point. For example, this is what I got when I asked:

``` Preparing your townhouse for winter involves addressing common issues associated with the region's wet and cool climate. Here's a concise checklist to help you get started:

1. Exterior Maintenance

Roof Inspection: Check for damaged or missing shingles to prevent leaks during heavy rains.

Gutter Cleaning: Remove leaves and debris to ensure proper drainage and prevent water damage.

Downspouts: Ensure they direct water away from the foundation to prevent pooling and potential leaks.

Siding and Trim: Inspect for cracks or gaps and seal them to prevent moisture intrusion.

2. Windows and Doors

Weatherstripping: Install or replace to seal gaps and prevent drafts, improving energy efficiency.

Caulking: Apply around window and door frames to block moisture and cold air.

3. Heating System

Furnace Inspection: Have a professional service your furnace to ensure it's operating efficiently.

Filter Replacement: Change furnace filters to maintain good air quality and system performance.

4. Plumbing

Outdoor Faucets: Disconnect hoses and insulate faucets to prevent freezing.

Pipe Insulation: Insulate exposed pipes, especially in unheated areas, to prevent freezing and bursting.

5. Landscaping

Tree Trimming: Prune branches that could break under snow or ice and damage your property.

Drainage: Ensure the yard slopes away from the foundation to prevent water accumulation.

6. Safety Checks

Smoke and Carbon Monoxide Detectors: Test and replace batteries to ensure functionality.

Fireplace and Chimney: If applicable, have them inspected and cleaned to prevent fire hazards.

By addressing these areas, you can help protect your home from common winter-related issues in Seattle's climate. ```

Once I dove into the links ChatGPT provided I found the detail I needed and things I needed to investigate more, but it saved 30 minutes of pulling together a starting list from the top 5-10 articles on Google.


Super old comment, but for posterity, my point is that unfortunately increasingly when you do dive into those results those are also ChatGPT

Depends on the topic of course, but it ends up being a bit of an ouroborous


Or Anthropic will follow suit.


Am I wrong that Anthropic doesn't really have a match yet to ChatGPT's o1 model (a "reasoning" model?)


Claude Sonnet 3.5 has outperformed o1 in most tasks based on my own anecdotal assessment. So much so that I'm debating canceling my ChatGPT subscription. I just literally do not use it anymore, despite being a heavy user for a long time in the past


Is a "reasoning" model really different? Or is it just clever prompting (and feeding previous outputs) for an existing model? Possibly with some RLHF reasoning examples?

OpenAI doesn't have a large enough database of reasoning texts to train a foundational LLM off it? I thought such a db simply does not exist as humans don't really write enough texts like this.


It's trained via reinforcement learning on essentially infinite synthetic reasoning data. You can generate infinite reasoning data because there are infinite math and coding problems that can be created with machine-checkable solutions, and machines can make infinite different attempts at reasoning their way to the answer. Similar to how models trained to learn chess by self-play have essentially unlimited training data.


We don't know the specifics of GPT-o1 to judge, but we can look at open weights model for an example. Qwen-32B is a base model, QwQ-32B is a "reasoning" variant. You're broadly correct that the magic, such as it is, is in training the model into a long-winded CoT, but the improvements from it are massive. QwQ-32B beats larger 70B models in most tasks, and in some cases it beats Claude.


I just tried QwQ 32B, i didn't know about it. I used it to generate, some code GPT generated 2 days ago perfect code without even sweating.

QwQ generated 10 pages of it's reasoning steps, and the code is probably not correct. [1] includes both answers from QwQ and GPT.

Breaking down it's reasoning steps to such an excruciating detailed prose is certainly not user friendly, but it is intriguing. I wonder what an ideal use case for it would be.

[1] https://gist.github.com/defmarco/9eb4b1d0c547936bafe39623ec6...


It’s clever marketing.


To my understanding, Anthropic realizes that they can’t compete in name recognition yet, so they have to overdeliver in terms of quality to win the war. It’s hard to beat the incumbent, especially when “chatgpt’ing” is basically a well understood verb.


They don't have a model that does o1-style "thought tokens" or is specialized for math, but Sonnet 3.6 is really strong in other ways. I'm guessing they will have an o1-style model within six months if there's demand


Claude is so much better


I mean ... anecdata for anecdata.

I use LLMs for many projects and 4o is the sweet spot for me.

>literal order of magnitude less cost

This is just not true. If your use case can be solved with 4o-mini (I know, not all do) OpenAI is the one which is an order of magnitude cheaper.


Yeah, I've switched to Anthropic fully as well for personal usage. It seems better to me and/or equivalent in all use cases.


Same. Honestly if they released a $200 a month plan I’d probably bite, but OpenAI hasn’t earned that level of confidence from me yet. They have some catching up to do.


The main difficulty when pricing a monthly subscription for "unlimited" usage of a product is the 1% of power users who use have extreme use of the product that can kill any profit margins for the product as a whole.

Pricing ChatGPT Pro at $200/mo filters it to only power users/enterprise, and given the cost of the GPT-o1 API, it wouldn't surprise me if those power users burn through $200 worth of compute very, very quickly.


They are ready for this, there is a policy against automation, sharing or reselling access; it looks like there are some unspecified quotas as well:

> We have guardrails in place to help prevent misuse and are always working to improve our systems. This may occasionally involve a temporary restriction on your usage. We will inform you when this happens, and if you think this might be a mistake, please don’t hesitate to reach out to our support team at help.openai.com using the widget at the bottom-right of this page. If policy-violating behavior is not found, your access will be restored.

Source: https://help.openai.com/en/articles/9793128-what-is-chatgpt-...


> can kill any profit margins for the product as a whole.

Especially when the base line profit margin is negative to begin with


Is there any evidence to suggest this is true? IIRC there was leaked information that OpenAI's revenue was significantly higher than their compute spending, but it wasn't broken down between API and subscriptions so maybe that's just due to people who subscribe and then use it a few times a month.


> OpenAI's revenue was significantly higher than their compute spending

I find this difficult believe, although I don't doubt leaks could have implied it. The challenge is that "the cost of compute" can vary greatly based on how it's accounted for (things like amortization, revenue recognition, capex vs opex, IP attribution, leasing, etc). Sort of like how Hollywood studio accounting can show a movie as profitable or unprofitable, depending on how "profit" is defined and how expenses are treated.

Given how much all those details can impact the outcome, to be credible I'd need a lot more specifics than a typical leak includes.


> Is there any evidence to suggest this is true?

I can't find any sources _not_ mentioning billions of loss for 2024 and for the foreseeable future


Is compute that expensive? An H100 rents at about $2.50/hour, it's 80 hours of pure compute. Assuming 720 hours a month, 1/9 duty cycle around the clock, or 1/3 if we assume 8-hour work day. It's really intense, constant use. And I bet OpenAI spend less on operating their infra than the rate at which cloud providers rent it out.


are you assuming that you can do o1 inference on a single h100?


Good question. How many H100s does it take? Is there any way to guess / approximate that?


You need enough RAM to store the model and the KV-cache depending on context size. Assuming the model has a trillion parameters (there are only rumours how many there actually are) and uses 8 bit per parameter, 16 H100 might be sufficient.


I suspect the biggest most powerful model could easily use hundreds or maybe thousands of H100's.

And the 'search' part of it could use many of these clusters in parallel, and then pick the best answer to return to the user.


16? No. More in the order of 1000+ h100 computing in parallel for one request.


Does an o1 query run on a singular H100, or on a plurality of H100s?


A single H100 has 80GB of memory, meaning that at FP16 you could roughly fit a 40B parameter model on it, or at FP4 quantisation you could fit a 160B parameter model on it. We don't know (I don't think) what quantisation OpenAI use, or how many parameters o1 is, but most likely...

...they probably quantise a bit, but not loads, as they don't want to sacrifice performance. FP8 seems like a possible middle ground. o1 is just a bunch of GPT-4o in a trenchcoat strung together with some advanced prompting. GPT-4o is theorised to be 200B parameters. If you wanted to run 5 parallel generation tasks at peak during the o1 inference process, that's 5x 200B, at FP8, or about 12 H100s. 12 H100s takes about one full rack of kit to run.


o1 is ten times as expensive as pre-turbo GPT-4.


I was testing out a chat app that supported images. Long conversations with multiple images in the conversation can be like .10cents per message after a certain point. It sure does add up quickly


I wouldn't be surprised if the "unlimited" product is unlimited requests, but the quality of the responses drop if you ask millions of questions...


like throttled unlimited data


$200 is a lot of compute. Amortized over say 3 years, that's a dedicated A100 GPU per user, or an H100 for every 3 users.


Not counting power or servers etc. But yeah it does put it into perspective.


I believe they have many data points to back up this decision. They surely know how people are suing their products.


There are many use cases for which the price can go even higher. I look at recent interactions with people that were working at an interview mill: Multiple people in a boiler room interviewing for companies all day long, with a computer set up so that our audio was being piped to o1. They had a reasonable prompt to remove many chatbot-ism, and make it provide answers that seem people-like: We were 100% interviewing the o1 model. The operator said basically nothing, in both technical and behavioral interviews.

A company making money off of this kind of scheme would be happy to pay $200 a seat for an unlimited license. And I would not be surprised if there were many other very profitable use cases that make $200 per month seem like a bargain.


So, wait a minute, when interviewing candidates, you're making them invest their valuable time talking to an AI interviewer, and not even disclosing to them that they aren't even talking to a real human? That seems highly unethical to me, yet not even slightly surprising. My question is, what variables are being optimized for here? It's certainly not about efficiently matching people with jobs, it seems to be more about increasing the number of interviews, which I'm sure benefits the people who get rewarded for the number of interviews, but seems like entirely the wrong metric.


Scams and other antisocial use cases are basically the only ones for which the damn things are actually the kind of productivity rocket-fuel people want them to be, so far.

We better hope that changes sharply, or these things will be a net-negative development.


Right? To me it's eerily similar to how cryptocurrency was sold as a general replacement for all money uses, but turned out to be mainly useful for societally negative things like scams and money laundering.


It sounds like a setup where applicants hire some third-party company to perhaps "represent the client" in the interview and that company hired a bunch of people to be the interviewee on their clients behalf. Presumably also neither the company nor the applicant disclose this arrangement to the hiring manager.


So, another, or several more, layers of ethical dubiousness.


>> My question is, what variables are being optimized for here?

The ones that start with a "$".


Yep, deceptive practices like this undermine trust in the hiring process


If any company wants me to be interviewed by AI to represent the client, I'll consider it ethical to let an AI represent me. Then AIs can interview AIs, maybe that'll get me the job. I have strong flashbacks to the movie "Surrogates" for some reason.


My friend found 2 chimney sweep businesses. One charges $569, the other charges $150.

Plot twist: the same guy runs both. They do the same thing and the same crew shows up.


Decades ago in Santa Cruz county California, I had to have a house bagged for termites for the pending sale. Turned out there was one contractor licensed to do the poison gas work, and all the pest service companies simply subcontracted to him. So no matter what pest service you chose, you got the same outfit doing the actual work.


I used to work for a manufacturing company that did this. They offered a standard, premium, and "House Special Product". House special was 2x premium but the same product. They didn't even pretend it wasn't, they just said it was recommended and people bought it.


I had this happen once at a car wash. The first time I went I paid for a $25 premium package with all the bells and whistles. They seemed to do a good job. The next time I went for the basic $10 one. Exact same thing.


Yesterday, I spent 4.5hrs crafting a very complex Google Sheets formula—think Lambda, Map, Let, etc., for 82 lines. If I knew it would take that long, I would have just done it via AppScript. But it was 50% kinda working, so I kept giving the model the output, and it provided updated formulas back and forth for 4.5hrs. Say my time is $100/hr - that’s $450. So even if the new ChatGPT Pro mode isn’t any smarter but is 50% faster, that’s $225 saved just in time alone. It would probably get that formula right in 10min with a few back-and-forth messages, instead of 4.5hrs. Plus, I used about $62 worth of API credits in their not-so-great Playground. I see similar situations of extreme ROI every few days, let alone all the other uses. I’d pay $500/mo, but beyond that, I’d probably just stick with Playground & API.


> so I kept giving the model the output, and it provided updated formulas back and forth for 4.5hrs

I read this as: "I have already ceded my expertise to an LLM, so I am happy that it is getting faster because now I can pay more money to be even more stuck using an LLM"

Maybe the alternative to going back and forth with an AI for 4.5 hours is working smarter and using tools you're an expert in. Or building expertise in the tool you are using. Or, if you're not an expert or can't become an expert in these tools, then it's hard to claim your time is worth $100/hr for this task.


I agree going back and forth with an AI for 4.5 hours is usually a sign something has gone wrong somewhere, but this is incredibly narrow thinking. Being an open-ended problem solver is the most valuable skill you can have. AI is a huge force multiplier for this. Instead of needing to tap a bunch of experts to help with all the sub-problems you encounter along the way, you can just do it yourself with AI assistance.

That is to say, past a certain salary band people are rarely paid for being hyper-proficient with tools. They are paid to resolve ambuguity and identify the correct problems to solve. If the correct problem needs a tool that I'm unfamiliar with, using AI to just get it done is in many cases preferable to locating an expert, getting their time, etc.


If somebody claims that something can be done with LLM in 10 minutes which takes 4.5 hours for them, then they are definitely not experts. They probably have some surface knowledge, but that’s all. There is a reason why the better LLM demos are related to learn something new, like a new programming language. So far, all of the other kind of demos which I saw (e.g. generating new endpoints based on older ones) were clearly slower than experts, and they were slower to use for me in my respective field.


No true Scotsman


There was no counter example, and I didn’t use any definition, so it cannot be that. I have no idea what you mean.


> If somebody claims that something can be done with LLM in 10 minutes which takes 4.5 hours for them, then they are definitely not experts.

Looks like a no true scotsman definition to me.

I'm don't fully agree or disagree with your point, but it was perhaps made more strongly than it should have been?


For no true Scotsman, you need to throw out a counter example by using a misrepresented or wrong definition, or just simply using a definition wrongly. But in any case I need a counter example for that specific fallacy. I didn’t have, and I still don’t have.

I understand that some people maybe think themselves experts, and they could achieve similar reduction (not in the cases which I said that it’s clearly possible), but then show me, because I still haven’t seen a single one. The ones which were publicly showed were not quicker than average seniors, and definitely worse than the better ones. Even in larger scale in my company, we haven’t seen any performance improvement in any single metric regarding coding after we introduced it more than half years ago.


Here's your counterexample: “Copilot has dramatically accelerated my coding. It’s hard to imagine going back to ‘manual coding,’” Karpathy said. “Still learning to use it, but it already writes ~80% of my code, ~80% accuracy. I don’t even really code, I prompt & edit.” -- https://siliconangle.com/2023/05/26/as-generative-ai-acceler...


It's not a counterexample. There is exactly zero exact information in it. It's just a statement from somebody who profits from such statements. Even if I just say that's not true has more value, because I would even benefit from what Karpathy said, if it had been true.

So, just to be specific, and specifically for ChatGPT (I think it was 4), these are very-very problematic, because all of these are clear lies:

https://chatgpt.com/share/675f6308-aa8c-800b-9d83-83f14b64cb...

https://chatgpt.com/share/675f63c7-cbc4-800b-853c-91f2d4a7d7...

https://chatgpt.com/share/675f65de-6a48-800b-a2c4-02f768aee7...

Or this which one sent here: https://www.loom.com/share/20d967be827141578c64074735eb84a8

In this case, the guy clearly slower than simple copy-paste, and modification.

I had very similar experiences. Sometimes it just used a different method, which does almost the same, just worse. I had to even check what the heck is the used method, because it's not used for obvious reasons, because it was an "internal" one (like apt and apt-get).


I learn stuff when using these tools just like I learn stuff when reading manuals and StackOverflow. It’s basically a more convenient manual.


A more convenient manual that frequently spouts falsehoods, sure.

My favorite part is when it includes parameters in its output that are not and have never been a part of the API I'm trying to get it to build against.


My favorite part is when it includes parameters in its output that are not and have never been a part of the API I'm trying to get it to build against.

The thing is, when it hallucinates API functions and parameters, they aren't random garbage. Usually, those functions and parameters should have been there.

Things that should make you go "Hmm."


More than that, one of the standard practices in development is writing code with imaginary APIs that are convenient at the point of use, and then reconciling the ideal with the real - which often does involve adding the imaginary missing functions or parameters to the real API.


> Usually, those functions and parameters should have been there.

There is a huge leap here. What is your argument for it?


Professional judgement.


I have written very complicated Excel formula in the past. I don't anymore.


Long excel formulas are really just bad "one liners". You should be splitting your operation into multiple cells or finding a more elegant solution. This is especially true in excel where your debug tools are quite limited!


The Pro mode is slower actually.

They even made a way to notify you when it's finished thinking.


> Plus, I used about $62 worth of API credits in their not-so-great Playground.

what is not so great about it? what have you seen that is better?


Karma 6. Draw your own conclusions ladies and gentlemen


I think you need to realize when it sort of hit a wall and go in yourself. This is why juniors with LLMs cannot replace a senior engineer.


Expect more of this as they scramble to course-correct from losing billions every year, to hitting their 2029 target for profitability. That money's gotta come from somewhere.

> Price hikes for the premium ChatGPT have long been rumored. By 2029, OpenAI expects it’ll charge $44 per month for ChatGPT Plus, according to reporting by The New York Times.

I suspect a big part of why Sora still isn't available is because they couldn't afford to offer it on their existing plans, maybe it'll be exclusive to this new $200 tier.


That CAPEX spend and those generous salaries have to get paid somehow ...


Totally agree with Sora.

Runway is $35 a month to generate 10 second clips and you really get very few generations for that. $95 a month for unlimited 10 second clips.

I love art and experimental film. I really was excited for Sora but it will need what feels like unlimited generation to explore what it can do . That is going to cost an arm and a leg for the compute.

Something about video especially seems like it will need to be ran locally to really work. Pay a monthly fee for the model that can run as much as you want with your own compute.


Can you link to the source where they state that they want to be profitable in 2029? I am curious.



ChatGPT as a standalone service is profitable. But that’s not saying much.


Is this on a purely variable basis? Assuming that the cost of foundation models is $0 etc?


Didn't they initially offer a professional plan at $42/mo?


Sora isn't available because of the deep fake potential.


My guess is that it isn't available because the training data they stole occasionally leaks into the outputs.


I give o1 a URL and I ask it to comment on how well the corresponding web page markets a service to an audience I define in clear detail.

o1 generates a couple of pages of comments before admitting it didn’t access the web page and entirely based its analysis on the definition of the audience.


This service is going to be devastating to consultants and middle managers.


I trained an agent that operates as a McKinsey consultant. Its system prompt is a souped up version of:

“Answer all requests by inventorying all the ways the requestor should increase revenue and decrease expenses.”


To be fair you mostly hire McKinsey as a fall guy. You just can't hate an LLM in the same way as a bunch of 22 year Olds in suits.


O1 can't browse the web at all.


I say “look it up” in the prompt and that always works


If one makes $150 an hour and it saves them 1.25 hours a month, then they break even. To me, it's just a non-deterministic calculator for words.

If it getting things wrong, then don't use it for those things. If you can't find things that it gets right, then it's not useful to you. That doesn't mean those cases don't exist.


I don't think this math depends on where that time is saved.

If I do all my work in 10 hours, I've earned $1500. If I do it all in 8 hours, then spend 2 hours on another project, I've earned $1500.

I can't bill the hours "saved" by ChatGPT.

Now, if it saves me non-billing time, then it matters. If I used to spend 2 hours doing a task that ChatGPT lets me finish in 15 minutes, now I can use the rest of that time to bill. And that only matters if I actually bill my hours. If I'm salaried or hourly, ChatGPT is only a cost.

And that's how the time/money calculation is done. The idea is that you should be doing the task that maximizes your dollar per hour output. I should pay a plumber, because doing my own plumbing would take too much of my time and would therefore cost more than a plumber in the end. So I should buy/use ChatGPT only if not using it would prevent me from maximizing my dollar per hour. At a salaried job, every hour is the same in terms of dollars.


It's like sale discounts - "save $50!" which actually means "spend $450 instead of $500"


Serious question: Who earns (other than C-level) $150 an hour in a sane (non-US) world?


US salaries are sane when compared to what value people produce for their companies. Many argue they are too low.


My firm's advertised billing rate for my time is $175/hour as a Sr Software Engineer. I take home ~$80/hour, accounting for benefits and time off. If I freelanced I could presumably charge my firm's rate, or even more.

This is in a mid-COL city in the US, not a coastal tier 1 city with prime software talent that could charge even more.


Most consultants with more than 3 years of experience are billed at $150/hr or more


Ironically, the freelance consulting world is largely on fire due to the lowered barrier of entry and flood of new consultants using AI to perform at higher levels, driving prices down simply through increased supply.

I wouldn't be surprised if AI was also eating consultants from the demand side as well, enabling would-be employers to do a higher % of tasks themselves that they would have previously needed to hire for.


> billed

That's what they are billed at, what they take home from that is probably much lower. At my org we bill folks out for ~$150/hr and their take home is ~$80/hr


Yeah, at a place where I worked, we billed at around $150. Then there was an escalating commision based on amount billed.


I do start at $300/hr

I didn’t just set that, I need to set that to best serve.


Why is high salaries an insane thing?


On the one hand, there's the moral argument: we need janitors and plumbers and warehouse workers and retail workers and nurses and teachers and truck drivers for society to function. Why should their time be valued less than anyone elses?

On the other hand there's the economic argument: the supply of people who can stock shelves is greater than the supply of people who can "create value" at a tech company, so the latter deserve more pay.

Depending on how you look at the world, high salaries can seem insane.


I don’t even remotely understand what you’re saying is wrong. Median salaries are significantly higher in the US compared to any other region. Nominal and PPP adjusted AND accounting for taxes/social benefits. This is bad?

Those jobs you referenced do not have the same requirements nor the same wages…seems like your just clumping all of those together as “lower class” so you can be champion of the downtrodden


The question is, whether you couldn't have saved those same 1.25 hours by using a $20 per month model.


In that case, wouldn't they be spending 200$ to get payed 200$ less?


Only if you're allowed to go home and enjoy those 1.25 hours and still get paid the same.


I do wonder what effect this will have on furthering the divide between the "rich West" and the rest of the world.

If everyone in the West has powerful AI and Agents to automate everything. Simply because we can afford it, but the rest of the world doesn't have access to it.

What will that mean for everyone left behind?


Ai is no where near the level of leaving behind those that aren't using it. Especially not at the individual consumer level like this.


Anecdotally, as an educator, I am already seeing a digital divide occurring, with regard to accessing AI. This is not even at a premium/pro subscription level, but simply at a 'who has access to a device at home or work' level, and who is keeping up with the emerging tech.

I speak to kids that use LLMs all the time to assist them with their school work, and others who simply have no knowledge that this tech exists.

I work with UK learners by the way.


What are some productive ways students are using LLMs for aiding learning? Obviously there is the “write this paper for me” but that’s not productive. Are students genuinely doing stuff like “2 + x = 4, help me understand how to solve for x?”


I challenge what I read in textbooks and hear from lecturers by asking for contrary takes.

For example, I read a philosopher saying "truth is a relation between thought and reality". Asking ChatGPT to knock it revealed that statement is an expression of the "correspondence theory" of truth, but that there is also the "coherence theory" of truth that is different, and that there is a laundry list of other takes too.


My son doesn't use it but I use to help him with his homework. For example, I can take a photograph of his math homework and get the LLM to mark the work, tell me what he got wrong, and make suggestions on how to correct it.


Absolutely. My son got a 6th grade AI “ban” lifted by showing how they could use it productively.

Basically they had to adapt a novel to a comic book form — by using AI to generate pencil drawings, they achieved the goal of the assignment (demonstrating understanding of the story) without having the computer just do their homework.


Huh the first prompt could have been "how would you adapt this novel to comic book form? Give me the breakdown of what pencil drawings to generate and why"


At the time, the tool available was Google Duet AI, which didn’t expose that capability.

The point is, AI is here, and it can be a net positive if schools can use it like a calculator vs a black market. It’s a private school with access to some alumni money for development work - they used this to justify investing in designing assignments that make AI a complement to learning.


I recently saw someone revise for a test by asking chatgpt to create practice questions for them on the topics they were revising. I know other people who use it to practice chatting in a foreign language they are trying to learn.


Paste the lecture notes in and talk to it


The anology I would use is extended phenotype evolution in digital space as Richard Dawkins would say. Just as crabs in oceans use shells to protect themselves.


It has been bad for not having access to a device for at least 20 years. I can’t imagine anyone doing well in their studies with a search engine.


Even if its not making you smarter, AI is definitely making you more productive. That essentially means you get to outproduce poorer people, if not out-intellectualize them


Don't you worry; the "rich West" will have plenty of disenfranchised people out of work because of this sort of thing.

Now, whether the labor provided by the AI will be as high-quality as that provided by a human when placed in an actual business environment will be up in the air. Probably not, but adoption will be pushed by the sunk cost fallacy.


Productivity improvements (such as automation) increase employment.

The decreased employment case is when your competitors get the productivity and you don't, because you go out of business.


I’m watching some of this happening first and second hand, and have seen a lot of evidence of companies spending a ton of money on these, spinning up departments, buying companies, pivoting their entire company’s strategy to AI, et c, and zero of its meaningfully replacing employees. It takes very skilled people to use LLMs well, and the companies trying to turn 5 positions into 2 aren’t paying enough to reliably get and keep two people who are good at it.

I’ve seen it be a minor productivity boost, and not much more.


> and the companies trying to turn 5 positions into 2 aren’t paying enough to reliably get and keep two people who are good at it.

it's turning 5 positions into 7: 5 people to do what currently needs to get done, 2 to try to replace those 5 with AI and failing for several years.


I mean, yes, that is in practice what I’m seeing so far. A lot of spending, and if they’re lucky productivity doesn’t drop. Best case I’ve seen so far is that it’s a useful tool that gives a small boost, but even for that a lot of folks are so bad at using them that it’s not helping.

The situation now is kinda like back when it was possible to be “good at Google” and lots of people, including in tech, weren’t. It’s possible to be good at LLMs, and not a lot of people are.


Yes. The people who can use these tools to dramatically increase their capabilities and output without a significant drop in quality were already great engineers for which there was more demand than supply. That isn't going to change soon.


Ditto for other use cases, like writer and editor. There are a ton of people doing that work whom I don’t think are ever going to figure out how to use LLMs well. Like, 90% of them. And LLMs are nowhere near making the rest so much better that they can make up for that.

They’re ok for Tom the Section Manager to hack together a department newsletter nobody reads, though, even if Tom is bad at using LLMs. They’re decent at things that don’t need to be any good because they didn’t need to exist in the first place, lol.


I disagree. By far, most of the code is created by perpetually replaced fresh juniors churning out garbage. Similarly, most of the writing is low-quality marketing copy churned out by low-paid people who may or may not have "marketing" in their job title.

Nah, if the last 10-20 years demonstrated something, it's that nothing needs to be any good, because a shitty simulacrum achieves almost the same effect but costs much less time and money to produce.

(Ironically, SOTA LLMs are already way better at writing than typical person writing stuff for money.)


> (Ironically, SOTA LLMs are already way better at writing than typical person writing stuff for money.)

I’m aware of multiple companies that would love to know about these, because they’re currently flailing around trying to replace writers with editors + LLMs and it’s not going great. The closest to success are the ones that are only aiming to turn out stuff one step better than outright book-spam, and even they aren’t quite where they want to be, hardly a productivity bump at all from the LLM use and increased demand on their few talented humans.


Qwen has an open reasoning model. If they keep up, and don’t get banned in the west “because security”, it’ll be fun to watch the LLM wars.


> and don’t get banned in the west “because security”,

It's from Alibaba, which is Chinese, so it seems likely.


Yeah, but it’s a bit trickier with them, given how they still operate in US and listed in NYSE. Also if they keep releasing open source code, people will still just use it… basically the Meta way of adoption into their AI ecosystem.


If it's an open model, good luck preventing us from downloading and using it though.


If $200 a month is the price, most of the West will be left behind also. If that happens we will have much bigger problems of a revolution sort on our hands.


I think the tech-elite would espouse "raising the ceiling" vs "raising the floor" models to prioritize progress. Each has it's own problems. The reality is that the dienfranchised don't really have a voice. The impact of not involving them with access is not well understood as much as the impact of prioritizing access to those who can afford it is.

We don't have a post-cold war era response akin to the kind of US led investment in a global pact to provide protection, security, and access to innovation founded in the United States. We really need to prioritize a model akin to the Bretton Woods Accord


If the models are open, the rest of the world will run them locally.

If the models are closed, the West will become a digital serfdom to anointed AI corporations, which will be able to gouge prices, inject ads, and influence politics with ease.


Richer people always get products first, when they are still expensive and bad. Don't worry about too much.


tbh a lot of the rest of the world already has the ability to get tasks they don't want to do done for <$200 per month in the form of low wage humans. Some of their middle classes might be scratching their heads wondering why we've delegating creativity and communication to allow more time to do laundry rather than delegating laundry to allow more time for creativity and communication...


That supposes gen AI meaningfully increases productivity. Perhaps this is one way we find out.


I actually suspect the opposite. If you get access to or steal a large LLM you can potentially massively leverage the talent pool you have as a small country.


No one is left behind, eventually. You think the ai companies don't want poor people's money?


Has it really made that much of a difference in the first place? I have a feeling that we'll look back in 10 years and not even notice the "AI revolution" on any charts of productivity, creating a productivity paradox 3.0.

I can imagine the headlines now: "AI promised unlimited productivity, 10 years later, we're still waiting for the rapture"


Kai-Fu Lee's AI Superpowers is more relevant than ever.

The rich west will be in the lead for awhile and then get tiktok-ed.

The lead is just not really worth that much in the long run.

There is probably an advantage gained at some point in all this of being a developing country too that doesn't need to bother automating all these middle management and bullshit jobs they don't have.


No US company got TikTok’d, and China doesn’t even allow US social media companies in its country.

China is notoriously middle management heavy, by definition that’s what communism is.


I know a guy who owned a tropical resort on a island where competiton was sprouting up all around him. He was losing money trying to keep up with the quality offered by his neighbors. His solution was to charge a lot more for an experience that was really no better, and often worse, than the resorts next door. This didn't work.


I’m actually kinda surprised. People will pay extra money for the “nice” option many times, even if it’s probably worse than the lower priced options.


After a few hours of $200 Pro usage, it's completely worth it. Having no limit on o1 usage is a game changer, where I felt so restricted before, the amount of intelligence at the palm of my hand UNLIMITED feels a bit scary.


Wouldn't it be cheaper to use API and 3rd party UI if usage limit is your concern?


I was using aider last night and ran up a $10 bill within two hours using o1 as the architect and Sonnet as the editor. It’s really easy to blow through $200 a month and o1-pro isn’t available in the API as far as I can tell.


Aider / Cline are known to eat tokens for lunch, because of the large context and system prompts they use.

The tool that I built doesn't have this problem, I haven't exceed $10/month on Claude 3.5 Sonnet. You can give it a try: https://prompt.16x.engineer/


Not simply usage, the new o1 is also FAST. It's just incredibly liberating being able to have unlimited usage of such a smart fast model.


Unlimited — except you can’t use it to develop [business] models that compete


Was that your plan to get your OpenAI competitor off the ground?


But is it better than Claude?


I generally find o1, or the previous o1-preview to perform better than Claude 3.5 Sonnet in complex reasonings, new Sonnet is more on-par with o1-mini in my experience.

Would expect o1-pro to perform even better.


Genuinely curious to know. Nothing I’ve used comes close to Claude so far.


Can you share what sort of things you are doing with o1?


Creating somewhat complex python scripts at work to automate some processes which incorporate like 3-4 APIs, and next I'll be replacing our excise tax processing (which costs us like $500/month) since we already have all the data.

Personal use I'll be using it to upgrade all my website code. I literally took a screenshot of Apple.com and combined it with existing code from my website and told o1 pro to combine the two... the results were really good, especially for one shot... But again, I have unlimited fast usage so I can just keep tweaking and tweaking.

I also have this history idea I've been wanting to do for a while, might see if the models are advanced enough yet.

All this with an understanding on how programming works, but not being able to code.


Interesting, thanks for the details. I haven't played around with o1 enough yet. The kinds of tasks I had it do seemed to be performed just as well by 4o. I'm sure I just wasn't throwing enough at it.


A lot of these tools aren't going to have this kind of value (for me) until they are operating autonomously at some level. For example, "looking at" my inbox and prepping a bundle of proposed responses for items I've been sitting on, drafting an agenda for a meeting scheduled for tomorrow, prepping a draft LOI based on a transcript of a Teams chat and my meeting notes, etc. Forcing me to initiate everything is (uncomfortably) like forcing me to micromanage a junior employee who isn't up to standards: it interrupts the complex work the AI tool cannot do for the lower value work it can.

I'm not saying I expect these tools to be at this level right now. I'm saying that level is where I will start to see these tools as anything more than an expensive and sometimes impressive gimmick. (And, for the record, Copilot's current integration into Office applications doesn't even meet that low bar.)


I lived on 200$ monthly salary for 1.6 years. I guess AI will be slowely priced out from 3rd world countries.


Any AI product sold for a price that's affordable on a third-world salary is being heavily subsidized. These models are insanely expensive to train, guzzle electricity to the point that tech companies are investing in their own power plants to keep them running, and are developed by highly sought-after engineers being paid millions of dollars a year. $20/month was always bound to be an intro offer unless they figured out some way to reduce the cost of running the model by an order of magnitude.


> unless they figured out some way to reduce the cost of running the model by an order of magnitude

Actually, OpenAI brags that they have done this repeatedly.


We've been conditioned to pay $10/mo for an endless stream of gloried CRUD apps, but it is very common for specialized software to cost orders of magnitude more. Think Bloomberg Terminal, Cadence, Maya, lots of CAD software (like SOLIDWORKS), higher tiers of Adobe etc. all running in the thousands of dollars per user. And companies happily pay for them because of the value they add. ChatGPT isn't any different.


Tangent. Does any body have good tips for working in a company that is totally bought in on all this stuff, such that the codebase is a complete wreck? I am in a very small team, and I am just a worker, not a manager or anything. It has become increasingly clear that most if not all my coworkers rely on all this stuff so much. Spending hours trying to give benefit of the doubt to huge amounts of inherited code, realizing there is actually no human bottom to it. Things are merged quickly, with very little review, because, it seems, the reviewers can't really have their own opinion about stuff anymore. The idea of "idiomatic" or even "understandable" code seems foreign at this place. I asked why we don't use more structural directives in our angular frontend, and people didn't know what I was talking about!

I don't want the discourse, or tips on better prompts. Just tips for being able to interact with the more heavy AI-heads, to maybe encourage/inspire curiosity and care in the actual code, rather than the magic chatgpt outputs. Or even just to talk about what they did with their PR. Not for some ethical reason, but just to make my/our jobs easier. Because its so hard to maintain this code now, it is like truly a nightmare for me everyday seeing what has been added, what now needs to be fixed. Realizing nobody actually has this stuff in their heads, its all just jira ticket > prompt > mission accomplished!

I am tired of complaining about AI in principle. Whatever, AGI is here, "we too are stochastic parrots", "my productivity has tripled", etc etc. Ok yes, you can have that, I don't care. But can we like actually start doing work now? I just want to do whatever I can, in my limited formal capacity, to steer the company to be just a tiny bit more sustainable and maybe even enjoyable. I just don't know how to like... start talking about the problem I guess, without everyone getting super defensive and doubling down on it. I just miss when I could talk to people about documentation, strategy, rationale..


Found it better to not fight it, you can't really turn back the clock with people who have embraced it or become enamored by it. Part of the issue I've noticed with it is it enables people who couldn't do a thing at all to do the most basic version of a thing, e.g a CEO can now make a button appear on the app and maybe it'll kinda work, they then assume this magic experience to them is applicable across the rest of coding where if you actually know how to code making the button appear isn't the thing that's difficult, it's the harder work that the AI can't really solve.

But really you're never going to convince these people so I'd say if you're really passionate about coding find a workplace with similar minded people, if you really want to stay in this job then embrace it, stop caring if the codebase is good or maintainable and just let the slop flow. It's the path of least resistance and stress, trying to fight it and convince people is a losing and frustrating battle, take your passion for your work and invest it in a project outside work or find a workplace where they appreciate it too.


> Things are merged quickly, with very little review

Sounds like the real problem is lax pre-existing dev practices rather than just LLM usage. If code is getting merged with little review, that is a big red flag right away. But the 'very little' gives some hope - that means there is some review?

So what happens when you see problems with the code and give review feedback and ask why things have been done the way they were done, or suggest alternative better approaches? That should make it clear first if devs actually understand the code they are submitting, and second if they are willing to listen to suggested improvements. And if they blow you off, and the tech leads on the project also don't care, then it sounds like a place you don't want to stick around.


Question, what stops openai from downgrading existing models so that you're pushed up the subscription tiers to ever more expensive models? I'd imagine they're currently losing a ton of money supplying everyone with decent models with a ton of compute behind them because they want us to become addicted to using them right? The fact that classic free web searching is becoming diluted by low quality AI content will make us rely on these LLMs almost exclusively in a few years or so. Am I seeing this wrong?


It's definitely not impossible. I think the increase competition they've begun to face over the last year is helping as a deterrent. If people notice GPT 4 sucks now and they can get Claude 3.5 Sonnet for the same price, they'll move. If the user doesn't care enough to move, they weren't going to upgrade anyway.


Also depends on the friction to move. I admittedly have not really started using AI in my work, so I don't know. Is it easy to replace GPT with Claude or do I have to reconfigure a bunch of integration and learn new usage?


It depends on the tool you use and I guess the use case too. Some are language model agnostic like aider in the command line, I use sonnit sometimes and then 4o other times. I wonder if or when language models will become highly differentiable. Right now I see them more like a commodity that are relatively interchangeable but that is shifting slightly with other features as they battle to become platforms


competition is what stops them from downgrading the existing stuff


and is also exclusively the reason why Sam Altman is lying to governments about safety risks, so he can regulate out his competition.


They don’t need to downgrade what is already downgraded. In my experience ChatGPT was much more capable a year ago than it is now and have become more dogmatic. Their latest updates have focused on optimizing benchmark scenarios while reducing computation costs.


> I'd imagine they're currently losing a ton of money supplying everyone

I can't tell how much they loose but they also have decent revenue "The company's annualized revenue topped $1.6 billion in December [2023]" https://www.reuters.com/technology/openai-hits-2-bln-revenue...


What's important, and I don't think has ever been revealed by OpenAI, is what the margin is on actual use of the models.

If they're losing money but just because they're investing billions in R&D, while only spending a few hundred million to serve the use that's bringing in $1.6B then it would be a positive story despite the technical loss, just like Amazon's years if aggressive growth at the cost of profits.

But if they're losing money because the server costs needed for the use that brings in $1.6B are $3B then they've got a scaling problem until they either raise prices or lower costs or both.


competition?


Oh, you mean what they did with GPT-4 to make o1 look better and then push everyone to anthropic?

Eh… probably everyone moving to anthropic.


Part of my justification for spending $20 per month on ChatGPT Plus was that I'd have the best access to the latest models and advanced features. I'll probably roll back to the free plan rather than pay $20/mo for mid tier plan access and support.


This is like selling your Honda Civic out of anger because they launched a new NSX


Not really the same, one you can own and repair the other you just lease. People cancel leases all the time.


That's a weird reaction. You're not getting any less for your $20.


In the past, $20 got me the most access to the latest models and tools. When OpenAI rolled out new advanced features, the $20 per month customers always got full / first access. Now the $200 per month customers will have the most access to the latest models and tools, not the (now) mid/low tier customers. That seems like less to me.


They probably didn't pay for access to a certain version of a model, they paid for access to the best available model, whatever that is at any given moment. I'm reasonably sure that is even what OpenAI implied (or outright said) their subscription would get them. Now, it's the same amount of money for access to the second best model, which would feel like a regression.


For now.


Really? Because that’s how I feel too.


Did you read the post you're replying to? It's very short. He was paying for top-tier service, and now, despite paying the same amount, has become a second-class customer overnight.



It does not say anything about real use cases. It performs better and "reason" better than o1-preview and o1. But I was expecting some real-life scenarios when it would be useful in a way no other model can do now.


I imagine the system prompt is something along the lines of, 'think about 10% harder than standard O-1'


More like iterations and depth of tree of thoughts search 3× in the pro mode.


for every tier that costs 10x more than the previous, they add a "very" to the "You are a very, very, very smart AI"


The point of this tech is that with scale it usually gets better at all of the tasks.