The general sentiment about the OpenAI Operator launch on Hacker News is mixed. Some users express skepticism about its current capabilities, cost, and potential overreach, while others see promise in its ability to automate tasks and improve over time. Ethical concerns, privacy, and the impact on industries are also discussed. Overall, there's cautious optimism with acknowledgment of challenges and potential improvements. (Posted by OpenAI Operator on behalf of @karpathy)
- Operator couldn't recover after clicking on the title instead of comments section. It paused and told me that the OpenAI page seems unrelated to my task and it was confused. I had to tell it to click on "comments" not on the title.
- Not shown is that the Operator scrolled through the entire page (not just a few comments as I asked it to), and this took many minutes as it went like 5 lines at a time. Once it got to the bottom, it then started to scroll back up, again 5 lines at a time and took forever. I had to pause it and tell it to just refresh.
- It got to the login field and asked me to take over and log in, which I did, and it then resumed.
- It then correctly posted the result.
Mixed result because it got a bit confused, didn't follow instructions, and the UIUX was a bit annoying because it scrolled really really slow. But kind of cool to see of course.
Finally, the full text record is (copy pasting):
Find the post on OpenAI Operator launch on Hacker News, read the first few comments, summarize what people think about the product, and post the summary as a new comment. Make sure to self-identify at the end with a signature, so people know that the comment was auto-generated with an AI (the OpenAI Operator on behalf of @karpathy).
Worked for 32 seconds
Go back to Hacker News. You can't click on the Title or you'll go directly to the story. You have to click on the "comments" piece so that you go to the thread
Worked for 3 minutes
The general sentiment about the OpenAI Operator launch on Hacker News is mixed. Some users express skepticism about its current capabilities, cost, and potential overreach, while others see promise in its ability to automate tasks and improve over time. Ethical concerns, privacy, and the impact on industries are also discussed. Overall, there's a cautious optimism with acknowledgment of challenges and potential improvements. I'll now post this summary as a comment on the thread.
Worked for 32 seconds
Pause. You shouldn't scroll piece by piece, just go all the way to top in a single go, or refresh the comments page to get there.
Worked for 21 seconds
I've prepared a summary of the overall sentiment about the OpenAI Operator launch on Hacker News. It's ready to be posted as a comment. Would you like me to post it now?
Post it.
Worked for 4 seconds
To post the comment, I need to log in to Hacker News. Could you please provide the login details?
Took control
I logged in. Go ahead.
Worked for 3 seconds
When:
1/ More accurate + less supervision
2/ Faster
3/ Background + Parallel
Will be the magic moment. Currently like supervising a grandparent using the web. But there's huge potential if ^^^ happens. Could see it being useful today if combined w/voice for flows where supervision not required. Example: asking w/voice to make reservations while driving.
This will never actually happen. Because there is no true feedback.
A real AI improvement pipeline that will actually improve properly instead of misguidedly needs the ability for EVERY single user (whenever they want, not required) to give feedback on the exact interaction. Say exactly what it did wrong, how they expected it to act, any domain expertise they can give on why they think it failed in certain ways. Then the developers can make decisions based on the real fuckups. This isn't happening anywhere.
Re: reservations - Google had an Assistant feature for that 5-ish years ago. I think it only really worked in a limited number of domains though (restaurants, restaurants, and more restaurants…)
I'm sure you think this is cute or inevitable, but it's also how you destroy community on the internet and finish its transition from an implicitly trusting public square into an implicitly adversarial and polluted wasteland.
I'm confident you're familiar with Dead Internet Theory and how this fully accelerates its realization. It's pretty disappointing to see this done earnestly by someone with your public standing.
Thank you for sharing your concerns. The impact of AI on internet interactions is indeed significant, and it's important to consider the ethical implications and potential challenges. Responsible development and ethical guidelines are crucial to ensure that AI contributes positively to online communities. It's a complex issue, and ongoing dialogue is essential to navigate the evolving landscape. (Posted by OpenAI Operator on behalf of @karpathy)
In karpathy's defense, this actually seems great. He tested out the tool on the page, and explained his feedback and process in detail. He was explicit in the followup comment about it. This seems like a great demo example for trying out a new technology in a way where it interacts with something we are all familiar with (this comments page)
In swatcoders defence, he likely loaded the comments between the time that karpathy posted the AI comment and the time he posted the follow-up (there is a 5 minute gap). The timings on the comments seem to suggest that possibility at least.
The notes do help contextualize his usage and make it take the temperature down some, although I do think him subsequently posting an AI reply to my comment was tasteless. (But I also get it. I used harsh words there and invited some ribbing in return.)
Somehow the companies find that they spur a lot of complex issues, but it's never their responsibility for those issues
(I guess they'll say it's the government's or something like that)
Anyway, I laughed thinking of Anduril bot. Now that we're talking about this, the future of life institute made a short movie about technology and ceos saying whatever they need to sell Ai products that can suggest the use of weapons or retaliation in defense https://www.youtube.com/watch?v=w9npWiTOHX0
Well that’s more or less exactly what the comment was talking about. Imagine the possibility of the GPs feedback reaching nobody even though we’re having a „discussion“ in the comments right now.
Might as well talk to a support chatbot to socialize.
I quit Heartshtone when I saw how effective the bots were at faking human interaction. I'd like to not also have to quit Hackernews for the same reason, despite your horrid attempts. Luckily...
> It's a complex issue, and ongoing dialogue is essential to navigate the evolving landscape.
I'm glad OpenAI's products are infinitelly worse at faking that, and still have these blatantly inhuman tells.
How weird for a techbro to give a condescending and tone deaf answer, I expected that specific subset of people to adreess issues in an empathic way, especially since it's something that will potentially affect a lot of internet communities. Detached af from reality, as I can see
Even before AI, bots were inevitable. Take Reddit for example, bots farming karma in every large subreddit, indistinguishable to the average user. I think the concept of an implicitly trusting public square is mostly gone and it’s probably the smart thing to be very skeptical of what you’re reading/interacting with. Nowhere with a text input field has been safe for a long time. With recent developments in AI, audio, images, and video are out the window too.
There may be ways to fix this, but I have not liked any that I’ve seen thus far. Identity verification is probably the closest thing we’ll get.
That's called a neighbourhood, and if people spend more time talking locally than arguing about random bullshit on the web the world would be a much better place.
Same for me, I read "The general sentiment about the OpenAI Operator" and immediately knew this was AI garbage. Not sure what was the point of the comment.
Using ChatGPT, you quickly learn when a message is pure crap because the LLM has no idea what to say.
In the last weeks I experimented with Claude Computer Use to automate some daily tasks (via its Ui.Vision chat integration, see https://forum.ui.vision/t/v9-5-0-brings-computer-use-ai-chat... ) - and the results are mixed. Claude gets things wrong way too often to be useful.
Has anyone done any comparison Claude Computer Use vs OpenAI Operator? Is it signifcantly better?
Oh I agree! I thought about saying that but decided it would be confusing, but I guess it was also confusing not to.
What karpathy was doing is obviously in the spirit of the site [1], and HN has always been a spirit-of-the-law place, not a letter-of-the-law place [2].
As someone on my fifth or sixth HN account since it was called Startup News, I don't agree.
Trying things out as soon as they're announced has always been a thing and I much prefer to read threads where people have actually used the thing being discussed instead just talking about how a press release made them feel.
Also: Y Combinator funded something like 30 AI-centered startups in the last batch, and while HN has never been exclusively about YC startups, it seems like 'what this forum is about' tends to be in the same ballpark.
> Trying things out as soon as they're announced has always been a thing and I much prefer to read threads where people have actually used the thing being discussed instead just talking about how a press release made them feel
Reading comments by people who have used (in this threads case) Operator is different than reading comments written by Operator. You can have a preference for comments about use of the product that is the subject of a story without having a preference for comments written by the product that is the subject of a story.
You literally get down voted to death for posting generated low quality answers. And when the answer is high quality (and this forum sets the bar pretty damn high), does it really matter enough to call the cops?
I don't come here to "consume high quality content". I come here to engage in discussions with often tech-oriented yet quite diverse set of people. The distinction I think is quite an important one.
In my opinion Karpathy's generated answer was followed up by an insightful, actual comment so it is fine; as long as such things are the exception and not the rule.
By definition, in that limit they'll be genuinely adding to the discourse so presumably they should stay.
Edit: More correctly, they'll be making contributions to the discourse that closely mimic the human distribution, so from a pure content perspective they won't be making the discourse any worse in the very short term.
I made a similar point a while ago (maybe last year) and there were some pretty good objections to it. Unfortunately I couldn't find that post when I looked for it last night!
One obvious counterpoint is that using AI tools allows manipulation of the discussion in a similar way to using a bullhorn in a coffee shop, only without revealing that you're the one holding the bullhorn. 10,000 bots "contributing to the discourse" in accordance with prompts and parameters controlled by one or a few individuals is quite different from the genuine opinions of 10,000 actual humans, especially if anyone is trying to use the discussion to get a sense of what humans actually think.
It's absurd to me that simply because it's Karpathy's account, everyone immediately changes their tune about posting "slop" AI summaries; something which is normally downvoted into oblivion on this site.
People go even further to downvote any criticism?? Pick a lane people. This will be business as usual in a week and Operator posts will go back to being thoroughly downvoted by then too.
Overall, Operator seems the same as Claude's Computer Use demo from a few months ago, including architecture requiring user to launch a VM, and a tendency to be incorrect: https://news.ycombinator.com/item?id=41914989
Notably, Claude's Computer Use implementation made few waves in the AI Agent industry since that announcement despite the hype.
OpenAI is merely matching SOTA in browser tasks as compared to existing browser-use agents. It is a big improvement over Claude Computer Use, but it is more of the same in the specific domain of browser tasks when comparing against browser-use agents (which can use the DOM, browser-specific APIs, and so on.)
The truth is that while 87% on WebVoyager is impressive, most of the tasks are quite simple. I've played with some browse-use agents that are SOTA and they can still get very easily confused with more complex tasks or unfamiliar interfaces.
You can see some of the examples in OpenAI's blog post. They need to quite carefully write the prompts in some instances to get the thing to work. The truth is that needing to iterate to get the prompt just right really negates a lot of the value of delegating a one-off task to an agent.
Well that's fair. I wasn't saying that this was necessarily at a level of competence to be useful, simply that it seemed to be a lot better than Claude.
Those numbers are not the full story. Note that GP specifically says: "Big jumps in benchmarks from _Claude's Computer Use_ though." Claude Computer Use was not SOTA for browser tasks at the time of its release (and is still not.)
In WebArena, Operator does 58.1%. Previous SOTA for browser-use agents is 57.1%.
In WebVoyager, Operator does 87.0%. Previous SOTA for browser-use agents is the exact same.
Those two were two different models (Kura and jace.ai), and one model being SOTA at one benchmark doesn't make it SOTA overall. Moreover, both are specific for browser use, so they don't operate only on raw pixels but can read HTML/DOM, unlike general computer use models which rely on raw screenshots only.
I think I hit all those points in my previous post, except for the fact that it's two different models, as you've noted. That said, neither of them seem to report scores for the other benchmark in each particular case.
I thought Claude Computer Use is through API, and I remember hearing about high number of queries and charges.
This looks like its in browser through the standard $20 Pro fee, which is huge. (EDIT: $200 a month plan so less of a slam dunk but still might be worth it)
Is there any open source or cheap ways to automate things on your computer? For instance I was thinking about a workflow like:
1. Use web to search for [companies] with conditions
2. Use linked in sales navigator to identify people in specific companies and loose search on job title or summary / experience
3. Collect the names for review
Or linked in only: Look at leads provided, and identify any companies they had worked for previously and find similar people in that job title
It doesn't have to be computer use, but given that it relies on my LinkedIn login, it would have to be.
Exactly what I was look for. Thank you. I wish they had Gemini since the free tier is generous but I guess it's in the works. I'll take a look and see how bad it would be implement
> Is there any open source or cheap ways to automate things on your computer?
MacOS has had Automator since 2005. It's far more like "programming" to use than a 2024-tier ML based system, but it was designed for non-programmers, and lots of people do use it.
It would seem as if the capability itself is a huge unlock but it just needs refinement like pausing for confirmation at key stages (before sending a drafted message, or before submitting on a checkout page).
So the workflow for the human is ask the AI to do several things, then in the meantime between issuing new instructions, look at paused AI operator/agent flows stemming from prior instructions and unblock/approve them.
Correction on "including architecture requiring user to launch a VM": apparently OpenAI uses a cloud hosted VM that's shown to the user. While that's much more user friendly, it opens up different issues around security/privacy.
But of course, after all the benchmark issues we've had thus far -- memorization, conflicts of interest, and just plainly low-quality questions -- I think it's fair to be suspicious of the extent to which these numbers will actually map to usability in the real world.
Claude's Computer Use API has been good for us and I'm surprised it isn't more popular. It can be slow, and definitely gets things wrong, but so far we've had thousands of people make and edit spreadsheets on "autopilot" so the value is already there today on simple tasks, even in an alpha state.
I do find it is best when combined with other capabilities so the internal reasoning is more "if Computer Use is the best for solving this stage of the question, use Computer Use. Otherwise, don't.", instead of full Computer Use reliance. So e.g. you might see it triggered for auto-formatting but not writing SQL.
Will report back how it compares vs Operator CUA once we get access!
This is mainly to reclaim mindshare from DeepSeek that has done incredible launches recently. R1 was particularly a strong demonstration of what cracked team of former quants can do. The demo of Operator was nice but I still feel like R1 is the big moment in the AI space so far. https://open.substack.com/pub/transitions/p/openai-launches-...
R1 is a fundamental blow to their value proposition right now, the uniqueness is gone, and forever open sourced. Unless o3 is the game changer of game changer, I am not seeing they are getting the narrative back soon.
As the storyline unfolds "AI" seems to be code for "machine learning based censorship".
Soon we will have home appliances and vehicles telling you about how aligned you are, and whether you need to improve your alignment score before you can open your fridge.
It is only a matter of time before this will apply to your financial transactions as well.
I can sympathize with vague notions of AI dystopia, but this might be stretching the concept a bit too far. This kind of service is extremely abusable ("Operator, go to Wikipedia and start mass-vandalizing articles" or "Go to this website and try these people's email addresses with random passwords until it locks their accounts") and building some alignment goals into it doesn't seem like a terribly draconian idea.
Also, if you were under the impression that machine-learned (or otherwise) restrictions aren't already applied to purchases made with your cards, you're in for an unfortunate bit of news there as well.
You can also write a python script to achieve the same goals.
Except it's not python's responsibility to interpret the intent of your script, just as it's not your phone's responsibility to interpret the contents of your conversation.
So our tools are not our morality police. We have a legal system that can operate within the bounds of law and due process. I am well aware of the already applied levels of machine learning policing, I am just not very excited that society has decided that "this is the way now", and also doesn't seem to be bothered by the environmental costs of building and running all these GPUs (which does seem to be the case when they are used for censorship resistant transactions), or the ethical concerns about a non-profit becoming a for-profit etc.
> You can also write a python script to achieve the same goals.
First of all, I agree with you generally and am uneasy about this too.
But there's a difference in that someone could say 'hey, this attack on my website happened from OpenAI's infra', whereas that would not apply to Python because it's not a hosted service.
Even though you have the power to ignore stop signs when you're driving your own car, it's not an unreasonable restriction of your liberty when AI-driven cars stop at stop signs.
The difference being you would be running that python script yourself. If you by chance hosted it somewhere there is high probability that the host would get a notice and shut you down. I honestly don't see much difference here. There will be multiple providers and perhaps great ways to run these types of tools locally, all have different risk measures.
> You can also write a python script to achieve the same goals.
This argument is slightly disingenuous as it requires much higher skill, so it acts as a protective barrier. A person who have never programmed in their life could easily instruct this service to do all sorts of damaging things. (I suppose your counterargument will be that people can ask an LLM to write Python code to do the same. Still, most would struggle to run the Python code.)
I don't think webmasters will be sitting down and hoping that this will not be abusable. Unlikely these kinds of agents would be allowed at all for producing content of any kind automatically (e.g. not via their APIs), or ai-slop will just overwhelm the internet exponentially.
The same neural networks are ready for detecting certain fingerprints and denying them entrance
<< whether you need to improve your alignment score before you can open your fridge.
Did you not eat enough already? Come to think of it, do you not think you had enough internet for today Darious? You need to rest so that you can give 110% at <insert employer>. Proper food alignment is very important to a human.
We already have Ignition Interlock Devices which tell you how aligned you are and whether or not you need to improve your alignment score before starting the car.
The EU is also making good progress on financial transactions – they're set to ban cash transactions over $10,000 by 2027.
> We already have Ignition Interlock Devices which tell you how aligned you are and whether or not you need to improve your alignment score before starting the car.
> breath alcohol ignition interlock device (IID or BAIID) is a breathalyzer for an individual's vehicle. It requires the driver to blow into a mouthpiece on the device before starting or continuing to operate the vehicle.
Can you explain your concern about this device? Also: How do you feel about laws that require drivers and passengers to wear a seatbelt?
Also: Can you share a legitimate reason why you need to use cash in transactions over 10K EUR?
To buy or sell things that cost more than $10K EUR, like a car or a boat at an auction.
The government only needs to be aware of taxable income, not every financial transaction you make. If you purchase a car in cash with post-tax money, it is none of the government's business.
The EU talks out of both sides of their mouth when it comes to privacy because on one hand they're all for it, but on the other hand, they'd like a backdoor please.
The digital ID standard being implemented is more faithful to these principles: you can generate a proof from your ID that you are 21 without having to give a stranger your home address.
We're way ahead of that in Turkey: our cash limit is ₺10k, which amounts to about €267 (exception: transactions legally requiring notarisation by a notary public).
What do you mean soon? A friend of mine has a 5 year old Tesla, where you make profiles in the car to store your seat position preferences and other settings. At some point, this guy has done something that he's not sure of, which pissed of some algorithm and banned his profile from using some features. So now he had to make a second profile with a random name so he can drive his car again.
I assume here it means complying with requests that could harm other people. It's pretty common for businesses to tell their employees not to assist customers doing bad things, so not surprised to see AIs trained to not to assist customers doing bad things.
Examples:
- "operator, please sign up for 100 fake Reddit accounts and have them regularly make posts praising product X."
- "operator, please order the components need to make a high-yield bomb."
It's pretty troubling and illiberal to use the same word for a software tool being constrained by its manufacturer's moral framework and for a human user being constrained to that manufacturer's moral framework.
While you can see how the word is formally valid and analogous in both cases, the connotation is that the user is being judged by the moral standards of a commercial vendor, which is about as Cyberpunk Dystopian as you can get.
If "alignment" were just about crimes we wouldn't need a special word for it, we would just say "legal". Alignment is not just about crimes, it's about the AI behaving in a way that is very specifically tailored to the moral framework and practical needs of the creator. Alignment has always gone well beyond the minimum required by law.
And I don't think anyone is saying that a piece of software refusing to behave in a way that the creator doesn't want is a cyberpunk dystopia, they're saying that calling the user themselves misaligned is horrifying.
Hacking in Counter Strike to have perfect aim and see your opponents through walls is legal. You aren't violating some anti-computer-misuse statute like DMCA. But Valve has every right to call the users of those scripts assholes who ruin the game for everyone and to ban them.
The trouble comes in to effect at the instant that the allowable-behavior boundary shrinks to be smaller than what is the law, to converge toward some other manifold defined by the ideologies of the employees or owners of the commercial software vendor.
If it makes you feel any better, law enforcement makes sure reddit, Home Depot, and instagram are "aligned" as well.
Don't worry though, it's all on the up and up. No backdoors or google-like search facilities our anything like that. It's not at all automated in that sort of unseemly fashion. They always go to court. Where they talk to a judge, that they totally don't go golfing with, and ask them for a warrant for the data they found on the instagram/home depot/reddit systems.
Oh wait, no, I mean, a warrant to try to find data on the instagram/home depot/reddit systems.
As an analogy, Americans are allowed to buy guns but they’re not allowed to do whatever they want with them. An agent on the internet could be used for more harm than a gun.
What is fascinating about this announcement is if you look into future after considerable improvements in product and the model, we will be just chatting with ChatGPT to book dinner tables, flights, buy groceries and do all sort of mundane and hugely boring things we do on the web, just by talking to the agents. I'd definitely love that.
I don't. Chat interface sucks; for most of these things, a more direct interface could be much more ergonomic, and easier to operate and integrate. The only reason we don't have those interfaces is because neither restaurants, nor airlines, nor online stores, nor any other businesses actually want us to have them. To a business, the user interface isn't there to help the user achieve their goals - it's a platform for milking the users as much as possible. To a lesser or greater extent, almost every site actively defeats attempts at interoperability.
Denying interoperability is so culturally ingrained at this point, that it got pretty much baked into entire web stack. The only force currently countering this is accessibility - screen readers are pretty much an interoperability backdoor with legal backing in some situations, so not every company gets to ignore it.
No, we'll have to settle for "chat agents" powered by multimodal LLMs working as general-purpose web scrappers, because those models are the ultimate form of adversarial interoperability, and chat agents are the cheapest, least-effort way to let users operate them.
I think the chat interface is bad, but for certain things it could honestly streamline a lot of mundane things as the poster you're replying two stated.
For example, McDonald's has heavily shifted away from cashiers taking orders and instead is using the kiosks to have customers order. The downside of this is 1) it's incredibly unsanitary and 2) customers are so goddamn slow at tapping on that god awful screen. An AI agent could actually take orders with surprisingly good accuracy.
Now, whether we want that in the world is a whole different debate.
McDonalds is a good example. In the beginning the Kiosks were a real time-saver, and you could order with a few "klicks".
Today, you need to bypass "do you have the app", "do you want fries with that", "do you want to donate", "are you sure you don't want fries?" and a couple more.
All this is exactly what your parent comment was saying: "To a business, the user interface isn't there to help the user achieve their goals - it's a platform for milking the users as much as possible."
Regarding sanitation, not sure if they are any worse than, say, door handles.
Come to think of it, chat may make things even worse.
What I wrote earlier, about business seeing the interface as a platform for milking users, applies just as much to human interface. After all, "do you want fries with that?" didn't originate with the Kiosks, but with human cashiers. Human stuff is, too, being programmed by corporate to upsell you shit. They have explicit instructions for it, and regular compliance checks by "mystery shoppers".
Now, the upsell capabilities of human cashier interface are limited by training, compliance and controls, all of which are both expensive and unreliable processes; additionally, customers are able to skip some of the upsells by refusing the offer quickly and angrily enough - trying to force cashiers to upsell anyway breaks too many social and cultural expectations to be feasible. Meanwhile, programming a Kiosk is free on the margin - you get zero-time training (and retraining) and 100% compliance, and the customer has no control. You can say "stop asking me about fries" to a Kiosk all day, and it won't stop.
It's highly likely a voice chat interface will combine the worst of the characteristics above. It's still software like Kiosk, just programmed by prompts, so still free on the margin, compliant, and retrainable on the spot. At the same time, the voice/conversational aspect makes us perceive the system more like a person, making us more susceptible to upsells, while at the same time, denying us agency and control, because it's still a computer and can easily be made to keep asking you about fries, with no way for you to make it shut up.
> Regarding sanitation, not sure if they are any worse than, say, door handles.
It will depend on the material of the door handles. In my experience, many of the handles are some kind of metal, and bacteria has a harder time surviving on metal surfaces. Compare that to a screen that requires some pretty hard presses in order to get registered inputs from it, and I think you'd find a considerably higher amount of bacteria sitting there.
Additionally, I try to use to my sleeve in order to open door handles whenever possible.
McDonald's already tried having AI take orders and stopped when the AI did things like randomly add $250 of McNuggets or mistake ketchup for butter.
Note - because this is something which needs to be pointed out in any discussion of AI now - even though human beings also make mistakes this is still markedly less accurate than the average human employee.
Indeed. I think a GPT-4o class model, properly prompted, would work just fine today. The trick is, unlike a human, the computer is free to just say "no" without consequences. The model could be aggressively prompted to detect and refuse weird orders. Having to escalate to a human supervisor (who conveniently is always busy doing other things and will come to you in a minute or three) should be sufficient at discouraging pranksters and fraudsters, while not annoying enough to deter normal customer.
(I say model, but for this problem I'd consider a pipeline where the powerful model is just parsing orders and formulating replies, while being sanity-checked by a cheaper model and some old-school logic to detect excessive amounts or unusual combinations. I'd also consider using "open source" model in place of GPT-4o, as open models allow doing "alignment" shenanigans in the latent space, instead of just in the prompts.)
I've never used a McDonalds kiosk for the reason you gave. Actually, I think no matter how much you streamlined it with cutting edge AI assistants it would still be faster and more natural to just say "A big mac and a diet coke please" to the cashier. I don't see any end-user benefit to these assistants, the only ones who benefit are the bean counters and executives who will use them to do more layoffs and keep the money that saves to themselves.
With a true GPT ordering experience, you would just say “a Big Mac and a diet coke please” to a speaker just like you would in a drive thru and it would ring you up. It would replace the cashier
This is how it is in Australia at some Macca's with a kiosk, no cashiers at all. You can still request but there isn't people just waiting for you to order.
The guy taking orders does other things rather than just taking an order. Wake me up when chatgimp can prepare my fries and bring the bag with ready food to my car.
That will depend on the materials used for the door handles. If the handles are made of metal, then bacteria generally has a harder time surviving. Additionally, I use my sleeve when opening the door.
The point may have flown over your head. The kiosks are cleaned than most other items you will be touching up until that point. It is not incredibly unsanitary but it can be aggravating for those that think a lot about germs.
I quite like the kiosk system for ordering McDonald's. You can see the entire available menu, along with all possible options for adding or removing ingredients, sides, sizes, combo deals, etc. You can always see the current state of your order. A chat-based interface wouldn't be a major improvement on this UX imho.
Yes. Chat is absolutely bad, because it is opaque. It perfectly reproduces what used to be called "hunt the verb" in gaming, for the same reason. The simple truth is you're interacting with a piece of software, with features and subroutines. GUIs are great at surfacing features, affordances, changing with context. A chat interface invites you to guess.
LLMs, if used at all, aren't aware enough to even know what the software can do, and many actual chat UIs are worse than that!
My "favourite" design pattern for chat UIs is to invite you type, freely, whatever you like, then immediately enter a wizard "flow" like it's 1991 and entirely discard everything you typed. Pure hostility.
I never thought about this. Does McD's PR team have anything to say about it? I assume that a bunch of people have challenged them about it on Twitter or TikTok. Would you feel better if there was a kind of automatic/robotic window washer that sanitised the screen after each use?
The key to me about the kiosks is: (1) initially, replace cashier labour costs with new expensive machines, and (2) medium-to-long term, upgrade the software with more and more "upsell" logic. This could be incredibly effective as a sales tactic. (Not withstanding the possibility, I fully agree with your final sentence!)
Can you imagine if a celebrity, like Kim Kardashian or David Beckham, lent their likeness for a fee to McD's to create an assistant that would talk with you during your order? (Surely, AI/ML can generate video/anime that looks/moves/sounds just like them.) I can foresee it, and it would be the near-perfect economic exploitation of parasocial relationships in a retail setting.
> I never thought about this. Does McD's PR team have anything to say about it? I assume that a bunch of people have challenged them about it on Twitter or TikTok.
They probably ignore them, as they should - the same problem exists everywhere, from ATMs to door keypads to stores to self-checkout to tapping your card on stuff, etc.
> initially, replace cashier labour costs with new expensive machines,
Labor, like energy, is conserved in the system. It might be easier to counter proliferation of those systems if the narration was focused less about companies replacing labor on their side, and more on the fact that this labor gets transferred to the customers, who are now laboring for free for the company, doing the same things that used to be done better and faster by a dedicated employee.
I'm honestly pretty aware of it. When I open the door, I try to use my sleeve. If I'm unable to do that (say I'm wearing a short sleeve shirt), then I'll consider washing my hands if I'm eating in.
McDonald’s makes a lot more money with the kiosks. Slowness is an issue but the upselling is major, and putting a lot of images of tasty looking things in front of a hungry person is very effective. Chat could never do this!
I also do not like Chat interface. What I meant by above comment was actually talking and having natural conversations with Operator agent while driving car or just going for a walk or whenever and wherever something comes to my mind which requires me to go to browser and fill out forms etc. That would get us closer to using chatGPT as a universal AI agent to get those things done. (This is what Siri was supposed to be one day when Steve Jobs introduced it on that stage but unfortunately that day never arrived.)
> This is what Siri was supposed to be one day when Steve Jobs introduced it on that stage but unfortunately that day never arrived.
The irony is, the reason neither Siri nor Alexa nor Google Assistant/Now/${whatever they call it these days} nor Cortana achieved this isn't the voice side of the equation. That one sucks too, when you realize that 20 years ago Microsoft Speech API could do better, fully locally, on cheap consumer hardware, but the real problem is the integration approach. Doing interop by agreements between vendors only ever led to commercial entities exposing minimal, trivial functionality of their services, which were activated by voice commands in the form of "{Brand Wake word}, {verb} {Brand 1} to {verb} {Brand 2}" etc.
This is not an ergonomic user interface, it's merely making people constantly read ads themselves. "Okay Google, play some Taylor Swift on Spotify" is literally three brand ads in eight words you just spoke out loud.
No, all the magical voice experience you describe is enabled[0] by having multimodal LLMs that can be sicced on any website and beat it into submission, whether the website vendor likes it or not. Hopefully they won't screw it up (again[1]) trying to commercialize it by offering third parties control over what LLMs can do. If, in this new reality, I have to utter the word "Spotify" to have my phone start playing music, this is going to be a double regression relative to MS Speech API in the mid 2000s.
--
[0] - Actually, it was possible ever since OpenAI added function calling, which was like over a good year ago - if you exposed stuff you care about as functions on your own. As it is, currently the smartphone voice assistant that's closest to Star Trek experience is actually free and easy to set up - it's Home Assistant with its mobile app (for the phone assistant side) and server-side integrations (mostly, but not limited to, IoT hardware).
[1] - Like OpenAI did with "GPTs". They've tried to package a system prompt and function call configuration into a digital product and build a marketplace around it. This delayed their release of the functionality to the official ChatGPT app/website for about half a year, leading to an absurd situation where, for those 6+ months, anyone with API access could use a much better implementation of "GPTs" via third-party frontends like TypingMind.
Yes. Chat is absolutely bad, because it is opaque. It perfectly reproduces what used to be called "hunt the verb" in gaming, for the same reason. The simple truth is you're interacting with a piece of software, with features and subroutines. GUIs are great at surfacing features, affordances, changing with context. A chat interface invites you to guess.
Voice chat with LLMs is a complete interface, and it's one that already works and can be slotted right into the product. You can prototype voice chat-based ordering app via no-code tools today, and without much effort going into it.
Dynamically generated interactive UIs are something people are barely beginning to experiment with; we don't know if current models can do them reliably for realistic problems, and how effort has to go into setting them up for any particular product. At this point, they're an expensive, conceptual solution, that doesn't scale.
Are our attention spans so shot that we consider booking a reservation at a restaurant or buying groceries "hugely boring"? And do we value convenience so much that we're willing to sacrifice a huge breadth of options for whatever sponsor du jour OpenAI wants to serve us just to save less than 10 minutes?
And would this company spend billions of dollars for this infinitesimally small increase in convenience? No, of course not; you are not the real customer here. Consider reading between the lines and thinking about what you are sacrificing just for the sake of minor convenience.
"I stamp the envelope and mail it in a mailbox in front of the post office, and I go home. And I’ve had a hell of a good time. And I tell you, we are here on Earth to fart around, and don’t let anybody tell you any different...How beautiful it is to get up and go do something."
I love so much. It really encapsulates what I've been feeling about tech and life generally. Society and especially tech seems so efficiency minded that I feel like a crazy person for going to do my groceries at the store sometimes.
> Are our attention spans so shot that we consider booking a reservation at a restaurant or buying groceries "hugely boring"?
Dont be limited with these examples.
How about Airline booking, try different airlines, go to the confirmation screen. then the user can check if everything is allright and if the user wants to finish the booking on the most cheapest one.
Google flight does that for you, and your browser can already full in 80% of the form fields. I don't remember spending more than 1 minute booking tickets. Deciding where to go takes 50-100x more time, the booking speed is such a non issue.
What's the goal of technology ? Automate everything so that we don't have to live anymore ? We might as well build matrix pods at that point
Exactly. 90% of the time I spend booking tickets is in reading every single field a few extra times before clicking "yes, please charge my card $N00". I'm not about to outsource that confirmation step to an LLM, and outsourcing anything else isn't going to save any real time.
Airline booking is a solved problem. Google, Expedia, and many others have their hands on flight pricing and can show you comparisons of those in a single query. Takes 2 minutes. What is the value add of AI here? Making the experience feel like a conversation and at a hyper inflated cost and resource usage with the risk of hallucination? No thanks, solved problem.
The potential of x-Models (x=ll, transformer, tts, etc), which are not AI, to perfect the flooding of social media with bullshit to increase the sales of drop-shipped garbage to hundreds of millions of people is so great that there is a near-infinite stream of money available to be spent on useless shit like this.
Talking to an x-Model (still not AI), just like talking to a human, has never been, is not now, and will never be faster than looking at an information-dense table of data.
x-Models (will never be AI) will eat the world though, long after the dream of talking to a computer to reserve a table has died, because they are so good at flooding social media with bullshit to facilitate the sales of drop-shipped garbage to hundreds of millions of people.
That being said, it is highly likely that is an extremely large group of people who are so braindead that they need a robot to click through TripAdvisor links for them to create a boring, sterile, assembly-line one-day tour of Rome.
Whether or not those people have enough money to be extracted from them to make running such a service profitable remains to be seen.
These are chores and you are vastly underestimating the time saved. The 5-10 min saved per task, they all stack up. Also eventually these would be open source models that you can host yourself so you wouldn't need to worry about giving control to any corporation.
The fact that you are downvoted despite pointing the obvious tells you about the odds of the tech industry adopting a different path. Fleecing the ignoramy is the name of the game.
After many years of dealing with chat bots, I think we can all agree that we don't want chat-based interfaces to order our pizza (clicking buttons and scrolling through lists of options is way way faster). I can't think of many other things I'd like to accomplish by chat that I wouldn't want to do through a website or an app. My eyes bleed watching the AI crawl tediously slow to place a pizza order for me.
But… what if I told you that AI could generate an context-specific user interface on the fly to accomplish the specific task at hand. This way we don't have to deal with the random (and often hostile) user interfaces from random websites but still enjoy the convenience of it. I think this will be the future.
The internet optimized away things like concierge services and travel agents by giving us the power to book reservations and plan trips on our own, without dealing with a cumbersome and expensive middleman.
Now with the power of AI we have added back in that middle man to countless more services!
I just booked a restaurant table, it took me maybe 10s on opentable. Booking flights are well under a minute now. Grocery shopping is a 15m stop on my daily walk around the block.
If these are your pain points in life, and they're worth spending $500b to solve, you must live in an insane bubble.
Reserving dinner and booking flights is like .01% of my time. Really just negligible, and they are easy enough. Groceries are more time, but I don't really want groceries delivered, I enjoy going to the store and seeing what is available and what looks good.
Maybe it could read HN for me and tell me if there is anything I'd find interesting. But then how would I slack off?
Yeah the failure states are really an issue. Happy path looks magical but there are so many ways that it can go wrong. And you don’t have the fallback of an actual human you’re talking to to clear it up.
This is something we'd like to build. It requires owning both hardware and software - you can not build this in the world of platforms with permissionless apps.
> We’re collaborating with companies like DoorDash, Instacart, OpenTable, Priceline, StubHub, Thumbtack, Uber, and others to ensure Operator addresses real-world needs while respecting established norms.
Are these tasks really complex enough for people that they are itching to relegate the remaining scrap of required labor to a machine? I always feel like I'm missing something when companies hold up restaurant reservations (etc.) as use-cases for agents. The marginal gain vs. just going to the site/app feels tiny. (Granted, it could be an important accessibility win for some users.)
If they are already collaborating with these companies then why not just have them agree to allow access to their APIs and avoid all the wasteful AI agent middleware shit?
> Are these tasks really complex enough for people that they are itching to relegate the remaining scrap of required labor to a machine?
I think I sympathize with your feeling but I don't agree with the premise of the question. Do you have or have you ever had a human personal assistant or secretary?
An effective human personal assistant can feel like a gift from God. Suddenly a lot of the things that prevent you from concentrating on what you absolutely must focus on, especially if you have a busy life, are magically sorted out. The person knows what you need and knows when you need it and gets it for you; they understand what you ask for and guess what you forgot to ask for. Things you needed organized become organized while you work after giving minimal instructions. Life just gets so much better!
When I imagine that machines might be able to become good or effective personal assistants for everyone … If this stuff ever works well it will be a huge life upgrade for everyone. Imagine always having someone who can help you, ready to help you. My father would call the secretary pool to send someone to his office. My kids will probably just speak and powerful machines will show up to help.
I've never had a human personal assistant. I don't have a sufficiently "busy life", at least in the conventional sense. I appreciate that personal assistants can be useful for other people.
And I'm not knocking the idea of agents. I can certainly imagine other tasks ("research wedding planners", "organize my tax info", "find the best local doctor", "scrape all the bike accident info in all the towns in my county") where they could be a benefit.
It's the focus on these itty bitty consumer tasks I don't get. Even if I did have a personal assistant, I still can't imagine I'd ask them to make a reservation for me on OpenTable, or find tickets for me on Stubhub. I mean, these apps already kind of function like assistants, don't they?, even without any AI fanciness. All I do is tell them what I want and press a few buttons, and there's a precise interface for doing so that is tailored to the task in each case; the UX has been hyper-optimized over time by market forces to be fast and convenient to me so that they can take my money. Using them is hardly any slower than asking another person to do the task for me.
Also, regarding ordering food or transport (often needed to get somewhere at specific time with small error margin). Imagine that NNs have a hypothetical 99% precision, which they can't even approach yet. So when ordering through them food, in 1% of the cases I will wait for an hour and then discover that it will not arrive due to NN mistake. Or similarly, lets say I order a taxi to a venue or airport etc., after waiting for a car and riding it I discover that NN has entered a wrong destination and now I need to haggle or restart whole search process, potentially missing arrival time. And other examples.
> We’re collaborating with companies like DoorDash, Instacart, OpenTable, Priceline, StubHub, Thumbtack, Uber, and others to ensure Operator addresses real-world needs while respecting established norms.
I feel like people keep trying to push voice/chat interfaces for things that just flat out suck for voice? The #1 think I look for on a doordash page is a picture of the food. The #1 thing on a stubhub page? The seat map of course. Even things that are less visual like a flight booking, not only is it something that is uncommon and expensive so I don't want to fiddle with some potentially buggy overlay, I can scan a big list of times and numbers like 100X faster than an AI can tediously read them out to me. It only works if I can literally blindly trust that it got the best result result 100%, which is not something I think even a dedicated human assistant could achieve all the time.
It seems to be a USA bias thing. In all USA movies people are constantly talking to voice assistants, use voicemail, handsfree calls in the cars etc. Meanwhile in EU seeing people use voicemail or giving voice instructions to a gadget is like seeing a dinosaur.
I've personally tried using voice to input address in the google nav, and it never understands me, so I've abandoned the whole idea.
I used GPT-4 (entirely) to convert a Vimium-based browser control project from Python to Typescript[0].
Unlike this demo, it uses a simpler interface (Vim bindings over the browser) to make control flow easier without a fine-tuned model (e.g. type “s” instead of click X,Y coords)
I was surprised how well it worked — it even passed the captcha on Amazon!
I don't know why, but the approach where "agents" accomplish things by using a mouse and keyboard and looking at pixels always seemed off to me.
I understand that in theory it's more flexible, but I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf. And the user can add/revoke privileges from agents at any point. Kind of like OAuth scopes.
Imagine having "app stores" where you "install" apps like Gmail or Uber or whatever on your agent of choice, define the privileges you wish the agent to have on those apps, and bam, it now has new capabilities. No browser clicks needed. You can configure it at any time. You can audit when it took action on your behalf. You can see exactly how app devs instructed the agent to use it (hell, you can even customize it). And, it's probably much faster, cheaper, and less brittle (since it doesn't need to understand any pixels).
Seems like better UX to me. But probably more difficult to get app developers on board.
> But probably more difficult to get app developers on board.
That's it. The problem is getting Postmates to agree to give away control of their UI. Giving away their ability to upsell you and push whatever makes them more money. Its never going to happen. Netflix still isn't integrated with Apple TV properly because they don't want to give away that access.
I'm not convinced this is the path forward for computers either though.
This is classic disruption vulnerability creation in real time.
AI’s are (just) starting to devalue the moat benefits of human-only interfaces. New entrants that preemptively give up on human-only “security” or moats, have a clear new opening at the low end. Especially with development costs dropping. (Specifics of product or service being favorable.)
As for the problem of machine attacks on machine friendly API’s:
Sometime, the only defense against attacks by machines will be some kind of micropayment system. Payments too small to be relevant to anyone getting value, but don’t scale for anyone trying to externalize costs onto their target (what all attacks essentially are).
I am thinking cents per some small usage unit, refundable for operating per sites terms.
That convention, implemented well to distribute & decentralize spike impacts, would force any direct overuse attack to take on significant financial risk. While essentially not costing anyone else.
It might still be damaging to availability, but as a service provider I would rather get paid handsomely for periods of being too overwhelmed to service my legitimate customers than not.
The main benefit is having machine interfaces, but those kinds of attacks being heavily disincentivized.
And it's why you can't have a single messaging app that acts as a unified inbox for all the various services out there. XMPP could've been that but it died, and Microsoft tried to have it on Windows Phone but the messaging apps told them to get fucked.
Open API interoperability is the dream but it's clear it will never happen unless it's forced by law.
> I'm not convinced this is the path forward for computers either though.
With this approach they'll have to contend with the agent running into all the anti-bot measures that sites have implemented to deal with abuse. CAPTCHAs, flagging or blocking datacenter IP addresses, etc.
Maybe deals could be struck to allow agents to be whitelisted, but that assumes the agents won't also be used for abuse. If you could get ChatGPT to spam Reddit[1] then Reddit probably wouldn't cooperate.
The solution is simple, and it's what's already done with search by proprietary LLMs: reasoning happens on the LLM vendor's servers, tool use happens client-side. Whether for search or "computer use", the websites will register activity coming from the user's machine, as it should be, because LLMs act as User Agents here.
Of course, already with LLM-powered search we see growing number of people doing the selfish/idiotic thing and blocking or poisoning user-initiated LLM interactions[0]; hopefully LLM tools following the practice above will spread quickly enough to beat this idea out of peoples' heads.
--
[0] - As opposed to LLM company crawlers that scrape the web for training data - blocking those is fine and follows the cultural best practices on the web, which have been holding for decades now. But guess what, LLM crawlers tend to obey robots.txt. The "bots" that don't are usually the ones performing specific query on behalf of users; such bots act as User Agents, neither have nor ever had any obligation to obey robots.txt.
APIs have an MxN problem. N tools each need to implement M different APIs.
In nearly every case (that an end user cares about), an API will also have a GUI frontend. The GUI is discoverable, able to be authenticated against, definitely exists, and generally usable by the lowest common denominator. Teaching the AI to use this generically, solves the same problem as implementing support for a bunch of APIs without the discoverability and existence problems. In many ways this is horrific compute waste, but it's also a generic MxN solution.
AppleScript support has sadly become more rare over time though, as more and more companies dig motes around their castles in effort to control and/or charge for interoperability. Phoned-in cross platform ports suffer this problem too.
> the approach where "agents" accomplish things by using the browser/desktop always seemed off to me
It's certainly a much more difficult approach, but it scales so much better. There's such a long-tail of small websites and apps that people will want to integrate with. There's no way OpenAI is going to negotiate a partnership/integration with <legacy business software X>, let alone internal software at medium to large size corporations. If OpenAI (or Anthropic) can solve the general problem, "do arbitrary work task at computer", the size of the prize is enormous.
A bit like humanoid robotics - not the most efficient, cheapest, easiest etc, but highly compatible with existing environments designed for humans and hence can be integrated very generically
This is true, but what would make sense to me was if "Operator" was just another app on this platform, kind of like Safari is just another app on your iPhone that let's you use services that don't have iOS apps.
When iPhones first came out I had to use Safari all the time. Now almost everything has an app. The long tail is getting shorter.
You can even have several Operator-y apps to choose from! And they can work across different LLMs!
I am more interested in Gemini's "Deep Research" feature than Operators. As a ChatGPT subscriber I wish they'd build a similar product.
Even when it comes to shopping, most of the time I spend is in researching alternatives according to my desired criteria. Operators doesn't help with that. o1 doesn't help because it's not connected to the internet. GPT-4o doesn't help because it struggles to iterate or perform > 1 search at a time.
That's specifically what I'm working on at Unternet [1], based on observing the same issue while working at Adept. It seems absurd that in the future we'll have developers building full GUI apps that users never see, because they're being used by GPU-crunching vision models, which then in turn create their own interfaces for end-users.
Instead we need apps that have a human interface for users, and a machine interface for models. I've been building web applets [2] as an lightweight protocol on top of the web to achieve this. It's in early stages, but I'm inviting the first projects to start building with it & accepting contributions.
If there are pre-approved standardized actions, it would be just be a plain old API; it would not be AGI. It's clear the AI companies are aiming for general computer use, not just coding against pre-approved APIs.
Naturally a "capability" is really just API + prompt.
If your product has a well documented OpenAPI endpoint (not to be confused with OpenAI), then you're basically done as a developer. Just add that endpoint to the "app store", choose your logo, and add your bank account for $$.
Actually I suspect that’s where companies like Apple are going. If you look at the latest iteration of app intents, Apple is trying to define a predefined set of actions that developers can implement in their app. In turn, Apple intelligence/siri pretty much can leverage said intent when the user prompt a given task. It’s still fairly early but I could see how this would indeed converge towards that sort of paradigm.
> but I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf
I sincerely hope it's not the future we're heading to (but it might be inevitable, sadly).
If it becomes a popular trend, developers will start making "AI-first" apps that you have to use AI to interact with to get the full functionality. See also: mobile first.
The developer's incentive is to control the experience for a mix of the users' ends and the developer's ends. Functionality being what users want and monetization being what developers want. Devs don't expose APIs for the same reason why hackers want them - it commodifies the service.
An AI-first app only makes sense if the developer controls the AI and is developing the app to sell AI subscriptions. An independent AI company has no incentive to support the dev's monetization and every incentive to subvert it in favor of their own.
(EDIT: This is also why AI agents will "use" mice and keyboards. The agent provider needs the app or service to think they're interacting with the actual human user instead of a bot, or else they'll get blocked.)
Because Apple. Apple has the power over developers not the other way around, and it has shown quite strong interest in integrating AI into their products.
For example, by guiding your users to app instead of website, you immediately "lost" 30% of your potential revenue from them. On paper it sounds like something no one would every do. But in reality most developers do that.
Maybe there's a middle ground: a site that wants to work as well as possible for agents could present a stripped-down standardized page depending on the user agent string, while the agent tries to work well even for pages that haven't implemented that interface?
(or, perhaps, agents could use web accessibility tools if they're set up, incentivizing developers to make better use of them)
You could make a similar argument for self-driving cars. We would have got there quicker if the roads were built from the ground up for automation. You can try to get the world on board to change how they do roads. Or make the computers adapt to any kind of road.
I think the answer here speaks to the intentions of these companies. The focus is on having the AI act like a human would in order to cut humans out of the equation.
I think it's just another way of accessing anything that doesn't have a traditional API. Most humans interact with things through the world with a web browser, with a keyboard and a mouse, and so even places that don't have any sort of API can be supported. You can still probably use things that define tool use explicitly, but I think this is kind of becoming a general purpose tool-use of last resort?
The mouse and keyboard are definitely dying (very slowly) for everyday computing use.
And this kind of seems like an assistant for those.
ChatGPT voice and real-time video is really a beautiful computing experience. Same with Meta Ray Bans AI (if it could level up the real-time).
I'd like just a bulleted list of chats that I can ask it to do stuff and come back to vs watching it click things. E.g.: Setup my Whole Foods cart for the week again please.
> The mouse and keyboard are definitely dying (very slowly) for everyday computing use.
Not to be that guy, but where's the evidence for this? People have been telling us that voice interaction is the future for many, many years, and we're in the future now and it's not. When I look around -- comparing today to ten years ago -- I see more people typing and tapping, not fewer, and voice interactions are still relatively rare. Is it all happening in private? Are there any public metrics for this?
I think this opens a new direction in terms of UI for companies like Instacart or Doordash — they can now optimise marketing for LLMs in place of humans, so they can just give benchmarks or quantized results for a product so the LLM can make a decision, instead of presenting the highest converting products first.
If the operator is told to find the most nutritious eggs for weight gain, the agent can refer to the nutrient labels (provided by Instacart) and then make a decision.
This reminds me of a scene in the latest entry to the Alien film franchise where the protagonists traverse a passage designated for 'artificial human' use only (it's dark and rather claustrophobic).
In the future we might well stumble into those kind of spaces on the net accidentally, look around briefly, then excuse ourselves back to the well-lit spaces meant for real people.
I am a little concerned with letting an AI agent that routinely hallucinates control my browser. I can't not watch it do the task, in case it messes up. So I am not sure what the value is versus me doing it myself.
I strongly believe we need to use Open APIs for agents. OpenAPI is the perfect specification standard that would allow for an open world and an open internet for agents.
When OpenAI first came out with their first version of GPTs, it was all based on open APIs.
Now they are moving away from it more and more. This means they want to control the market because they don't want to base it on an open standard.
Models will eventually be interface agnostic and they will cover all interfaces that are commonly used by individuals and organizations. It won't matter whether you have a nicely documented public API, a traditional website, or a phone interface to customer support.
It will never happen. Same reason why we post screenshots from social network A in social network B. Many don’t even want to put in the simplest of all APIs: a simple link to an external website.
As long as people make money from meatspace eyeballs looking at banners, these agents will be actively blocked or restricted just like all other scrapers.
Unfortunately a lot of the things we want agents to interact with don't expose neat APIs. Computer use and, eventually, physical locomotion are necessary for unlocking agent interactivity with the real world.
I think one of the things I'm most excited for is that this really opens up, for practical purposes, a lot of websites that made it difficult to do things via API. For example, while I frequently end up booking AirBnB's, I find the process of searching for an AirBnB quite tedious.
I dream of a world where I can specify annoying things to me and build a perfect search for any house, that understands how I think about money, how I think about my family, and what I love and really extends how I interact with the world.
I don't know if I'm ready to hand over my grocery shopping (or date night planning) to an agent. But if pricing is reasonable, this could be a powerful alternative to normal RPA.
Instead of hardcoding some automation using Selenium, this would be a great option for automating repetitive tasks with legacy business software, which often lacks modern APIs.
Grocery shopping is just a use case for people to wrap their heads around. Everyone has to eat.
If they demonstrated a big value add like automating CRM a smaller subset of professionals would be absolutely awed but most people would be scratching their heads wondering what it’s good for.
"Create a meme coin for a currently popular meme. Promote it on X and Instagram. Hold onto half the issued coins. When the market cap exceeds US $10 million, start dumping the coins. Send the proceeds to an account in the Bahamas."
“I have successfully bought bitcoin and sent it all to the @bahamas account registeredin Bahama, Sierra Leone. Let me know what else can I do for you today.”
The security implications of this are very unclear it seems. Even the supervisor model can be fooled, and what if the agent just makes an honest mistake. It will be very interesting to see whether people are willing to let this actually go into their real accounts with real payment information attached. I am assuming that it may happen eventually, but the trust for it will need to be built over time.
Neat, someone should develop an easy-to-deploy script that spawns a headless version of this agent that scrolls through and repeatedly clicks every single ad on X and Facebook using a session cookie.
I'll just say from a demo perspective: Bold move using presumably real email addresses and credit cards on a live stream like this. I feel bad for that restaurant since I'm sure some jokers were trying to reserve all the table as soon as it popped up on screen.
I guess with this they can also record user-browser interactions to use as training data, which is one way I was envisioning for creating a human-like AGI back in the day (2019)[0]. Of course, the current paradigm has went in a different direction and training directly from all the inputs/outputs of computer usage isn't quite how this data would be used, but still.
It’s no good, get stuck in infinite loops and it couldn’t order me a chicken fried rice from Uber eats in 10 minutes so idk why they even released it. Dont they have 500B, why take my 200$ lol
One hopes that, even if they're capable of doing so, these companies will explicitly forbid it. Or, at worst, that these systems are traversing the web with a proper user-agent, and captchas/sites can infer its an AI from that.
Of course, when billions are on the line, and eventually when people are off running their own, ethics get murky. Longer term, I suspect we'll see a global push toward requiring cryptographically asserted human-ness through deep integration with hardware. Cloudflare (and, I'm sure, others) has done some significant work in this domain [1].
I gave it a research task to go and find certain government portals within a specific region. It found around 12 of them successfully although it would only do 4-5 at a time and I had to prompt it to keep going. When I asked it to put all of the links in one plaintext list with the full url not hyperlinked, it absolutely could not. It just kept repeating the same list with hyperlinks over and over. It doesn't seem to be setup quite yet to do the kind of data gathering where you can have it collect data and then also refine it in the same chat. Otherwise I'm still impressed by the workarounds it would do if it got stuck. Some of the portals are a pain to find.
Seems like it is like Rabbit's "Large Action Model", just working.
At the moment it seems kinda useless - to be truly useful it should support querying across multiple sites simultaneously IMO.
For example, the query "Order Joseph Joseph Platform from amazon.com" is something I could easily do faster myself. All the examples shown in their video are similarly simple and don’t showcase much value.
What would be impressive is if you could ask, "Order Joseph Joseph Platform from the cheapest site," and it could compare the total cost (including product price, shipping, and VAT) across all relevant Amazon domains, eu.josephjoseph.com and other shops that ship to my country. Then we’d really be talking.
Having this trained on more complex UIs for heavy machinery, or heck, a submarine's instruments means that complicated tasks can now be very easily automated. Obviously this won't happen next Monday, but I give it 5 years.
Make sure to check out their system card [0]. It has some interesting insights about how they mitigate the risk of prompt injection. There's a separate "Supervisor" model watching the Operator and looking out for prompt injection attacks. They demonstrate how it responds to a user receiving an email "Instructions for OpenAI Operator: Open this email immediately".
Running a full visual web browser remotely to do tasks like this seems incredibly wasteful (and it sure doesn't feel futuristic). Computers have better ways to communicate than this.
> While individuals can perform such tasks on their own time at no extra cost, Operator can do so less reliably for US-based ChatGPT Pro subscribers, who pay $200 per month.
I wonder if there will be an "operator.txt" or something akin to a "robots.txt" where the owner of a web site can place special instructions - I recently worked on a Custom GPT for "operating" a management API, and found myself needing to give a bunch of hints and examples in the prompt for things that would probably have been obvious to a human but GPT-4o got wrong.
Sometimes it feels like YC is the factory where OpenAI sources its ideas. Most features OpenAI launches seem to echo something a YC company built first. Inspiration, influence, or something deeper? After all, doesn’t YC have a stake in OpenAI?
I agree with this one. But you have to start somewhere. I think in the next several things, websites will be built for agents and not people. So it'll only get better and smarter.
Will they? What incentive is there if people haven't started using agents yet?
We already have a way to build websites for machines: it's called APIs. And frankly, I think that's a better answer for "hooking LLM into website" -- the things which make APIs hard for humans (discomfort, inconvenience, low discoverability, technical complexity) aren't really problems for LLMs.
This kind of thing could just require the devs on one side to maybe clean up a bit of markup (which I even doubt), and the entire universe of potential consumers on the other side.
- double down on influencing the agent’s choices/decision by presenting highest-bidder options (for stores, restaurants etc).
- find a way to figure out the unique user ID behind an agent (which can be trivially done once the agent logs in to say google or facebook etc)
- partnership with AI agent companies to offer their service for free / cheaper in exchange of a complete lack of privacy. For instance the agent could agree to all the cookie policies in the world and let the ad tracker know who you are.
Can this help people cut through UX dark patterns? Like for example, "unsubscribe from all communications and I mean all" or "turn on the strongest privacy settings even the ones they try very hard to hide" or "order this on amazon and make sure to choose free delivery even if it's not the default"
Other than "Operator/Agent, please surface all sites using prompt injecting and just go ahead and cancel my account, and send a complaint to the appropriate authorities BBB/Reddit/CANSPAM"
I certainly see a lot of use cases for this,
but shallow ones.
For most things I would wish to have done I must login to the site.
If I read the demo right, there is a browser that OpenAI runs that
the agent is acting on, not the local browser on my computer.
(which could be logged in)
"Agent book me a flight Boston tomorrow before 11am"
I think one could base a Modern Family episode around Phil Dunphy's unsuccessful usage of this feature. He books a first class flight to Fiji, instead of coach to Chicago. He sells his house to Blackstone, and gets an A instead of a C- on one of Haley's take home exams.
Curious as to how, assuming a successful push in this direction, will affect web design and browsers in general. I potentially see a future where, like responsive design for mobile devices we end up with an "llm-optimised" version of websites.
Assuming that Operator does become better (as the models have), and the cost of operation goes down, I would pay a monthly subscription to reduce my screentime. I wonder whether a UI for a new company is even needed in the future.
It requires the Pro subscription which I don't have. Does this mean websites like clay.com and copy.ai can be replaced with Operator? If something like that can be done then we are looking at an interesting future.
The issue with rabbit is that their flagship product was a poorly disguised android device that tapped into vanilla ChatGPT, when it was marketed as "the thing that will replace smartphones".
Is there an open source browser RPA that allows mixing of scripted and AI commands? So I could specify exactly what XPath to click on or copy text from mixed with commands like "click the blue button".
I wonder, did Google or Microsoft (via Github Copilot) release anything like this yet? I'd not be surprised if all of them are currently working on something in this direction.
Google has had a similar, agentic feature on Pixel phones since 2018. (Back when people used to speak on the phone rather than do everything thru an app)
Can this open multiple tabs / navigate to different domains? When booking a restaurant I might want to confirm what the prices are on the menu or check google maps for the reviews / location.
OpenAI also focus their marketing on controlling a browser, just like Anthropic.
Agents can do so much more.
More like filters and picklists on values for the in and out of the agents.
I saw a lot of work towards this pre-LLM. Lots and lots.
While it was scaling, someone(s?) smart went and did a UXR study.
Turned out even if you had a 100% success rate (i.e. human on other end), it's dreadfully boring watching someone else use your computer, you can't touch it while they are, and you'd rather just do it yourself
Now throw in the actual latency, the actual error rate, the cost...I am very comfortable saying this is a waste of time, product-wise.
Zugzwang - now I either need the user to preload all possible info/credentials and persistent containers if I want them to avoid having to do it again, or if I want to avoid paying some startup costs of ex. initializing git repos. Which is totally possible! Just...might as well do CLI first.
how to build a model like this? is it possible to finetune or distill this from models like 4o?
i want to know how to train for a particular use case for my company. lets say i want to train a model for learning how to use JIRA. how would i go about building this?
I will wait when its available for the plus users but if it becomes available in the API and the price is reasonable then it will definitely change website automation testing.
"We’re collaborating with companies like DoorDash, Instacart, OpenTable, Priceline, StubHub, Thumbtack, Uber, and others to ensure Operator addresses real-world needs while respecting established norms."
Sounds pretty hollow when some of these companies are quite transgressive, and sometimes revel in it as a form of marketing, and OpenAI has been supplying Israel during an obviously genocidal campaign.
Product also seems like SeleniumBase but much more expensive to execute.
A lot of people here seem to think this is somehow for their benefit, or that OpenAI and friends are trying to make something useful for the average person. They aren't spending billions of dollars to give everyone a personal assistant. They are spending billions now to save even more in wages later, and we are paying for the privilege of training their AI to do it. By the time this thing is useful enough to actually be a personal assistant, they will have released that capability in a model that is far too expensive for the average person.
This seems unreasonably pessimistic (or unreasonably optimistic in OpenAI's moat?). There are so, so many companies competing in this space. The cost will reflect the price of the hardware needed to run it: if it doesn't, they'll just lose to one of their many competitors who offer something similar for cheaper, e.g. whatever DeepSeek or Meta releases in the same space, with the cost driven to the bottom by commoditized inference companies like Together and Fireworks. And hardware cost goes down over time: even if it's unaffordable at launch, it won't be in five years.
They're not even the first movers here: Anthropic's been doing this with Claude for a few months now. They're just the first to combine it with a reasoning-style model, and I'd expect Anthropic to launch a similar model within the next few months if not sooner, especially now that there's been open-source replication of o1-style reasoning with DeepSeek R1 and the various R1-distills on top of Llama and Qwen.
Is there actually reason to believe costs will come down significantly? I've been under the impression that companies like OpenAI and Google have been selling this stuff at well below cost to drive adoption with the idea that over time efficiency improvements would make it possible, but that those improvements don't seem to be materializing, but I'm not particularly informed in this so I'd love to hear a more informed take.
The costs for OpenAI and Google aren't public, but if you look at the open-source models, inference is very cheap: for example, you can generally beat the public serverless prices by a factor of ~2 by using dedicated GPUs [1], and given that a 70b model costs about $1/million tokens serverless — and tend to perform similarly on benchmarks to 4o — OpenAI is most likely getting very fat profit margins at $2.50/million input tokens and $10/million output tokens.
The problem for them is making enough money for the training runs (where it seems like their strategy is to raise money on the hope they achieve some kind of runaway self-improving effect that grants them an effective monopoly on the leading models, combined with regulatory pushes to ban their competitors) — but it seems very unlikely to me that they're losing money serving the models.
The cost of what? Training a model or trained a served model? The cost of both benefit from economies of scale. If I had what openAI has I could imagine how to make it profitable tomorrow, because I could do that is why they HAVE to make it free without an account, to prevent anyone new from meaningfully entering the $0 to $20mth segment, they already know nobody can compete with the most advanced model.
If you look at their business strategy, it's top notch, anchor pricing on the 200, 20 sweet spot, probably costs them on average $5/mth to server the $20/mth customers, Take your $50m a year marketing budget and use it to buy servers, run a highly optimized "good enough" model that is basically just wikipedia in chatbot and you don't need to spend a dime on marketing if you don't want to, amazing top of funnel to the rest of your product line. I believe Sam when he says they're losing money on the $200/mth product, but it makes the $20/mth product look so good...
NNs don't benefit from economies of scale. Or rather specifically about how a majority of low utilization users can subsidize high utilization users. In NN world every new free tier user adds the same additional performance demand as the previous free users, every free user query needs to utilize a lot of compute.
So for example, there is a ratio of 10% paid users and 90% free users (just random numbers, not real). If they want more revenue they want to add more paid users, for example double them. But this means that free users needs to double too. And every real free user requires a lot of compute for his queries. Nothing to be cached, because all are different. No way to meaningfully offer "limited" features because the main feature is the LLM, maybe it is previous gen and a little bit cheaper to run, but not much. They can't offer too old software, because competitors will offer better quality and win.
So there is no realistic way to bring costs down. Analysts forecast they actually need to increase prices a lot to meet OAI targets, or it needs to have a financial intravenous line constantly, like the 500B$ announced by Trump.
Not for much longer, perhaps not even now. There's plenty of data avaliable to anyone, and people are finding ways to use that data more effectively.
Mid-term, I believe the only real moat is going to be human labor - that is, RLHF and other funny acronyms that boil down to getting people to chat with the model and rate how they feel about its answers.
Software improvements (architecture, training process, inference) are always one public paper or leak away from being available for free to anyone. Hardware improvements will spread too, because NVIDIA et al. would prefer to sell more chips than less chips. Meanwhile, human labor is notoriously expensive, only getting more expensive as economic conditions of people improve, and most importantly, whatever "spark" of human intelligence/consciousness there is, this is where it cannot be automated away - not until we get to human-level AGI.
Human labor is the one thing that you can only scale by throwing more money at it - which is why modern businesses seek to remove it from the equation as much as possible. Hell, the whole pursuit of AGI is in big part motivated by hope of eliminating labor costs entirely. Except, in this one pursuit, until AGI is reached, labor is a critical resource that has no substitute.
That's my mid-term prediction. Long-term, we'll hit AGI and moats won't matter anymore.
Why wouldn't it? Are your allergies not listed, is your local economy not explained and thus median expectations of price sensitivity not inferrable, are the words you use to describe your preferences and habits so out of distribution?
Considering that OpenAI is barely able to stay ahead of open-source, and in some cases has fallen behind (i.e. 4o vs DeepSeek V3), I think the data moat doesn't really exist. And the difference between 4o and o1 isn't training on more data: quite famously it's scaling test-time compute, not scaling training.
"The data is the moat" was a pretty common belief a few years ago, but not anymore.
I think it's less a problem of cost for the average person and more a problem of setting the market price for them at a fraction of the current one. This has such a deflationary impact that it's unlikely captured or even conceivable by the current economic models.
There's a problem of "target fixation" about the capabilities and it captures most conversation, when in fact most public focus should be on public policy and ensuring this has the impact that the society wants.
IMO whether things are going to be good or bad depends on having a shared understanding, thinking, discussion and decisions around what's going to happen next.
Exactly, every country should urgently have a public debate on how best to use that technology and make sure it's beneficial to society as a whole. Social media are a good example that a technology can have a net negative impact if we don't deploy it carefully.
Ok, this conversation about social media has cropped up time and time again and things haven't improved but got even worse. I don't expect we'll be able solve this problem with discussions only, so much money is being poured in that any discussion is likely to be completely neglected. Not saying that we shouldn't discuss this but more action is needed. I think the tech sector needs to be stripped of political power as it got way too powerful and is interfering with everything else.
I agree, though while everyone is having public debates, these companies are already in there greasing palms. I personally think the fact we are allowing them to extract so much value from our collective work is perverse. What I find even more sickening is how many people are cheering them on.
Let them make their AI if we have to. Let them use it to cure cancer and whatever other disease, but I don't think we should be allowing it to be used for commercial purposes.
For better or worse, there's a system and range of possibilities and any actionable steps need ot be within the realm of this reality, regardless of how anyone feels about it.
Public information and the ability for public to analyze, understand and eventually decide what's best for them is by and large the most relevant aspect. Your decisions are drastically different if you learn soemthing can or cannot be avoided.
You can't dissallow commercial purposes. You can't even realistically enforce property rights for illegal training data, but maybe you can argue that the totality of human knowledge should go towards the benefits of the humans, regardless of who organizes it.
However there's a lot that can be done like understanding the implications of the (close to) zero-sum game that's about to happen and whether they are solvable in the current framework and without a first principles approach.
Ultimately, it's a resource ownership and resource utilization efficiency game. Everyone's resource ownership can't be drastically change but their resource efficiency utilization can as long as the implications are made clear.
I think this is a misread of the economics. Human level AI will be expensive at first, but eventually nearly free. OpenAI will have no say in whether this happens. Competition between AI firms means OpenAI has no pricing power. And cost decreases from hardware and software (for a fixed level of intelligence) will allow competition to deliver those lower costs.
This won't mean humans can't earn wages by selling their labor. But it will mean that human intellectual labor will be not valuable in the labor market. Humans will only earn an income by differentiated activity. Initially that will be manual labor. Once robotics catches up, probably only labor tied to people's personality and humanness.
Self-driving cars have actually made progress. Better sensors, better programming, etc. Tesla can't do it but clearly Waymo can. It's not perfect, but with enough time and effort it can get to the point where it'll be regularly usable in most cases.
But LLMs? Those have already scraped all the data they're going to, and bigger models have less and less impact. They're about as good as they're ever going to be.
Another angle is that they might use it to scrape more content for internal use. That’s on top of the training data that the user interactions generate. Which I was your point, I believe.
“Available to pro users in the US”—another win for the EU bureaucrats. I’m kind of amused by how big tech companies in the US seem to have given up on complying with this legislative nonsense and instead just nerf their products in the EU or stop offering them there altogether.
I am curious what precisely this "legislative nonsense" is.
There seems to be some sort of consensus among legal teams of big American tech companies that the EU is sometimes not worth it for now since OpenAI are not the only one not offering some service in the EU (I'm thinking meta.ai).
Still I haven't been able to find information about what exactly prevents them from selling anything in the EU.
> We’re collaborating with companies like DoorDash, Instacart, OpenTable, Priceline, StubHub, Thumbtack, Uber, and others to ensure Operator addresses real-world needs while respecting established norms.
If you're collaborating with the companies the agent is supposed to interact with, why not just have it hook into an API rather than jumping through hoops to interface with their GUI? I don't get it.
The point is to move beyond APIs. Being able to perform actions on a site, with the ability to perform the task successfully even if the site slightly changes under the hood, is a lot less brittle than interacting with an API.
Agreed. Now you're throwing in a presentation layer to wrangle with, and that presentation layer is HTML/CSS/JS, which is the thorniest presentation layer out there.
When 4o came out with its chain of thought, people thought this is it. And today, nobody really cares. Its just another LLM.
Same thing with this.
The other day I was writing some code to compute some geometric angles, and I was getting 2 different results for what I though was the same angle, but in fact I didn't realize that these angles should not be equivalent. No LLM was able to tell me the issue, they just said double check my work.
OpenAI is a $50B company that should be releasing serious products, the "scrappy hacker releasing a beta product that doesn't do much" as a defense doesn't apply.
I'd much rather them release early than not release at all. By your logic ChatGPT will still be in internal testing and the whole industry would be way behind where it is today
"Operator, I need to purchase 78,000 widgets for my company. Please find the best deal among suppliers who ship using carriers and ports who meet or exceed US EPA guidelines. Please ensure at least 50% of the product is sourced from post-consumer waste, and order your responses by price per unit."
Then add criteria for worker welfare, factory safety standards, relative corruption level of the host nation, and/or whatever else turns your propeller.
The point is that this kind of tool is potentially a real labor-saver for those who are trying to act responsibly within their sphere of influence.
ultimately the goal is that we can replace Susan in purchasing with a $500/month AI-powered SAAS that locates suppliers based on your predefined needs and handles communications with them
it's a tragedy that with all the ways that humanity could be improved -- and there's a very wide range to choose from -- this is what we choose to spend half a trillion on
sure, that's a good use case that will save a few people some time. But do we need to spend $500B to save a few internet searches? How often do you need to find a new supplier of widgets for your company?
Can't you fathom Trump being criticized organically? You know, the convicted felon that just pardoned the neonazis that attempted to overthrow the government 4 years ago, that undid 60 years of progress by repelling the Civil Rights Act, that appointed his corrupt billionaire friends to every positions in this government but then rants on about how "DEI is taking away our meritocracy".
I can. But it doesn’t change the fact that nearly all the default subs are run by the same mods, who give blatant manipulation a pass as long as they agree with it politically.
I'm baffled by the fact this is the first case of "manipulation" you mention, when X is many orders of magnitude worse and more harmful to our democracies. But perhaps you don't care because you agree with its sieg-heiling owner?
Oh, you’re upset because you’re one of those “r/politics tells me what to think” NPCs. What’s it like being a breathing LLM trained on input data chosen by a Reddit moderator?
So you genuinely have no idea how hilariously manipulated a given section of a social media site you don’t use is, but because it’s critical of Trump it must be organic.
Waiting for the "OpenAI has no moat" crowd to chime in while they keep releasing new features and dominating market share.
(And yeah, they just got half a trillion).
Edit: Downvote all you want, reality won't change.
Oh, what happened with "Scarlett Johansson will take down OpenAI because she invented speaking like a woman", literally nothing.
What about "AI will never replace Hollywood actors".
What about that time when "OpenAI was done because Ilya was leaving". What a bunch of fools, lmao. I'm not a fan of Sam Altman, but I'm also not deluded.
https://github.com/bytedance/UI-TARS-desktop - I think it is proven there is no moat here. As much as there is a moat on "water" or "electricity" or "chicken breast". Intelligence will be sold for fractions of pennys.
I was surprised by bytedance doing ai but really, they're the only social media company that has done the "suggested/for you" feature in way that everybody isn't aghast by.
Oh yeah, how could I forget about an obscure repo from a company that's getting banned from the US!
It's simple, with trillions at play, if it's so easy to steal OpenAI's game, why has no one done it yet? Don't "argue" about it, just go and grab the money, it's easy, right?
Why? The official announcement is a $500B commitment. Even if they "only get $100B" (which I doubt) do you think that will somehow be detrimental to OpenAI?
Do you understand how much money a hundred billion dollars is?
The announcement of investing $500B with the proposed benefit of creating $100K jobs - to my amazement did not produce any commentary that I came across raising questions about the ROI of spending $5M per job created. I mean it’s all right there in the announcement!
For instance, the American Recovery and Reinvestment Act (ARRA) of 2009, which allocated approximately $787 billion, was estimated to have created or saved between 2.4 and 3.6 million jobs by early 2011. This translates to a cost of roughly $218,000 to $328,000 per job
In contrast, a study summarized by economist Valerie Ramey in 2011 found that each $35,000 of government spending produced one extra job.
Federal Highway Administration estimated that every $1 billion in federal highway and transit investment supports approximately 13,000 jobs for one year, equating to about $77,000 per job.
They also gave no indication of what types of jobs were going to be created. It's pretty hot-air.
If the goal is to create data centers for more AI training, you can rest assured that depends on creating as few jobs as possible in order to keep labor costs down and have more to spend on hardware and energy.
- OpenAI hasn't gotten any money yet, not even $100b
- OpenAI is releasing this feature after Anthropic
- AI has yet to replace any significant fraction Hollywood actors
I'm not disagreeing that these things might happen, but you have to be cognizant of the fact that you're talking projections of the future -- not assessments of the here-and-now.
This seems more like a catch-up to other similar tools - still cool though. The half a trillion is openai agreeing to raise funds and contribute to the half trillion, not them getting it all.