> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
yes it transcribes inputs automatically, but not in realtime.
outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.
It's incredible that people are talking about the downfall of software engineering - now, at many companies, hundreds of call center roles will be replaced by a few engineering roles. With image fine-tuning, now we can replace radiologists with software engineers, etc. etc.
What's the role of the software engineer besides setting this up?
Your example makes me think it will merely moves QA into essentially providing countless cases and then updating them over time to improve the AIs data.
And is it really gonna be cheaper than human support?
What's gonna happen when we will find out (see the impossibility to reach a human when interacting with many companies alredy) this is gonna bring (maybe, eventually) costs down, and revenue too because pissed off customers will move elsewhere.
More than a majority of a software engineer’s time is spent on bug triage, reproducing bugs, simulating constituents in a test, and debugging fixes.
Doesn’t matter what the computer becomes — AI, AGI or God-incarnate — there’s always a role between that and the end-user. That role today is called software engineer. Tomorrow, it’ll be whatever whatever. Perhaps paid the same or less or more. Doesn’t matter.
There’s always an intermediary to deal with the shit.
Hmm, I wonder if that’s the roles priests & the clergy have been playing all this while. Except, maybe humanity is the shit God (as an end user) has to deal with
the _role_ of radiologists isn’t going away, but as with software engineers, better tools means there are fewer needed to serve the same patient population. So it’s highly likely that there is going to be displacement within that industry as well.
I don't believe that is comparable. 1. Modern algorithms started out as a cronjob (which already worked better than the alternative) 2. Advances in applying optimal control theory are well known, (mostly) deterministic and explainable. They are in no way comparable to the black box that is the current state of computer vision. 3. Their failure can be readily observed and compensated for, since the patient will definitely notice. The same cannot be said about imaging.
OpenAI just launched the equivalent of Velvet as a full fledged feature today.
But seperate from that you typically want some application specific storage of the current "conversation" in a very different format than raw request logging.
I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software
They definitely use their own products internally, perhaps to a fault: While chatting with OpenAI recruiters, I received calendar events with nonsensical DALLE-generated calendar images, and "interview prep" guides that were clearly written by an older GPT model.
AFAIK a lot of these ideas are not new (the JSON thing was done with OS models before) and OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?), so I think this is actually all within expectations.
Their web UI was a glitchy mess for over a year. Rollouts of just data is staggered and often delayed. They still can’t adhere to a JSON schema accurately, even though others have figured this out ages ago. There are global outages regularly. Etc…
I’m impressed by some aspects of their rapid growth, but these are financial achievements (credit due Sam) more than technical ones.
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
Not sure why you are being downvoted. You are generally right. Most of their new product rollouts were acoompanied with huge production instabilities for paying customers. Only in the most recent ones did they manage that better.
> They still can’t adhere to a JSON schema accurately
Strict mode for structured output fixes at least this though.
> OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?)
It depends on how you define startup but I don't think they will surpass Uber, ByteDance, or SpaceX until this next rumored funding round.
I'm excluding companies that have raised funding post IPO since that's an obvious cutoff for startups. The other cuttof being break even, in which case Uber has raised well over $20 billion.
Is it that most models are based on the transformer architecture ? And so performance improvements can then we used throughout their different products ?
> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
> even though JSON is supposed to ignore key order
Most tools preserve the order. I consider it to be an unofficial feature of JSON at this point. A lot of people think of it as a soft guarantee, but it’s a hard guarantee in all the recent JavaScript and python versions. There are some common places where it’s lost, like JSONB in Postgres, but it’s good to be aware that this unofficial feature is commonly being used.
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.
It's probably stuck in legal limbo in the EU. The recently passed EU AI Act prohibits "AI systems aiming to identify or infer emotions", and Advanced Voice does definitely infer the user's emotions.
(There is an exemption for "AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use", but Advanced Voice probably doesn't benefit from that exemption.)
Apparently this prohibition only applies to "situations related to the workplace and education", and, in this context, "That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons"
So it seems to be possible to use this in a personal context.
> Therefore, the placing on the market, the putting into service, or the use of AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and education should be prohibited. That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use.
This is true, though it may not make sense commercially for them to offer an API that can't be used for workplace (business) applications or education.
I see what you mean, but I think that "workplace" specifically refers to the context of the workplace, so that an employer cannot use AI to monitor the employees, even if they have been pressured to agree to such a monitoring. I think this is unrelated to "commercially offering services which can detect emotions".
But then I don't get the spirit of that limitation, as it should be just as applicable to TVs listening in on your conversations and trying to infer your emotions. Then again, I guess that for these cases there are other rules in place which prohibit doing this without the explicit consent of the user.
In a nutshell, this uncertainty is why firms are going to slow-roll EU rollout of AI and, for designated gatekeepers, other features. Until there is a body of litigated cases to use as reference, companies would be placing themselves on the hook for tremendous fines, not to mention the distraction of the executives.
Which, not making any value judgement here, is the point of these laws. To slow down innovation so that society, government, regulation, can digest new technologies. This is the intended effect, and the laws are working.
Companies like OpenAI definitely have the resources to let some lawyers analyze the situation and at this point it should be clear to them if they can or can't do this. It's far more likely that they're holding back because of limitations in hardware resources.
I use those words because I've never read any of the points in the EU AIA.
They definitely do have the resources, but laws and regulations are frequently ambiguous. This is one reason the outcome of litigation is often unpredictable.
I would wager this -- OpenAI lawyers have looked that the situation. They have not been able to credibly say "yes, this is okay" and so management makes the decision to wait. Obviously, they would prefer to compete in Europe if it were a no-brainer decision.
It may be possible that the path to get to "yes, definitely" includes some amount of discussion with the relevant EU authorities and/or product modification. These things will take time.
I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
> I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The market for realistic voice agents is huge, but also very fragmented. Customer service is the obvious example, large companies employ tens of thousands of customer service phone agents, and a large # of those calls can be handled, at least in part, with a sufficiently smart voice agent.
Sales is another, just calling back leads and checking in on them. Voice clone the original sales agent, give the AI enough context about previous interactions, and a lot of boring legwork can be handled by AI.
Answering simple questions is another great example, restaurants get slammed with calls during their busiest hours (seriously getting ahold restaurant staff during peak hours can be literally impossible!) having an AI that can pick up the phone and answer basic questions (what's in certain dishes, what is the current wait time, what is the largest group that can be sat together, etc) is super useful.
A lot of small businesses with only a single employee can benefit from having a voice AI assistant picking up the phone and answering the easy everyday queries and then handing everything else off to the owner.
The key is that these voice AIs should be seamless, you ask your question, they answer, and you ideally don't even know it is an AI.
Hopefully, it won’t cause a plethora of nuisance phone calls. As the cost tends to zero, then it will be much easier to spam people; even more so than now.
they're definitely going to instruct the AI agents to lie to you, and deliberately waste your time, and be pushier than ever, because it's not costing them anything to have a real human on the line even longer. at least we'll have our own agents to waste their compute in turn
One thing I'm really excited for is having this real-time voice model in video game characters. It would be really cool to be able to have conversations with NPCs, and actually have to pick their brain for information about a quest or something.
You're right, having a voice conversation for any reason is just so passe these days. They should stop adding microphones to phones and everything. So old-fashioned and inefficient. And who wants to ever have to actually talk to someone or some AI to ask for anything? I'm sure our vocal cords will evolve away soon. They are so primitive. Vestigial organs.
keep in mind that this is just v1 of the realtime api. they'll add realtime vision/video down the road which can also have wide applications beyond synchronous communication.
Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
correct - you should also be able to save a lot by skipping their built-in VAD and doing turn detection (if you need it) locally to avoid paying for silent inputs.
yes and you can use it in text-text mode if you want. a key benefit is for turn-based usages (where you have running back and forth between user and assistant) you only need to send the incremental new input message for each generation. this is better than "prompt caching" on the chat completions API, which is basically a pricing optimization, as it's actually a technical advantage that uses less upstream bandwidth.
I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.
Interesting choice of a 24kHz sample rate for PCM audio. I wonder if the model was trained on 24kHz audio, rather than the usual 8/16kHz for ML models.
Weekly caps are for standard accounts (not going to be talked about at DevDay). The blog does note RPM changes for the API though:
"10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now."
I just had an evil thought: once AIs are fast enough, it would be possible to create a “dynamic” user interface on the fly using an AI. Instead of Java or C# code running in an event loop processing mouse clicks, in principle we could have a chat bot generate the UI elements in a script like WPF or plain HTML and process user mouse and keyboard input events!
If you squint at it, this is what chat bots do now, except with a “terminal” style text UI instead of a GUI or true Web UI.
The first incremental step had already been taken: pretty-printing of maths and code. Interactive components are a logical next step.
It would be a mere afternoon of work to write a web server where the dozens of “controllers” is replaced with a single call to an LLM API that simply sends the previous page HTML and the request HTML with headers and all.
“Based on the previous HTML above and the HTTP request below, output the response HTML.”
Just sprinkle on some function calling and a database schema, and the site is done!
Other than being borderline impossible to secure, it “should just work” once the AIs get smart enough.
Fine-tuning the model based on example pages and responses might be all that’s required for a sufficient level of consistency.
An immediate use-case might be prototyping in-place.
If you have an existing site, you can capture the request-response pairs and train the AI on it, annotated with the spec docs. Then tell it to implement some new functionality and it should be able to. Just route a subset of the site to the AI instead of the normal controllers.
One could “design” new components and functionality in English and try it instantly with no compilation or deployment steps!
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
and `response.done` [1] with the response text.
[0] https://platform.openai.com/docs/api-reference/realtime-serv...
[1] https://platform.openai.com/docs/api-reference/realtime-serv...
reply