Hacker News new | past | comments | ask | show | jobs | submit login
Audiobox: Meta's new foundation research model for audio generation (meta.com)
268 points by reqo 5 months ago | hide | past | favorite | 84 comments

If I shutdown every voice other than the optimist's one in my head, this, along with other recent AI research, will mark the advent of never-seen-before role play game possibilities. If the current pace of progress continues, we'll see games with complete narrative freedom for players, where you aren't limited to pre-written answers anymore, but can actually talk to in-game characters with your actual voice, goals, and motivations. And those virtual conversation participants can talk back to you, react to your words and actions in a believable, fully immersive manner. That's a dream come true for every gamer on the face of this earth, I believe.

The more rational voices in my mind, though, become more and more afraid of a world where the only thing you can trust is people sitting right in front of you. That makes the world of information pretty small again.

I’m an amateur music producer and vocals are by far the toughest part of making music. I have to find a singer, convince them to work with me (I am an amateur and not particularly good tbh), and book studio space because its very tough to get a clean recording at home.

I’m hoping that like digital instruments, I’ll be able to splice in digital voices instead of finding singers.

This already exists, ex. Audimee: https://audimee.com/

And the millions of other RVC websites. Musicfy, Uberduck, Coversai, Kitsai, FakeYou, Voicemyai, Voicify, Bangerapp, Tryreplay, Weightsgg ...

RVC is so easy anyone can spin up a website for it. No moat. Over a hundred thousand trained weights files in the open, so it's easy to bootstrap.

For anybody else who also hadn't come across the term RVC before:

"The RVC model is a Retrieval-based Voice Conversion system using AI for high-quality voice cloning. It utilizes artificial intelligence to modify or clone voices in real-time." Source: https://speechify.com/blog/rvc-vocal-models/

Wow. I'm going to give each family member a password and make them prove that they're real when they call me :)

or go pro and issue a One Time Pad to each of them.

Would make a nice vignette in a film about a dystopian future where video can be be generated cheaply and of sufficient quality.

Yeah but I tried Audimee too and they are working on giving best quality voices. It is not about u train or use RVC, it is about quality and save time.

Now if only the prices would drop and I could start making custom audiobooks for the price of regular audiobooks.

Great, so the competition is going to yield cheap services with a nice UX.

An English language and accent bias in most current models

https://voice-swap.ai/ may be of interest, you can covert your rough singing to use a real singer's voice (apparently! I've not tried it)

This is already somewhat available (check out Dreamtomics Synthesizer V and the voices like Solaris etc)

I have been using Synthesizer V for a lot of my music and the quality is very high, even with little manual tuning. There are a lot of voices to choose from now, and they added a cross-synthesis feature which lets you use the voices originally intended for Japanese and Chinese in English, and they don’t have a strong accent (I do find that I have to mess with the lyrics input a little). Also SAROS is coming out which has experimental Spanish support.

Definitely recommend looking up some Synthesizer V covers to see how realistic singing voice synthesis has become in recent years. It’s also free to try lower quality versions of the voices.

Save your money and record at home under a blanket.

Closet full of jackets works too. That’s what a lot of podcasters do when traveling

They also do pillow forts, e.g. http://PillowFortStudios.com/

Layered towels are surprisingly capable too

Or if you want to feel a little fancy, a 3-sided folding project board with foam glued to it.

Part of me looks forward to these experiences. A larger part of me already mourns the decline is purely human creativity that they suggest. Perhaps AI models will make the perfect game or music or write the perfect novel. I'm certain they'll be programmed to reproduce the funky jank and humans bring to everything they create. At some level, though, we each have our own story that we want to tell and when there are so many automated voices in the room, it's going to be harder to tell those stories.

It'll affect linear entertainment far before it impacts games seriously (Especially the 'full immersive' games where you chat with AI agents)

AI is still too expensive and performance intensive to run in games cost effectively, and truly powerful AI is probably another 10-100x cost increase.

On the other hand, novels will be rapidly replaced by visual novels. The cost of having a novel fully illustrated and voiced will go down 1000x. A high quality illustration used to cost $500-$1000 (A day's work from a high-tier commercial artist), soon it will be about $0.5. I'm not counting in the author's time to prompt the images, because it would have costed them way more time to communicate with the illustrator anyways.

The entire boundary between novels, comics, cartoons etc will blur. Like if a newly written Harry Potter can have thousands of illustrations set in Hogwarts and be fully voiced, the standards for a movie adaptation will be astronomically high, which will in turn drive AI use in movie production just to keep up.

> It'll affect linear entertainment far before it impacts games seriously

Maybe. But I think you make the mistake of considering games that combine existung AAA features + AI as where it will first impact games, where I think it will first make its mark in games that don’t use hardware heavily for 3d rendering by opening up new modes of gaming.

> novels will be rapidly replaced by visual novels. The cost of having a novel fully illustrated and voiced will go down 1000x.

The cost if having art made isn't the only reason novels aren't fully illustrated now, and AI doesn't impact any of the others.

Are you sure of this? LLMs seem like a pretty low-hanging fruit for game studios, even if they don't implement fully immersive player actions upfront. Even just generating background sound or NPC banter during build automatically seems like a no-brainer to me, enabling huge savings on content writing - and that's really the lowest-cost solution imaginable.

> The more rational voices in my mind, though, become more and more afraid of a world where the only thing you can trust is people sitting right in front of you. That makes the world of information pretty small again.

Gives me cosmological analogies: in the far enough future, the only things you can see in the night sky are the members of the local supercluster.

> That's a dream come true for every gamer on the face of this earth, I believe.

There's definitely some gamers who would like to never talk to a character in a video game, even if it means hours of bumbling around for the blue key that some NPC just told me about while I ignored 100% of the text in the game.

I'm old and I sympathize with your last sentence, I _imagine_ certain reinforced paths through repeated experience are myelin covered so much as to be almost static. But if I think of generations growing up, their brains aren't reinforced/seasoned enough to spot these "Wait... What?" moments in online content that is almost sensical but not quite. In a world with ever increasing AI hallucinated content, when children absorb fake content and build strong paths in their brains, at some point the words/meaning could be so distorted that you cannot understand anymore the person in front of you? I think of the political landscape where we have similar problems, social media algorithms curate content to keep you entertained within subjects A and B, and you have a community with shared values and you are seasoned and share its language domain. How would the landscape look like when the net can be bombarded by content that appears to be true, valid and useful but it really is not, for young generations? Also if somebody can paint the right myoline picture in my head (not plausible text generation but science) I would really appreciate it.

I think that an answer to finding a 'true' reality on the net comes back to the centuries old dicipline of philosophy. The world has been full of 'garbage-thought' for ever. I might suggest that there is less 'garbage-thought' now for those that wish it than there ever has been.

The AI-Voice revolution leads us instead to the problem of authority, and of imitation of authority. Too many of us take it as given that an authority has the truth.

The AI-Voice / AI-Content revolution allows low quality actors to imitate relevant authorities.

So, to address your question, we need to study philosophy in schools, so that kids learn to think for themselves.

The amount of context needed would require quite a bit of novel R&D that doesn't exist on the horizon yet. I think it's more realistic that it'll be a mixture of real & fake in the interim (e.g. the LLM will record important game state changes / information & then use that as context but it'll still forget a bunch of things you'd expect a human to).

>If the current pace of progress continues, we'll see games with complete narrative freedom for players, where you aren't limited to pre-written answers anymore, but can actually talk to in-game characters with your actual voice, goals, and motivations. And those virtual conversation participants can talk back to you, react to your words and actions in a believable, fully immersive manner. That's a dream come true for every gamer on the face of this earth, I believe.

It's basically Dungeons and Dragons with an AI dungeon master who can generate video in realtime. Which would be awesome, but like Dungeons and Dragons it wouldn't be easy to keep the player on track.

I'd imagine everyone else has an agenda, a schedule and ordinary life which is pre-written, so the game actually wants to tell an interesting story, but as a player, you can choose to listen or just do all shenanigans in the context of the game you can come up with. If you want to be a farmer instead of defeating the dark overlord, so be it -- until the world ends because the overload has achieved his goals with no resistance, or maybe someone else becomes a hero instead. ...Gosh, the more I think about it the more awesome it gets.

Edit: just one more! Imagine actually having to complete quests in a given amount of time, because the rest of the world continues to revolve. People being mad at you because you left their children to die in the dungeon after arriving a day too late, because you were busy running an errand for someone else.

What I have been thinking would be really cool and fun to try would be to remake zork with llms, speech to text and text to speech.

I think you might be able to do it with a lot of prompting, and having a database that functions like a wiki for the current and past states of the game world.

If you got really fancy, you could also make it a pseudo MMO where the content of the story you create with a character could be used as the basis for a NPC plotline in other people's worlds, possibly reducing the amount of content needed to be written.

If it got popular you could also use it as a research tool, where you could force some subset of the player population have some interaction, and be able to test and get a dataset for counterfactual reasoning.

The next 5 years will be wild.

> The more rational voices in my mind, though, become more and more afraid of a world where the only thing you can trust is people sitting right in front of you.

You could care less if the content you already consume on forums like HN were generated by AGI. Just sayin.

It would be fantastic to put a bunch of different LLMs in a game map with “senses” fulfilled by multimodal inputs and agency to carry out actions within the game’s universe. With a goal such as make the most money or rule the most kingdoms, it would be super interesting to see how it self organizes.

I used to be an LLM until I took an arrow to the knee. Aside from the joke, I think the barrier would definitely reduce and in-game characters would be contextually far more aware but how do you enforce plot progression in such a truly open world? Can you control the boundary of LLM expression?

> but how do you enforce plot progression in such a truly open world?

I'd imagine a mixture of general prompting/training the LLM on techniques like those used for plot progressiom by human game masters in TTRPGs and guidance via systems tracking progress and injecting contextual prompts based on mechanisms like those used for tracking and guiding plot progression in GM-less/GM-replacement systems (e.g., the Mythic Game Master Emulator) for TTRPGs.

Definitely agree with your last point, but regarding games with endless possibilities. Mm I rather have a game that is created by someone with a clear goalbin mind. One with boundaries. A good single player experience. Like for example Alan wake 2.

Approaching a full life simulation.

Maybe soon game developers will realize black holes and the uncertainty principle are great for efficiency.

Back to Descartes

So on one hand there is the prospect of "complete narrative freedom for players" and on the other hand there is a fake information dystopia right ahead of us. Well, I would suspect, that the more dystopia we are going to have, the more people will want to flee that reality for a while into ever getting better role playing games. Is that, what people once called progress?

"a world where the only thing you can trust is people sitting right in front of you"

Oh and I think real people can be a source of missinformation as well. Also holograms(or brain implants) might have a breakthrough soon, as well. So all in all I think we are living in interesting times.

> Oh and I think real people can be a source of missinformation as well.

Yeah, definitely. But at least you have a chance to detect their intentions; for some forged digital information, you don't even have that. "Interesting times" is one way to put it...

> react to your words and actions in a believable, fully immersive manner

“I’m sorry, but as an ethically trained AI I cannot engage in this sword fight. Violence is never the answer.”

Yeah. It’s gonna be very immersive :P

you can use an uncensored model. not everything will be connected to open ai.

I could see Microsoft making the first move next generation since they're knee deep in it.

This a fantastic new development in the AI Audio space! However, it's quite disappointing that the model is closed sourced. Nonetheless, Alibaba's equivalent was released earlier in Nov and it's open-sourced! https://github.com/QwenLM/Qwen-Audio

Does anyone have suggestions for how to integrate this into your tech stack via an internal API? Interested to hear the varying thoughts on this. From what I softly understand is that the model weights have to be swapped or altered per se to be able to commercially reuse this. Correct me if I'm wrong.

> Alibaba's equivalent was released earlier in Nov and it's open-sourced! https://github.com/QwenLM/Qwen-Audio

Openly distributed perhaps, but definitely not open source. The license appears as closed as meta (research-only with some leeway for other uses.)

Do you have any truely open-source general audio generation models yet?

I know about StyleTTS2, which is open source (MIT) and uncensored, but that model focuses on speech generation only. Having an non proprietary model like audiobox or Qwen-Audio would be really nice.

Thanks for the link. License is clear: Researchers and developers are free to use the codes and model weights of both Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial use.

and important, if you have more than 100m active users:) 4. Restrictions If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us

So, looks like it's absolutely fine to use, except for IT behemoth.

As for how to use, API, I think. Interesting applications are possible. Like interactive mobile robots. Assistants for people with disabilities, both software and wearable.

Interesting times... this will be called AI revolution probably. It's already not a joke, after several ups and downs.

Normally I want basically everything to be open source. But as soon as I saw that audio restyling demo, I began to feel concerned they may be open sourcing this. The model can take a sample speakers voice, new text to speak, and also a description of a new location (like a cathedral with many echoes, or other background noises) and produce a convincing new audio sample.

This technology will present serious challenges for the verification of covertly recorded audio. It will of course ultimately become widespread but I’m not inherently bothered by the idea of slowing down its release. Giving researchers extra time to examine possible detection techniques seems helpful to me.

this technology is already exploited in the wild for scams. probably for years. it's too late to worry or try stop. the question is what can we do about it. and the first is to make people aware of it.

VR is gonna get wild in like 5 years if they keep this up

This is why I'm high on the metaverse long term. In ten years, there will be a $500 (or whatever the 2033 inflation adjusted value is) VR headset that blows the Apple Vision Pro out of the water in terms of optics, will run a highly optimized version of the lastest revision of Llama locally (and it will be much better than anything we currently have today), come with wifi 8 (so it will have multigigabit per second real word performance), capable of rendering graphics that look much more realistic than Unreal Engine 5 (with high frame rates due to AI upscaling and frame generation).

There will be people that will spend almost every waking hour with one of those things attached to their face if they can also make this device lightweight and comfortable

In 2040 we will have lenses with 32k displays and gpus with 10 trillion transistors and 1 petabyte of memory. Its hard to predict what you can do with that. The real world would be empty by then.

The real world is much richer than a bunch of stuff just being projected in to your eyes. You can’t climb a tree in VR.

Maybe for some. But for those living in unfortunate circumstances (low socioeconomic status, small apartment, poor future prospects, etc) I bet a VR world of paradise and social connection is far more inviting.

And yet I'll still be waiting five weeks to just download the world because I'm stuck on 2Mb/s ADSL.

Germany? :-)

UK, looks like.

Have you seen some of the demos people have been building with Unreal 5.3? Insane stuff. In a decade, this stuff will be hard to tell from reality.

Sounds cool - got any specific ones to share?

That's really good, though I also want to point out some of the amazing graphics that modders accomplished in Crysis (2008): https://youtu.be/3w6COXBfIY4

The new avatar (blue skin not arrow) game looks like this demo with some characters tossed in to control.

The geometry and lighting is amazing but I couldn’t detect any animation, like gentle motion due to wind.

Some motion due to wind, in another video: https://youtu.be/_B9hkn6wgNA?feature=shared&t=24


How long before someone manages to clone him/herself online and apply for relatively simple gig work. And duplicates than 1000 times. Making millions with simple work. Its almost possible i think. Clone your voice with this, clone your looks by wiring comfyui like sd nodes to your webcam. Everything instructed/orchestrated by some AI agent controlled by chatgpt. Some wiring logic is what you need to make. The only thing you as the real person have to do is answering some critical decisions which are send by a push notification.

By the time a machine learning model can replace 1000 workers they’ll just stop hiring real workers. What remains will be tasks which can’t be automated.

Or isn't allowed to by law, e.g. security etc.

Multi input? Infilling? First generative audio model I've seen that starts to close the gap with image models.

> We’re inviting researchers and institutions who have been previously involved in speech research, and who want to pursue responsibility and safety research on the latest Audiobox models, to apply.

It's not clear as to what the expected outcome of this 'responsibility and safety research' effort is. Is the idea to nerf the tech such that it can't be used for morally/ethically nefarious purposes? If so, then is the "speech research" community the group best fit to do that work?

Have weights been released?

Edit: nvm, seems not from this line "In the coming weeks, we will be opening up the application here, along with an interactive demo that will showcase Audiobox’s capabilities."

The speed of progress is just incredible. I've been using all kinds of different TTS engines for years now and the rapid pace of advancement is awesome. I usually generate all my audiobooks from ebooks and articles and the quality and stability (think artifacts) has gone up so much in the last few months.

What’s the best way to try these models out?

Does Meta usually provide a web interface for them or do you have to download and run locally?

I think that for artificial intelligence to become like humans, it should be treated under the same conditions as natural humans. It should be able to see the surrounding environment, listen to the surrounding sounds, smell the surrounding smells, and taste the surrounding food. It should be given Parents and relatives should be given their own partners and their own country. In this way, the artificial intelligence trained in the environment will naturally be more like human beings and have their own emotions.

What you're looking for is embodiment, and actually this was explicitly left aside in a recent paper that attempts to give measurement criteria for AGI[0]. But I agree with you that the entire lived sensation is critical to approaching any objective involving alignment.

[0] https://arxiv.org/abs/2311.02462

What if the "environment" for an AI is just "the internet".

A long time ago, there was a great story in a Shadowrun supplement of all things about a hacker that got trained to teach an "ai" how to break into computers. It was basically at a child's level, emotionally, and the only world it's ever known was "the matrix" (yes, really -- and written almost a decade before The Matrix came out). Eventually it turns out that it's not an ai, but a corporation was stealing kids and sticking them in Virtual Reality at birth to train a team of super hackers.

Regarding their "responsible" model, Meta's engineers aren't stupid. They know that:

1. No TTS audio output is tamper-proof. Their "safeguards" will be busted, and quickly. Whether via a small adversarial NN, some basic DSP, or just...holding a cheap recorder near your speakers, maintaining audio file provenance has no chance.

2. Impersonations have vexed humanity since the invention of vocal cords. Insofar as it's soluble, it's been solved -- authenticity is determined by a fluid mixture of context, trustworthiness, and the authority of involved parties & institutions. Always has been. Always will be. If I could drill one idea deep into every tech evangelist's head, it'd be: The solution to every problem isn't automatically "more technology." But hammers see only nails, so the vicious cycle continues, and society deals with the consequences (e.g. cryptobros decentralizing money...by slowly reinventing banks, but with more fraud).

3. This secret audio ID "feature" is probably harmful. It adds needless complexity. At best it exacerbates a false sense of safety because impersonation is trivial. Bad guys can emulate it on authentic recordings to discredit them as "fake." Nobody who'd actually benefit from such safeguards will respect them. News says this audio that affirms my confirmation bias is fake? Nah, the news is fake.

Meta knows all of this. Optimistically, I hope it's just lip service to concern fetishists; plausible deniability for the knife manufacturer when a bad guy uses one. Pessimistically, it might be pretext for an about-face on their OSS commitment. "Oops, researchers trivially broke our safeguards. Shucks. That's scary. Guess we'll build a moat instead of an OSS community. Think of the children or terminator or whatever works these days"

I suppose we'll see.

I think the release of closed source models right now is a net negative and worth opposing. Right now we’re building a future where the very wealthy and powerful will control access to AI on ethical grounds, while they have uncensored access to the latest and most powerful models. Innovation, high frequency trading, medical breakthroughs, creative output - all of these and more will be enhanced by AI, and you’ll be eating leftovers and paying a fortune for them, wondering why you can’t keep up - unless we enable a vibrant open source ecosystem, and force big tech to release models into that ecosystem.

Support open source models by celebrating their release and pressuring companies to release them, and oppose closed source AI or face a very bleak future for you and your descendants.

You may be having fun with “Open” AI’s API today, but you’re supporting and celebrating the collapse of society into megacap AI elites and a majority paying for metered access to old technology.

A utopian future would be FOSS LLMs that are private and run locally. The opposite of one where models are public, proprietary - running on your data and owned by just a few large entities.

I mean sure.. but imagine if this were open sourced as is. This is new tech that has barely had time to mature. The possibilities for abuse are endless. I for one am happy that this model isn't being open sourced. This is an excellent way for people to generate all kinds of disturbing and fake audio clips.

Uh yeah, who cares? Give an example of something this audio generator could make that is dangerous.

Same logic could be applied to Linux by Microsoft in the 90s. In fact, the “It’s for your safety” has been applied to some of the worst things humans have perpetrated including apartheid (which I lived through) and the holocaust. And it’s always those that claim to keep us safe doing the worst. And it continues with perceived dangers providing pretext and moral authority to do bad things.

"If we were to release our kettle and chickens, they would go extinct within a few years."

"I just hold on to all the money, 'cause bitches can't be trusted with it. We pool all the kissing money together, see? But if you wanna buy anything, you just talk to the bottom bitch, and then the bottom bitch talks to me.

Do you know what I am saying?"

Choose OSS, or put that mouth to work on someone's AI API.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact