Hacker News new | past | comments | ask | show | jobs | submit login
Voicebox: Generative AI model for speech that generalizes across tasks (facebook.com)
219 points by OkGoDoIt on June 19, 2023 | hide | past | favorite | 118 comments



> As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm. In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks. We believe it is important to be open about our work so the research community can build on it and to continue the important conversations we’re having about how to build AI responsibly, which is why we are sharing our approach and results in a research paper

Learning from the pushback on releasing LLaMA it seems. I wonder how hard this will be to replicate. (Trained on “60K hours of English audiobooks and 50K hours of multilingual audiobooks in 6 languages for the mono and multilingual setups”, this doesn’t sound intractable.)


Assuming 10 hours a piece, 6k books feels a very achievable dataset. Even Librivox claims 18k books (with many duplicates and hugely varying quality levels). If you wanted to get expansive, you could dig into the podcast archives of BBC, NPR, etc which could potentially yield millions of hours.

[0] https://librivox.org/


older BBC material when the 'standard bbc' voice reigned supreme might make a great training set and could/should be publicly accessible?


From the paper:

> Model Transformer [Vaswani et al., 2017] with convolutional positional embedding [Baevski et al., 2020] and ALiBi self-attention bias [Press et al., 2021] are used for both the audio and the duration model. ALiBi bias for the flow step xt is set to 0. The audio model has 24 layers, 16 attention heads, 1024/4096 embedding/feed-forward network (FFN) dimension, 330M parameters. We add skip connections connecting symmetric layers (first layer to last layer, second layer to second-to-last layer, etc.) in the style of the UNet architecture. States are concatenated channel-wise and then combined using a linear layer. The duration model has 8 heads, 512/2048 embedding/FFN dimensions, with 8/10 layers for English/multilingual setup (28M/34M parameters in total). All models are trained in FP16.


That little stinger at the end was not as surprising as they thought it was :P

It's very cool tech, but it's far from transparent. It has a very obvious "autotune" like sound to it that jumps right out. when they edited that one word it was obvious it had been edited.

Again, super cool tech, just not going to replace voice actors or anything.


For me its more like I wake up and check if humans have been replaced yet. Oh good, it's another day that I don't have to share one time pads with my mother to ensure that I'm talking to her and not a simulant performing fraud on a massive scale.


imagine when they have robots indistinguishable from humans, with some sort of real skin face that can be morphed to any existing or non existing face. I mean, it almost seems easy, or inevitable at least.


Cryptographic proof of personhood is going to be a thing, is it not? Outside of BigTech, Signal is as poised as WorldCoin to be just that.


I’m just not convinced that anything not tied directly to a government issued ID is going to be strong enough.


Most of the time you want to confirm that you're talking to someone from a given context -- they own a specific Twitter account, or you met them at a party last week, or they sent you an email or were present in a meeting that you want to have a conversation about.

Government ID doesn't help much with those -- it's actually the thing that is not strong enough.


Yes and it's going to be done through digital IDs. Unless something dramatic happens, we're poised to turn to digital IDs linked to your real ID and in turn validating access to apps/communication.


And authoritarians everywhere will rejoice (and they will give out the means to duplicate these IDs to a select few in case they need to generate evidence that 'you' have offended the state).


The only sensible approach to this problem (assuming it is a real problem) is trusted individuals certifying others as human.

There are alternatives to using the state for this, but they are difficult and fraught with UX issues. Perhaps a decentralised web of trust or some sort of blockchain based registrar of trust that can trace trust routes between mutually distrusting individuals.

Unless such a system is in place, international and strong before states start playing in this space, there isn't much chance of beating a state's approach to the problem.

Just look at https certificates. The current system involves browsers shipping configured to trust a whole bunch of entities I don't really trust, and there has been relatively little interest in trying to build a working decentralised approach to site security.


I'll also add that some of the state proposals are in general quite nice in where their priorities are. Such as prioritizing working in the open, both in sharing research, working with standards groups, and open source tooling.

It would be nice if we could come to a thorough solution that actually does cover all bases, rather than all these companies trying to create their own digital ID services that just encourage us to instead do silly things like photograph your ID front and back.

I mean hell it's taken like 20 years for privacy by design to become an ISO standard? That sort of timeline is not something we can really tolerate as more and more people continue relying on online services and in turn wind up trusting horribly outdated techniques/general malaise about data.


> It has a very obvious "autotune"

To me it has a very obvious "Hindi is my native language" accent. I mean after literally the first sentence: "The research team at Meta is excited to share our work...". Ouch. The "our work": just ouch. I was wondering why it wasn't a native english speaker presenting the video when the video is precisely about generating speech.

The first seven seconds are particularly bad.

Don't get me wrong: I've got a lovely french accent when I speak english.

This has either been trained on too many audiobooks spoken by non-natives or they've used their own tech, where the "reference audio" given as input was from a non-native.

In any case something is seriously off.

At 1:59, the "Hi guys, thanks you for tuning in! Today we are going to show you..."... That is obviously an Hindi speaker speaking (it's an example of fixing a real voice by removing background sounds).

I think that the main voice of the video was done by the same person who did the example at 1:59. And I think that they used their example of using a "reference audio".

And that person ain't a native english speaker.

To compare: when the reference audio uses a proper english accent (the example with the "diverse ecosystem" at 0:52), then the output from the text-to-speech sounds native.

I think they just fucked the demo video and it may already be ready for prime time.


Maybe they deliberately chose an accent that wasn't native English to demonstrate the style transfer capability. I think the ability of the system to output accented voices is a strength not a weakness, so long as it can do other accents too.


I'm surprised you had such a negative reaction to the Hindi accent! To me, it was no more difficult to understand than my colleagues who speak English as a second language.

To me, this is a style choice for the demo. Not evidence that they "fucked" it up. Accents are common - everyone has one! It's nice to see the model can support your personal voice even if it's not completely neutral English.


> It's nice to see the model can support your personal voice even if it's not completely neutral English

There is no such thing as "neutral" English.


>Nonetheless, a form of speech known to linguists as General American is perceived by many Americans to be "accent-less", meaning a person who speaks in such a manner does not appear to be from anywhere in particular. The region of the United States that most resembles this is the central Midwest, specifically eastern Nebraska (including Omaha and Lincoln), southern and central Iowa (including Des Moines), parts of Missouri, Indiana, Ohio and western Illinois (including Peoria and the Quad Cities, but not the Chicago area).


> Nonetheless, a form of speech known to linguists as General American is perceived by many Americans to be "accent-less"

TLDR: "neutral English" is like "neutral water temperature" - it feels neither hot not cold because it matches ones body temperature. It's subjective, and terming it "temperatureless water" is even less accurate.

I'd put emphasis on "perceived" and "American" in that statement, and also note that this is limited to regional accents: General American is unambiguously American. Similar to General American, many countries have developed a "Newscaster" accent, e.g. Received Pronunciation for Britain, but it's not considered neutral as it is the "upper class" accent.

In every language I've known well enough to distinguish accents, I've realized newscasters adopt a distinct accent/cadence that's not commonly used. But I wouldn't call it "accentless" - it's just another accent that may/may not have evolved from a culturally dominant regional accent (or dominant figure from a specific region.)


The accent was obvious enough that I wonder if they might have not been trying to hide it at all? Maybe they just happened to pick somebody from the team with a very mild accent.


The accent was part of the show. They demonstrated how to create an accent from scratch: sample a voice in the accent's original language (हिंदी, français) and then have that voice read text in the target language (English). Voilà, accent.


For video narration elocution, I'd say it was most of the way there.

When. Narrating. Videos. One. Tends. To. Speak. Differently.

Or, the more important case -- if I'm listening to audio-version-of-X, is it sufficiently human-like that I can forget that it's synthesized voice?

To me, yes.

Easy to tell if you're specifically listening for it, but to use an analogy one doesn't typically read novels and parse closely for grammar, does one? Your attention is elsewhere, on the content and plot.


SPOILER ALERT :p

Its more surprising nowadays if an article about AI doesn't have a "twist" that what you read/heard/saw was AI.


I agree with you that v1 isn't a suitable replacement... right now. But what v2? v3?


I don’t think GP is saying it won’t improve. The thread is about the current state of it that meta wrote this article on.


What stinger at the end?


Not releasing code or weights under the false pretense of misuse.


No, they are referring to the end of the video, where they have a “surprise twist” that the video narration was autogenerated and not a real person.


I mean that wasn’t even slightly non-obvious.


False? The misuse opportunities are obvious.


Reproducing the code is a matter of time, and short at that.


I think that the "star trek" use case of a live translation is super exciting. I think that this also will force people to have pass phrases that they use to authenticate phone calls with. I normally downplay when people bring up everyone signing everything with a public/private key (impractical for normal users) but clearly there will be a need for authentication protocols as AI proliferates


Coming to an android and iphone near you voice authentication. Audio streams have inaudible data produced by encrypting a mutually known but changeable token like the current time with your private key embedded in the stream for example in frequencies you can't hear. Your phone app queries the service with the time of call and the data received and if they are also a subscriber it is able to discern their identity with certainty.

If this feature is standardized and built in it could be paired with a token like a yubikey which is on users keychain and authenticated even if they were using someone else's phone.

All these details can be completely opaque to the user who just needs to see a blue check by calls made with a device verifiably logged in as name@provider and a red exclamation beside calls made with no info at all.

Think fake kidnapping scams, industrial espionage, recent attacks involved in pretending to be in HR to redirect corp transfers to attacker.

Just a little doubt is liable to blow up such scams.


> If this feature is standardized and built in it could be paired with a token like a yubikey which is on users keychain and authenticated even if they were using someone else's phone.

Remember the standard tech bait-and-switch: if this feature is built in, it'll not be good enough to function for purposes you describe, but it will be good enough to track you by your voice for advertising purposes.


In a fake kidnapping scheme, it doesn’t seem like that much of a stretch to say “I’ve been kidnapped, and they took my keys, so I don’t have my yubikey” or something along those lines.


A lot of these scams evaporate like a soap bubble when someone questions it. All it can take is calling the persons phone to verify that they aren't actually kidnapped.


How "live" can translation ever really be? Properly translating anything from one language to another requires context.


While I’m not necessarily in favor of this, a multimodal AI that has access to your location, vision inputs, etc could obtain much of this context. People already explore foreign countries by hobbling together these services.


Human translators do it in approximately real-time, see for example the United Nations. At most you might need to wait for the end of the sentence to translate if e.g. the verb/object ordering is different.


like... a pin?


In the US some phone companies have been using voice recognition to authenticate when their customers call. This will definitely have to see its end.


HMRC in the UK also use it. Has never worked once for me and I don't even have any kind of accent.


I am mostly excited for cheaper audiobooks with consistent voices for different characters.


I'm excited about them making it faster to produce. I finished the most recently published audiobook in a series this weekend. The author posts unpublished chapters to a site called Royal Road. I listen to books while running and driving, so it's a non-starter to visually read them. It would be nice to have that pipeline accelerated.

Now, I just want to talk about my little weekend project... I spent a couple of hours scraping Royal Road and trying to get TTS working. Eventually, I settled on:

1. `wget --recursive` filtering only the chapters 2. A python script to strip extraneous html like advertisements and the headers. 3. Pipe into pandoc emitting plain text. 4. Copy it to my phone for TTS: https://f-droid.org/packages/com.danefinlay.ttsutil/

I really wanted to use all local tools, but I just couldn't get any of the Linux tools to sound as good or work as fast as Google TTS services. Also, the TTS paid services I found were just too expensive to justify (20hr book for ~$70).

I'm more than happy to additionally purchase the audiobook when it is published. I just don't want to wait.


Just FYI since you ended up using an external tts anyway -

https://beta.elevenlabs.io/speech-synthesis

is vastly better, especially for fiction.

Also worth trying is: https://speechify.com/


Yeah, it isn’t so much that I want publishers to have a cheaper way of making an audiobook that avoids the (apparently minimal) cost of employing a voice actor.

I don’t want to wait for the publisher to decide they want to do an audiobook.


This is a huge issue in the voice acting community. Been frequent recent discussions over at https://www.reddit.com/r/VoiceActing/.

For what its worth, most of the cost of audiobooks doesn't come from paying talent. For intermediate level actors, the going rate is around $50-$100 per finished hour (PFH) and experienced actors it can be around $250-$300. This page does a decent job of laying out pay structures for audiobooks: https://speechify.com/blog/whats-the-meaning-of-per-finished...

An 8 hour audio book might cost the author/producer about $1800-$2k.

Just talking about Audible exclusively, they take about %50 of sales. But it's kinda wishy washy about exactly how much an author will earn in royalties. It's not as much as you might think. Good article from an author here that lays out some sales numbers: https://selfpublishingadvice.org/how-audiobook-authors-are-p...

The other way that a narrator can get paid is called royalty share. That means the author/producer doesn't pay the narrator anything up front and the voice actor then relies on a small percent of each book sale to get paid. Theoretically, if an audiobook ends up really taking off then the narrator potentially could make a lot of money. But that rarely happens. Most audiobooks that you find on Audible have very, very low sales volumes.

To sum it up, it doesn't occur to lost of audiobook fans but voice acting is a very competitive industry. It takes a lot of work to make a name for yourself, and even then the most successful actors probably aren't making much more than a highly paid software engineer. For most wannabe voice actors (including myself), its something you do more for love than necessarily to make a career out of it. Though of course, lots of people do but not the majority.

This is all why I'm personally not a fan of these voice generation models. It's going to eventually make this niche industry non-competitive for real humans except for the talent that is already established. People keep blaming the actors as being too expensive when most are barely making it without secondary jobs.


There’s also a different possible take on this.

Voice acting seems to be really bad career, so eliminating that job is desired, if you can deliver same quality/better product for cheaper to customers, without requiring employees to be underpaid.

I know it sucks for people in that industry, but technical progress always eliminates jobs. Calculator used to be a job, now it’s a device.


Caveat though:

> if you can deliver same quality/better product for cheaper to customers, without requiring employees to be underpaid.

This almost never happens. Cheaper? Yes. Same or better quality? Not a chance. Automated solutions tend to allow reducing quality way below what humans workers would want to do, or even could cheaply and reliably (i.e. doing worse job than a careless one takes actual effort/skill). Like with every other case of automation replacing humans, expect the quality to be pushed down to minimum tolerable levels, as this is the point that maximizes revenue.


I think you’re romanticizing quality of human work. World is full of employees who couldn’t care less and will take the path of least resistance to get a paycheck. Removing humans from the loop often directly leads to improved quality. Of course, those are same humans that make decisions about how to use automation, so it’s not a panacea.

There’s tons of things where quality improved immensely due to automation. Engines, drugs, batteries, just to name few.


I'm not romanticizing quality of human work. I'm not claiming people give more shit than it seems. I'm claiming that with humans doing the work, quality can only get so low[0] - and automation lets you punch through that floor, achieving much lower quality standards.

Or, put another way:

> Removing humans from the loop often directly leads to improved quality.

Yes, but improved quality for the same price means leaving money on the table, so approximately every business immediately drops quality to the baseline and pockets the difference - and from that point on, competition will optimize the quality further down.

> Of course, those are same humans that make decisions about how to use automation, so it’s not a panacea.

It's not the humans being replaced that make that decision - it's their bosses, who rent or own the automation, that make this call.

> There’s tons of things where quality improved immensely due to automation. Engines, drugs, batteries, just to name few.

Sorta, kinda. In areas with strict regulatory standards? Yes. In areas where automation improves both cost and quality, and the competitive pressure isn't very strong? Sure. With products not yet commoditized? Often enough. When it enables market segmentation? Of course.

But then you have commodities, or automation replacing people directly on the "critical path" of value chain. That's where products and services go to shit. Bonus points if automation allows to engage customers in "self-service" - i.e. outsource work to the customers.

Case in point: automated checkout machines in stores. They reduce jobs, but in theory, they could reduce queues, increase throughput, and make shopping more pleasant - win-win deal for everyone - even the cashiers could be shifted to oversight/support jobs, ensuring increased throughput and more profit for the store, for the same number of employees.

In practice, it turns out the optimal setup for the store is deploying way too few machines, and instead of having dedicated employees for oversight/support of the machines, those responsibilities are just tacked on to the workload of the existing (reduced) work force. As a result, queues are longer, customers are frustrated, overall shopping experience is shit - but the store knows perfectly well the customers will endure it anyway[1].

The market optimizes for profits, not quality or happiness. It's not just greed - money is the lifeblood of companies, and without it they die. As a result, however, competitive pressure ensures that any value or virtue that can be sacrificed to improve profits, will be sacrificed. Those who refuse get outcompeted by those who make that sacrifice. The ratchet turns, and the sacrificed value is lost forever.

--

[0] - There are many limiting factors. If the business is pushing down quality of human work too hard, they'll eventually have to deal with employee frustration, or hit limits imposed by OSHA or labor law, or just a soft limit where producing a fixed amount of goods/services costs X in labor, and there's no point in trying to save 0.1X on quality if it requires workers to put effort, which will make them produce less per unit of time, or increase variability of output, or both.

[1] - There are many reasons for it, including customers being price sensitive to the point of irrationality, usually valuing their free time at 0, and being easy to confuse with constant churn of deals. Stores also know that frustration is a fleeting feeling, while well-crafted product selection makes a store/chain sticky. Notice how automated checkout machines tend to proliferate in grocery stores and drogeries, and are seldom seen anywhere else: that's because they work best in places where customers are susceptible to factors I described earlier - and thus will endure bad experience and still come back for more. It's not like there are alternatives - competitive pressure ensures all competitors offer equally shitty experience. The ratchet made a turn, there is no going back.


> Case in point: automated checkout machines in stores.

I don't know where you live, but I've never seen automated checkout machines. I only have seen self checkout machines. It requires the customer to do the cashier' job and that's all.

The only reason it's not good is that it's not automated enough (if at all -- for me the self checkout machine is literally zero automation more than a regular cashier)


Yes, I meant self-checkout machines.

> The only reason it's not good is that it's not automated enough (if at all -- for me the self checkout machine is literally zero automation more than a regular cashier)

That's the point. But you are not the buyer of that automation, the store is. That automation displaced human cashiers and lowered the quality of service for customers, while generating better margins for the store (promptly eaten by competition). From your POV, i.e. customer's POV, it's not automated enough - but it's not going to be for quite a while, because there is no incentive to do it. The store doesn't stand to benefit much from additional automation, not enough to justify investment. Whether or not customers like it is irrelevant, as long as they're still coming in anyway.


I really wonder whether you have seen Midjourney, Stable Diffusion and ChatGPT or any other the recent trendy AI things.

You can't find an illustrator who could "cheaply and reliably" do illustrations at Midjourney's level. You just can't. If you could you would have been the biggest contractor company in the world long time ago.


"Midjourney's level" is precisely the quality I'm talking about. It's impressive for what the computer can do, yes. It's not impressive when you find it coming out of a black box labeled "commercial commissioned art", not when compared to what used to come out of that box for about the same price. The images are... almost OK. But there's always something. A missing finger here, a tiny extra eye there, some psychedelic pattern faintly visible in the negative space, etc.

But what can you do? Every black box labeled "commercial commissioned art" is now returning similarly off images, almost but not quite there. They all dropped their prices a little, so there's that - while the few black boxes offering the quality that used to be normal now cost 2-3x of what used to be normal. Hard pass.

(Meanwhile, people operating the black boxes - i.e. companies or in-house departments churning out commercial graphics cheaply - are swimming in money made on firing all their minimum-wage artists, replacing them with Midjourney or SD, and pocketing the difference. Sure, they had to drop the prices a little to clear out remaining human-powered competitors, and they will have to drop them way further once the competition restarts in the earnest - but for a short moment, they all get to make small fortunes on selling shit output, that's 100+x cheaper to produce, at roughly the same price as mediocre one before.)

Can AI be used to generate much higher quality at the same cost as human art? Sure - you'll need to spend what you used to pay an artist, whom you just fired, on generating variants and a (cheaper, at least per unit of output) human select best ones - but yes, AI can give you much better quality for the same price. But AI can also give you same quality as before for cheaper, or somewhat worse quality for much cheaper. Which is the best option to choose?

The answer, I claim, is that there is no choice - competitive pressure will force everyone to go for shittiest quality the market can bear, sold almost at cost. This will satisfy enough demand that "standard quality" offering becomes something very expensive or outright unavailable, as economics of using minimum-wage factory artists suddenly stops working.


Perhaps it's not the bar the GP was applying, but I think "good enough at 1/10 the price" is quite empowering for consumers. Consider all of the people that can't currently afford Audible, but who would like to listen to audio books while they commute to work, for example.

And of course, nothing stops you from paying what we pay now for human voice actors if there continues to be a quality differential that customers care about. (Though perhaps Baumol's Cost Disease would push the price up for today's human-generated quality.)

Extrapolating further -- if the commoditized version of audio books is AI generated voice, perhaps the new job for voice actors is human narrating/acting of AI-generated content for personalized stories ('Ractives from Stephenson's book "The Diamond Age"). Who knows, human voice actors could become more in demand, not less. To be clear I wouldn't forecast this as the most likely outcome, just pointing out that there are many possible outcomes.


> I think "good enough at 1/10 the price" is quite empowering for consumers.

That's the thing though - it's not as empowering as it seems longer-term, because the "good enough" quickly drops to "barely fit for purpose"/"if it were any worse, it would be illegal to market or sell". This has been the case with most established classes of products I can think of, including pretty much anything that's been fully commoditized.

And so

> nothing stops you from paying what we pay now for human voice actors if there continues to be a quality differential that customers care about. (Though perhaps Baumol's Cost Disease would push the price up for today's human-generated quality.)

Nothing stops me today. But even if the quality differential exists, the dropping price on the low-quality version will reduce demand on the moderate-quality version, pushing its prices up and reducing number of suppliers (here, voice actors). The end result seems to always be a bifurcation: there is not enough demand to sustain a business doing decent quality work for a reasonable price, so all companies move to providing either low quality work cheaply, or high quality work at a hefty premium. The middle disappears.

In the specific context of this thread, the middle in question is the current quality of audiobooks with voice acting. The quality level available to most consumers will be below that, and the next step up will be niche recordings at high cost.


Sounds like the voice acting community is going to join the taxi driver community in the near future.


Why would they make it cheaper when they can make even more profit by not having to pay a voice actor?


If they make it better for the reader, they can potentially raise the price. If they can make it cheaper to produce, they can potentially increase their profit without raising the price.

Usually on balance this falls somewhere in between -- more value for less money for the consumer, and more profit on each marginal unit of production for the producer, which is how technology progresses across most consumer goods.


Because it would be extremely easy to produce it cheaper or record it independently.

This opens up for non-signed authors to release audio books.


Yes, but they're not going to pass on costs from not using a voice actor. They're just going to charge what they normally would have, and not worry about having to give so and so a cut.


Sure, and then we'll get more audiobook options, as it becomes economically viable to make more niche stuff into audiobooks.


Niche stuff has always been economically viable. Niche stuff also tends to get publisher support. Realistically these days the only reason why books don't have an audiobook format is due to not wanting to support Amazon/Apple, or because they just don't want it. Voice actors literally are not expensive for audiobooks. You can absolutely afford one if your tiny $9 book hits more than 200 copies.

Unless you're talking about illegal niche, in which fair but I highly doubt stores are going to accept those. All generation helps with is content that will be free.


I think you’re underestimating how much work goes into hiring and producing something. Making it more self-service will go a long way toward making it more accessible to small creators. And my impression is that many books don’t sell 200 copies.


Get a text copy of the book and then pass it through the tool yourself, right? Who cares how much the publisher wants.


I've been watching the text-to-speech space for a while, waiting/hoping for something both open and better than CoquiTTS. ElevenLabs sounds amazing but is super expensive for something like a book, and tortoiseTTS is so slow as to be unusable.

I wrote a quick python script to read an ebook using coqui and the end result sounds pretty good. It's come in especially handy for books I want to listen to while doing yard work and stuff around the house.

https://github.com/aedocw/epub2tts


I was wondering if it would be possible to build something like this right now.

Use text to speech and chatgpt to tag the character text and timestamps.

Then use a speech to speech to change the character voices or even the whole reader.

But as a product I feel like theres some legal hurdles to figure out.


The best we can do is more robot spam.


Oh that is an amazing use case!


I've been working on that for my hobby writing project. I'm using Elevenlabs' API and homegrown scripts to automate audio-generation and synchronization between text and audio. I have separate voices for the characters (and the narrator, when the narrator is third-person speaker). Below, there are some links to a section of a chapter that you can download and judge for yourself.

This is a huge boon for independent authors, until AIs replace us as well :-) .

Things I have learned:

* A good human narrator could do much, much better, but the quality obtained this way is not totally terrible.

* The possibility to produce a section in a matter of minutes is a huge plus. The thing with a book is that it's never totally finished. If you discover a problem after you have submitted your text to a human narrator and paid $ XXXX, there is nothing you can do.

* Currently, there is no platform that I know of distributing and selling books like this. Audible only accepts audiobooks narrated by humans. To my knowledge, platforms that accept ebooks don't handle epub with media overlays. Well, Apple Books say they do but I haven't gotten it to work. There are no alternative platforms for audiobooks that I know of, but I haven't done a ton of research there.

* The possibility to have more control over emotions expressed in the speech could be a bonus, particularly for small, overly dramatic parts of the narration. Coqui TTS new editor is a step in the right direction, but their TTS doesn't sound yet as good as Elevenlabs. Voicebox seems promising, but there is no way to use it at least for now.

* Cost is a big deal 1/3. With my scripts, I pay almost nothing when I fix a typo, since most of the audio is stored in little bits in the database, and only what changes is submitted to the API. But the human time of a narrator costs much more, as it should.

* Cost is a big deal 2/3. As a reader, I have learned that how much a book sells tells me nothing about how much I will like it. But only books that have a potential to sell can afford audiobooks. If I want to listen to a story too quirky to be mainstream, or from an independent author that I follow in Twitter, the chances I'll find it as audiobook are next to none.

* Cost is a big deal 3/3. Voice narration is not the only aspect one needs to pay for. A good story needs an army of editors, proofreaders, and designers. Generally, the more an author or a publisher needs to disburse on those, the more bland and mainstream the book must become to sell and justify the investment.

-----------------------------------

Note that this is a WIP. Book chapter with automatic narration:

An epub with media overlays. It requires an epub reader that supports that standard feature of the epub 3 specification. Currently, and that I know of, there is Thorium and BookFusion for iOS.

https://drive.google.com/file/d/1U8XUB9xhu86JuketGH5WchM0obN...

An MP3 track from the epub above:

https://drive.google.com/file/d/1-u89ee52VZzGZ0oTGC_az5Uqbfs...


So are they releasing it or not? It's a nice PR statement but "we are not making the Voicebox model or code publicly available at this time". Phantom release?


They are holding it closed for the safety of the children and their future profit margins.


seem to be already in the process of reproduction by the community - https://github.com/SpeechifyInc/Meta-voicebox


Anyone know what tool is being used to create AI singing voices/renditions of Mariah Carey singing Whitney Houston to other popular songs?

Here's a bunch of results on YouTube and some are really good

https://www.youtube.com/results?search_query=mariah+carey+ai...


> we are not making the Voicebox model or code publicly available at this time

Hopefully it’ll do a LLaMA.


Pretty interesting. IIUC they haven't released the model weights, but I think most TTS papers in the upcoming months will release the weights since OpenAI Whisper.


I've been working on an open source audio editor which uses Whisper to slice speech. Very exciting to see more capabilities on the horizon!


Is this really better than eleven labs?


Just listening to it, it's subjectively not better, but if it's > 10x faster/cheaper, I would use it anyway -- it's good enough to be listenable.

Eleven Labs is the first voice synthesis that is good enough that I'd listen to an audiobook generated from it, but pricing is such that it would cost $100 to synthesize a 10 hour audiobook. A little too expensive. If they could get it down to $10 I'd cancel my Audible subscription and just synthesize audio from ebook text.

So if I can get a locally running voicebox model and just leave it running on my laptop over night transcribing an audiobook, that's even better.


Have you tried tortoisetts? I believe eleven labs basically forked that and made improvements on voice quality and speed there


tortoisetts is good to takes a long time to render audio even on it's fastest setting with the fast fork.

Although it wasn't clear to me how voicebox compares.


How does it compare to Voicebox in quality?


I would say that properly configured Tortoise is better, but that comes with the massive caveat that Tortoise:

1 - Is a real pain to get 'working right' - it's not even remotely batteries included

and, more importantly:

2 - Is incredibly slow. I've been turning Heart Of Darkness into an audiobook as a unit test and it takes ~30m per paragraph, on average. Add to that the occasional hiccup where a block gets transcribed badly (Tortoise occasionally 'drops out' of it's selected voice) and Tortoise only really works if you have a ton of compute and you still don't mind waiting forever.


FYI there’s also this fork for faster inference: https://github.com/152334H/tortoise-tts-fast


> So if I can get a locally running voicebox model and just leave it running on my laptop over night transcribing an audiobook, that's even better.

This is basically my dream for local AI... locals models trained on my own data/code/styles. Even if they're slow, as long as they work (V/RAM) and are of high enough quality then I'm happy to wait!


I'm not sure. Eleven Labs requires some tweaking of the parameters, but I've created samples that friends and family are certain is me.


Came here to find out myself, still not sure.


Looking forward to being able to automatically generate audiobooks in famous voices. Gilbert Gottfried's "Blood Meridian" will be a hoot.


So where can we try this out?


> See voicebox.metademolab.com for a demo of the model

They are not releasing the model (yet?) but demo samples are available across many tasks


This is not close to " state of the art " in TTS. The output is clicky , low bitrate and lacks vocal nuance. Its novel for using a " flow matching " approach in its architecture and being suited to cloud - based translation. Have a listen to U Washington , Google " sound storm " instead !


This is like qhwn Google published all those papers about their LLM tech, and ChatGPT just launched something that worked. It will end the same way for their voice tech if they never release it. Someone will release something just as good and take the market.


All I want is a podcast player that automatically cleans up the crappy audio.


> introduces state of the art AI Model for speech

> narrator for the presentation is indian woman with lisp

every time


The narration was bad. I was thinking maybe it was the head developer of the project and they wanted her to narrate or something. Why would you choose that narration?


A common courtesy would be to check for naming collisions in voice-generating domain.

https://github.com/jmiskovic/voicebox

If I ever start selling scrapbooks for collecting human faces I'll be returning the favor.


oh okay, so this is what they were recording me for...

hehe, fun world we live in.


Pretty impressive, but I had the video running in the background and it sounded a bit too sterile for my taste. Also, the narrator sounds a bit like she once gave a foot massage to Marsellus Wallace's bride.


can an expert in the field comment on whether the results are more or less impressive than Soundstorm?


I've got the same question; found some of the researchers for both projects on twitter and will see if I can get an opinion from one of them. Just waiting on verification to pm them. Will reply here if I hear back unless you have a contact with notifications.



Any available open source APIs?


And the source code for this one has....

...not been open sourced and cannot be found.

Sorry AI bros. Better read the paper this time.


It won't be a big barrier. Voice stuff is not as computationally intense as video or LLMs, so it's still an area where small teams or hobbyists can make a dent.


> There are many exciting use cases for generative speech models, but because of the risks of misuse, we are not making the Voicebox model or code publicly available at this time. While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.

Yeah... I mean there definitely are ways to misuse this (especially the style transfer!) I don't think you're going to do anything except delay the inevitable Facebook.


They don't really care about misuse. They just don't want to say openly that they like to keep their shiny new tech and make money with it. No idea why, most people wouldn't bat an eye if you stated from the get go: We built it, we'll use it.


I agree. Though in fairness they have opened other models e.g. for speech recognition.


Is Meta just operating in hope and AI moonshots now ?

Everyone one of their products is just garbage to me and becoming less relevant by the day. When do they actually starting building something useful again ?

Honestly Apple seems to be using “AI” much more successfully and actually seamlessly integrating it into their existing products to improve them.

My theory is Mark is hoping the meta verse will pop out if Yan’s bottom at some stage. Maybe he is right? I just can’t for the life of my understand why the current products are just so so neglected?


Meta products have more users than ever


Doesn’t prove much honestly. How do you even know those users are real ? What value are the products providing ?

My Apple products continue to improve my day to day immensely. The meta products are just rubbish on the whole. How has Instagram improved in the last 5 years ?


I'm no fan of Meta but I'm glad there are companies investing in big moonshots like AI and yes, even Metaverse.


The anecdotal fallacy


[flagged]


Voice commands being usable is the one thing I miss from having an Android phone.


I think he’s talking about the new iOS feature that allows your text to be read in your own voice model.


Do you use an iPhone ? They do pretty amazing things with images now. Even the search for a photo by text is quite amazing. I used it the other day for work for the first time and I found what I needed in my tens of thousands of photos. It almost truly felt like an extension of my memory, it was actually a pretty cool feeling.

So sorry I don’t buy the Siri attack as being proof of anything. I have found Siri has improved a lot. Even the speed at which Siri works is much faster. The other day I was driving in a really loud old car and used my Apple watch to change the music, I said to myself “there’s no way Siri will get it”, and it did.

I also trust Apple with my data, at least for now. I don’t even slightly trust Meta at all. I found Zuckerbergs interview on Lex Friedman more scary than before as he said all the same freaky stuff disguised by a more “cool” and polished facade. The same dystopian ideas are still there.


Sent from my iPhone




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: