Hacker News new | past | comments | ask | show | jobs | submit login
Google open-sources the Lyra audio codec (googleblog.com)
244 points by chmaynard 5 days ago | hide | past | favorite | 139 comments

One thing I'm slightly worried about "machine learning" in compression rather than conventional everything-is-sines mathematical approaches is the possibility of odd nonlinear errors. Remember the photocopier that worked by OCR and would occasionally mis-transcribe numbers?

I don't mind compressing a phoneme to <unintelligible> as much as I would mind it compressing it to a clearly audible different phoneme.

Are you aware that the same exact uncompressed recording sounds different depending on context? This is known as the McGurk effect.

Very worth your two minutes if you're not yet familiar with the effect: https://www.youtube.com/watch?v=2k8fHR9jKVM

While fascinating, that’s not the same as a codec failing silently by literally changing one word into another, equally clear word instead of getting fuzzy or unintelligible.

At the end of the day, it all comes to using the right tool for the job, and this is just another codec in your toolbox.

This is no different than using, for example, a probabilistic algorithm to solve some NP-hard problem in your real world software. As long as you understand the limitations, I don't see an issue with using an algorithm that has a small non-significant (for your use-case) rate of failure. I would definitely not use this to communicate with the space station, but in the right context (Google Duo, low bandwidth), it's the perfect tool.

It would be curious how the court would interpret this. Just wait for the next high profile SEC shakedown.

That’s probably meant for another thread

I meant phone taps.

[disclaimer: Personal opinion, not that of my employer.]

I had a coworker play me before/after of an early version of the codec "babbling" and it was definitely uncanny valley. It looks like some work has been done on the problem since then.

The second paper linked in the README.md of the repo talks about talks about a few strategies to reduce 'babbling' or 'babble'. For your reference, here's the citation and the link to the PDF.

Denton, T., Luebs, A., Lim, F. S., Storus, A., Yeh, H., Kleijn, W. B., & Skoglund, J. (2021). Handling Background Noise in Neural Speech Generation. arXiv preprint arXiv:2102.11906.


This already happens with existing compression algorithms. Certain vowel sounds get collapsed, so someone will say, for example, "66" and it will come out on the other side as "6". Very annoying because you can't exactly coach a layperson on how to talk "the right way" to not trigger this vowel collapse.

> how to talk "the right way"

Not suggesting it as a fix, but this did remind me of the military phonetic alphabet, which includes numbers too.

3 is "tree", 4 is "fow er", 5 is "fife", 9 is "niner". The rest of the numbers are mostly as-is, but you'll hear very deliberate enunciation, like "Zee Row" for 0.

whiskey hotel yankee delta oscar india hotel alpha victor echo tango oscar sierra papa echo alpha kilo tango hotel echo lima alpha november golf uniform alpha golf echo oscar foxtrot tango hotel echo mike alpha charlie hotel india november echo ? tango hotel alpha tango india sierra india november sierra alpha november echo!

| perl -pe 's/(\w)\w+/\1/g'

Humans adapt a whole hell of a lot easier than machines.

Sure, it would be nice to have clean high bandwidth, low latency voice channels to everywhere so you could drop pins and expect the other side to hear it. Unfortunately, high bandwidth never really happened, and some places never ran land lines to everyone's home, and nobody wants to pay the high price of circuit switched voice when packet switched voice mostly works good enough and is enormously cheaper.

But is Lyra a significant improvement over modern Opus at 8Kbps? You can buy a Grandstream HT802 for ~$30 and its DSP can decode Opus today, whereas Lyra will require orders of magnitude more power to decode while providing much worse reproduction accuracy.

I'm having a little trouble following this, could you explain a bit more? It seems to me like "66" would be pronounced "SIKSIKS", so for that to become "SIKS" would mean the "KS" (consonants) would be collapsed, no? (Not trying to refute you or anything, just understand :) )

As someone with a weird sibilant that doesn't seem to compress well, I want to say that it goes across as "sɪkɪks" and I got used to saying "double six" on the phone.

So I would say "seven nine double six", which is another problem if I'm talking to an American.

This applies to GSM digitization and other "regular phone" compression, the newer computer calls have been better at taking the words.

Probably turn into something like SIIIIKS.

Exactly, but sometimes it's so subtle you can't even tell it's the compression taking over.

I don’t know if it’s improved over the last 6 months, but Zoom sucks for Native Spanish speakers speaking English. Like zoom would not pick up the J/H sound at all on English words.

> you can't exactly coach a layperson on how to talk "the right way" to not trigger this vowel collapse

I've never noticed. At any rate, we should not coach people to adapt to technology in this way. It is Procrustean and anti-human and unnecessarily places a burden on people that belongs to the software and the developer.

For what it's worth, amateur radio operators already have specialized rules and techniques for speech, to improve clarity over a muffled noisy analog radio channel.

Going as far as using trinary for on the fly data encoding .

I've always suspected the optimal experience is a balance...we define some intermediate language that both the computers need to be programmed to understand and humans need to be trained to adopt.

The most obvious example is learning to type...I've had by far the most fun working with computers in a keyboard-centric environment, mostly because I'm good enough at pressing keys and the computer is good enough at understanding them.

That said, I agree with both you and GP: trying to train a layperson to talk differently based on the quirks of the codec used to encode their voice seems like a poor choice!

Thats not a compression algorithm

A supersampling algorithm is a (de)compression algorithm. You give it an image and it gives you an "decompressed" image. It's not a very good one though.

The OCR issue was the first thing I thought about. Machine learning is probabilistic, not deterministic, so in the case of S being converted to 5 (or 6 to 8, etc.), which definitely impacts numerical data in the case of the OCR stuff, we can expect similar voice mis-classifications. Perhaps "You're fine" might get misclassified as "you're fired".

>Remember the photocopier that worked by OCR and would occasionally mis-transcribe numbers?

For those who don't remember: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

>photocopier that worked by OCR

The interesting bit was that it wasn't supposed to work by OCR...that had been deliberately turned off. The compression was too clever.

Back when Lyra was announced [0], I listened to the released samples and it changed an "m" sound to an "l" sound.

[0]: https://news.ycombinator.com/item?id=26309553

IMO: the output of machine learning is correlated garbage. This is confusing to most people who are used to programs that implement an algorithm (reminder that "correctness" is part of the definition of an algorithm.)

> Remember the photocopier that worked by OCR and would occasionally mis-transcribe numbers?

That was perfectly ordinary compression?

The phenomenon is all over the place, most visible in autocorrect.

It was ordinary compression, something called JBIG2. It did not mistranscribe, but mark slightly different number or character blocks as same, resulting replaced parts in images.

In other words, its match tolerance is a bit too lax, so it get poisoned by blocks in its own dictionary, thinking it already has the blocks for things it had just scanned.

More details can be found in [0] and [1].

[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...

[1]: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...?

Yes! This is why I always turn off autocorrect! It’s true that I absolutely make more typos without it, but at least they’re obvious as typos, and not different words that potentially change the meaning of the sentence.

One day, voice cloning may become so powerful that only word data and intonations will become part of the datastream. There could be various 'layers' in which encodes/decodes can occur. Voice Cloning would be at the very top of the stack.

You mean like: "Buttle" vs. "Tuttle" ?

The problem with JBIG2 and why it mistranscribed is that it worked by average error, instead of something sensible like maximum error.

Recent past threads on this:

Lyra audio codec enables high-quality voice calls at 3 kbps bitrate - https://news.ycombinator.com/item?id=26300229 - March 2021 (198 comments)

Lyra: A New Very Low-Bitrate Codec for Speech Compression - https://news.ycombinator.com/item?id=26279891 - Feb 2021 (25 comments)

Is there significant new information here? https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

Edit: it seems the SNI is the open-sourcing. I've changed the title to say that now. Corporate press releases are generally an exception to HN's rules about titles and original sources: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor....

> Is there significant new information here?

The fact that it's not open-source? The blog post is posted today and from Google themselves, so I assume there's new information.

"Please note that there is a closed-source kernel used for math operations that is linked via a shared object called libsparse_inference.so. We provide the libsparse_inference.so library to be linked, but are unable to provide source for it. This is the reason that a specific toolchain/compiler is required.* - README

Yes, that will have to be removed as part of the effort of porting it to new platforms.

Any idea why this is proprietary? Is it third party? The only references I find online to "libsparse" is an MIT-licensed Python library.

Nope, no idea.

What's in it? Is there anything in there that's likely to be generally useful, or is it all Lyra-specific?

[update: proprietary .so]

They should re-implement the needed bits of libsparse_inference before releasing this thing. Otherwise it's just a distraction.

Probably they should get it building with something other than Bazel, too.

It's not a kernel module, it's a compute kernel. Nothing to do with operating systems. They provide versions for android-arm64 and linux-x86_64.

The fine README says it builds and runs on Ubuntu 20.04.

Ah, so Lyra today will not work on RISC-V, i386, Power, MIPS, lower end or older ARM chips like the Allwinner H3 (very popular in Single Board Computers) and any other new architecture that comes out?

It won't even work on Windows, macOS, or iOS.

I've been waiting for an audio codec that could actually silently change the words I've said.


Link to the website of the person who found the error in the first place: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

Doesn't seem that better compared to Codec2 which is already fully Open Source (LGPL), even taking into account that Codec2's examples originals are already of much worse quality than the ones on Lyra's website. I'd be curious to hear both working on the same set of audio samples.


Agreed; codec2 doesn't alter speech as aggressively, require proprietary components, or have as strong a connection to Google.

Encoding takes >40ms? Opus takes 5-26.5ms. Apparently 150ms[1] is the generally accepted upper bound for call latency.

I think the article could do with some bandwidth/quality/latency/power comparisons to other codecs.

[1] https://en.wikipedia.org/wiki/Latency_(audio)

Ya I was just coming here to say the same thing. 40ms _just in the codec_ feels like a lot. Because that's not even including time to pull in audio from the hardware (could be 20ms or more in Android devices), time to upload, and time to have it across the Internet, and then time to decode + play on the receiver. That adds up pretty quickly. I'm guessing 40ms was chosen because it is some sweet spot of having enough data to get a worthwhile compression on, but it's one of these things where technology, however impressive it might be, is slowly giving us a worse experience over time in the pursuit of digitization.

From my understanding the 40ms is just the feature extraction part. The encoding also does quantization, which surely adds to this number.

>These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network.

So while Encoding doesn't take 40ms, the latency + encoding will indeed be 40ms+.

150ms is the End to End Latency, which is basically everything from Encoding + Network + Decoding. We cant beat the speed of light on our fibre network. We can certainly do something with Encoding and Decoding. And Lyra doesn't seems to help with that case here. Something I pointed out last time Lyra was on HN.

I think Opus default to 20ms with option of 10ms slot ( excluding Encoding speed ) at the expense of quality. What we really need is higher bitrate, lower latency and higher quality codec. Which is sort of the exact opposite of what Lyra is offering.

> We cant beat the speed of light on our fibre network.

Speed of light in what? We can absolutely be faster than fibre optics, which are quite slow relatively speaking (2/3rds that of light in a vacuum).

Internet latency is much higher than it could be, even using fiber: https://arxiv.org/abs/1811.10737

And adding an HFT-style microwave backbone could reduce Internet latency even more: https://arxiv.org/abs/1809.10897

We wont be replacing Glass Fibre with Vacuum Fibre anytime soon. And I have been following this tech for long, but I do wish I am very wrong.

Starlink ?

Satellite links are orders of magnitude slower than fiber.

> Satellite links are orders of magnitude slower than fiber.

Minimum end-to-end latency for communications from opposite points of the earth is much lower for Starlink style LEO satellites than for fiber.

Which is only in the case of "opposite points of the earth", otherwise you are just adding ~700KM of distance between two point. The point is even if we have perfect Speed of light Data Transfer over a direct line, we are fundamentally limited by it and nothing can be done. But Encoding, Decoding, Time Slots and quality are everything that we have control of and should be look into more seriously.

Aren't they still heavily expected to feature in connecting that "next billion" ?

Yes, because they are convenient for other reasons (don't require infrastructure over land) which makes them suitable for connecting rural areas where it doesn't make sense to run fiber. But fiber will always be the fastest you can get, and if you get fiber in a vacuum, you could theoretically achieve near-speed of light communication. Satellites won't get you anywhere close to that, even if you use lasers, because there is always atmospheric disturbances that introduce latency.

There is no such thing as 5ms VOIP audio latency at 6 Kbps, the IPv4+UDP headers would amount to 44.8 Kbps at minimum, so it's irrelevant if one encoder is tuned to be able to encode 5 ms chunks instead of 40 ms chunks. 40 ms intervals requires a minimum of 5.6 Kbps + the codec rate.

I.e. at 10 Kbps it's impossible to have a lower VOIP latency than 32 ms. Likely the 40 ms number they tuned for in the real world.

The favorite way to cheat compression contests. Buffer more data, get more compression.

I don't think it is discussing encoding time in the article, it says "features are extracted in chunks of 40ms". My reading is that its breaking down the speech into 40ms chunks, compressing it, and sending that.

But since the buffer size has to be 40ms then so the minimum latency is 40ms

Sure latency ends up being 40 ms but that's a function of needing to wait to send the encoded data + network headers at 6 Kbps not a function of the encoder being slow holding everything up.

Yeah, AMR (for GSM) is 10ms as well.

This seems kind of unnecessary, compared to Opus at ~10 kbps. If you're sending IPv6+UDP in 40 ms chunks, that's 9.6 kbps just from the packet headers (25 Hz * 40+8 bytes).

When the voice payload is smaller than the packet headers, you're well into diminishing returns territory.

Opus at 8Kbps sounds better, and commodity, inexpensive hardware like the Grandstream HT802 Analog Telephone Adapter supports this codec today (along with any cheap Android phone).

Lyra as it stands today will not support anything outside of x86-64 and ARM64 without rewriting the proprietary kernel it relies on.

Anyone listening to the sample audio linked to in the article should read this note from the last time this was discussed on HN: https://news.ycombinator.com/item?id=26309787

Summary: the Lyra audio samples are louder which muddies the comparison

If I remember correctly the original landline audio was 64kpbs, 8000 Hz. So Lyra is 1/20 of this. And probably still sounds better.

PCMU/PCMA (G.711μ and G.711a) are not original landline quality audio, but rather what Bell Systems felt they could get away with passing off as a toll quality call in 1972.

Lyra will likely sound better, but the reproduction accuracy is apt to be quite a bit poorer as many others have commented. G.711 was created to require nearly no processing (its nearly raw PCM data from a sound card after all) while operating at reasonable bitrates, Lyra looks much more computationally intensive and will likely only run on smartphones in the next few years.

Edit: Is Lyra a significant improvement over modern Opus at 8Kbps? You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today, whereas Lyra will require orders of magnitude more power to decode while providing much worse reproduction accuracy.

> Is Lyra a significant improvement over modern Opus at 8Kbps?

It is over 6Kbps Opus[1].

> You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today

A RaspberryPi Zero will provide more than sufficient power for Lyra (it was originally implemented on a Pixel 2). That's ~$10

[1] https://www.cnx-software.com/2021/02/28/lyra-audio-codec-ena...

> It is over 6Kbps Opus[1].

The overhead from packet headers to send data every 40ms is 9.6kbps, is the difference between 12.6Kbps and 17.6Kbps meaningful at that point? We are sending the same number of packets, likely with the same packet loss rate.

> A RaspberryPi Zero will provide more than sufficient power for Lyra

A Raspberry Pi Zero can't run Lyra, as the proprietary math kernel is only offered in compiled form for x86-64 and android-arm64: https://github.com/google/lyra#license

> is the difference between 12.6Kbps and 17.6Kbps meaningful at that point

It is when you are sending video as well. One of the stated purposes of this work is to enable video conferencing over 56Kbps dial up modems.

> A Raspberry Pi Zero can't run Lyra, as the proprietary math kernel is only offered in compiled form for x86-64 and android-arm64

How annoying! Still - the point is that hardware capability isn't likely to be the issue.

It is mentioned in other comments that the math kernel will be opened.

> > You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today

almost entirely irrelevant if you're making calls to or from the PSTN, since your SIP trunking provider most likely only supports G.711 alaw/ulaw, or even if they support you handing them a call as G.722 or any other codec, their upstreams almost certainly don't support anything other than G.711.

Isn’t the Pixel 2 much more powerful than a Raspi Zero?

Original landline audio was/is analog

Not an authoritative source, but (as a point of interest) analog landlines seem to be specified as 24 dB SNR and 300-3000 Hz passband [1], giving ~21.5 kbps information rate [2].

[1] https://www.tschmidt.com/writings/POTS_Modem_Impairments.htm

[2] https://en.wikipedia.org/wiki/Shannon%E2%80%93Hartley_theore...

Now what could go wrong? 4 dots becomes 3 dots...[1]

1). Silicon Valley - Finale S6E7 https://www.youtube.com/watch?v=48Y77jSSHGU

There's huge wins but the grandiosity of "enabling voice calls" is grating. I don't think this will open many users to voice communication. It will reduce data-costs in a way that has an impact on a significant amount of people's bottom line. But I feel manipulated with the current headline, and by the long extended lack of ability to mix the very real hope with some measure of humility.

Why is the demo link towards the bottom of the post pointing to the Basis Universal repository (which is a texture compressor)?


Copy-pasta error, or did they run the post through Lyra? ;)

In practical terms, very impressive. Anyone know what latency is like? Feels a domain where people who have not experienced low latency full duplex cannot fully appreciate why voice has faded in everyday life...

Sounds like at least +40ms of latency:

> features, are extracted in chunks of 40ms, then compressed and sent over the network

>"Lyra’s architecture is separated into two pieces, the encoder and decoder. When someone talks into their phone the encoder captures distinctive attributes from their speech. These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network. It is the decoder’s job to convert the features back into an audio waveform that can be played out over the listener’s phone speaker. The features are decoded back into a waveform via a

generative model.

Generative models are a particular type of

machine learning model

well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal."

PDS: Audio Codec meets Machine Learning! I love it!!!

I have been using Duo more for audio calls lately and the call quality has been excellent. Compared to WhatsApp its much much better which often times can mimic the sound quality of a regular phone call. Ive tested in the US and with my family in India, where the connection isnt the greatest

Huh, for me whatsapp calls are way better quality than duo. Wonder what causes such variations.

I'm using a variety of devices with similar results - pixel 4xl, windows, Linux with Firefox and chrome and chromeos.

Usually on wifi and sometimes tmo lte

my family in India is 90% of the time using lte

The Github README shows that the public API uses types from Abseil, a library that "promises no ABI stability, even from one day to the next." That seems problematic.

Is abseil used at the codec API surface?

Isn't Lyra the name of Facebook's cryptocurrency as well? I cannot remember if that project was shelved.

Are you thinking of Libra, now called Diem?


Yes I was. Thanks, I had them mixed up.

Since this is explicitly targeted at "the next billion users," do we have any sense of how well-optimized this is on non-English audio corpuses? I can't imagine that a model trained primarily on English/Western phonemes would perform as well on the rest of the world.

They say they tested it on 70+ languages.

Ah you're right. I couldn't find it on the original link in this post, but the post links to https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-..., which mentions the 70+ languages statistic under the "Fairness" section. Thanks!

which is less than the number of spoken languages in India alone.

Wonder if India will ever go through a forced linguistic convergence like China did

Unlikely, there's too much pride in each local language. Might all converge on English over a couple of generations, though, but more for commercial reasons.

Or even in New York City public schools.

Yeah, I wonder about those weird languages with lots of clicks... (though they are probably not part of the next billion)

My immediate thought: train a Transformer (or Tacotron2) that transforms text to the encoded Lyra codes... And, we will finally have a good real-time open-source text-to-speech system running on mobile devices.

I find that Lyra sounds good at first but it can chop off hard consonants in certain scenarios. It sort of sounds like slightly slurred speech. Anyone else getting that impression from their samples?

It looks like "the model" has 5MB worth of coefficients so it is no problem fitting it into phones, ham radios, etc.

(Radio hams badly need a good digital speech codec for VHF/UHF operation)



Where else I can see a demo?

Sounds like NVIDIA's Maxine [1], but for voice?

1: https://developer.nvidia.com/maxine

Another reason for end-to-end speech encryption: to keep your cleartext voice signal away from these overaggressive codecs changing the words. I can understand the need for a super low bandwidth codec on top of Mt. Everest, but 64 kbit PCM was good enough for our grandparents' landlines (or 13 kbit GSM for their mobiles) and it's good enough for us.

What a spectacular failure of imagination. Why change anything ever, right? I supposed dial-up modems were good enough for you too.

Everyone is imagining that codecs like this will "change your words" but no-one has provided examples of that actually happening. I don't believe it.

Spectacular failure of imagination? I mean it's a speech codec, an incremental improvement over the many out there that work fine. We're no longer in a world where voice calls dominate the world's telecom bandwidth usage. We routinely receive a megabyte of Javascript and ads and crap to display a 288 character tweet. Soon there will be 5G everywhere, so we'll get 10MB of JS etc. to see the same 288 character tweet. 1MB is 10+ minutes of full-rate GSM, or a lot more than that of Opus. If Lyra is really free (no blobs) and its computational requirements don't make us churn our phone hardware yet again, then great, it can reduce the already very low cost of voice calls by another smidgen, increasing carrier margins while almost certainly not showing in lower prices to the end user. So at that end of things, it's tolerable, while it would be horrible if (say) it were patented and became a standard, so that FOSS VOIP clients became non-interoperable with what the big vendors were using.

Lyra is more transformative in some extreme niche areas of extremely limited bandwidth, say spacecraft radios or handheld satellite phones or whatever. Those applications already use super low bandwidth codecs that sound like crap. So Lyra won't really save bits, but it will help intelligibility a lot by sounding better in the same bits.

There are examples posted elsewhere in this thread, although from a beta version.

Also, Opus already blows any requirements, either in terms of latency or in terms of size. Adding such unreliability for minimal gain seems foolish.

Notable that the two post authors sign it with " - Chrome", indicating I presume they are Chrome team members.

In the side by side comparisons I've seen between opus and lyra at 6kbps, lyra sounds remarkably better.

A more useful system would take Opus-compressed data as input and feature-extract that, presumably faster than this thing. Bonus for not requiring a proprietary library like libsparse_inference.so.

Also, instead of encoding independent 40ms segments, it should be much better to encode 10ms segments given the previous 30ms.

Is there any difference with another audio codec? It's great to see that another player in the market—this time, it's machine learning that produces high-quality calls. I'll keep an eye on the impact in the future. This architecture will surely disrupt our communication industry.

Does anyone know how it compares to Codec2? Opus is great down to ~12kbps but Codec2 is the real contender down at the bottom. And I bet it uses way less CPU than Lyra

I'd hope they would, who cares about a closed codec?

Can't wait to try that in ffmpeg!

is the training code opensource ?

Google misses the mark here...

Bad internet connectivity in the developing world isn't "only 56kbps" as some people think.

It's "random bursts of fast with random 30 second gaps of no connectivity at all". It's routed through 3 layers of proxies and firewalls which block random stuff and not others, while disconnecting long running connections.

Oh, and it'll be expensive per MB.

To that end, Lyra helps with the expense of a data connection, but is unusable for long voice calls. What would help more is a text chat system like WhatsApp.

Oh right - WhatsApp is already wildly popular in most of the developing world for mostly this reason.

> Oh right - WhatsApp is already wildly popular in most of the developing world for mostly this reason.

Not only that, but carriers will often advertise plans with "unlimited Internet for Facebook and WhatsApp" (a punch in the face of net neutrality).

So not only WhatsApp has more impact with audio messages when audio calls are too unstable, audio calls already substitute the bulk of phone calls even for people who have shitty data plans.

This is what my carrier says on their most basic offering:

> What does WhatsApp Unlimited mean?

> The benefit is granted automatically, without the need for activation. And the use of the app is unlimited to send messages, audios, photos, videos, in addition to making voice calls. Only video calls that are discounted from the internet package, as well as access to external links.

Heya, please could you unpack your reasoning a little bit more?

You said:

> WhatsApp is already wildly popular in most of the developing world for mostly this reason.

I can't speak for the majority of the developing world, but here in South Africa, WhatsApp is indeed the predominant communications app.

That being said, WhatsApp voice calls are also used here quite a bit.

So with that in mind, and reading from the article:

> Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs

To me 3kbps sounds pretty great, and might actually work out cheaper / better than one might imagine.

So I'm just wondering, how does WhatsApp voice call data usage compare to Lyra?

Also whilst South Africa is indeed a developing country (where, among other things, the price of data is proportionately high relative to average household income), the cellular network infrastructure is excellent.

So I don't think the random bursts of connectivity you describe are as big of an issue here, whereas the price of data most certainly is.

In which case, I can definitely see a market for Lyra (assuming the 3kbps is indeed vastly superior to WhatsApp's data usage for a voice call).

Hope that makes sense but I'd be happy to extrapolate a little further :-)

Lyra is a good candidate for replacing the protocol already used in Whatsapps voice calls. The binary size of Whatsapp matters, so it would depend on Lyra not requiring a multi-megabyte neural net too. The 40 millisecond extra enforced delay might have a negative impact on user experience.

It might be a good candidate for use in the voice message feature of whatsapp. That feature doesn't require low latency audio, so there might be even better compression schemes that use forward and backward compression techniques.

In the middle east I noticed a baffling-to-me usage of whatsapp: people were simply exchanging voice messages back and forth instead of calling. [0]

Presumably for exactly the reason you've stated.

[0] I later tried it myself with a friend, but you end up losing the benefits of both worlds -- you can't search or review old messages effectively (as you would text), and its significantly slower than calling.

There is a gap in the market for "searchable" voice clips - ie. auto transcribed to text, and allowing the user to see the text or hear the message.

This is going to be VERY useful for WebXR social platforms.


Give them at least 3-4 years ;)

I used to be fond of Google products.

I hope this never takes off.

This whole machine learning, optimization etc, story, but the end goal is that Google can easily transcribe your voice calls and store it as text. Then it can apply all shady practices that it previously was too expensive to do because storing voice and extracting information from it required huge storage costs and actual human labour.

Or worst, just imagine what some government you don't trust could do with all those voice call transcripts.

This codec has nothing to do with what you're worried about. There's no current technical limitation preventing what you're describing. Google doesn't do it because it makes no sense for their business and because your phone calls aren't routed through Google's servers. Governments outside the US are already doing it.

I mean...this is them open sourcing it?

It sounds more like a "offline" codec, not a Google service compressing your voice so I don't immediately see how Google would violate our privacy here this time.

> This whole machine learning, optimization etc, story, but the end goal is that Google can easily transcribe your voice calls and store it as text.

Can they not do that with opus?

This will make voices radically more correlatable, most likely. It's a more effective model for voice, it has run endless regressions & found better patterns to model human sounds upon. That could well make processing & comparing pieces of speech data less computationally expensive.

I don't see much relation to surveillance & transcription issues. This technology does not, would not change the field of battle significantly, if such a battle were about. Which it probably is, in some countries, perhaps even applying to Google-touched, -relayed, or Google-held data.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact