I don't mind compressing a phoneme to <unintelligible> as much as I would mind it compressing it to a clearly audible different phoneme.
Very worth your two minutes if you're not yet familiar with the effect: https://www.youtube.com/watch?v=2k8fHR9jKVM
This is no different than using, for example, a probabilistic algorithm to solve some NP-hard problem in your real world software. As long as you understand the limitations, I don't see an issue with using an algorithm that has a small non-significant (for your use-case) rate of failure. I would definitely not use this to communicate with the space station, but in the right context (Google Duo, low bandwidth), it's the perfect tool.
I had a coworker play me before/after of an early version of the codec "babbling" and it was definitely uncanny valley. It looks like some work has been done on the problem since then.
The second paper linked in the README.md of the repo talks about talks about a few strategies to reduce 'babbling' or 'babble'. For your reference, here's the citation and the link to the PDF.
Denton, T., Luebs, A., Lim, F. S., Storus, A., Yeh, H., Kleijn, W. B., & Skoglund, J. (2021). Handling Background Noise in Neural Speech Generation. arXiv preprint arXiv:2102.11906.
Not suggesting it as a fix, but this did remind me of the military phonetic alphabet, which includes numbers too.
3 is "tree", 4 is "fow er", 5 is "fife", 9 is "niner". The rest of the numbers are mostly as-is, but you'll hear very deliberate enunciation, like "Zee Row" for 0.
Sure, it would be nice to have clean high bandwidth, low latency voice channels to everywhere so you could drop pins and expect the other side to hear it. Unfortunately, high bandwidth never really happened, and some places never ran land lines to everyone's home, and nobody wants to pay the high price of circuit switched voice when packet switched voice mostly works good enough and is enormously cheaper.
So I would say "seven nine double six", which is another problem if I'm talking to an American.
This applies to GSM digitization and other "regular phone" compression, the newer computer calls have been better at taking the words.
I've never noticed. At any rate, we should not coach people to adapt to technology in this way. It is Procrustean and anti-human and unnecessarily places a burden on people that belongs to the software and the developer.
The most obvious example is learning to type...I've had by far the most fun working with computers in a keyboard-centric environment, mostly because I'm good enough at pressing keys and the computer is good enough at understanding them.
That said, I agree with both you and GP: trying to train a layperson to talk differently based on the quirks of the codec used to encode their voice seems like a poor choice!
For those who don't remember: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
The interesting bit was that it wasn't supposed to work by OCR...that had been deliberately turned off. The compression was too clever.
That was perfectly ordinary compression?
The phenomenon is all over the place, most visible in autocorrect.
In other words, its match tolerance is a bit too lax, so it get poisoned by blocks in its own dictionary, thinking it already has the blocks for things it had just scanned.
More details can be found in  and .
Lyra audio codec enables high-quality voice calls at 3 kbps bitrate - https://news.ycombinator.com/item?id=26300229 - March 2021 (198 comments)
Lyra: A New Very Low-Bitrate Codec for Speech Compression - https://news.ycombinator.com/item?id=26279891 - Feb 2021 (25 comments)
Is there significant new information here? https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
Edit: it seems the SNI is the open-sourcing. I've changed the title to say that now. Corporate press releases are generally an exception to HN's rules about titles and original sources: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor....
The fact that it's not open-source? The blog post is posted today and from Google themselves, so I assume there's new information.
They should re-implement the needed bits of libsparse_inference before releasing this thing. Otherwise it's just a distraction.
Probably they should get it building with something other than Bazel, too.
The fine README says it builds and runs on Ubuntu 20.04.
I think the article could do with some bandwidth/quality/latency/power comparisons to other codecs.
So while Encoding doesn't take 40ms, the latency + encoding will indeed be 40ms+.
150ms is the End to End Latency, which is basically everything from Encoding + Network + Decoding. We cant beat the speed of light on our fibre network. We can certainly do something with Encoding and Decoding. And Lyra doesn't seems to help with that case here. Something I pointed out last time Lyra was on HN.
I think Opus default to 20ms with option of 10ms slot ( excluding Encoding speed ) at the expense of quality. What we really need is higher bitrate, lower latency and higher quality codec. Which is sort of the exact opposite of what Lyra is offering.
Speed of light in what? We can absolutely be faster than fibre optics, which are quite slow relatively speaking (2/3rds that of light in a vacuum).
And adding an HFT-style microwave backbone could reduce Internet latency even more: https://arxiv.org/abs/1809.10897
Minimum end-to-end latency for communications from opposite points of the earth is much lower for Starlink style LEO satellites than for fiber.
I.e. at 10 Kbps it's impossible to have a lower VOIP latency than 32 ms. Likely the 40 ms number they tuned for in the real world.
When the voice payload is smaller than the packet headers, you're well into diminishing returns territory.
Lyra as it stands today will not support anything outside of x86-64 and ARM64 without rewriting the proprietary kernel it relies on.
Summary: the Lyra audio samples are louder which muddies the comparison
Lyra will likely sound better, but the reproduction accuracy is apt to be quite a bit poorer as many others have commented. G.711 was created to require nearly no processing (its nearly raw PCM data from a sound card after all) while operating at reasonable bitrates, Lyra looks much more computationally intensive and will likely only run on smartphones in the next few years.
Edit: Is Lyra a significant improvement over modern Opus at 8Kbps? You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today, whereas Lyra will require orders of magnitude more power to decode while providing much worse reproduction accuracy.
It is over 6Kbps Opus.
> You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today
A RaspberryPi Zero will provide more than sufficient power for Lyra (it was originally implemented on a Pixel 2). That's ~$10
The overhead from packet headers to send data every 40ms is 9.6kbps, is the difference between 12.6Kbps and 17.6Kbps meaningful at that point? We are sending the same number of packets, likely with the same packet loss rate.
> A RaspberryPi Zero will provide more than sufficient power for Lyra
A Raspberry Pi Zero can't run Lyra, as the proprietary math kernel is only offered in compiled form for x86-64 and android-arm64: https://github.com/google/lyra#license
It is when you are sending video as well. One of the stated purposes of this work is to enable video conferencing over 56Kbps dial up modems.
> A Raspberry Pi Zero can't run Lyra, as the proprietary math kernel is only offered in compiled form for x86-64 and android-arm64
How annoying! Still - the point is that hardware capability isn't likely to be the issue.
almost entirely irrelevant if you're making calls to or from the PSTN, since your SIP trunking provider most likely only supports G.711 alaw/ulaw, or even if they support you handing them a call as G.722 or any other codec, their upstreams almost certainly don't support anything other than G.711.
1). Silicon Valley - Finale S6E7 https://www.youtube.com/watch?v=48Y77jSSHGU
Copy-pasta error, or did they run the post through Lyra? ;)
> features, are extracted in chunks of 40ms, then compressed and sent over the network
Generative models are a particular type of
machine learning model
well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal."
PDS: Audio Codec meets Machine Learning! I love it!!!
Usually on wifi and sometimes tmo lte
my family in India is 90% of the time using lte
(Radio hams badly need a good digital speech codec for VHF/UHF operation)
Where else I can see a demo?
Everyone is imagining that codecs like this will "change your words" but no-one has provided examples of that actually happening. I don't believe it.
Lyra is more transformative in some extreme niche areas of extremely limited bandwidth, say spacecraft radios or handheld satellite phones or whatever. Those applications already use super low bandwidth codecs that sound like crap. So Lyra won't really save bits, but it will help intelligibility a lot by sounding better in the same bits.
Also, Opus already blows any requirements, either in terms of latency or in terms of size. Adding such unreliability for minimal gain seems foolish.
Also, instead of encoding independent 40ms segments, it should be much better to encode 10ms segments given the previous 30ms.
Bad internet connectivity in the developing world isn't "only 56kbps" as some people think.
It's "random bursts of fast with random 30 second gaps of no connectivity at all". It's routed through 3 layers of proxies and firewalls which block random stuff and not others, while disconnecting long running connections.
Oh, and it'll be expensive per MB.
To that end, Lyra helps with the expense of a data connection, but is unusable for long voice calls. What would help more is a text chat system like WhatsApp.
Oh right - WhatsApp is already wildly popular in most of the developing world for mostly this reason.
Not only that, but carriers will often advertise plans with "unlimited Internet for Facebook and WhatsApp" (a punch in the face of net neutrality).
So not only WhatsApp has more impact with audio messages when audio calls are too unstable, audio calls already substitute the bulk of phone calls even for people who have shitty data plans.
This is what my carrier says on their most basic offering:
> What does WhatsApp Unlimited mean?
> The benefit is granted automatically, without the need for activation. And the use of the app is unlimited to send messages, audios, photos, videos, in addition to making voice calls. Only video calls that are discounted from the internet package, as well as access to external links.
> WhatsApp is already wildly popular in most of the developing world for mostly this reason.
I can't speak for the majority of the developing world, but here in South Africa, WhatsApp is indeed the predominant communications app.
That being said, WhatsApp voice calls are also used here quite a bit.
So with that in mind, and reading from the article:
> Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs
To me 3kbps sounds pretty great, and might actually work out cheaper / better than one might imagine.
So I'm just wondering, how does WhatsApp voice call data usage compare to Lyra?
Also whilst South Africa is indeed a developing country (where, among other things, the price of data is proportionately high relative to average household income), the cellular network infrastructure is excellent.
So I don't think the random bursts of connectivity you describe are as big of an issue here, whereas the price of data most certainly is.
In which case, I can definitely see a market for Lyra (assuming the 3kbps is indeed vastly superior to WhatsApp's data usage for a voice call).
Hope that makes sense but I'd be happy to extrapolate a little further :-)
It might be a good candidate for use in the voice message feature of whatsapp. That feature doesn't require low latency audio, so there might be even better compression schemes that use forward and backward compression techniques.
Presumably for exactly the reason you've stated.
 I later tried it myself with a friend, but you end up losing the benefits of both worlds -- you can't search or review old messages effectively (as you would text), and its significantly slower than calling.
This whole machine learning, optimization etc, story, but the end goal is that Google can easily transcribe your voice calls and store it as text. Then it can apply all shady practices that it previously was too expensive to do because storing voice and extracting information from it required huge storage costs and actual human labour.
Or worst, just imagine what some government you don't trust could do with all those voice call transcripts.
Can they not do that with opus?
I don't see much relation to surveillance & transcription issues. This technology does not, would not change the field of battle significantly, if such a battle were about. Which it probably is, in some countries, perhaps even applying to Google-touched, -relayed, or Google-held data.