The post is contaminated with some wet marketing language, adding unnecessary noise to the information. The most important part of this whole post is this:
...
[V]ector quantization ... works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.
In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers
...
This is really impressive. They say that at 3kbps this new codec sounds as good as Opus at 12kbps, and that it is trained across a wide range of bandwidth. It is a much bigger deal than Lyra was. I'd like to know whether it can run at low latency.
How low though? Lyra added significant overhead, from what I read. They said it was small but then in practice it was like 100ms or something, which given a typical ok connection is easily 50ms would put it in the perceivable range of delay.
Any sufficiently fancy compression for communication formats immediately makes me worry about the Xerox Effect[1], where the reconstructed content is valid yet semantically different in some important way to the original.
[1] I propose we call it that unless there's already a snappy name for it?
Indeed. I also expect this failure mode will be undetected for a long time due to how our sense of hearing works. My last neuroscience class was many years ago, but I do remember that in some sense, we hear what we expect to hear (more so than vision if I recall correctly, though there is plenty that happens in our vision processing) in that our ears tune for particular frequencies to filter out ambiguities.
Suppose a person says something that the codec interprets differently. Perhaps they have one of the many ever evolving accents that almost certainly were not and absolutely could not possibly be included in the training set (ongoing vowel shifts might be a big cause of this). The algorithm removes the ambiguity, but the listener can't tell because they hear themselves through their own sense of hearing. Assume the user has somehow overcome the odd psychological effects that come hearing the computer generated audio played back, if that audio is mixed with what the person is already hearing, it's likely they still won't notice because they still hear themselves. They would have to listen to a recording some time later and detect that the recording doesn't match up with what they thought they said... which happens all the time because memory is incredibly lossy and maleable.
Most of the time, it won't matter. People have context (assuming they're actually listening, which is a whole other tier of hearing what you expect to hear) and people pick the wrong word or pronounce things incorrectly (as in not recognizable to the listener as what the speaker intended) all the time. But it'll be really hard to know that the recording doesn't record what was actually said. You need to record a local accurate copy, the receiver's processed copy, and know to look for it in what will likely be many hours of audio. It's also possible that "the algorithm said that" will be a common enough argument due to other factors (incorrect memory and increasing awareness of ML-based algorithms) that it'll out number the cases where it really happens.
This seems similar to being able to read your own handwriting, when others can't. If it's an important recording, someone else should listen, and it would be better to verify a transcription.
In a live situation, they will ask you to repeat if it's unclear.
Yep, it's kinda happening with the music example on the page: the Lyra (3kbps) sample have some human sounding part when the original reference is just music without any speech. Probably because Lyra was trained on speech.
It's a valid concern but I think it can also be solved. Compress-decompress and compare to the original using a method not susceptible to xerox effect. If the sound has materially changed then use a fallback method, robust but less efficient, for that particular window.
Just curious, is there any progression for relatively high bitrate audio codec? Not that I'm not satisfied with the current state of AAC, but I found most of these new development often about some super low bitrate (this case is even more extreme, 3kbps!?).
Opus is a remarkable codec because it’s excellent at almost everything. The only areas where it’s being beaten are extreme narrowband, which it can’t do, and narrowband, where it’s still not shabby (though some of this new stuff is redefining what’s possible).
Opus tackled a broad field of competitors that were each somewhat specialised for their part of the field, and pretty much beat all of them at their own game. And in most cases the incumbents were high-latency, while Opus not only achieves quality superiority but also supports low-latency operation.
Past about 16kb/s, Opus is pretty much just the format to use, except for extreme niches like if you want to represent over 20kHz (above which Opus cuts).
Opus is so good there’s pretty much nothing left to improve for now (for now), and even if you improved things it probably wouldn’t be worth it. That’s why all the development is happening in narrowband, because that’s the only interesting space left. Perhaps eventually some of those techniques will be scaled up past narrowband. I don’t know.
Yeah. In reality the main reason new audio codecs are developed post-Opus isn't technical, it's so that companies can get their patents into new standards and rake in the licensing royalties. There are better codecs for really low bitrates but that's quite niche these days; even telephony is going wideband and higher bitrate.
Let's assume for a moment that you're not stupid enough to confuse 200kBps (1.6Mbps) for 200kbps.
Opus is fine down to 8kbps. It fits over a cheap, shitty mid-20th century analogue telephone line with room to spare.
The ultra-narrow band stuff is very niche, and is consequently unlikely to have the broad impact you're imagining.
In contrast there is continue enthusiasm for these pointless midband codecs that are similar in performance to Opus but have the "advantage" that somebody gets $$$.
8kbps is enough - when you don't spare anything anything to bit correction. Maybe in those cases, somewhat okay analogue audio is enough (for example, in long-distance raditelephony). But having a very impressive digital codec raises the bar significantly, especially the last time someone bothered with this is someone in Nokia trying to fit 6kbps using what was now rudimentary phone chips.
Additionally, there are people in the world (including US) who are stuck using unreliable 28kbps lines. Having an option to do excellent audio and video is something that no-one seemed bothered to do.
I just compared 730 kb/s Flac with 160kb/s Opus, I can see no difference even on spectrogram using 'mother of mp3' track: Tom's Diner (Live At The Royal Albert Hall, London, UK, November 18, 1986).
Very surprising, will be migrating all my music to opus to save space.
Beware of phase differences, they won't show up on a spectrogram but could seriously upset your stereo impression. Before you compress all your Flac content and only figure this one out afterwards. That could be quite annoying.
Yeah - definitely keep your backups in FLAC. Having a lossless source gives you infinite future flexibility, and that cannot be underestimated. Otherwise you're kinda doomed to hit https://www.youtube.com/watch?v=fZCRYo-0K0c eventually.
For on-listening-device though, oh heck yes - Opus is great.
As I said, spectrograms are not indicative of compression quality. Codecs should be judged by ears only. What you see in a spectrogram will vastly differ between codecs and will not reflect their compression efficiency.
For 99.9% of use cases, that's a solved problem. You'll never use anything other than FLAC and Opus for lossless and lossy compression respectively.
I'm sure there are unusual cases such as live streaming over satellite internet where getting an extra 1% compression on high quality audio is a big deal, but even that's likely a temporary problem. Starlink users already get >50mpbs
After AAC, researchers realized that AAC produced perceptually transparent audio with excellent compression and that growing bandwidth and disk space meant there wasn't much point to further improvement in that direction.
And remember that, unlike MP3, AAC is more like an entire toolbox of compression technologies that applications/encoders can use for a huge variety of purposes and bitrates -- so if you need better performance, there's a good change you can just tune the AAC settings rather than need a whole new codec.
So research shifted in two other directions -- super low-bandwidth (like this post) for telephony etc., and then spatial audio (like with AirPods).
Oh I wrote something about that [1] on HN and HydrogenAudio But basically there are zero incentive to do so. We are no longer limited by Storage or Bandwidth. Bandwidth or Cost Per transfer decline at a much greater rate than any Audio or Video Codec Advancement.
So if you want higher quality at a relatively high bitrate? Use Higher Bitrate. Just use 256Kbps AAC-LC instead of 128Kbps, all of its patents has expired so it is truly patent free. The only not so good thing is that all the Open Source AAC Encoder are't anywhere near as good as the one iTunes is provided. Or you could use Opus, at a slightly better quality / bitrate if it fit your usage scenario. Even high bitrate MP3 for quite literally 100% backward compatibility. If you want something at 256Kbps+ but doesn't want to go Lossless, Musepack continues to be best in its class but you get very little decoding support.
The article mentions that it can scale bitrate directly by adding or removing layers. But I sure wish they had included some hi-fi quality sample audio.
Being able to download in advance is huge. Looking forward to satellite audio for everything spoken-voice one day. Some org was working on it, but they were working with 2kbps 24/7, which is definitely not a lot!
In the papier they compare lattencies from 7ms to 26ms and observe no loss of quality. Processing requirement is lower for higher latencies because of batching.
Have you ever played a counterstrike with a lagging network? Observe the teammates running into walls and such. That's what the conversation would sound like.
Running into a wall is not a result of NN prediction. It's a result of naïve linear prediction. NN will predict that he'll stop, turn around, may be shot someone. The hard thing is to not disappoint NN by actually running into a wall.
You need to take into consideration the complexity. Counterstrike has very tiny set of possible actions you can take. Compare that to the space of things one can say.
A NN is going to fail on that at least as missereably as linear interpolation fails in CS.
When speech prediction fails, it should sound clearly wrong. Otherwise we risk serious misunderstanding when the prediction says something that sounds good but means different.
We already have trouble like this with texting autocorrection.
So, at what point during a phoneme does it become distinct?
After any one phoneme has been sent down the wire you can truncate the audio and send a phoneme identifier; at the other end they just replay the phoneme from a cache.
Like doing speech to text, but using phonemes, and sending the phoneme hash string instead of the actual audio.
I remember reading that paper. This is amazing. But it's not really predicting what we're gone say beyond the current syllable:
> While our model learns how to plausibly continue speech, this is only true on a short scale — it can finish a syllable but does not predict words, per se.
Has anyone thought of trying to end-to-end train an h.265 decoder? The results might not be 100% perfect, but the resulting codec might bypass a ton of patents.
Since the h265 relies heavily on operations which are not easily differentiable, such as translation of patches of images, together with a pretty complicated binary format, I'd be pretty amazed if the NN actually learned anything meaningful at all.
Interpolated translation is continuous and easily differentiated. There's lots of work on machine-learned video codecs already, from Nvidia, Qualcomm and others.
> There's lots of work on machine-learned video codecs already, from Nvidia, Qualcomm and others.
I've tried them, the current state of the art means that they're only useful on relatively static things (some shaking etc) while spike up to AV1-level bitrate to reach perceptual similarity when the movement is too much. Maybe in the future (or whatever under the wraps concoction Nvidia, Qualcomm or another player have), ML-based video codecs will surpass handtuned codecs, but it's not (yet) the present state.
It's not difficult to propagate gradients while translating an image. Learning "pick a 8x8 patch from (145,17) apply X to it and translate it by (4,-8)" from data is on different level, is not it?
The premise was using e2e learning to avoid patent issues. I am sure that with some preprocessing you can plug a NN inside the deciding process and learn very meaningful stuff.
Aren't software implementations royalty free anyway? and even if they weren't, could they not classify a neural network as a software implementation? (from the patent enforcing point of view) Because if that is not the case, this idea would be applicable to a lot of things right? Seems like an easy hack
Is it possible to apply this same technique to video codecs? If not general video, then at least video streams where the center subject is a human face?
> that induce the reconstructed audio to sound like the uncompressed original input.
What if we want the codec to produce easier-perceptible sound, not just a reconstruction. Could bake in noise and echo reduction, voice improvements etc
I assume think they start with the "signal" and "noise" as separate audio files, and then they play them together in order to create a synthetic noisy input. Then they can train the output against only the signal so that it will learn to filter out the noise.
Almost certainly, but this is true of most low bitrate codecs. I've got a very deep voice and it becomes largely unintelligible in marginal mobile signal conditions. If anything this one might be more tweakable and/or personalizable than what we use today.
To some extent, surely. In their samples they have some music and some audio with background noise. The music survives ok and the clanging of the background noise is reduced to clicking so maybe languages with clicking sounds do ok too.
I suppose there are a few futures:
- the paper was very innovative but nothing really happens in the ‘real world’
- Google roll it out globally to one of their products and we discover that some voices/accents/languages work poorly (or well)
- The same but with slow rollout, feedback, studies, and improvements to try to make sure everyone gets a good result.
For a company the penalizes sites for being "not mobile friendly", they really drop the ball on this blog. All the graphs have their right sides off screen, hidden, with no chance to see them on a normal modern Android phone.
Now this is impressive. Truly quite amazing technology. I pray we reach the day where my 1200GB music collection can be compressed without noticing, to a fraction of the size.
Depends what format you have it in now. 160kbps opus is transparent for almost all music. That's 1/5 the space of a FLAC collection, or 1/2 the size of 320kbps mp3, two other popular ways to store music without noticeable loss in quality.
I actually do encode to 128 opus and crank up the encoder settings for music synced to my own devices, because I know they can play it. It's pretty much transparent to me but I am not an audiophile with audiophile equipment, which is why I keep the originals.
It's not very portable. I can't keep it on my phone without a 1TB micro SD, and I could theoretically carry an SSD but thats not really portable as it requires internal connection. And if it's an external drive with a USB connection then it's probably much larger than my phone.
It's certainly able to be kept and it's not a particularly unwieldly amount of data for the average power user, but if I ever had to bring my music somewhere (and I mean all of it) I'd be screwed. Plus that data would take hours to copy if ever needed. I spent just an hour today dding a 64GB SD card.
I just keep my favorite albums and discographies in 320 MP3 as I know there is almost no music player on earth that will fail to play it. Then I keep that in a micro SD in my wallet, for that emergency in which you desperately need some Pink Floyd.
2. Consider https://beets.io for management, it may be just the system you want. One of its features is converting you library into a lossy version while keeping your high res files in a very managed way.
As you seem knowledgeable, it looks like beets manages music you've already ripped, do you have a suggestion for ripping? Last time I tried, it was a disaster (I got a lot of poor quality audio and ended up nuking everything because it was too much work to listen to everything and try again)
It's been a while since I ripped my last CD, but I always found https://github.com/bleskodev/rubyripper very useful. It uses cdparanoia to get multiple rips and combines them into what it thinks is the exact information located on the original cd.
These days, I can fortunately download most stuff directly from bandcamp :)
I just have a syncthing folder shared between my VPS, my home server, my desktop, laptop, and phone; it syncs it continuously so a download once syncs it everywhere else. It's basically set it and forget it.
in low bandwidth situations, for recording speeches or podcasts, i have a similar question.
the codec2 examples have a 8 KHz range so can't be compared to the lyra ones as is. perhaps you could encode your own voice wavs with ffmpeg and compare.
in this case there's also the question of portability, can the resulting files be played on android or iphone and how much cpu cycles/battery power would it cost? I'd rather listen to 1 hour of lyra speech then codec2 speech if the battery would last twice as long.
I don’t know but there seemed to be a demo for Codec2 running on STM32F4, whereas Lyra repository README explains how its optimized implementation allows it to run on a midrange phone in real time, so…
Lyra is also a speech-only codec, yet is included in the comparison.
Note also that Codec2 had some experimental work extending it with the WaveNet neural network, which improved the performance.
Given both of this, it seems disingenuous to exclude Codec2 from the comparison. I can only assume its left out because it performs well at even lower bitrates.
>Over the past few years, different audio codecs have been successfully developed to meet these requirements, including Opus and Enhanced Voice Services (EVS).
I guess the Google AI team works separately from the main Google Team. Lot of respect for pointing out EVS.
would be interesting to characterize the behavior of the encoder (ie; how does it differ from a mel warped spectrum... or what is the warping that it learns?)
also would be kinda neat to see something that is pretrained in this way, and then does a small amount of humans in the loop training iterations to see if quality improves or perhaps an uncovering of something previously unknown about human auditory perception...
> Opus is a versatile speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been widely deployed across applications ranging from video conferencing platforms, like Google Meet, to streaming services, like YouTube. EVS is the latest codec developed by the 3GPP standardization body targeting mobile telephony. Like Opus, it is a versatile codec operating at multiple bitrates, 5.9 kbps to 128 kbps.
Why 5.9 instead of 6 kbps? Did I want to have some kind of PR victory over Opus?
... [V]ector quantization ... works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.
In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers ...