
A Real-Time Wideband Neural Vocoder at 1.6 Kb/S Using LPCNet - weinzierl
https://people.xiph.org/~jm/demo/lpcnet_codec/
======
loudmax
For those too impatient to read the details, check out the "Hear for yourself"
examples toward the bottom of the page. They're reproducing decent sounding
speech at 1.6 kbps.

1.6 kbps is nuts! I like to re-encode audio books or podcasts in Opus at 32
kbps and I consider that stingy. The fact that speech is even comprehensible
at 1.6 kbps is impressive. As the article explains, their technique is
analogous to speech-to-text, then text-to-speech.

The original recordings are a little stiff, and the encoded speech is a little
more stiff. It isn't perfect, but it's decent. It'll be interesting to hear
this technique applied to normal conversation. If regular speech holds up as
well as their samples, it should be perfectly adequate for conversational
speech. At 1.6 kbps, which is absurd.

Also, I wonder how well this technique could be applied to music. My guess is
that it won't do justice to great musicians ... but it might be good enough
simple pop tunes.

~~~
jmvalin
Actually, this won't work _at all_ for music because it makes fundamental
assumptions that the signal is speech. For normal conversations, it should
work, though for now the models are not yet as robust as I'd like (in case of
noise and reverberation). That's next on the list of things to improve.

~~~
petercooper
Here we go! This is the first minute or so of Penny Lane by The Beatles
converted down to a 10KB .bin and then back to a .wav:
[http://no.gd/pennylane.wav](http://no.gd/pennylane.wav) .. unsurprisingly the
vocals remain recognizable, but the music barely at all.

~~~
bch
As imagined by Marilyn Manson...

~~~
petercooper
Pretty much! It shows off how the codec works to a great extent though as it
seems to be misinterpreting parts of the music to be the pitch of the speech,
so Paul's voice sounds weird at the start of most lines but okay throughout
the lines.

I've also run a BBC news report through the program with better results
although it demonstrates that any background noise at all can throw things off
significantly:
[https://twitter.com/peterc/status/1111736029558517760](https://twitter.com/peterc/status/1111736029558517760)
.. so at this low bitrate, it really is only good for plain speech without any
other noise.

~~~
jmvalin
Well, in the case of music, what happens is that due to the low bit-rate there
are many different signals that can produce the same features. The LPCNet
model is trained to reproduce whatever is the most likely to be a single
person speaking. The more advanced the model, the more speech-like the music
is likely to turn

When it comes to noisy speech, it should be possible to improve things by
actually training on noisy speech (the current model is trained only on clean
speech). Stay tuned :-)

------
mmastrac
I've been waiting for someone to do this* with audio and/or video. Amazing
work.

Also worth reading this related link:
[https://www.rowetel.com/?p=6639](https://www.rowetel.com/?p=6639)

There's some really exciting progress that could be made in this space. The
page mentions that they could use it as a post-processing filter for Opus for
better decoding to avoid changing the bitstream. It could also be useful as a
way to accommodate for packet loss and recover "just enough" to avoid
interrupting the conversation.

* encoding audio through a neural network for network transmission

------
danielvf
For comparison, your standard police/fire/medical digital radio in the US
sends voice at 4.4Kb/s. So this is a approximately a third of that.

~~~
jdc
So maybe this line of work will mean more spectrum available in the future.

~~~
metildaa
Only if Motorola gets out of the way and supports modern codecs and standards.
Current public safety radio networks are using ancient TDMA tech that Motorola
has cobbled together, along with audio codecs that shred voice quality. The
only good part is the durability of the pricey radio, some are even
intrinsically safe.

~~~
davidf560
Public safety digital radio networks are primarily APCO Project 25 (P25) which
use IMBE/AMBE developed by DVSI. Motorola's original digital radios used a
proprietary vocoder called VSELP (also used by iDEN/Nextel). When APCO
standardized public safety digital radios, they rejected VSELP and chose IMBE
from DVSI instead. Personally I think VSELP sounds better than IMBE, and I'm
not sure IMBE was chosen due to technical superiority or if it was political
reasons (i.e. picking a non-Motorola solution due to Motorola's dominance).
Also, APCO Project 25 Phase 1 was not TDMA, however Phase 2 is.

[https://en.wikipedia.org/wiki/Project_25](https://en.wikipedia.org/wiki/Project_25)

Public safety radio is a true mission critical service that moves slowly -
equipment lasts years or decades and is expensive and not frequently replaced
or upgraded, hence new technology adoption is slow. Vocoder choice is driven
by a standards committee for interoperability (which has seen more emphasis
since 9/11), and of course committees aren't typically known for working fast.
Public safety radio is definitely not a place for a "move fast and break
things" mentality.

------
zackbloom
Just to put this in perspective, a traditional phone line encodes 56 Kb/s of
data, which was believed to be the size channel to send the human voice with a
reasonable quality. They are doing it in 1.6 Kb/s!

~~~
na85
Aren't "traditional" aka POTS lines analog, and therefore not doing any
encoding whatsoever?

~~~
metildaa
There are band filters and such on legacy, fully analog systems.

G.711 (which is standard now for non-cellular call audio) is a step down, but
Opus at 16Kbps sounds better to me than a classic, full analog system due to
the lack of band cutoff & smarter encoding.

------
walrus01
For those interested in low bandwidth audio codecs, take a look at the voice
codec used for Iridium handheld satellite phones, which was finalized in about
1998. Fully twenty plus years ago.

It doesn't sound the best, but consider the processing power constraints it
was designed with...

[https://en.wikipedia.org/wiki/Iridium_Communications#Voice_a...](https://en.wikipedia.org/wiki/Iridium_Communications#Voice_and_data_calls)

~~~
jmvalin
Iridium appears to be using a vocoder called AMBE. Its quality is similar to
the one of the MELP codec from the demo and it also runs at 2.4 kb/s. LPCNet
at 1.6 is a significant improvement over that -- if you can afford the
complexity of course (at least it'll work on a phone now).

~~~
walrus01
Based on my previous experience with Iridium I believe it actually operates at
a data rate up to about 3 to 3.2 kb/s. 2400 bps of it is actual usable voice
payload, the remaining 600 bps is FEC.

Iridium data (not the new next-generation network) service is around the same
speed, it's 2400 bps + whatever compression v42bis can gain you. For plain
text and stuff it can be a bit faster, something that's already incompressible
by v42bis will be right around 2400 baud.

------
qwerty456127
The examples sound excellent. Top (equal or better) of any text-to-speech
synthesizer I've ever heard. I would love to start using it for audio books
and for VoIP to save space and traffic as soon as possible. And a Linux-native
text-to-speech synthesizer capable of producing speech of this quality is a
thing I dream of (now the only option I know is booting to Windows and using
Ivona voices)

~~~
metildaa
Mozilla Deepspeech (Speech tp Text) and Mozilla TTS are both useful at this
point: [https://research.mozilla.org/machine-
learning/](https://research.mozilla.org/machine-learning/)

------
wanderfowl
This is really amazing work. Nice to see LPC pushed to its limits, and I can't
wait to see what's next for speech compression. Here's hoping the mobile
companies pick up on something similar soon.

------
tomcam
Voice quality at that bit rate is absolutely astounding to me. This is one of
my favorite things – to see such elegantly applied research.

------
th0ma5
Would be really cool when this is in FreeDV and the HF bands benefit from
this!

~~~
Scoundreller
My dream is for telecom monopolies to get broken up by technologies like this.

------
a-dub
might be fun to try to port this to the dadamachines doppler... :)

------
0-_-0
OK, so... Would it be possible to do something similar for video?

~~~
cooper12
Similar as in the same approach, or as in "apply neural networks to all the
things"? Because if it's the former, this approach was very specifically
tailored to human speech, taking into account how much it can
compress/interpolate qualities like pitch and the spectral envelope. That's
far too specific to apply to video.

As for the latter, you'd have to perhaps feed Google Scholar the right
incantations or ask someone with knowledge. As far as I know, video codecs
already have a huge bag of tricks they use (for example the B-frames borrowed
in this post). Even then, the key points in this codec were that firstly it's
meant for use at very low bitrates, where existing codecs break down, and then
secondly it's a vocoder, so it's converting audio to an intermediate form and
resynthesizing it. That kind of lossiness is acceptable for audio, but I'm not
sure how it would work acceptably for video.

~~~
0-_-0
I should have been more specific. I meant that instead of compressing video to
minimise pixel difference, minimise feature difference instead.

------
nategri
Really thought this was gonna be human-neural-activity-to-speech and I feel
like a doofus.

------
rasz
3 Gflops, we are deep beyond diminishing returns here. Opus seems good enough.

~~~
nullc
Opus is awesome and covers a previously unmatched spectrum of use cases... but
that isn't everything.

Opus isn't good enough to be a replacement for AMBE for use over radio. Opus
doesn't make it easier to make very high quality speech synthesis, etc.

Opus loss robustness could be much better using tools from this toolbox-- and
we're a long way from not wanting better performance in the face of packet
loss.

~~~
rasz
This is roughly 2x improvement over AMBE+2, except AMBE peaks at maybe couple
hundred MIPS, and there are better less computationally intensive alternative,
like 20-70 MIPS [https://dspini.com/vocoders/lowrate/twelp-
lowrate/twelp2400](https://dspini.com/vocoders/lowrate/twelp-
lowrate/twelp2400)

