
Another new experimental codec from xiph.org - AndrewDucker
https://xiphmont.dreamwidth.org/95505.html
======
makomk
Previous discussion:
[https://news.ycombinator.com/item?id=19520194](https://news.ycombinator.com/item?id=19520194)

------
armagon
The noise suppression demo is incredible! Scroll down to "Show Me the
Samples!" (about 3/4 of the way down) on
[https://people.xiph.org/~jm/demo/rnnoise/](https://people.xiph.org/~jm/demo/rnnoise/)
and hit play and try the buttons.

My heavens; this sort of technology would enable, say, a robot to have a much
better time understanding what a person is saying in a public place (as well
as having possibilities for better chatting for people over the internet).

~~~
jononor
One can do as you propose for speech recognition, use noise reduction first,
then a classifier. But it can be more efficient to just train a speech
recognition detector directly on input corrupted by noise. This way the
detector learns invariant to noise. This is commonly done via data
augmentation, which mixes in various noise into labeled speech snippets.

There are still[1] circumstances where one would have noise suppression as a
pre-processing step to the speech detector. For instance when using microphone
arrays, the frontend may use data from many microphones (2-20), perform
adaptive Beamforming extract the cleanest possible mono speech signal. This
can include a Vocal Activity Detection (VAD) as estimator, and multi-source
aware noise reduction. The output is then ran through a standard mono
detector, either on device (typically only Keyword Spotting) or sending to the
cloud.

Strong neural networks in combination with microphone arrays is the reason why
smart home devices like Alexa etc have become pretty decent (compared to what
was feasible 10 years ago) at speech recognition when the speaker is far away.

1\. Integrating more and more into functions into big neural networks and
jointly optimizing the overall system is definitely a trend, and very actively
researched.

------
lousken
I'd love to see more work in 32-96kbit range which is where discord and other
voice chats operate but with opus i guess we're at the point where the
consumer hardware is becoming a problem and microphones on headsets are just
not good enough anymore and high quality mics are still ridiculously expensive
for whatever reason.

------
rapsey
It is pretty cool and all but is there really a need for such efficiency with
regards to bandwidth? Is there a use case that requires so little?

~~~
Applejinx
This is Monty's jam. Why not? What's wrong with impressive technology? And
I'll tell you one thing right off.

Gaming.

What if you wanted to have a seemingly normal game, but when played, you
discovered that the characters have seemingly infinite dialogue trees? Tens of
thousands of hours of voice performances. What if you could code up a game
especially because it became possible to do something like that? What if you
had lines delivered in ten different tones of voice and the modulation of that
was relevant to gameplay? So it'd seem a little like normal gaming voice
acting, except that maybe you'd be getting nonverbal cues and not know why you
were reacting differently or getting more tense, except the 'other people' in
the game were acting subtly differently.

Yes. Yes there are lots of ways to use this type of thing (also, I'll note
that game engines like using Xiph tech already, Godot uses .ogg already)

Absolutely there's a use case for this. Many.

~~~
est31
TTS would in fact be better than having the samples recorded and then
compressed and sent to the game. You could maybe even do procedurally
generated voices speaking procedurally generated text. The quality of TTS is
already better than the quality of that codec.

------
kibibu
Note that the demo link is incorrect, and should be
[https://people.xiph.org/~jm/demo/lpcnet_codec/](https://people.xiph.org/~jm/demo/lpcnet_codec/)

~~~
xiphmont
Oh, dang. Fixed.

