
Real-Time Noise Suppression Using Deep Learning - tigranhakobian
https://devblogs.nvidia.com/nvidia-real-time-noise-suppression-deep-learning/
======
konschubert
I am really impressed with what Nvidia is doing here.

I think there is a huge market for improving sound quality in video calls.

For me, roughly every second call I make is somehow harmed by some kind of
"bad audio" problems. Breathing, reverb, noise, clipping, too silent, there
are so many things that can go wrong.

And this really harms the productivity of video calls.

I have started collecting and building tools to detect all of these sources of
bad audio and am collecting them at
[https://www.tinydrop.io](https://www.tinydrop.io) Maybe these APIs can help
people to improve their setup. But if software like Nvidia's comes along and
just fixes the problem once and for all - that's great as well!

~~~
davitb
Disclosure: I'm the author of the blog post and co-founder at 2Hz.

This is a guest post on NVIDIA Developer Blog. The author of the technology is
a startup called 2Hz (2hz.ai). Our passion is to improve voice audio quality
in audio/video calls. It's a tough problem but also fun to work on.

Agree, breathing, reverb, noise are all problems and should be fixed. We
started with noise and already shipped a product you can try on your Mac. The
app is called Krisp (krisp.ai).

Reverb, breathing, voice cutting will come next.

~~~
hathawsh
Hi! As someone who seems to struggle more than most to understand people on
video calls, I'd like to give you my impressions.

Something struck me about the sample video. The very first sample included
background noise, but it was very easy to understand regardless of the noise,
probably because it was recorded by a pro microphone rather than a phone.
Every other sample was far more difficult, regardless of noise removal. Noise
removal doesn't really seem to help; in fact, any imperfections in the noise
removal process actually make the audio more difficult to understand because I
have to guess not only the speaker's voice and the noise but also the
algorithm for noise removal.

What does help me is low frequency pickup. I think the first sample is easy
because there are plenty of low frequency components that are later lost
through the phone.

Low frequencies are presumably difficult to pick up due to the size of the
microphone in a phone, but could there be a way to restore those frequencies
through audio processing? It would be interesting to analyze the response of
specific microphones to specific low frequencies and find patterns that an
audio processor could use to restore the low frequency components.

Anyway, kudos for doing some very interesting work. I don't know how
representative my experience is.

~~~
CharlesW
> _I don 't know how representative my experience is._

As someone who works with speech content, this seems unusual. Typically, low
frequencies are reduced because there's not much useful voice signal there—for
example, NPR typically rolls off frequencies below 250 Hz.

~~~
hathawsh
Thanks for your viewpoint!

Here's something concrete: the first phrase in the video ends with "small
demonstration", but starting with the second instance, I distinctly hear
"sall" instead of "small". In the version with the noise, the "m" sounds like
an aberration of the noise and is detectable. With the noise removed, the "m"
is replaced with a blip that sounds like an encoding error.

------
johnvanommen
What's wrong with using multiple microphones?

A mic element costs about thirty cents, and the processing power required for
noise cancellation already exists in the CPU of the mobile device.

I think it's particularly interesting that Amazon has made microphe arrays
particularly cheap, due to Alexa. MiniDSP offers a microphone array for under
$100, which is an unheard of price considering what these cost ten years ago.

[https://www.minidsp.com/products/usb-audio-
interface/uma-8-m...](https://www.minidsp.com/products/usb-audio-
interface/uma-8-microphone-array)

~~~
sophistication
How does multi microphone filtering work? I guess they localize different
sound sources by cross-correlation (to get the timings) and triangulation
(based on the timings and the speed of sound)?

~~~
gugagore
I think the (or perhaps only one) key phrase is "beamforming". A single
microphone element has a certain sensitivity pattern (e.g. it may be a very
directional microphone, or be equally sensitive in all directions). With
multiple pick-ups, you can emulate some different sensitivity patterns.

A related idea in radar is synthetic-aperture radar (SAR).

~~~
johnvanommen
Great point.

A lot of the interesting things in audio were inspired by radar. Dan Wiggins
at Sonos used to work on radar, and Don Keele created a loudspeaker technology
called "CBT" that's based on radar technology.

Because microphones are basically the inverse of loudspeakers, what works in
loudspeaker arrays can also work in microphone arrays.

------
opdahl
This is amazing! Full props to the Nvidia team that accomplished this.

I downloaded the Mac app they provided [1] which I highly suggest everyone
with a mac tests out. I ran it on my old MacBook Air 2013 using daily.co. It
worked like a charm. Definitely using this in the next group chat, where there
is always someone who forgets to turn off their microphone.

One cool side effect is that it actually removes the reverberations that
happen when you have two computers on the same call, where the mics keep
picking up on the output of the other computers and a large high-frequency
noise happens (that I'm sure we all have experienced). The system simply
removed it and I didn't even know it was there until I turned off the app.

Amazing work and I really hope that Skype, Apple, Google etc implement this
into their voice apps, or even phone providers build this into phones. Maybe
in the future, we actually can have phone conversations in windy weather and
on the streets.

[1]:
[https://krisp.ai/?utm_source=Nvidia%20blog&utm_medium=downlo...](https://krisp.ai/?utm_source=Nvidia%20blog&utm_medium=download)

~~~
loa-in-backup
copied from another comment, NOT MINE:

COMMENT FOLLOWS

davitb 18 hours ago

Disclosure: I'm the author of the blog post and co-founder at 2Hz. This is a
guest post on NVIDIA Developer Blog. The author of the technology is a startup
called 2Hz (2hz.ai). Our passion is to improve voice audio quality in
audio/video calls. It's a tough problem but also fun to work on.

Agree, breathing, reverb, noise are all problems and should be fixed. We
started with noise and already shipped a product you can try on your Mac. The
app is called Krisp (krisp.ai).

Reverb, breathing, voice cutting will come next.

------
mehrdadn
An interesting human problem that I imagine would come up here is that the
speaker could be getting distracted with all the noise (crying
baby/siren/etc.) while the listener would have no idea what's going on and
think the speaker is being confused/dumb/slow/etc... very curious how this
would play out in real conversations!

~~~
samstave
This is really evident when a speaker hears an echo of themselves, slightly
delayed, back through their speakers.

It's really hard to speak when what you say comes echoing back.

------
adamloving
I downloaded the mac app, configured a virtual device to send the system
output to the "Krisp Speaker" and verified that it cuts most of the music out
of what I'm listening to, leaving only the voice (at a some what degraded
quality). I wish I could configure it _cancel_ ambient noise, not just remove
it from the input signal.

~~~
ghostly_s
In your perception, what is the difference between "cancelling" a signal and
removing it?

~~~
fredsanford
Phase Cancellation [1]

[1] [https://www.sageaudio.com/blog/pre-mastering-tips/phase-
canc...](https://www.sageaudio.com/blog/pre-mastering-tips/phase-
cancellation.php)

------
Aspos
Can imagine a codec, which would suppress noise, recognize the speech, send
the text along with the voice information so in case of broken signal codec on
the receiving end could reconstruct the speech using the text applying
speaker's voice features via style transfer.

So, if my voice is distorted in a broken line, it would be reconstructed from
the text and reconstruction would sound like me. I guess it will be the
ultimate 1kbps codec.

------
post_break
My friend is an airline mechanic. One thing his coworkers all had was the
jawbone headset. This was back in like 2008-2009. He said he could call up a
mechanic working right next to the turbine while it was running and hear him
crystal clear. I wonder if any of that technology paired with software
technology will make it so there is 0 noise in calls. Maybe an implanted bone
mic.

~~~
johnvanommen
IIRC, that technology originated in fighter jets, and worked it's way down to
consumer goods. The company sold Bluetooth headsets for a while, but multi-mic
solutions were cheaper and worked better. They tried to hang in there for a
few years, diversifying into consumer electronics like the "Jawbox."

It didn't work out, and they went bankrupt.

------
ccostes
Really impressive results, though I wish they had gone more into the deep
learning part of it (but I guess that's probably the secret sauce).

Can't help but notice how well Nvidia is positioned for what appears to be a
growing wave of demand for GPUs. Surprised this hasn't reflected in their
share price (feels like they could be the next Intel, but what do it know).

~~~
jononor
Dedicated chips for machine learning (inference) are being developed by many
companies. The hope is that these will be used instead of (or in addition to)
GPUs for ML tasks.

Not that Nvidia is poorly positioned. In fact, I expect that if dedicated ML
chips work out, Nvidia will also put one on the market.

~~~
twtw
> Nvidia will also put one on the market

Already done. Tegra Xavier includes DLA (deep learning accelerator).

------
Jude2711990
Have you tried SoliCall Pro ([http://solicall.com/solicall-
pro/](http://solicall.com/solicall-pro/))? Once installed using virtual audio
device technology it will improve the audio with multiple options like NR,
PNR, RNR, and more.

------
acd
Does this deep learning noise cancelling also work for music with headphones?
If so then we can ditch proprietary noise cancelling headphones and just use
the phones?

~~~
petra
Active Noise cancelling(ANC) headphones are really sensitive to latency.

Take An ANC with 0 latency, that stops cancelling noise at 8 khz, and add a 50
usec latency to it, now will stop cancelling noise at ~1.5 khz.

But This article talks about 20ms latency.

~~~
Judgmentality
> Take An ANC with 0 latency, that stops cancelling noise at 8 khz, and add a
> 50 usec latency to it, now will stop cancelling noise at ~1.5 khz.

How did you calculate this?

~~~
petra
It's an simple explanation of figure 3 here:
[https://www.edn.com/design/analog/4458544/2/A-perspective-
on...](https://www.edn.com/design/analog/4458544/2/A-perspective-on-digital-
ANC-solutions-in-a-low-latency-dominated-world)

~~~
wrycoder
If I understand that figure correctly, at 8KHz with no latency one gets 12 dB
cancellation. With 50 ms latency, 0 dB. And above that frequency, the
cancellation actually makes the noise worse. An analog cutoff filter would be
needed.

------
TatWakie
One of my favorite teams! Looking forward to the time when I won't hear any
background noise on my calls anymore :)

------
exabrial
They have mac app!! Incredible! How could I get this into my car kit?

------
33a
That's crazy how well it handles nonstationary noise.

------
npunt
Love it. Don’t really love the idea of audio contents of conversations being
routed to a cloud server for processing though — needs to stay on-device for
privacy.

~~~
davitb
This technology is already integrated into Krisp app
([https://krisp.ai](https://krisp.ai)) and it runs all locally on device.

~~~
npunt
Thanks for the heads up! I really like what you're doing - not only is it
great for the general public, it's a game changer for people with difficulties
hearing.

------
Aic1kuir
[https://](https://)

 _> devblogs.nvidia.com uses an invalid security certificate. Certificates
issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign are no longer
considered safe because these certificate authorities failed to follow
security practices in the past._

~~~
msla
> _Certificates issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign
> are no longer considered safe because these certificate authorities failed
> to follow security practices in the past._

Those are some pretty big names. Names where reasonable companies could
believe that nobody would ever _dare_ enforce the rules against them, because
it would break the Web.

Who says nobody ever got fired for buying IBM?

Heck. If the for-pay CAs keep screwing up, Let's Encrypt could become the
sane, reasonable, conservative choice, even among the most Enterprise of
Enterprise Enterprises.

~~~
gsnedders
They were all subsidiaries of Symantec when their various faults occurred
leading to the Symantec distrust; it's all just the Symantec distrust.

------
pslam
The story title is "AI powered Noise Cancellation" but the text never uses the
term "AI" at all. It's deep (machine) learning. It doesn't need the useless
marketing bonus term "AI" to make it better — it's already interesting enough
without.

~~~
sctb
We've reverted the headline from the submitted “NVIDIA on state of art in AI
powered Noise Cancellation” to that of the article.

