I had been wondering when a product like this would come along. Being able to remove background noise from calls sounds pretty great, but it's not a problem I run into too often - most people on my calls are in a meeting room or a quiet place.
But, if it can really fill in missing or distorted chunks then for me that's a killer feature. I'm going to give it a try on my morning check-in call, which is coming up.
I'd love to see a product like this be able to adapt to the voices of individual speakers, and fill in gaps or distortion in their natural voices. I presume that would be more difficult, though, because you'd need a model of each speaker at the output device, so both parties would need to be running Krisp, and you'd somehow have to share the model between them - which if you already have network issues causing dropouts and distoration might not be feasible. Unless it was a side-channel thing, where for regular calls with the same people the voice model for is updated after each call, ready for the next one.
Though I'm not sure I love the idea of a model of my voice being constructed and transmitted around. But I think it would be really cool :)
> most people on my calls are in a meeting room or a quiet place
Since I moved to NYC, I think I haven't had one phone call without some obnoxiously loud siren/horn/baby/dog/jack hammer/piano/bros in the background. People must think I listen to the NYC soundtrack on repeat at full volume and make phone calls.
This call typically has decent audio quality, and as it's a stand-up / status call people tend to speak one at a time. No-one today had much in the way of background noise, so I'm not sure there was a lot of opportunity for the app to show off.
That said, I switched the Krisp speaker mode on and off repeatedly while each person who was talking.
1. When just one person was talking Krisp didn't seem to ever make the sound worse, which I guess is a start.
2. When there was a conversation going back and forth, or when one person started talking right after someone else finished, the voices got mangled or muted and I couldn't understand them. I had to turn Krisp off.
3. When there was some background echo (like in a large room) or some minor distortion I thought it maybe sounded a bit clearer with Krisp running, but it didn't really make much difference. I could understand the speaker either way.
With the audio problems I mentioned at (2) and no real gain from using Krisp I doubt I would use it regularly, though if I run into a call with bad background noise I might try it again.
I also tried the Krisp microphone, and at one point I had to repeat myself, which doesn't usually happen. But I have no way of knowing whether that was due to me speaking unclearly, or audio issues at the listeners end, or something else. So I don't really have an opinion about the microphone, but as I am in a quiet place anyway I wouldn't probably use it.
It would be nice if there was a single channel evaluation mode for the speaker. If I could hear in my right ear the normal audio was, and in the left what the Krisp-processed sound, then I would have a better chance of evaluating the performance. I guess if you have a lot of continuous background noise that mode would be redundant, and it should be an obvious improvement switching back and forth.
Such feedback from users is priceless. Thanks for it. Krisp is still in beta and we keep experimenting with different DNN models. For example we have a DNN model which maximally preserves multiple voices while removing noise. It's not shipped yet.
Re #2, one thing we noticed is that Conferencing apps themselves will distort the voice when multiple voices are overlapping. Especially when there is also noise. There is not much Krisp can do here since the stream it receives is already distorted.
Unfortunately for krisp speaker we don't control the audio stream. Imagine how many times the stream gets signal processed before krisp speaker gets it (noise cancellation in the headset, noise cancellation in RTC, codec, etc).
Re Krisp microphone, the DNN model used here is more effective since what Krisp receives is "less processed/distorted stream".
Please stay tuned, our release cycle is around 2 weeks. More quality and UX features are under way.
>When there was a conversation going back and forth, or when one person started talking right after someone else finished, the voices got mangled or muted and I couldn't understand them. I had to turn Krisp off.<
This may have nothing to do with Krisp as an app. All phones/communications device have the ability for either half or full duplex audio. Half duplex means only one caller can be heard. If someone were to speak, the audio would cut out as you experienced. Full duplex allows for conversations to happen all at once.
So this all could just be a symptom of the feature of the device or service you're using. I would try your test on different devices and/or services.
My first internship was at a company that was doing this over a decade ago, but before machine learning had proliferated to adjacent fields. It's easy to separate out stationary signals like ambient noise, which are easily isolated in the frequency domain. But it's a totally different thing to remove something like a baby's cry, shuffling of papers, or other sharp transients. Blindly separating without a large corpus of inputs + machine learning techniques seems impossible.
I remember as intern spending hours in front of spectrograms manually deleting noise so that the researchers could get clean targets. Let me tell you, I started being able to identify a lot of phonemes just by visually examining waveforms.
Eventually, the company did pursue some noise cancellation, but only as one part of their offering. I don't think they ever could get the holy grail of separating non-stationary noise.
There is a new paper out from Google research that specifically addresses the use case with screaming children in the background - very exciting!
https://www.youtube.com/watch?v=zL6ltnSKf9k
How about a VST plugin? that would give you cross-platform support (Mac/Windows/Linux/iOS) and would allow it to be used as a part of many, many audio processing pipelines - from simple streamers/youtubers all the way to pros
What does it actually filter on? Is it trained to recognize my voice or just any human voice? I can see you can smartly filter out noisy noise, but can it also filter out nearby conversations like in an office environment?
Yeah, and I have to de-noise a conference I recorded a few days ago before uploading it. A Linux version of such a tool would be of great help, although I will rather try standard approaches first.
Mobile support is harder since Krisp won't be able to register as a virtual mic/speaker on Android or iOS.
At least we haven't found any way around it.
All audio is processed on device, but is the goal to use the public training/learning to tailor a more robust model which they can sell commercially/integrate into apps/phones/etc?
My suspicion is that the founders/engineers will be acqui-hired for a not-insignificant sum by one of the existing big players in the teleconference space (Zoom, Google, Cisco, etc.).
Further, this is probably their goal. In fact, I wouldn't be surprised if Apple bought something like this as system-wide background noise canceling would be a fantastic OS feature.
iPhones already have this feature if you hold your phone up to your ear. There is a microphone in the back of the phone near the camera that phase cancels out background noise from the input from the front mic.
They need this for the AirPods then. Everyone I call while outside in NYC says the microphone is super sensitive and picks up all the surrounding noise.
they also recently added 4 (or 5?) mics on the new iPad Pro so that it could do facetime calls while cancelling out the feedback.
Wouldn’t be surprised if a software update makes it possible to use all of those mics for some very good ambient noise cancellation during face time calls
I'd be a little worried about the feature that fills in missing voice chunks. It sounds ripe for accidentally replacing one lost word with a completely different word that could also make sense in context. Almost like the issue where Xerox copiers would sometimes replace one character with another. Hopefully the filling in of missing chunks is done in a way that doesn't allow it to fill in whole words, but rather just short sub-syllable chunks of audio?
It sounds ripe for accidentally replacing one lost word with a completely different word that could also make sense in context.
For awhile, it seemed like the autocorrection in iOS was deliberately trying to break up me and my then-girlfriend. It even seemed to have a penchant for doing an unfortunate autocorrect just a sliver of a second before my finger hit send on a txt. I finally turned the feature off.
I just tried this over a Zoom call. It worked well but not as well as the samples on their website would make you believe. Background noise was muted during silence periods but not always during speech from the active speaker. This makes it even more annoying because you hear tidbits of background noise only at the same time that the active speaker is speaking. I'm still very impressed though.
As an amateur radio operator, I struggle very often with noise-related issues. Nice tech. I'm used to see digital processing applying in this field, but never thought about applying machine learning. Feeling curious about learning about the technology some day, because, for me, based on the solutions I currently know, this seems, indeed, magic.
Seems to me that "a virtual microphone [that] sits between your device microphone and Apps" would be a juicy target for attack
I am /not/ saying Krisp.ai is a trojan or is nefarious, but if I was the NSA or FSB or... something like this would be very interesting to me. Both for infiltration (deliberately malicious) or for exploitation (compromised / exploited at run time).
I have a generic Android phone and its lack of background noise was quite surprising, so either it has active noise canceling or an extremely short-range mic. I suspect most if not all of them already have such functionality.
It would be nice if I could upload audio and have it processed. I have a lot of videos that were captured on an iphone or something and would love a tool to help clean up the audio track before posting.
There is noise reduction(w/ templates) in Movie & it does a decent job. You can also try out audacity & gate noise below a certain threshold for crisper audio.
This would be really great for work-at-home people, freelancers, and digital nomads! I work online for a startup and this is exactly the kind of thing I can think we could all use. Another niche that could benefit from this would be online teaching. Most of the teachers of online English teaching companies work from home. For work at home parents (and there are plenty), the mute whining/screaming child feature is a godsend. Fingers crossed, waiting for Windows support...
OMG! this is awesome. I just tested and it works wonderfully!
It just mutes all background noise (White noise, tv sound, toy sound, kids crying, people eating..) just wonderful.
I've been waiting for something like this, except now I want it to be embedded in my bluetooth earphones. They all suck at removing background noise when I'm speaking on a call. Will this work well even if I'm speaking into my airpods?
Something I'd like to see if a voice recorder that only records my own voice. It would be even better if other voices were obliterated and filled over with something like the "voice" noise from Katamari Damacy.
Is there a way to run an existing audio file through this to clean it up? For example, if I have a video of a presentation with lots of background noise like chairs creaking, people coughing, etc.?
Cool! I thought I had a test recording for this, but (un?)fortunately, my workplace sprung for a fancy directional mic that already did a very good job of isolating the speaker's voice.
I still plan to play with the app for voice chat, though.
Sure you can. I don't think anybody can stop you doing that. You just need 2 computers.
On one you play the clip.
On the other you record it.
The remove the audio from the video and put the new audio onto it.
Yes, of course I can run my audio file through it by actually playing it and recording the result, but I'm asking if there's a version of this tool that takes an input file and writes to an output file on its own.
Just tried the Skype call testing service with and without Krisp while mashing on my keyboard. It did seem to almost eliminate the typing sounds which were rather loud otherwise.
Yes, that's how phone microphones for example work.
To remove static background noice would just need to get a silent sample and reverse out the polarity of the static noise from the audio signal. It can be done real-time or in post production.
To sum out random background sounds you would need an omnidirectional mic that picks up everything. Then you'd need a close range shotgun type mic from which you subtract all that noise mix picks out.
What I am thinking of is more is using signal phase differences on spatially separated similar microphones to isolate the desired source and filter the rest. There is probably a name for this I wouldn't remember. You wouldn't need a specific directional mic and could instead build a synthetic virtual microphone with whatever directional properties you wanted (including distance and not just directional selection)
awesome app! Might wanna correct the grammer for the unfurl description though
<meta name="description" content="Take calls from wherever you want without being embarassed
for a background noise. Get krisp for Mac and use with any conferecing app!">
Not really. The tech used in noise-cancelling headphones is called ANC (active noise cancellation). It eliminates the noise coming to your ear from your surrounding environment.
In contrast, Krisp eliminates the noise going from your environment to the call participants and vice versa.
I've read the front page and FAQ section but I'm still not 100% clear what this product is. Is it an active noise cancelling app that listens to your environment around the laptop and cancels noise out (like Bose QC25/35 headphones)? Serious comment - I think you need to explain this better on your site.
But, if it can really fill in missing or distorted chunks then for me that's a killer feature. I'm going to give it a try on my morning check-in call, which is coming up.
I'd love to see a product like this be able to adapt to the voices of individual speakers, and fill in gaps or distortion in their natural voices. I presume that would be more difficult, though, because you'd need a model of each speaker at the output device, so both parties would need to be running Krisp, and you'd somehow have to share the model between them - which if you already have network issues causing dropouts and distoration might not be feasible. Unless it was a side-channel thing, where for regular calls with the same people the voice model for is updated after each call, ready for the next one.
Though I'm not sure I love the idea of a model of my voice being constructed and transmitted around. But I think it would be really cool :)