This post is on noise cancellation specifically, and it actually has the potential to be a huge step forward.
One of the big audio problems with group meetings is that the background noise from each participant adds up, to a point where it quickly becomes unbearable. For that reason, videoconferencing generally only plays audio from one or two participants at most, using a fairly simple estimation of whichever audio signal is currently loudest. The problem is that this can make it really hard to interrupt (people will literally not hear you), or tell the difference between two people going "mm-hmm" versus the whole group. If you've ever been in a group meeting where everybody applauds something, this is why you see everyone applauding but only hear a smattering.
But if this noise cancellation really succeeds, it could be a huge leap forward because audio cues and overlap will actually work for the first time -- hearing the "mm-hmms", hearing everyone pipe up, and so on. Videoconferencing will feel more like an actual single shared audio environment, rather than the kind of "walkie-talkie" effect it so often feels like now.
I'm really looking forward to this.
This is driving me crazy with Google Meet in these COVID19 times. Even in a relatively small conference, I have a really hard time interrupting someone to ask a quick question, even when the speaker is expecting interruptions. It's always "excuse me!"; delay as person continues speaking; I stop; the other person says "yes, please ask away"; when I restart my question the other person already assumed I've changed my mind and continues speaking; repeat ad infinitum. And this is if they even hear me over the audio breaking up.
It's very, very frustrating. If they solve this it would hugely improve quality of life in remote conferencing for me.
When a conference call is made up of people all in the same city on decent internet connections, latency is usually not a big issue.
But when a conference call has people from New York, San Francisco, and Japan on it, even if it's only 3 participants, latency can be bad just because of the speed of light, essentially (on top of what is otherwise reasonable hardware/software latency). Latency may be bad even if you're talking with a colleague in the same city, since the audio is "mixed" on the server, and that server might be across the world if a participant from across the world started the meeting. (Counterintuitively, the latency with your local colleague could be twice as bad as with the colleague from across the world.)
There are a few things that most meetings could benefit from. Having an organizer who's aware of the differences between leading an in person meeting and a remote meeting, cutting video to save bandwidth if the meeting doesn't absolutely need it (the organizer can usually just disable the function), muting when you're not speaking (by far the best quality of life improvement, can be done silently by the organizer if someone is just doing their Vader impersonation throughout the meeting), using the "raise hand" function (again, the organizer plays a huge role here), using the native app instead of the web one usually provides better quality and performance, using a wired connection instead of wireless if possible, sometimes even starting meetings at non-standard hours (like 15 to/past the hour) helps avoid the rush of people logging in at the same time, etc.
Typical restart-all-the-things usually made it go away. But it wasn’t unusual for 500ms of latency to slowly build up during a 30min call. Unfortunately I have nothing more useful to add, the issue resolved itself before I could track down a definitive cause.
Stop using it.
Without even looking at your setup, I would bet $100 minimum that it's Bluetooth latency. It adds a lot of latency, 500ms is not unusual, and many folks have no idea that all that latency is really just the last 18 inches. This is why you're seeing more and more cases where people are using good old iPhone wired earphone for conference calling, especially when skyping a TV interview.
How are you measuring? Half that is considered high-end from what I'm reading, and AirPods Pro apparently reduce that to 150ms.
Bluetooth definitely is the issue with 95% of these latency issues.
Crazy that airpods have that much latency. I have bluetooth in the car and it has way too much latency - I thought Apple had improved it to not be noticeable but I never checked the numbers.
Incidentally, for playing midi instruments you generally want things to be below 8ms to feel natural. 150ms is an eternity.
Is the lipsync ok if you watch a video on an iphone with airpods?
For video they compensate for the latency by displaying the visuals slightly delayed (this has been done for over a decade even way back on feature phones).
Even game consoles have the option to do this since some TVs/receivers have audio or video latency.
One way to solve this is to have the speaker name the person, and then wait until that person speaks. For example, if someone interrupts:
Person A: Excuse me!
Speaker: Yes, Mr. A? [waits]
Person A: What about X?
Person A and B: Excuse me!
Speaker: Yes, Mr. A? Mr. B, I'll come back to you after A. [waits]
Person A: What about X?
Speaker: [Talk about X]. Mr B, you were saying?
Person B: What about Y?
I want to just be able to hit my self-view and have it have a big icon on it or something so the person currently speaking (and everyone else) can see that I want to say something. Maybe sort these in chronological order so the speaker can see who wanted to talk first?).
In theory you could do this with a good chat, but for some reason the chat in Zoom and the others is kind of an afterthought and nobody uses it.
One of the reasons I prefer text based chat is multiple people can talk at the same time without needing to deal with interrupting audio. If you can type well, the bandwidth is higher for group communication (and you get a log).
At least with video you can kind of tell when someone is waiting to speak by seeing their expression. Audio only is worse (but maybe wouldn't be, if you had good intent-to-speak tools built into the app?)
this is a solved problem. webex has had a hand raise feature for a decade now. trivial for google to just copy.
Which is one good reason to use video. At least with smaller meetings, someone can raise their hand or just look really pained. (Bigger meetings, you probably need to use chat.)
Sadly, the software does not just magically take care of it. Anytime two people talk, a typical echo canceler just starts decimating frequencies until both of them are unintelligible.
Add in a couple of clueless teams who mount a camera/mic against a conference room wall and introduce massive amounts of room echo into the mix, and I'm at the point where a conference call becomes an absolutely mentally exhausting experience just trying to decipher what is being said. I have no hope of contributing, because I can only hear 2/3 of the syllables, and my brain is running on overdrive trying to turn those back into words. By the time I've figured out what they just said, they're half-way into the next sentence. What a stressful hellscape.
Ironically, if we had no echo cancellation, it would force everyone to use ear buds, and the average call quality would be a lot better.
I have some screenshots of waveforms showing laptop mic vs headset, and the signal-to-noise ratio with the headset destroys even good noise-cancelling using a laptop mic that's farther away from one's mouth.
I don't know what else I can do.
Participants often don't realize that they're the culprit when somebody else sounds terrible.
Cheaper bluetooth headsets seem to pick up everything around them. Had that issue with a coworker where the headset was worse than using the internal mic.
Biggest and annoying issue though is consistent bluetooth disconnect/reconnect issues even on different MacOS machines. Latest firmware and such. Pretty sure it's not 2.4ghz interference.
"Modern high-end Bluetooth headsets support AptX, an audio codec compression scheme that offers better sound quality. But AptX is only enabled if it’s supported on both the transmitter and receiver. When using a Bluetooth headset with a PC, it only works if your PC’s hardware and drivers are compatible." (https://www.howtogeek.com/354321/why-bluetooth-headsets-are-...)
Not sure if this applies to a Mac though.
A few different ways it's come in handy for me:
- The Bluetooth speaker I use for music has a tendency to sporadically sound super hollow. Turns out it has a mic built in, and the voodoo of Mac's bluetooth stack would decide at random when it connected whether it would go into audio-only mode (and use the higher-bandwidth AAC codec) or go into audio+mic mode (splits the available bandwidth between the two and as a result uses a lower bitrate audio codec to compensate for the bandwidth drop). Used ToothFairy to always force that device into audio-only mode.
- After the above discovery, tested doing the same with my actual headset, and leveraging the built-in mic for input. For call audio, it's pretty erratic on whether it'll have any impact at all, and depends heavily on the circumstances of the call itself. Sometimes the audio is massively better, but most of the time the audio is already degraded when it gets to my machine and the bluetooth improvement is moot. That said, makes music in between calls far more pleasant.
- My bluetooth mouse is particularly susceptible to that consistent disconnect/reconnect issue you mentioned. ToothFairy can create a menubar icon for individual devices, which helps to act as a quick sanity check to see if my mouse has disconnected. ToothFairy can also run a shell script on disconnect, which has been handy. At this point I have it trigger a system notification so I'm at least immediately made aware of it, check my idle time in case it was the mouse going into sleep mode from inactivity, then conditionally leverage blueutil to look for the device and reconnect if found (forcefully restarting the bluetooth stack in the process if it has issues). Doesn't fix whatever the root cause is for that consistent disconnect/reconnect issue, but this duct tape re-establishes a connection far more quickly when it happens, making the issue itself significantly less disruptive.
I only stumbled on it via Setapp and tried it on a whim, but it's definitely one of the more handy utility apps I've found and well worth the $5 App Store price for anyone that has similar bluetooth frustrations with their Mac.
Then I realized I'd moved to an apartment complex. I did a wifi scan, and found over fifty competing SSIDs.
Switched to ethernet, and the improvement was night and day.
No matter which software you use, some people will be at an advantage simply due to their isp/wired connection/wired mic.
This could be a business idea though: Conferencing software which equalizes everyone’s latency.
Or is it just worse?
Discord users are more likely to have a dedicated fast internet connection and doesn't seem to care about profitability at the moment.
It's just the difference in designing for a 100/10 connection to yourself vs sharing a 100/100 connection with 20 other people. Zoom reasonably gracefully degrades on choppy/slow connections while Discord becomes straight up unusable.
The problem is that Google Meet (or my connection, or whatever technical reason) wasn't up to the task. This has happened enough times that I dread interrupting now. Sadly, one person monologuing is not how face-to-face meetings really work.
People just don’t care about quality, they will gladly use the crappiest mic with all the noise all day..
But people shouldn't take "workspace" abuse from others. Be polite, assertive, constructive. Give them credit for taking the time to care a bit about this problem, offer to debug with them later, try to demonstrate it - switch to speaker, capture what you hear. Alternatively just switch off voice and type. Tell them that the sound is garbage for you so you switched for typing. Try to do quick 1-on-1s instead of the group call, even offer to write a summary after.
This is basically the equivalent of constantly not giving a fuck about how loud one is in an open office. There ought to be proper channels [ha, no pun intended] to address and solve these.
Mute your or participants' background noise in any communication app
I have noticed this and I hate it. It makes normal conversation absolutely impossible.
Discord, which is an audio first product, is much better than other solutions in this regard and their video conferencing while new has been very enjoyable to use.
I think if it was dynamic, where turning my head towards the person speaking balanced the audio (like in real life), I would not have a problem with it. A super simple form of virtual reality that would only require a simple head-mounted gyroscope or motion sensor.
Another podcast I listen to has two people with very similar voices, and I sometimes have a hard time figuring out who's speaking, so I welcome any advancements in this space.
Just use proper equipment. Headset is an absolute must. Next step is software that only transmits when someone is talking. Gamers have figured this out decades ago. Just look at Mumble, Teamspeak, Discord. They know.
People without proper headsets in that environment get ignored after a while. Nobody wants to use brainpower to understand what you are saying. Corporate might be harder, but you also get payed for that.
We pretty much take echo cancellation for granted at this point. Using something better than your laptop microphone on a call is still a good idea but I'm not sure that wearing headphones/earphones is that big a deal at this point.
You don't need to go back that far until speakerphones other than very expensive Polycoms and the like were pretty mediocre at cutting out because of echo.
I think you can do it not only to your microphone (outgoing audio), but to the other participants in the meeting (incoming audio)
It seems that what's old is new again ...
The only thing car-kits seemed to do was add minimum cut-offs before transmitting and make use of directional microphones.
It's more practical than a touch bar, at least.
This is a big issue with hearing aids. The whole industry is focused on optimizing for voice intelligibility and as a musician you end up doing trial-and-error with the audiologist to turn all that stuff off.
We need more open source hearing aids - I've read of a few but they're not mainstream.
About music, this is getting much better in hearing aids. I've been from analog thru digital over 15+ years of hearing aids, and my latest (3 months ago) pair from Phonak (no affiliation) is an honest leap forward. It has a built in Music profile that disables all sound optimizations in general, while still attempting to correct the hearing ranges that you have a deficit in. I was on the verge of no longer being able to hear with hearing aids, that has probably been extended by 3-5 years with these new models. At that point I will be approaching cochlear implant level hearing loss. I happily embrace my cyborg future!
On top of Music, I have a Walking profile that attempts to focus on the person that is walking to the left or right of my and can pick with side on the fly. And they make great ear plugs when things are loud.
The Normal program, auto-magically selects between 8'ish profiles to pick the best one for the environment. And it has finally got it right. Older models I would daily need to force it into the best mode because it guessed wrong. The latest model I only have to tell it what to do once every few weeks.
And to the original topic, noise cancellation, hearing aids bluetooth'ed to the phone/PC for conference calls is hands down the best possible audio experience. Built in noise cancellation, amazing microphones that can be used for your voice portion of the call, tuned to your hearing, with some of the finest sound output possible. Just amazing. These things are so good these days that they are finally being labeled as assistive devices for people without hearing loss. They can give someone with normal range hearing essentially bionic hearing. Tinnitus? They play customized white noise to make the ringing less noticeable. Doesn't help everyone, but it's really nice for me. I hear more ringing when I take my aids out.
Oh, and it does all of this on a device that fits in your ear with a battery the size of a few grains of rice and all in a few milliseconds so your brain sees the mouth move at the same time it actually hears the audio.
Again, get them if you need them.
I just got a similar model I assume (M90-R) and it's definitely not switching to music mode automatically when I play music. (Maybe it's different for listening.) I just had the audiologist add a music mode that I can switch to manually, but getting acceptable timbre for the instruments I play (accordion, melodica, and piano) is work in progress. Making an expensive instrument sound like cheap trash is disappointing, though of course I can take them out.
Having Bluetooth is nice, particularly for phone calls, but I find the sound quality is unsatisfying for listening to music, so it won't be replacing speakers for me.
Not to mention the screensharing is infinitely better as well. It's pretty pathetic of the busines sapps, we went through a day where I was trying to screenshare something and my remote coworkers kept complaining of lag, blurriness, or the app would just crash (slack). We went through ms teams, zoom, slack, and google meet. All had issues. Convinced everyone to install Discord and suddenly I was able to shared my desktop perfectly at 1080p without noticeable lag and crystal clear audio.
It's still better than food noises, but I have noticed that as a disadvantage.
Discord's lack of lag in audio makes a huge difference for voice comms. I've only used it for gaming, but you can really tell the difference when you switch to the game's voice chat feature which has probably a third of a second of latency. And of course Zoom et. al. have a lot more lag and it really hurts the experience. In addition to low latency, the sound is also very good quality.
Discord can and does log all messages through the system, and has many internal tools that operate on the plaintext. Anything you communicate through Discord you should assume any/all Discord staff may read.
They claim that the voice comms are e2e but there are no further details available (like where the keys are generated).
I'm guessing they mean to add the feature set to standard headphones. Leveraging say the laptop microphone to provide active noise canceling to someone with a standard set of earbuds.
In every video conf I've been, you can instantly tell when "one of them" who can't be bothered to mute themselves joins. The audio quality immediately goes down the drain. It's always the same subset of people who do it, too. As soon as they're enjoined to please mute, the audio quality is restored.
No amount of magic signal processing will ever match it.
While perhaps misguided to use it that way, the mute button thus act as a social-clueness meter.
Then the policy is that they cannot join unless they use the headset.
Plus I'd hate to be the intern that has to sit and hold the space bar while the boss delivers a presentation
The issue I find a bigger problem is lag causing people to talk over one another. I've been on a lot of calls where the call quality was fine but conversations were difficult because it was hard to judge when the other person had stopped talking.
From a technical point of view, that is really the best thing. It works, and sometimes it's the only thing that works.
But if you try to get people actually do it, you run into problems:
(1) They don't realize it's them. AFAIK the system doesn't play their audio back to them, so while everyone else hears the noise, they don't. The one person who needs to take action is the one person who doesn't know action is necessary.
(2) They are distracted. When their spouse is talking, they are focused on whatever their spouse is saying, not on how it affects the meeting audio. Or the meeting is boring and they're not paying attention.
(3) They just don't care enough. They are there to attend a meeting, not fiddle with computer stuff. Some people will never take the time to learn where the mute button is in the software.
Perhaps #1 could be improved, though, with some kind of blindingly obvious indicator in the UI. If "YOUR MIC IS WHAT EVERYONE IS HEARING RIGHT NOW" flashes when your mic takes the floor, maybe you'd notice it lighting up when you didn't intend for it to.
For those wondering, unmuting is a privileged operation that only the user could do themselves.
Running this economically on servers at scale in realtime, I consider this very impressive. I can't say how it compares with RTX, but I wonder if it has anything to do with the amount of computing resources that can be dedicated to it. A single expensive card dedicated to one audio stream, versus a single Google server than needs to process hundreds (thousands?) of audio streams.
Given how common voice communication is in our world, I am sure Google can build ASICs for this (if not just run it on TPUs), and get the marginal cost of vocal processing to be negligible.
Heck, they probably would just need to divert <5% the resources of Fuschia or any other of their "senior engineer retention" projects.
Only way this could happen is if someone is standing 3 feet away from you and does it while you talk which would just be rude and probably be stopped by you immediately anyway.
I'm more curious how this could work in the metro or with a washing machine nearby
Unfortunately, you need a $400 graphics card, and >100 watts of power to run RTX...
Two repeated notes and the noise cancellation just immediately shuts you down... we've been using Zoom and luckily you can turn all the audio processing off if you go in "Advanced" and enable "Turn on original audio".
Unfortunately for them they decided to lie on the bed in the hotel and the jet lag hit them pretty hard. Next thing you know they are asleep and they started snoring and I guess fairly loudly and everyone on the call could hear. So then the people on the call spend some time trying to figure out who is the person snoring, going through all the attendees. Eventually they figure out who it was and they started yelling trying to wake them up, which they did after awhile. Needless to say my coworker was very embarrassed about the incident at the time, but it did make a good story to tell people :)
For large meetings, organizers can enable a single-talker mode. Holding the talk button puts you in a queue. Your screen indicates when it's your turn to talk. This prevents folks from talking over each other. This eliminates echo by muting the talker's speakers while recording their voice. Also, attendees see the current talker, not the person whose dog just barked.
Yes, we can do VR worlds that still look like Second Life, but while we are working on fixing that, we might solve for near term things to improve interactivity.
Well, and solve the "one person speaks, all must listen, lag for response, loop" which I find is similar to how morse code discussions work.
My biggest issue (when I worked in videoconferencing) was echoing, and locking onto the delay window where echoes could occur. Depending on the distance from a conference room speaker to all the walls, echoes could occur at one or more offsets (appear at microphone input with some delay after presenting at the speaker). And ambient noises could masquerade as echoes. The filters tend to be IIR filters, and get wound up easily. It was awful.
Edit: I had only watched the video. The article does indeed contain a lot more detail.
Can't wait to see what the Google Duo team will come up with in response. I mean, we saw the blog post on their great new video codec (AOMedia Video 1 was it?) but I personally felt it left much to be desired.
What happened to the Hangout guys? Are they still in this one? Product middle management wants to be wooed.
I messed around for a few months with speech enhancement last year and didn't really get anywhere beyond sort-of-reproducing a few existing models: https://github.com/MattSegal/speech-enhancement
All the published "state of the art" examples I could find were pretty crap, whereas Krisp AI were doing much better than what I've seen released publicly.
I don't know if it's a combination of her cheap hardware or what, but it's... odd.
Meetings should be for review/discussion and decision making not vocal exercise and grandstanding.
Would also be great if meeting providers would have a dial to show current latency for all participants to make easier to interject.
Lastly I do recommend using meeting tools that have features like letting you vote, raise a hand, chat all in a sync to main voice. Will make life easier for meeting moderators..... And if you don't use moderators then start the practice of doing it - quality of meetings will improve hugely.
Edit: apparently can also remove kids crying; just not included in demo
Just implement push to talk with mute-by-default. 90% of the audio issues would be resolved. Another 5% could be solved by buying everyone a decent headset which hopefully has a push-to-talk button on it as well.
You don't always get to choose if background noise is present.
Also, you just asked that people push a button, and wear a headset. That is a lot, and this is about lowering the bar needed to get a good experience.
Comparatively, I was impressed that we could even have a Meet without everyone needomg to be on mute.
Maybe you could also use this personal model to hide very short network interruptions. Other party could use this model to constantly predict my next piece of audio and switch to prediction in case packet is lost.
The difference is night and day based on some of the recordings I've heard.
I messed around a little with speech enhancement last year and didn't really get anywhere: https://github.com/MattSegal/speech-enhancement
Oh good. Meet is already a huge battery-hog on my laptop, so adding fancy signal processing client-side was worrying me.
Key Takeaway - Its fine to be not 100% accurate, roll it out and learn.
- Approval from Execs
- Data -> Learning -> Training -> Variability -> Training -> Tuning
- Privacymatters (for all digital educated , uneducated)
- What & Whys of UX -- ultimately what user says
- Definitely Cloud -- Its 21st Century
- Optimised for Speed , Cost (a bit irrelevant if I am Google ;) )
- Release (with presentation) -- Timing matters
- Feedbacks (On permission)
A summary on the "Denoiser" and not "Noise cancellation" [Don't want to get ranted out by Data Science folks] feature of googlemeet by PM. Applies to any such feature.
> When you’re on a Google Meet call, your voice is sent from your device to a Google datacenter, where it goes through the machine learning model on the TPU, gets reencrypted, and is then sent back to the meeting.
Obviously theoretically there is basis to what you're saying but, end-to-end somewhere its problematic.
What does this mean?
Seems like over-engineering. The issue is either with the microphone, with the hi-def stuff or something else.
Every normal phone never had an inch of a problem, so I'm really confused why computers have this issue.
Better for them to just point their laptop microphone at the whole room and let the poor saps at the other end suffer.
The headset at least still has the benefit of isolating the output from the conference to my ears, s.t. the laptop mic picks it up, but it would be nice to not be tethered to the laptop and be able to pace the room.
Especially if the other end was not on a headset.
Very poorly. Of all the available alternatives (Zoom, Skype, FaceTime), Google Meet seems to have the worst audio _and_ video quality. This is inexplicable for a company very easily capable of technological and product leadership in both of those things.
Shouldn't the title be "How the _new_ Google Meet noise cancellation works" then?
We somehow have this sexist social expectation that women who show their feelings (crying, screaming) are "hysterical" (really a nasty word) and not taken seriously. If so, men screaming should be equally considered a sign of immaturity and lack of self-control.
Also could help with customers ("Sorry, I can't hear you!").