I just want to say that this is really actually kind of mind-blowing from an audio engineering perspective.
Outputting audio from multiple laptops in the same room is easy. Perfectly syncing it is harder. Implementing echo cancellation across all of that is quite a bit trickier than regular single-device echo cancellation.
But then treating all the laptop microphones as a kind of microphone array, having to deal with sync issues and phase issues and background noise issues... that's hard core.
Kudos to the engineering team on this one. This is actually pretty amazing.
Technically very impressive, and meets a real need. My office is still running telecom hardware from a decade ago, all the wireless mics have dead batteries, and is reluctant to replace since so many meetings are completely virtual, so why have custom in office hardware.
This essentially replaces that expensive proprietary hardware with a matrix of laptops, and essentially every user gets a mic.
As someone who grew up with iPhones coming out around the time I was in middle school and apps producing noises like this were used as pranks in class I have to ask that people don't do this. The sound is painful.
Yeah I'm pretty sure there are plenty of adults that can hear those frequencies though. It's not like everyone reaches 18 and suddenly loses hearing.
Stupid devices IMO. When they first came out I downloaded a sample audio file from the manufacturers website to see if I could hear it. I couldn't.... because they encoded it as MP3 and it was completely filtered out by the encoding! Literally an empty file.
In our region, some people have devices to repel moles or marten using "Sounds that are inaudible for humans".
I have yet to come near to one of these devices that I can't hear.
And it's not only me, my wife can also hear them, as well as my daughter.
I also know some people who can't hear anything from these devices, but it feels like the statistics about what people can hear and what not are not that up to date.
Agreed. My mother thinks I'm lying. I will admit I feel a little bit special that I can hear her motion activated cat-poop repeller thing.
I visited her recently and wasn't sure if I had finally aged out of my sensitive ears (34 now) or if her batteries needed to be replaced.
FWIW I wonder if people do not experience physical pain from certain sounds, because people seem to be totally fine with sirens but it feels like I'm having a spike pushed into the side of my head.
When I was a teenager, the applause from the end of year school talent show caused me physical pain — enough that the teachers noticed and got me out of the hall.
This no longer seems to be the case, as I'm living right by a major junction and get random full volume sirens at least six times in the average day. I hate them, but they don't hurt.
I used to hear the remote from TVs especially old Philips ones and LG's with the single chip on them. That was until I hit 44... after that is hit or miss or just imagining.
I too am in awe of the audio engineering challenges and opportunities here.
But I don't necessarily know that Meet is trying to tackle all this? Are they using the mics as a microphone array & processing signals across phases? Could be missing it but I don't see that they said so. Perhaps they're just picking the loudest mic for a given speaker? Or any of a dozen other simpler tactics?
The current baseline is to manually mute and unmute microphones. So picking the best microphone sounds like a better idea already. If other people make a sound, I think it would be acceptable of that sound was missed/softened.
In a large room, perfect syncing is actually impossible since different listeners will be far enough from each other speakers to cause, at best, comb filtering, and at worst, audible delays.
I assume that if speakers are not set too loud then either you're close and the algorithm works or you're far and sound is quiet enough that it's not an issue.
Also imagine that 44.1kHz on one laptop will not equal 44.1kHz on another or one can run at 96kHz and others on 44.1kHz etc. that means everything has to be dynamically resampled in realtime whilst preserving quality and low latency.
I think this is much simpler than what you're suggesting. Careful microphone level management can handle this. No need for audio sync. I know they use the word "sync" but that's a very broad term.
If the distance from the microphone to an "unwanted source" is three times the distance asthat from the microphone to the source phasing likely wont be an issue.
There's always caveats with engineering but it's a decent rule of thumb assuming equal volume sources... I can imagine it's not too hard to detect that anyway, weve been able to do realtime fft for a very long time.
No, probably not there are no phase issues if you just don't transmit the signal. The hard part would be to determine who's in the room, and then who's talking and then mixing appropriately to eliminate feedback and optimize speaker sound quality. None of which requires signal phase accurate synchronicity.
If they're actually able to "sync" (again a poorly defined term) given the problems associated with network latency and different hardware it would border on magic.
Is a neural network even really necessary? This seems like something where careful application of some normal math would work. Find a loud event, correlate the loud event across devices to get offsets (by sequence ID, not clock time), do some fancy math to apply inverse sin waves or what not.
That's not to diminish the accomplishment or say that it's easy (or that I could do it), but I don't think a neural network is necessary here.
I have a very strong feeling this is more marketing BS than 100% solved technical achievement. (disclaimer: xoogler, no inside info, just familiar with recent Google and separately, audio processing)
I agree that this is amazing - but what I don't like about it, is the fact that a 3rd party is doing this, when it should really be a built-in feature of the operating system - or at least, be implemented as close to the device as possible. From what I can glean from this breathless press release, this functionality requires a fair bit of cloud ... anathema to audio professionals, but maybe not so, the professional management classes.
Too many times these kinds of services are wrapped up at the application layer, where really they belong in the operating system. For example, wouldn't this be a perfect thing to implement as a plugin for Pulseaudio, or JACK, or even .. VST?
(Disclaimer: I work on high end microphone and audio products at a well-known hardware manufacturer of such, where much more effort is being made to make the devices, themselves, smarter ..)
I would honestly try to sync audio output based on a shared time reference, something along the lines of what AES67/Ravenna/Dante does but you can be a little more lax and use ntp or system time since you don't need to be sample accurate.
For the microphones that would be a little harder but you should be aware it's not that tough since a few high end manufacturers have phased microphone arrays for videoconferencing. You could probably get close though but the fact is you need the audio from all the sources in a single location for processing and do phase analysis on it and possibly find an optimal delay for each by checking the group delay.
The advantage they have is some latency is acceptable and they don't need to do it on a low power device.
I don't see what this has to do with Gemini but maybe that's just marketing...
My money is that outside of a couple of dog and pony demos with everyone on one well-administered LAN you could not make this work with system time and NTP on consumer devices. You will regularly see 100ms difference in NTP time.
The fact that phased array microphones exist has nothing to do with the point we are discussing, which is audio coherence across heterogenous devices whose only real connection is a web browser.
I'm thinking more some sort of system with a sync point registered per device and using that as a time reference.
It's not inconceivable that they could easily detect multiple devices in a room and find a sync point based on microphone input from a speaker.
Once you have a sync point found you can then set a delay on all devices to try to match that sync point. Nobody said this is easy or everyone would be doing it but it's simple enough.
The phased array microphones is more a pipe dream but you wpuld absolutely be able to do something approaching that with multiple devices on a single room depending on how accurately you can predict microphone location within the room. Im reasonably sure you could start by just using the closest mic and then over timr as you improve sync you can try to use multiple.
As I said they get every single audio stream in and out into their servers and they have full control of the audio the tab is playing and the timing of that.
I don't see this being any different to what the likes of Sonos/Google Home/ Apple Home etc are doing with synced appliances for stereo/ multichannel devices, it's likely significantly harder because it's heterogeneous devices as you said.
All that doesn't answer my question of how you would do this at the OS level? You don't have any of the required information per device, only the central server has even the hope of having all the relevant information and control.
We agree that doing it at the OS level is probably the wrong direction. I think you could get there with PNTP and audio hardware support, which is more how Sonos etc do it afaik but then again you aren’t solving the heterogenous device problem.
It is apparently a good example of something that needs performant neural nets in the cloud to solve. At first glance it looks like a low-level hardware-firmware problem. Market conditions prevent solving it at that level though, so we had to wait for the right combination of resources, new signal processing and heavy cloud compute.
Outputting audio from multiple laptops in the same room is easy. Perfectly syncing it is harder. Implementing echo cancellation across all of that is quite a bit trickier than regular single-device echo cancellation.
But then treating all the laptop microphones as a kind of microphone array, having to deal with sync issues and phase issues and background noise issues... that's hard core.
Kudos to the engineering team on this one. This is actually pretty amazing.