I just want to say that this is really actually kind of mind-blowing from an aud...

nytesky · on May 26, 2024

Technically very impressive, and meets a real need. My office is still running telecom hardware from a decade ago, all the wireless mics have dead batteries, and is reluctant to replace since so many meetings are completely virtual, so why have custom in office hardware.

This essentially replaces that expensive proprietary hardware with a matrix of laptops, and essentially every user gets a mic.

But it only works on laptops right, not phones?

iandanforth · on May 26, 2024

Do you think they are using some kind of inaudible-to-humans signal to coordinate this or is the unmodified audio good enough?

dathery · on May 26, 2024

I could see this happening. Zoom for example already uses ultrasonic sound to detect what conference room a laptop is in for one-click screen sharing.

YZF · on May 27, 2024

I wasn't aware audio hardware on laptops can emit ultrasound. Aren't those filtered to be below 22KHz?

nomel · on May 27, 2024

Most adults can’t hear > 18kHz.

Related, there are devices you can buy to repel teenagers: https://www.today.com/news/controversial-mosquito-sonic-devi...

hiimshort · on May 27, 2024

As someone who grew up with iPhones coming out around the time I was in middle school and apps producing noises like this were used as pranks in class I have to ask that people don't do this. The sound is painful.

IshKebab · on May 27, 2024

Yeah I'm pretty sure there are plenty of adults that can hear those frequencies though. It's not like everyone reaches 18 and suddenly loses hearing.

Stupid devices IMO. When they first came out I downloaded a sample audio file from the manufacturers website to see if I could hear it. I couldn't.... because they encoded it as MP3 and it was completely filtered out by the encoding! Literally an empty file.

Fanmade · on May 27, 2024

In our region, some people have devices to repel moles or marten using "Sounds that are inaudible for humans". I have yet to come near to one of these devices that I can't hear. And it's not only me, my wife can also hear them, as well as my daughter. I also know some people who can't hear anything from these devices, but it feels like the statistics about what people can hear and what not are not that up to date.

IsTom · on May 27, 2024

Personally I can't hear much above 15.5 kHz and I can hear these devices, so they're not even particularly high-pitched.

dijit · on May 27, 2024

Agreed. My mother thinks I'm lying. I will admit I feel a little bit special that I can hear her motion activated cat-poop repeller thing.

I visited her recently and wasn't sure if I had finally aged out of my sensitive ears (34 now) or if her batteries needed to be replaced.

FWIW I wonder if people do not experience physical pain from certain sounds, because people seem to be totally fine with sirens but it feels like I'm having a spike pushed into the side of my head.

ben_w · on May 27, 2024

When I was a teenager, the applause from the end of year school talent show caused me physical pain — enough that the teachers noticed and got me out of the hall.

This no longer seems to be the case, as I'm living right by a major junction and get random full volume sirens at least six times in the average day. I hate them, but they don't hurt.

ercan · on May 27, 2024

I used to hear the remote from TVs especially old Philips ones and LG's with the single chip on them. That was until I hit 44... after that is hit or miss or just imagining.

thfuran · on May 27, 2024

But weren't old remotes all IR?

cbracken · on May 27, 2024

Some very old remotes used what effectively amounted to high frequency tuning forks. Example: https://www.theverge.com/23810061/zenith-space-command-remot...

ssl-3 · on May 27, 2024

No, not all were IR.

But all of them made by "LG" and featuring a single-chip design are either IR or RF.

alexdbird · on May 28, 2024

Those will still have a ceramic resonator in them, as they’re cheaper than a quartz crystal. Hard to imagine them being audible, but not impossible.

rangestransform · on May 27, 2024

These are everywhere in Japan as anti loitering devices

andrekandre · on May 27, 2024

i wonder if that is what those super irritating high-pitched screeching noises near some buildings in shibuya...

it seems to occur only late at night but could be wrong... i had assumed they were for driving rodents away but i guess not?

literallycancer · on May 27, 2024

Courtesy of anti social boomers. If I ever meet the person who approved the installation of this kind of system I swear I'll punch them in the face.

stavros · on May 27, 2024

I can hear whatever my PC can output, but I can't hear anything in our Zoom meeting rooms. Maybe it's low volume, though.

nomel · on May 27, 2024

> I can hear whatever my PC can output

This would require measuring equipment, since you can also only hear what you PC can output that you can hear. ;)

stavros · on May 27, 2024

I do use my phone to measure, though the mic might be filtered as well.

lomase · on May 27, 2024

To sample a analog signal in digital you always have to filter it. Otherwise you would a lot of aliasing because of the Nyqst thing.

jorvi · on May 27, 2024

Chromecasts do it too, for pairing purposes. I’m sure other devices use it for clever purposes as well.

whoknowswho · on May 27, 2024

Teams does it too, I've had meetings where I've been able to audibly hear it coming from devices

slac · on May 28, 2024

Not great if you have dogs in the office unfortunately.

vineyardmike · on May 26, 2024

Unmodified audio is good enough, but using some signaling sound would improve coordination.

jauntywundrkind · on May 26, 2024

I too am in awe of the audio engineering challenges and opportunities here.

But I don't necessarily know that Meet is trying to tackle all this? Are they using the mics as a microphone array & processing signals across phases? Could be missing it but I don't see that they said so. Perhaps they're just picking the loudest mic for a given speaker? Or any of a dozen other simpler tactics?

whazor · on May 26, 2024

The current baseline is to manually mute and unmute microphones. So picking the best microphone sounds like a better idea already. If other people make a sound, I think it would be acceptable of that sound was missed/softened.

PaulDavisThe1st · on May 27, 2024

In a large room, perfect syncing is actually impossible since different listeners will be far enough from each other speakers to cause, at best, comb filtering, and at worst, audible delays.

IsTom · on May 27, 2024

I assume that if speakers are not set too loud then either you're close and the algorithm works or you're far and sound is quiet enough that it's not an issue.

PaulDavisThe1st · on May 27, 2024

If the speakers are not set too loud, you don't need the algorithm at all.

IsTom · on May 27, 2024

If you're sitting next to each other with regular volume then clearly you do, with the same volume, but across a room it's probably going to be ok.

varispeed · on May 26, 2024

Also imagine that 44.1kHz on one laptop will not equal 44.1kHz on another or one can run at 96kHz and others on 44.1kHz etc. that means everything has to be dynamically resampled in realtime whilst preserving quality and low latency.

warble · on May 26, 2024

I think this is much simpler than what you're suggesting. Careful microphone level management can handle this. No need for audio sync. I know they use the word "sync" but that's a very broad term.

lomase · on May 27, 2024

You need sync to mitigate phase issues because the mics are in different positions on the room.

jpc0 · on May 27, 2024

We were taught the 1:3 rule in audio...

If the distance from the microphone to an "unwanted source" is three times the distance asthat from the microphone to the source phasing likely wont be an issue.

There's always caveats with engineering but it's a decent rule of thumb assuming equal volume sources... I can imagine it's not too hard to detect that anyway, weve been able to do realtime fft for a very long time.

warble · on May 27, 2024

No, probably not there are no phase issues if you just don't transmit the signal. The hard part would be to determine who's in the room, and then who's talking and then mixing appropriately to eliminate feedback and optimize speaker sound quality. None of which requires signal phase accurate synchronicity.

If they're actually able to "sync" (again a poorly defined term) given the problems associated with network latency and different hardware it would border on magic.

albertzeyer · on May 27, 2024

I assume it's just some neural network doing the heavy lifting, or not?

everforward · on May 28, 2024

Is a neural network even really necessary? This seems like something where careful application of some normal math would work. Find a loud event, correlate the loud event across devices to get offsets (by sequence ID, not clock time), do some fancy math to apply inverse sin waves or what not.

That's not to diminish the accomplishment or say that it's easy (or that I could do it), but I don't think a neural network is necessary here.

mzl · on May 27, 2024

My guess is that that would introduce too much latency (neural networks generally have bad latency).

jasonjmcghee · on May 27, 2024

DLSS has to run in some small fraction of 16ms (as everything else still needs to run in that time) to keep 60fps- granted it’s running on a GPU.

But they can be quite fast.

I wouldn’t be surprised if they were and audio is routed through a neural network before relay.

weebull · on May 27, 2024

And in audio, that meets the definition of really bad latency.

jeffrallen · on May 26, 2024

If you were doing this kind of multipath and array signal processing on RF signals, it would be covered by ITAR.

And yet Google can ship it anywhere on the planet via HTTPS and webasm... Things that make you go hmm.

joshuamorton · on May 26, 2024

It would be....surprising if this was done client side and not serverside.

refulgentis · on May 26, 2024

I have a very strong feeling this is more marketing BS than 100% solved technical achievement. (disclaimer: xoogler, no inside info, just familiar with recent Google and separately, audio processing)

makeitdouble · on May 27, 2024

From the praise, I assume you have first hand experience with it ? Which part did you use and did you see any rough edges ?

boffinAudio · on May 27, 2024

I agree that this is amazing - but what I don't like about it, is the fact that a 3rd party is doing this, when it should really be a built-in feature of the operating system - or at least, be implemented as close to the device as possible. From what I can glean from this breathless press release, this functionality requires a fair bit of cloud ... anathema to audio professionals, but maybe not so, the professional management classes.

Too many times these kinds of services are wrapped up at the application layer, where really they belong in the operating system. For example, wouldn't this be a perfect thing to implement as a plugin for Pulseaudio, or JACK, or even .. VST?

(Disclaimer: I work on high end microphone and audio products at a well-known hardware manufacturer of such, where much more effort is being made to make the devices, themselves, smarter ..)

jpc0 · on May 27, 2024

How would you do this on the os?

I would honestly try to sync audio output based on a shared time reference, something along the lines of what AES67/Ravenna/Dante does but you can be a little more lax and use ntp or system time since you don't need to be sample accurate.

For the microphones that would be a little harder but you should be aware it's not that tough since a few high end manufacturers have phased microphone arrays for videoconferencing. You could probably get close though but the fact is you need the audio from all the sources in a single location for processing and do phase analysis on it and possibly find an optimal delay for each by checking the group delay.

The advantage they have is some latency is acceptable and they don't need to do it on a low power device.

I don't see what this has to do with Gemini but maybe that's just marketing...

svnt · on May 27, 2024

My money is that outside of a couple of dog and pony demos with everyone on one well-administered LAN you could not make this work with system time and NTP on consumer devices. You will regularly see 100ms difference in NTP time.

The fact that phased array microphones exist has nothing to do with the point we are discussing, which is audio coherence across heterogenous devices whose only real connection is a web browser.

jpc0 · on May 28, 2024

You don't need the time to actually be synced.

I'm thinking more some sort of system with a sync point registered per device and using that as a time reference.

It's not inconceivable that they could easily detect multiple devices in a room and find a sync point based on microphone input from a speaker.

Once you have a sync point found you can then set a delay on all devices to try to match that sync point. Nobody said this is easy or everyone would be doing it but it's simple enough.

The phased array microphones is more a pipe dream but you wpuld absolutely be able to do something approaching that with multiple devices on a single room depending on how accurately you can predict microphone location within the room. Im reasonably sure you could start by just using the closest mic and then over timr as you improve sync you can try to use multiple.

As I said they get every single audio stream in and out into their servers and they have full control of the audio the tab is playing and the timing of that.

I don't see this being any different to what the likes of Sonos/Google Home/ Apple Home etc are doing with synced appliances for stereo/ multichannel devices, it's likely significantly harder because it's heterogeneous devices as you said.

All that doesn't answer my question of how you would do this at the OS level? You don't have any of the required information per device, only the central server has even the hope of having all the relevant information and control.

svnt · on May 29, 2024

We agree that doing it at the OS level is probably the wrong direction. I think you could get there with PNTP and audio hardware support, which is more how Sonos etc do it afaik but then again you aren’t solving the heterogenous device problem.

It is apparently a good example of something that needs performant neural nets in the cloud to solve. At first glance it looks like a low-level hardware-firmware problem. Market conditions prevent solving it at that level though, so we had to wait for the right combination of resources, new signal processing and heavy cloud compute.