
Creating ad hoc microphone arrays from personal devices (2019) - tomstokes
https://www.microsoft.com/en-us/research/blog/bring-your-phones-to-the-conference-table-creating-ad-hoc-microphone-arrays-from-personal-devices/?ocid=msr_blog_meettrans_interspeech_tw
======
crazygringo
This is a really interesting technical concept.

Capturing high-quality audio in a meeting room for videoconferencing is a
notoriously complicated problem.

Microphones are crazy sensitive and pick up things like footsteps and
conversations outside the door, shuffling feet and tapping on keyboards, and
construction and HVAC noise like you wouldn't believe.

So filtering those things out, and _then_ capturing the best quality audio
from the current speaker, _and_ trying to get everyone's voice at roughly the
same volume whether they're sitting directly across from the microphone or are
piping up from the corner of the room...

...and do this all while cancelling 100% of the echo that might be coming from
two or three speakers at once...

...it's an insanely hard problem. Beamforming microphones absolutely help in a
huge way, because if you know the speaker's voice is coming from 45° then
knowing that any sound coming from any other angle can be removed is a really
helpful piece of info.

Now, with beamforming microphones, the precise relative location and direction
of each mic is known. The idea of creating one big beamforming mic for the
room out of people's individual mics is... insanely hard, but super cool.

It's interesting to me that this article is about measuring the quality of
voice transcription, rather than about the quality of audio in an actual
meeting. But I suppose the voice transcription quality measurement is simply a
proxy for the speaker audio quality generally, no?

This could actually be a huge step forward in not needing videoconferencing
equipment in meeting rooms. So far, one of the biggest reasons has actually
been dealing with echo and feedback -- when people are in the same call with
multiple devices in the same room, it tends to end badly. But if the audio
processing is _designed_ for that... the results could actually be quite
amazing.

And it's well-known that the "bowling alley" visual of meeting participants
(camera at the end of a long conference table) isn't ideal. If each
participant has their own laptop camera on themselves, it could be a vastly
better experience for remote participants.

~~~
vxNsr
> _And it 's well-known that the "bowling alley" visual of meeting
> participants (camera at the end of a long conference table) isn't ideal. If
> each participant has their own laptop camera on themselves, it could be a
> vastly better experience for remote participants._

My company pushes us to have any conference that will include remote people
from our desks, even if some or most of the attendees are in the same physical
local. It means that no audio is dropped bec of too much cross-talk and that
all attendees are on the same footing. Only real issue is that we don’t
automatically get headsets, you need to request/expense it.

~~~
gregmac
Yeah, this is just a much better way to conduct meetings.

I've been in many meeting rooms where there's a single projector/tv, and the
person controlling it only shows either the remote cameras OR their screen
(while they're sharing), so that isolates the remote people even more. (I've
also been the remote person in this situation, and it definitely feels more
like being an occasionally noisy fly on the wall then a full participant).

Everyone also gets their full desktop (big/multiple monitors, full keyboard,
etc).

It'll be interesting to see what happens post-lockdown.. will the people miss
the benefits of "one remote = all remote" and have more empathy for remote
people, or will we go back to the same old?

~~~
vxNsr
> _It 'll be interesting to see what happens post-lockdown.. will the people
> miss the benefits of "one remote = all remote" and have more empathy for
> remote people, or will we go back to the same old?_

I think it'll be like everything that could be learned during this time,
someone has to recognize the lesson and actively work to implement it.

My main issue with using my desk is that I normally keep my laptop closed and
off to the side, so if I just open it the view of me is in profile and doesn't
look like I'm paying any attention, IT is loath to buy an external webcam for
anyone because "every laptop comes with a webcam," luckily I was able to
source a spare one they had. But I know that most of the desks are set up the
same way as mine, so most people either choose to use their laptop screen as
the main monitor or just don't enable video for the call.

------
pjc50
My employer calls this "far field" audio, and has a number of
hardware/firmware solutions:
[https://www.cirrus.com/products/cs48lv41f/](https://www.cirrus.com/products/cs48lv41f/)
(we're also very secretive, so I can't really discuss it beyond the public
website)

The specific improvement Microsoft are touting is _blind_ beamforming, without
knowing where the microphones are located relative to each other. Regular
beamforming is already in use in some products.

------
itchyjunk
There are obvious(?) privacy issues and what not here. But ignoring all that
for a second, it does sound pretty cool to be able to leverage all the little
computers we walk around with.

Think of all those shitty little video clips people take at a concert. Could
all those be combined to make some high quality panoramic video? Probably a
lot of other cool applications that I can't even comprehend for now. What a
time to be alive.

~~~
kick
Panoramic no (panoramas work on a single axis), but interesting despite that.

One of the applications for a thing like that would be creating 3D
environments of concerts and other historical events that were fairly accurate
from any angle, though, which could have some pretty interesting effects
(could you imagine how interesting it would be if you could watch old concerts
of dead artists, or a politician's speech from a hundred years ago, with 6
degrees of freedom, "accurate to the millimeter!" or something?) and outcomes
and so on. Much more interesting historical record-keeping.

~~~
jcims
The word(s) for your 3d application is photogrammetry and/or videogrammetry.

~~~
kick
Thank you!

------
Zenst
Interesting, doable and from my experience of this area, need a reference
sound to calibrate, though that calibration could be ongoing for such things
like this.

Gets down to matching a single sound and working out the timing of that sound
from the multiple sources. Then you also need to factor in the frequency
response as well.

That last part would be important to handle things like the table the devices
are sat upon picking up vibrations from the desk. Remember that phones don't
have a rubber base to isolate them from the table so any vibration of that
surface would propagate into the device and microphone. Then the whole aspect
of varying devices and with that, varying microphone quality and device
housings. So calibrating at some level would be key for this to work, though
doable and processing wise you could even run a master device and handle the
processing there and remove the server aspect with some of the processing done
upon each local device and passed onto the main device for correlating.
Certainly some phones have the power to handle this type of affair to replace
the server aspect. But that would be more work/effort and something that may
well see later on. Though makes it harder to sell a bit of server processing
software then.

Though one test I'd like to see this system handle would be how well it
filters out those vibrations.

After all you don't want to hear somebody writing or putting a cup or other
object down whilst somebody else is talking.

I'd also wonder what type of jitter tolerances they are working with across
those devices and how that scales with devices/jitter - does jitter increase
after so many devices.

~~~
ftio
Could you do the reference sound beyond the range of human hearing so that you
could do it continuously?

~~~
Zenst
Nope as different frequencies propergate at different speeds.

However the initial greeting at start of the meeting would be good enough to
cover that. Though some feedback and constant recalibration would be ideal and
doable, That covers things like people entering the room and briefly changing
the rooms acoustics with the door open briefly. Then somebody closes a blind
and things like that, even somebody moving a coffee cup on the table would
have (whilst small) an impact upon the acoustics. Though in that last
instance, somebody moving a cap nearer a device would have a bigger impact
upon that single source.

Though easiest way would be having a sound source on the main camera that did
a simple frequency sweep - if you wanted to use a reference point sound source
for calibration. You may even get away with single calibration then, though
dynamic calibration and using the meeting itself to constantly recalibrate,
whilst more effort, would give a better result.

But be interesting seeing this in action and how they handle aspects like
that.

Indeed, thinking it thru you could have each device as it joins into the
meeting do a calibration tone sweep that the other devices would pick up. That
approach may well be better as you could get a more accurate map of all
microphones in relation to each other that way. So initial login/join of the
devices would handle that aspect nicely.

------
peter_d_sherman
Excerpt:

"While the idea sounds simple, it requires overcoming many technical
challenges to be effective. The audio quality of devices varies significantly.
The speech signals captured by different microphones are not aligned with each
other. The number of devices and their relative positions are unknown. For
these reasons and others, consolidating the information streams from multiple
independent devices in a coherent way is much more complicated than it may
seem. In fact, although the concept of ad hoc microphone arrays dates back to
the beginning of this century, to our knowledge it has not been realized as a
product or public prototype so far."

Thoughts:

There's something deep here, not with respect to microphones and speech
transcription (although I wish Microsoft and whoever else attempts to wrestle
with those problems the greatest of success!)

There's a related deep problem in physics here.

If we consider signals that emanate from outer space, let's say they're from
the big bang, or heck, let's just say they're from one of our past-the-edge-
of-this-solar-system satelites -- that wants to communicate back to earth.

Well, due to the incredible distances involved, the signal will get garbled in
various ways...

So here's the $64,000 question:

When that signal from deep space gets garbled, isn't it possible that it turns
into various other signals, at various different other frequencies and
wavelengths?

In other words, space itself, over long distances, acts as a prism (not
really, but as an easy way to wrap your mind around this concept), for radio,
and other electromagnetic waves...

Now, if you want to reconstruct the orignal message at these long distances,
you must be able to reconstruct garbled radio (and other em) waves, which are
moving at different frequencies, and may even arrive at the destination at
different rates of speed with various time shifts...

Basically, you've got to take those pieces -- move them to the correct
frequency, time correct them, speed them up or slow them down, sync them, and
overlay them -- to reconstruct the original message...

That's the greater question in physics -- the ability to do all of that, with
em signals from a long way off in space...

The article referenced -- is the microphone/audio/slow speed equivalent -- of
that larger problem...

------
pabs3
This reminds me of this open source project (and its predecessor manyears and
open hardware projects 8/16soundsusb).

[https://github.com/introlab/odas](https://github.com/introlab/odas)
[https://github.com/introlab/manyears](https://github.com/introlab/manyears)
[https://github.com/introlab/16SoundsUSB](https://github.com/introlab/16SoundsUSB)

Website of the team behind these:

[https://introlab.3it.usherbrooke.ca/](https://introlab.3it.usherbrooke.ca/)

------
geokon
Does anyone have any insight into why neural nets are used for the "blind"
beamforming? I don't have first hand experience with machine learning, but
this just doesn't seem to me like a machine learning type of problem. I get
it's not trivial, but it seems like there should be an analytic solution -
more or less

~~~
crazygringo
Acoustics are modified in extremely non-linear ways depending on the shape of
the room, bodies within it, materials, acoustic reflection, acting differently
at different frequencies, and so on.

In theory if the entire 3D layout and material properties were known known in
advance you could get clear audio analytically. But reverse-engineering the 3D
layout and materials from existing audio is essentially impossible.

So machine learning is used to find approximate solutions that work.

------
stragies
I look forward to exploring that github source drop.

------
stuaxo
Oh, I wanted this years ago when phones had terrible microphones and audio
codes.

The idea was that at a gig loads of people would record and you could
reconstruct a much better recording.

~~~
dannypgh
I'd assume a lot of the losses would be the same across all devices - e.g. GSM
and associated preprocessing will result in dynamic compression in a uniform
way regardless of placement, no? It's an interesting idea but it seems like
you'd need a mixture of different compression types.

------
andrewfromx
wow i just added
[https://news.ycombinator.com/item?id=22956082](https://news.ycombinator.com/item?id=22956082)
a few days ago, on point no?

------
kohtatsu
Would be cool if Microsoft gave more shits about privacy.

Edit: This would be cool if I trusted Microsoft to properly handle privacy.

~~~
moron4hire
I trust MS to not sell my data to every random jabroni on the net more than I
trust Google.

~~~
airstrike
While I agree, that's also an incredibly low bar

