
The Visual Microphone: Passive Recovery of Sound from Video - csom
http://people.csail.mit.edu/mrub/VisualMic/
======
jsprogrammer
Patent pending? The approach should be trivially obvious for anyone familiar
with the basics.

What happened to the MIT License and liberating knowledge, does MIT not do
that anymore?

~~~
joosters
"It is a rare mind indeed that can render the hitherto non-existent blindingly
obvious. The cry 'I could have thought of that' is a very popular and
misleading one, for the fact is that they didn't, and a very significant and
revealing fact it is too."

~~~
tormeh
It is a very good idea, but there's another problem at hand: It's not novel. I
remember hearing about this many years ago, though that was using vibrations
in windows to get the sound inside a building. IMO there's nothing to patent
here. Unless this is the same group it's just appropriating someone else's
work.

Besides, random people do think of cool stuff all the time. They just don't
normally patent it or start a business based on it. I thought of real-time
music streaming to phones as a subscription service way before Spotify was a
thing, but there wasn't much that 15 year old me could do about it. To this
day I still have no idea how I would have gone about with a similarly good
idea if I got one again.

~~~
sp332
It was a plot point in the movie Eagle Eye, which came out in 2008. Maybe
there's something more specific in the patent though.

And I don't know if there was a real service for this, but the idea of music
streaming to phones is pretty old. Peter Shickele used it on his parody album
Two Pianos are Better than One which came out in 1994. It was called "Inter-
Ear TelecommuniCulturePhone - Trademark!"

------
kaoD
Previous discussion:
[https://news.ycombinator.com/item?id=8131785](https://news.ycombinator.com/item?id=8131785)

~~~
gus_massa
I'll partially copy two interesting comments from that thread:

> _[A video with] 60 frames per second only allowed to identify the speaker
> and the number of people in the room._

> _The demo in the video is based on a high-speed (1000+fps) recording by a
> special camera, not on 'normal' video._

~~~
GuiA
A standard mobile phone today can do 240fps. We're only a few years away from
1000+ for in everyday devices.

~~~
Potando
Really? Isn't speed limited by the amount of light that can be received in a
small lens like on a phone?

~~~
wyager
>Isn't speed limited by the amount of light that can be received in a small
lens like on a phone?

It's also limited by sensor noise and efficiency.

Presumably, we'll push all three of these limits over the next few years.

------
downandout
Well that's pretty scary. I can see one application of this already: casinos
in the US are legally prohibited from recording audio on their floors, but
have perfectly positioned cameras everywhere. Beyond that, I'm guessing every
spy agency on earth will be buying solutions based on this.

It would be interesting to know what the genesis of this project was - for
example if the NSA or CIA was involved in suggesting to a professor that MIT
take a look at this area. This is a very mission-specific technology.

~~~
lnanek2
Spy agencies already have a laser they can put on a window to recover audio
from the vibration and other similar devices. This isn't really that
different. It seems behind current spook hardware, honestly.

~~~
deutronium
But this is passive, which would undoubtedly have advantages to spies

------
catshirt
could you do it the other way around?

how accurately can we recreate a 3d space from sound? what
assumptions/information would you need to make it more accurate?

~~~
sjtrny
Yes. Look at the setup/calibration involved with Soundbar type audio systems.

~~~
catshirt
awesome, thanks for the lead!

i will look it up, i am mostly curious about it's resolution. for instance, my
unqualified hunch is that the algorithm couldn't detect the size of the dog in
my room based on a microphone recording.

i guess the more calibration involved the easier the problem becomes. but that
is no fun. :)

------
blt
William T. Freeman is an outstanding vision researcher. His list of
publications
([http://billf.mit.edu/publications/all](http://billf.mit.edu/publications/all))
is full of these simple, clever solutions for problems slightly outside the
mainstream. I really admire his work.

------
yshalabi
Rubinstein was also behind the work on using pixel intensity variations to
visualize... stuff. They used it to extract heart rates. I am guessing sillier
methods used, but now to recover induced vibrations due to sound. Interesting
work.

------
baldfat
Someone give this to the writers of crappy TV that uses the enhanced photo
line I am sure they will flip out at the whole new story lines created by
this.

CSI has forever been changed. Bet you it is on next season on multiple of TV
crime shows.

------
sjtrny
This is an interesting extension of the ideas from this paper by the same
author/s
[http://people.csail.mit.edu/mrub/vidmag/](http://people.csail.mit.edu/mrub/vidmag/).

------
Animats
They're making progress. At 5000 FPS, it's not surprising that they can
recover audio. But from 60 FPS, that's striking. That works because some
imagers don't take the whole frame at once.

~~~
bsder
And I suspect that you could combine video feeds from several lower speed
cameras to give you an effective 1000FPS.

~~~
jobigoud
Yes but you have to precisely stagger the start of expositions of each camera
and it's hard to do on consumer hardware.

~~~
bsder
Um, why would you have to precisely stagger?

I suspect that you have enough information to actually align the videos after
the fact.

10 videos at 250FPS would probably distribute sufficiently.

~~~
Vulkum
Would it not be hard to interleave the first frame of these videos given
different starting times and angles (ignoring camera movement)? It should be
easy if the videos have synchronized timestamps, but that might not always be
the case.

~~~
bsder
Any in-frame motion probably allows you to align to frame after the fact. This
is existing technology, and gives you timestamp to frame alignment.

If you are reconstructing sound, you can now fuzz the time alignments to give
the maximum signal for the maximum time (non-correlation will damp to random
noise quickly). This allows you to pairwise reconstruct time alignments.

At that point, you put them all together and run your detailed analysis.

Now, I didn't say this way _EASY_. :) Or cheap. Or real-time.

Just that it is possible.

------
yummybear
I wonder how this would perform on iPhone 6's (or others) high speed camera.

------
Naushad
Slowly, the creative imaginations coming to life. Eagle Eye....

