
Extracting audio from visual information - r0h1n
http://newsoffice.mit.edu/2014/algorithm-recovers-speech-from-vibrations-0804
======
jballanc
While it's cool to see this technique applied to consumer video, the general
idea has been used in international espionage for a while now. In fact, Léon
Theremin (yes, of the musical instrument) invented an early device capable of
eavesdropping based on window vibrations:
[http://en.wikipedia.org/wiki/L%C3%A9on_Theremin#Espionage](http://en.wikipedia.org/wiki/L%C3%A9on_Theremin#Espionage)
. I also recall that this is why all of the windows in the White House are
fitted with tiny devices that vibrate the windows randomly as a
countermeasure.

~~~
clueless123
As a kid (in the early 80's!) I accidentally figured this out while "hacking"
a sound based "walkie talkie" with a photo cell and a LED pointed at each
other through a little telescope, from my house to my neighbors across the
bock.

The device worked, but had a bug. a hummmm on the receiver.

For the longest time I could not figure out what was the "hummm", till I
noticed it was there even with the transmitter off.. and then, I heard strange
voices even with the transmitter being off!

Then I realized the humm was the reflection of incandescent light on the
window, and the voices where the people in the room making the window vibrate
too..

Ahh.. those where great fun days early on "hacker" life as a hacker. :)

~~~
enraged_camel
>>till I noticed it was there even with the transmitter off.. and then, I
heard strange voices even with the transmitter being off!

...and that is when little Johnny knew he had to lay off the shrooms.

:P

I love hearing stories like yours though. I, too, miss the child-like wonder
of making discoveries. We take the physical world we live in for granted, but
there's just so much out there to see and learn from.

------
keerthiko
Wouldn't it be really neat to apply this to HD movie sequences, and hear what
the sounds on the set and the voices of actors were like pre-production? And
how unreal some of the sounds must have turned out with all the visual
tweaking that happens in production?

~~~
anigbrowl
I record the dialog for your movies. Aprt from obvious sound effects like
Darth Vader voices or so, actors' dialog is changed as little as possible
during post production. It's not heavily EQed or comrpessed, because that
would mean a corresponding change to the room tone and background noise, which
would then fluctuate unnaturally as you went back and forth between the
participants in a scene. In scenes with a lot of movement actors' voices
sometimes sound a little deeper than in real life, because they are fitted
with a tiny wireless microphone, and there's usually some resonance from the
chest cavity.

But generally what you hear is very close to how the person actually sounds -
although their accent or inflections may be adopted for the purposes of their
role. This can be a bit jarring; I've worked with method actors who maintain
their screen accent at all times during production until the film is done, so
when they switch back to their regular accent after a month or so it's
extremely disorienting, since I've been listening to them in my headphones day
in day out for weeks, and am paid to pay as much attention to their voices as
the cinematographer pays to their faces.

Some actors go even farther in support of their public image. Rock Hudson had
a somewhat high voice that producers deemed incompatible with his looks, so
during production he would warm up every day by shouting for 20 minutes and
gargling with orange juice to inflame his vocal cords, and of course he smoked
a lot too. What actors will do to themselves in pursuit of screen presence far
exceeds anything I've ever been asked to do in post. Editing dialog is more
than enough work without trying to sculpt people's voices.

~~~
keerthiko
Thanks a lot for that insight. I will assume you work primarily for Hollywood
with that description. It is interesting to hear about the behind-the-scenes
from film industries in different parts of the world.

In the Indian film industry (maybe not so much Bollywood, but more of the
local ones like Tamil, Malayalam, etc) there's a LOT more post being done.
Some entire Malayalam films have dubbed voices (for the original movie). Not
to mention nearly every song track does not have the singing recorded during
the shooting of the dance sequences. So those sets and actor voices I would
bet would be completely different.

~~~
anigbrowl
Yeah, here in California we have the luxury to do good production sound almost
all the time. Dubbing the voices later all sounds very different, because the
recordings are not in the same space as the original production and so on. I
forgot to think about that aspect, because here we always record production
sound unless there is an impossibly high level of noise (eg from a wind
machine or something). With modern cameras it's been very good for the sound
department because they are so quiet, whereas older 16 and 35mm film cameras
were quite noisy, like a sewing machine, and it required some work to prevent
this messing up the dialog recording.

------
uptown
Reminds me a little of "Dual Photography" as presented at Siggraph in 2005.
All the data you need to construct a new view is available if you know where
and how to look:
[https://www.youtube.com/watch?v=p5_tpq5ejFQ](https://www.youtube.com/watch?v=p5_tpq5ejFQ)

~~~
moron4hire
Hah! I have the book that is used in the example picture. Though, I suppose
everyone who studied CG in the early to mid 2000s has that book.

------
infiniteri
Incredible and terrifying at the same time. If they're doing this kind of
stuff right now with consumer cameras, imagine how effective this technology
will be in just a few decades. Privacy is fading quickly with the advent of
exciting technology like this.

~~~
platz
you need a camera that captures at least 2,000 frames per second.

~~~
pavel_lishin
> _Because of a quirk in the design of most cameras’ sensors, the researchers
> were able to infer information about high-frequency vibrations even from
> video recorded at a standard 60 frames per second._

~~~
Joeboy
The audio from the 60fps video sounds pretty bad though, which I suspect is
mostly because of inherent maths/physics limitations rather than anything that
software can improve.

Edit: They mention capturing frequencies up to five times higher than the 60Hz
frame rate, which would mean a maximum frequency of 300Hz, which would suggest
the equivalent of 0.6kHz audio, which is a 73.5th of the audio rate of a CD. I
doubt you'd get intelligible speech from current consumer hardware using this
technique.

~~~
mnw21cam
There is some small possibility of improvement through software techniques,
such as maybe data assimilation, which can use information from surrounding
time-frames to improve the measurement. This is assuming that the magnitude of
vibrations changes a lot slower than the vibrations themselves, which is
usually true, and how most audio compression works. It may be able to clean up
the sound a little. However, I would say that the results they have obtained
so far are very impressive.

~~~
T-hawk
The data comes in faster than 60 fps. A camera sensor doesn't capture the
entire frame instantly every 1/60 second. It progressively scans through the
frame over some measurable fraction of that 1/60 second. This is that quirk.

Suppose the camera scans 720 lines in HD every 1/60 second. Each row is offset
in time by 1/43200 second. A rigid object could be slightly offset in space on
each line of pixels, indicating that sound waves perturbed it in the time gap
between when the camera captured each line. So that subframe video data can be
turned back into audio at a much higher frequency than that apparent 60 Hz
video sampling rate.

In other words, we're not just talking about 60 frames-per-second from a
camera. It's really perhaps 43,200 _rows_ per second, an enormously higher
sampling frequency.

~~~
mnw21cam
> The data comes in faster than 60 fps

Yes, yes, that was completely obvious from the article. We are getting
thousands of "measurements" per second.

However, each of those measurements is incredibly inaccurate. Each one is
trying to detect the change of colour of 1/200 of the colour range in a single
pixel. You may be getting less than a single bit of entropy per measurement.

An advanced signal processing technique will look at the longer-term picture.
Sound vibrations are not a random walk - they tend to be a combination of sine
wave vibrations, where the rate of change of magnitude of each wavelength is
significantly lower than the vibrations themselves. Therefore they are to a
certain extent predictable, and this predictability is used by audio
compression algorithms. The signal processing algorithm will have to make use
of the extremely limited information coming from the measurements, and match
up possible sets of varying sine waves that could be causing those
measurements. This may be sufficient to reject some of the noise that we could
hear on that video, and clean up the sound a bit, but it is quite a hard (and
CPU-intensive) processing task.

------
FatalLogic
It's reminiscent of the laser microphone, which measures sound-induced
vibrations in objects by bouncing a laser off them, and reconstructs the
original sound waves

[http://en.wikipedia.org/wiki/Laser_microphone](http://en.wikipedia.org/wiki/Laser_microphone)

~~~
__m
or active and passive sonar in their difference.

------
adriancooney
This is amazing. Imagine taking high definition video of a crowd of people.
You could pick out objects nearby and hear what individuals are saying. You
could nearly produce a 3D auditorium by sampling different points in a video.
Couple this with a 3D camera and an Oculus Rift, you could have something
incredible.

~~~
atomicfiredoll
While I'm not sure how closely this matches up, it sounds like you may at
least be interested in exploring these concepts:

[http://en.wikipedia.org/wiki/Cocktail_party_effect](http://en.wikipedia.org/wiki/Cocktail_party_effect)

[http://en.wikipedia.org/wiki/Independent_component_analysis](http://en.wikipedia.org/wiki/Independent_component_analysis)

Edit: Spacing.

------
crucialfelix
Reminds me of this great hoax / video piece from a few years ago about French
scientists decoding voices that were accidentally encoded into vases in
ancient Pompeii

[http://www.youtube.com/watch?v=ZbpwBTDvXrI](http://www.youtube.com/watch?v=ZbpwBTDvXrI)

It's in French, but this makes it even funnier.

------
supahfly_remix

      Because of a quirk in the design of most cameras’ sensors,
      the researchers were able to infer information about
      high-frequency vibrations even from video recorded at a
      standard 60 frames per second.
    

Can anyone explain what this quirk is?

~~~
saticmotion
Cameras basically read their sensors one row of pixels at a time. By measuring
the distortion of each row, they can detect vibrations higher than the
camera's frame rate.

~~~
kaoD
So it's like if 960-row video at 60fps were actually a 57600 rows-per-second
video, right? Which they can extract info from because having more rows in a
still frame doesn't mean having more information (at least not linearly), i.e.
in still frames with no rolling shutter, rows contain redundant vibration
already extracted from previous rows.

So having a rolling shutter is good for this specific application because it
trades off resolution (most of which is redundant or insignificant
information) for sampling rate.

~~~
lifeformed
Between the time the first and last row are read, the object might've moved a
little bit. So if you take a picture with your phone from the side window of a
moving car, the picture will appear stretched.

~~~
kaoD
Sure, I meant it specifically as a guess of how it's applied to sound
extraction and how it means you have ROW samples per frame.

------
petercooper
And now think of how much high definition video, CCTV, and other forms of
recordings already exist.. and then think about running large collections of
pre-existing video through such algorithms, along with the best speech-to-text
in the business. You could have a whole new Wikileaks on your hands :-)

~~~
imaginenore
Thankfully most CCTVs are so crappy and so low framerate, I doubt they can get
much out of it.

If you want to record a particular person through a window, for instance, you
can get a laser microphone that catches the vibrations of the glass.

~~~
petercooper
Yeah, I've heard that's what the spies use, although that does require
specific effort. What I find more _intriguing_ about this development is how
pre-existing footage could be used. While HD CCTV is certainly not popular, I
suspect enough has been said in the presence of existing HD video to
incriminate a few people :-)

~~~
protopete
I think it unlikely that pre-existing footage can be used, because HD video is
almost always compressed, thus masking the minute vibrations in the pixels.
The algorithm described in the article work best for uncompressed video
directly from the image sensor, and they can run it in real-time without
needing to store the video.

------
sp332
This is from the movie Eagle Eye, right? The evil computer watches the
vibrations in a cup of coffee while someone is speaking.

~~~
shangxiao
lol that is the first thing that I thought as well. I just passed it off as a
typical Hollywood trope, but now I'm quite delighted to see it become a
reality.

~~~
TeMPOraL
> _I just passed it off as a typical Hollywood trope_

I often wonder what makes people to pass off things like that as "typical
tropes", where they are obviously realistic and doable.

~~~
ctdonath
"Any sufficiently advanced technology is indistinguishable from magic." Those
who view it as magic will express it as a trope, even if it is merely advanced
technology.

Just yesterday I chuckled when recalling a James Bond movie involving Bond
driving a car in reverse by viewing a back-up camera. At the time, it formed
an instant "trope" because it was so cool and novel. Yesterday, I was doing
exactly that with the backup camera on my car - obviously realistic and doable
technology.

------
ColinDabritz
This is a beautiful hack in best tradition of the term.

------
e12e
If only I hadn't been wearing my tinfoil hat, they'd never have known my
plans!

------
atuladhar
The first thing I thought of: a story called "The Extractor" that recently
featured on The Truth podcast:
[http://thetruthpodcast.com/Story/Entries/2014/3/9_The_Extrac...](http://thetruthpodcast.com/Story/Entries/2014/3/9_The_Extractor.html)

------
qzxvwt
Reminds me of Alvin Lucier's "I am sitting in a room" audio experiment.

------
NAFV_P
The _watery_ sounds you find when listening to low bit-rate mp3, there was a
bit of that in the video, especially the "Mary had a little lamb..." skit.

------
VanillaCafe
From watching the video, I get the impression that there is a very large
amplitude of the input audio -- taking advantage of the "loud" in loudspeaker.

~~~
CWuestefeld
A good question. I too got that impression from the first example. In the
second example (chips bag through glass), the "control" audio had a lot of
reverb, which might have been introduced by the phone they acquired it
through, but may also suggest that it wasn't just a person talking, but a
reproduction through some kind of amplification equipment.

------
pessimizer
If this is something that commodity hardware is now capable of, a total
surveillance society is now very cost-effective.

Is the stability required from the camera a dealbreaker when it comes to
outdoor mounted cameras in moving air, or would it be pretty easy to
algorithmically filter that out? i.e. are winds and drafts predictable enough
that they could be removed accurately enough for smaller vibrations to remain?

------
arh68
Wait, so what would be the _best_ reflector for this type of thing? If I could
host the ultimate cocktail party, placing ferns/etc all around the room, could
I record every conversation with a single camera?

What would those ferns look like? What would they be made of? I'm imagining
christmas trees made of cellophane fibers or something.

~~~
chadzawistowski
If you're allowed to place the ferns, why not simply hide microphones in them?

~~~
arh68
I don't know, you may be very far away. The camera could be in space, or maybe
on a plane. I mean, if you can hide microphones, the ferns would be irrelevant
anyways, apart from being a good distraction.

Kind of off topic but just last night I was watching some (silent) lightning
in the clouds, thinking how much more localized sound energy is when compared
to light. In other words, I can see for light-years but can't hear much
anything 100ft away. Or perhaps it's just our sensors are more sensitive to
light waves than air pressure waves. Now I realize sound isn't so localized!
It leaks. /rant

------
yarone
Reminds me of the technique used to uncover "invisible" motion by amplifying
the tiny movements found in a video:
[http://bits.blogs.nytimes.com/2013/02/27/scientists-
uncover-...](http://bits.blogs.nytimes.com/2013/02/27/scientists-uncover-
invisible-motion-in-video/)

------
lucb1e
It doesn't seem to mention what kind of camera you'd need to do this at the
mentioned distance (15 feet). I'm assuming it's at least a couple thousand
bucks, or maybe even so expensive that most professional photographers won't
have it, but does anyone actually know? Or did I miss it in the article?

~~~
randomdrake
The video goes into a bit more detail and shows a couple of things that
address your question.

1) A lot of the tests were done with a camera that costs thousands of dollars.
You can see images of the camera used in one of the experiments.

2) They were able to use consumer-grade cameras to also capture sound. Even
frequencies up to 5x higher than the actual 60 FPS of the captured video.

~~~
nl
I wonder how well a GoPro would do? They can do 120fps at 720p or 240fps at
WVGA.

~~~
ygra
Not very well, at least for the initial approach with a high-speed camera.
With 240 fps you're limited to 240 samples per second. If I'm not terribly
mistaken that would limit you to frequencies up to 120 Hz in the
reconstruction.

But the GoPro has a rolling shutter as well, so their second approach would be
applicable. However, that effectively relies on rows per second and while you
have a higher frame rate you have a lower resolution. In the end they could
cancel each other out.

------
thisjepisje
Reminds me of how deaf people can "listen" to music with a sheet of paper or a
balloon in their hands.

~~~
imjustsaying
*deaf people ;)

~~~
ProAm
Either way it's a superpower.

------
aus_
While a different technique, this reminds me of how researchers were able to
extract keystrokes based on the sound from each key press.

Here is the original paper:

[http://rakesh.agrawal-family.com/papers/ssp04kba.pdf](http://rakesh.agrawal-
family.com/papers/ssp04kba.pdf)

~~~
quasiconvex
Also reminds me of this:
[http://cnx.org/content/m13224/latest/?collection=col10380/la...](http://cnx.org/content/m13224/latest/?collection=col10380/latest)

------
Houshalter
How much vibration does sound actually cause an object. I tried humming near
various objects and noticed nothing. I used to be really interested in early
mechanical microphones and sound recording, but I couldn't find much
information on how they work either.

~~~
PeterisP
Your eyes wouldn't be able to see a 2000 Hz vibration, most vibrations that we
perceive as sounds (except from the very low vibrations) are so fast that even
with a huge amplitude they'd be just a motion blur for our eyes.

Now, a 2000 fps camera can see things that a naked eye can not.

~~~
Houshalter
But I should see a blur or something, unless it happens to be vibrating at
_exactly_ the frame rate of both of my eyes (and even that would cause some
weird effects, see helicopter blades spinning.)

Almost missed your comment btw.

------
jbuzbee
Reminds me of the attempts to extract audio from the groves on ancient spun
pots:

[http://en.wikipedia.org/wiki/Archaeoacoustics#Past_interpret...](http://en.wikipedia.org/wiki/Archaeoacoustics#Past_interpretations_controversy)

------
jbaiter
Am I the only one who isn't getting any audio from the video at all? I tried
in two different browsers, downloaded the video with youtube-dl and tried to
play it with mpv, everything to no avail, it's just a video with no sound.

~~~
mrfusion
Pretty ironic ... I wonder if they could run their algorithm on the video on
their page ...

------
johnydepp
Try noise cancellation on it. And results may be more clear. I did a project
on active noise cancellation using neural networks. It should apply here as
well.

~~~
anigbrowl
They did already, by the sound of the audio.

------
mgulaid
Now I am gonna be suspicious of chip bags and plants everywhere. We will need
to invent telepathic communication (mind to mind) to preserve privacy.

~~~
TeMPOraL
> _We will need to invent telepathic communication (mind to mind) to preserve
> privacy._

Or maybe it's time to think how to adapt to a world without privacy?

------
skywhopper
This is fun research, but it's basically just a microphone (which works by
turning these vibrations into magnetic pulses that generate electric pulses)
using a much, much more complex signal processing mechanism (observing photons
bouncing off the vibrating surface, translating the digital representation of
those photons into a digital simulation of the electric pulses a microphone
would generate, and turning that digital simulation into real electric
pulses).

------
tferraz
This algorithm is really incredible, I can only imagine the improvements to
this in 20 years

------
cagataycouk
Binoculars as bugging devices...

------
MichailP
I am just worried that they are picking up part of information from camera
mic. Maybe camera mic output is not totally independent from the video sensor,
and is encoded in the final video.

~~~
abedavis
The camera used in most of these experiments (a Phantom high speed camera)
doesn't even have a microphone - so that would be quite impossible.

~~~
MichailP
Thanks. However, there is still a possibility that mechanical vibration is
coming to the video sensor through some path (floor+camera tripod), and
affecting the video. After all mechanical vibration is affecting the potato
chips. Why is it so impossible to affect video sensor? Especially in the
sensitive high speed camera? Let the downvotes begin :)

~~~
underyx
I fail to see how that would make this any less impressive.

~~~
MichailP
If my evil :) speculation is true, the camera sensor is just picking up
mechanical vibrations from environment (coming through air, tripod etc.) and
encoding them into video. There is no proof that the bag of chips is vibrating
(they even say that the vibrations are not visible on footage). They are
extracting something from video, but that something may just be spuriuos
pickups by equipment, and not related image/video of bag of chips. Thus it
would be just a silly way to measure spurious mechanical vibrations.

------
ErikRogneby
I am amazed this wasn't funded by DARPA.

~~~
IvyMike
I would bet money that this technique is already well known and has been used
by certain intelligence agencies.

------
silus151
I think this will be a good help for visually challenged people. I wish to see
if this goes in that direction.

------
DarkIye
Can somebody executive summary this big lump of wooly garbage?

~~~
ctdonath
TL;DR - sound is a physical phenomenon, pushing on objects (more pronounced on
wide thin lightweight things like bags). A fast enough video camera with image
enhancement can "see" the sound affecting the object, which then can be
translated to recreating the sound from the video image.

Not long ago there was a spate of HN articles about apps that could measure
your heart rate via the camera (it watches for & measures subtle changes in
your skin color which occur during the pulse cycle). This is exactly the same
idea, just with a much faster "pulse".

I expect the researchers will next discover the "rolling shutter" (a "that's
not a bug, it's a feature!" of cell phone cameras) and discover how to extract
the audio info _without_ the need for high-framerate cameras. atomatica found
a perfect example:
[http://youtu.be/TKF6nFzpHBU?t=10s](http://youtu.be/TKF6nFzpHBU?t=10s)

