Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Extracting Audio from Visual Information (news.mit.edu)
43 points by sblank on Aug 17, 2022 | hide | past | favorite | 15 comments


Related works have been using visual processing to help with audio source-separation. Either by augmenting a separation AI [1] or by capturing the recording visually in the first place [2].

[1] https://ai.googleblog.com/2018/04/looking-to-listen-audio-vi...

[2] https://newatlas.com/music/optical-microphone-sound/


When this is eventually mainstream I wonder how it will interact with laws around recording consent. In two party consent states like MA you'd need my consent to record me talking, but not to record video. If video essentially encodes audio, however, then recording high resolution video may also require consent.


I get that this was developed for use as a practical tool with lots of potential benefits for analyzing videos shot by amateurs, but I'm rather more interested in what kind of audio it might interpolate when presented with really mangled or artificial visual content. I wonder what it thinks is happening when looking at a cartoon or computer generated film, or footage captured from a distorted VHS tape or a modern FPS game...

How much has this project developed since 2014? Can regular people download the tools to play around with it yet? Would love to see it try to get glitchy for creative purposes.


Needs a [2014] in the title.


Interesting work but seems to come with scary applications.

One thing I was wondering from the video. The narrator mentioned that some of the movement is hundreds of times smaller than a pixel. If that's the case, how are they detecting the movement then in those cases? If something moves but stays within a pixel, how are you knowing how it moves since the pixel is the smallest bit of information you have? Or is it because although the physical movement in space is smaller than a pixel, the resulting "color" information for the given pixel changes in proportion to the movement in large enough ways that can be measured?


There is indeed research in doing that -- "motion magnification"

visualizing vibrations in machinery: https://www.youtube.com/watch?v=rEoc0YoALt0&t=121s

overview of research, and discussions on HN: https://hn.algolia.com/?query=Eulerian%20Video%20Magnificati...

TED talk: https://www.youtube.com/watch?v=fHfhorJnAEI

edit: TFA:

> from the change of a single pixel’s color value over time, it’s possible to infer motions smaller than a pixel.

> the researchers borrowed a technique from earlier work on algorithms that amplify minuscule variations in video,


Thanks for the references! And yea, I watched the video, commented, and then read the article. Thanks for pointing to those quotes.



It seems like event cameras are going to rip this wide open and not require crazy cooled down slowmo cameras:

https://en.wikipedia.org/wiki/Event_camera

Could end up in a world where every cellphone can turn anyone's windows into a microphone or something.


See also Radio2Speech [1] which uses a UNet to recover audio from a RF beam.

[1] https://zhaorunning.github.io/Radio2Speech/


Wonder if intelligence agencies already utilize something similar…


Theremin (yes, the same) invented a device using an infrared beam directed at windows which was able to capture speech from inside. [1] This is separate from his RF powered Thing bug that was installed in the US Embassy in Moscow. [2]

Similar techniques have been shown using lasers vibrometers to achieve a similar effect.

Both of these are perhaps not visual in the same way as this method in the article. But, do illustrate the long history of non-traditional microphone eavesdropping techniques.

[1]Albert Glinsky (2000). Theremin: Ether Music and Espionage. University of Illinois Press. p. 10. ISBN 9780252025822. Retrieved 2013-12-28. theremin family huguenot.

[2]https://en.wikipedia.org/wiki/The_Thing_(listening_device)


Peter Wright's book (Spycatcher) about his (story of his) adventures reverse engineering the thing is a good read.

He had another trick of using simple phasing to make cocktail party type problems easier for analysts. Just play it out of phase in each ear and the brain works it out apparently.


If you haven't already seen tom Scott's video on background noise giving the location of a recording due to power line frequencies being subtly different I recommend it. https://youtu.be/e0elNU0iOMY





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: