Hacker News new | comments | ask | show | jobs | submit login
Algorithm recovers speech from a potato-chip bag filmed through glass (2014) (mit.edu)
327 points by MrJagil 6 months ago | hide | past | web | favorite | 109 comments

At some point it is (controversially) hypothesized that we may be able to pull imprinted recordings off of ancient artifacts: https://en.wikipedia.org/wiki/Archaeoacoustics#Past_interpre...

> [Jones] claimed to have extracted the hum of the potter's wheel from the grooves of a pot,

So far so good...

> and the word "blue" from an analysis of patch of blue color in a painting.

What the hell?

I don’t know about the techniques but the theory isn’t prima facie impossible. A paintbrush can act like a microphone just like anything else, and if the paintbrush (more likely a putty knife or more rigid object) picked up a sound while applying paint, that could manifest in the paint layer.

>> I don’t know about the techniques but the theory isn’t prima facie impossible.

The Total Perspective Vortex derives its picture of the whole Universe on the principle of extrapolated matter analyses.To explain — since every piece of matter in the Universe is in some way affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation — every sun, every planet, their orbits, their composition and their economic and social history from, say, one small piece of fairy cake.

- Douglas Adams

And...someone was saying “blue” the instant they applied the blue paint? That’s certainly not impossible to believe, it just seems like a bit of a stretch.

Maybe it was painted by Bob Ross?

"And the next word we recovered is....another instance of 'happy'. Are we sure the machine is working?"

Happy Little Cloud computing

Titanium white.


It’d be a stretch if they were talking about ancient aliens. Saying the color they’re painting isn’t a change of context.

The paper is behind a paywall, but perhaps it was done as a test and not on an actual artifact. Seems doubtful though.

Article is behind a paywall:


"First Example This consisted of a pdifine clay. hand thrown on a potter’s wheel. The wheel in this example was an old. student-made wheel, constructed of an automobile crankshaft and flywheel mounted in a (too) light wooden frame. Persistently out of alignment, the wheel had a noisy vibration almost amounting to.a chatter. The pot produced on this wheel was fired at low temperatures. When the pot was suitably mounted on thephono turntable and against the side of the revolving pot was held the phono cartridge (fitted. in this instance. with a “needle” consisting of a flat-ended sliver of wood threequarters of an inch long) the low-frequency chatter sound could be heard in the earphones. "

But that imprint would be static. It would be a nightmare to go from a static imprint to a time series of some kind. Even in the article, they were analysing video, whereby the diffs from one moment to the next could be captured. I bet that as the length of video decreases (become a static picture in the limit), so would the useful output from their algorithm.

Maybe one could start with making and painting things, while blasting super loud sound at them, and do it with a robot so the only difference is the sound. Then check if the difference between sounds and no sound can be detected in any way right after the thing is done.

If you get any results, just a reproducible blip, then try things you haven't made yourself in a controlled environment... but if you can't even get it there, it seems kind of pointless to mess around with really old things.

That's my layman armchair perspective anyway, but from the armchair it makes sense :D

I also once thought of that, but maybe there are other ways in which large vibrations have left fingerprints on materials. I don't mean gravitational waves, but really acoustic phenomena. Like a comet impact or a vulcano outburst. What's a delicate material which would be able to record sound but that doesn't get destroyed by time? Clay that dries up is a logical one.

Old glass or porcelain might be the most viable candidates for a first look as they would be cooling off from a heated, more malleable state, potentially capturing sound from the immediate vicinity as they cooled off.

I have no science to back this up. It's just a hunch.

That may be complicated from old glass never fully hardening. But it sounds plausible enough.

Possibly mortar in walls could record the workers taking. All sorts of pottery start out malleable, so they might be candidates. Cave paintings might be an option - wonder whether being finger painted would leave biological fingerprints behind - heart rate, for instance.

Pretty wild idea. Here’s hoping it has some legs.

Maybe someday we’ll learn about places like Stonehenge this way.

Old glass doesn't flow and is solid. Google it to see plenty of debunking.

Interestingly, gravitational waves might end up a good idea. We now have a couple of patterns to look for, and many giant concrete slabs might have them accidentally recorded.

Anyone remember the TV show Fringe? I seem to remember them using science fiction tech like that. Life imitates art.

It was capturing telephone touch tones on a plate of glass.

Olivia's smartphone could dial based on touch tones, that's totally sci fi these days.

Dialing back based on tones is not sci fi at all. The glass thing is sci fi though. Good stuff!

In the show, Olivia remembers those tones and happens to have an app for that on her smartphone, Walter doesn't make that leap.

I found some applications on the Play Store that can allegedly decode the tones, so it might not be so strange.

Well it's certainly thought provoking, thanks for the link.

If interested in this, you might also be interested in the lead author's other research papers


and his 2016 PhD thesis, which includes the "chip bag" research.


I'm interested in this research, from the angle of "Can passive surface measurements tell us what is happening inside the human body?"

Ah, yes, I saw this in 2010. They didn't have high enough resolution to detect actual vibrations, so they used an algorithm based on per-pixel color shifts. Really clever.

Definitely some interesting applications.

Sort of orthogonal but is there a similar way to get sound from seeing someone's lip's move in a video -- would it then be possible to recover audio/conversations from old video only files?

Edit: Guess they can read lips via AI https://www.techemergence.com/machine-learning-that-learns-m...

Now getting the high quality sound from just video would be amazing.

What about lip-reading with reflected Wi-Fi signals :) :


"WiHear aims to detect human speech by analyzing radio reflections from mouth movements. It requires individual user to train the system extensively, and can recognize only a limited number of words (6 words) with high accuracy."

> Definitely some interesting applications.

Eavesdropping is the obvious one. What others do you have in mind?

"If I'm sitting next to a swimming pool, and somebody dives in - and she's not too pretty, so I can think of something else - I think of the waves and things that have formed in the water. And, uh, when there's lots of people have dived in the pool there's a very great choppiness of all these waves all over the water and to think that it's possible, maybe, that in those waves there's a clue as to what's happening in the pool. That some sort of insect or something with sufficient cleverness could sit in the corner of the pool and just be disturbed by the waves, and by the nature of the irregularities and bumping of the waves have figured out who jumped in where and when and where what's happening all over the pool. And that's what we're doing when we're looking at something. Uh, the light that comes out is ... is waves, just like in the swimming pool except in three dimensions instead of the two dimensions of the pool it's they're going in all directions. And we have a eighth of an inch black hole into which these things go ... which, uh, is particularly sensitive to the parts of the waves that are coming in a particular direction it's not particularly sensitive when they're coming in at the wrong angle which we say is from the corner of our eye. And if we want to get more information from the corner of our eye we swivel this ball about so that the hole moves from place to place. Then ... uh, it's quite wonderful that we can see ... figure out so easy. That's really because the light waves are easier than the ... the waves in the water are a little bit more complicated it would have been harder for the bug than for us but it's the same idea. Figure out what the thing is that we're looking at at a distance."

Transcribed from footage included in the documentary "The Last Journey of a Genius" (1989) by Christopher Sykes, a BBC TV production in association with WGBH Boston and Coronet/MTI Film and Video.


Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second.

That’s a pretty convenient quirk if you ask me.

It's known as the Rolling Shutter effect. It's most commonly observed when attempting to take still imagery of very fast moving objects like rotors on airplanes, but affects pretty much any image sensor that isn't designed specifically to work around it. The following Wikipedia article goes into far more depth, but the basic problem is that an image sensor doesn't sample every pixel instantly, but instead reads them out serially, and thus there's a tiny, tiny fraction of a second in-between each pixel being sampled. It's small enough to not matter for most purposes, but it's still there, and can be useful / detremental depending on the application.


A relevant point here is that video compression almost always destroys the signal relied upon in this research.

And it is why mechanical shutters are still a thing in dedicated cameras. It is still much faster to move two pieces of metal or carbon fiber over the sensor than it is to read out a multi-megapixel CMOS sensor. When shooting video mechanical shutters cannot be used so the rolling shutter effect can cause artifacts such as fast-moving subjects appear slanted.

Err, modern global shutters like those made by CMOSIS/ams are _much_ faster than that.

Their fast one is an APS-C sensor with 4k*3k Px, a shutter closing time of 1s/~120000 and a minimum shutter open time of ~1s/50000. The shutter closing time might be even faster, just reconstruction from frame overhead time values I remember, and adjusting for the share the row skew had. Check the datasheet if you like to.

If you know a mechanical shutter that can do such, I'd like to know.

This, by chance, allows laser-flash illumination of objects that are behind a close wall of fog, as you can keep the shutter closed while the light travels to the further away object of interest. You will have the blur, but no longer the massive constrast loss due to light pollution. If it's not as bad as fog, and just e.g. normal rainfall or such, you lack the reflection artifacts that would be common, and only retain the refraction artifacts from the light passing through it by necessity.

These sensors do lack a little dynamic range, but you can compensate with some slight trickery, see the datasheet, to get ~15 stops out of this particular device.

Rolling shutter has applications, but videos for users of low knowledge and high ambitions is not one of them.

Could you expand on or link the bit about time of flight trickery in fog or rain?

c * 1 s / 50,000 is still kilometers, so I’m not sure I understand how this shutter can do what sounds like using time of flight to selectively illuminate stuff at a specific depth

No, the part that can solve this is the fast time from insensitive to sensitive, which is shorter than the time form sensitive to no longer sensitive. You will need a pulse with a short enough duration, e.g. a Q-switched Nd:YAG with frequency doubling giving you up to about a dozen Joules at up to a few hundred Hz with pulse durations of under 50 ns, and, while limited in some ways by breakdown peak power in components, a lower limit of about .5 ns.

Most of the shutter time is used to copy the data to the shadow pixel, but just releasing the dark pull won't take long.

Yes, global shutters are a thing in expensive professional video cameras. But almost every camera in the world is a phone camera prone to rolling shutter effects both in stills and video shooting. Out of the rest, almost all have mechanical shutters for stills shooting and electronic rolling shutter for video. It’s going to be a while before global shutters become a thing in consumer cameras.

Which is sad, tbh. Because global shutter, white slightly more expensive in terms of area for a given SNR, can do things to both compensate and just generally do "weird" things.

Borrowing from rendering, it seems like a solution would be for every pixel to have "double buffers", one for the measurement in progress and one for the past measurement. Then you just need to make sure you can read all the past values in less than a frame and that all the buffer swaps happen in sync.

I believe this is the premise behind image sensors which support global shutter (as opposed to rolling shutter) but other than being aware that this feature might exist, I'm no expert on the subject.

Makes me wonder why this is even possible...video of Oval Office argument... https://m.youtube.com/watch?v=Ija-VZwcznE

pushing the argument to its limit : what happens if we can recover any speech from this or that material (moving leaves, bags, lips movements (AI would be good at that), changes in magnetic fields whatever). should we assume that there is no such thing as a "safe place" anymore ? One could argue there never was (because, for example, police can eavesdrop) but, this article implies it can reach a whole new level... How is that kind of technology controlled by those in powers ? Considering Apple is valued at gazillions of dollars, I guess those technologies will be worth a lot and be available to some others very powerful people... James Bond looks quite outdated to me now :-/

Microphones are already quite good. If you can be seen, you can be "heard". That has been true for many years. I think there was a scene in the movie Enemy of the State involving cone mics. The difference with this technology is that someone can "hear" you even if you're around a corner.

Forgive me, as someone with zero actual knowledge in this field. This may be a naive question, but wouldn't this be fairly easily defeated by playing songs in the background?

It might be possible to cancel that noise out. But something like a pink noise generator or an office grey noise generator would probably fool it, the same way it would fool a microphone at distance

This doesn’t work anymore, if it ever did. There is free software capable of isolating audio like a single voice from background noise.

Not sure how this can't work, it's just audio jamming, same as for EM waves.

You just need to broadcast sufficient noise to cover the signal, i.e. the S:N ratio is such that the receiver (crisp packet, microphone, etc.) can no longer see the signal above the noise floor. Now, this might mean broadcasting a loud noise signal, which could overwhelm your (or other friendly) receivers (ears, etc.) so optimal placement of the jamming noise source becomes an issue.

Generally, you want it closer to the threat than you, so that distance attenuation keeps it bearable for you but still jams possible recording devices. Or, introduce high bandwidth noise vibrations into the surfaces of the area you are in, such as windows or walls. Anyway, it's very possible and in use today in secure facilities that must be protected from audio eavesdropping.

You could identify the song and then subtract it. If you add which noise it raises the noise floor, but won't effect overall decoding that much, because there is such deep knowledge of which phonemes are likely to follow in a given sequence. You have to assume there is a model trained for each participant, e.g. using telephone intercepts, other listening devices.

Maybe if you simultaneously played back segments of dozens of conversations of the participants talking. That would certainly be confusing for the participants.

I think that would just add music to the sound recovered.

Yes, or even better moderate white noise. Ambient sound will lower the signal to noise ratio between ambient audio and the audio of interest and so essentially render using this to eavesdrop on someone effectively moot.

As an exercise in using tertiary effects to pull in signals you might otherwise be prevented from receiving, well that is pretty cool.

Thanks for the replies everyone. A follow up: would the sound recovered from such vibrations have enough resolution to perform some of the tricks described (isolating voice from background noise or identifying the song and cancelling it out, etc...)?

A fan would probably work too, as it adds more white noise to the already noisy signal

And might blow the chip bag away entirely :)

you can separate the music from the other sounds.

A different approach to decoding vibrations into speech, involving a laser microphone: https://en.wikipedia.org/wiki/Laser_microphone

So if I eat my sour cream chedar ruffles in a one-party consent state could I get in trouble?

Are there any applications for this tech that aren't a massive violation of privacy? Seriously... why develop this? It's super creepy.

Tangential... but why do the US call them 'Chips' whereas in the UK we call them 'Crisps'?

This would also be really cool for recording vocals. I liked the glitchiness of potato chips!

Rule of thumb is never having sensitive conversations where such exploits are feasible.

With enough processing power, you can listen to anyone as long as you can get an object in frame that’s vibrating with speech (the UK and Chinese CCTV networks comes to mind). Depending on resolution, I’d expect this to be the case for any footage already stored. It’s just a matter of having a distributed computing job kicked off to comb through video data and add the additional audio metadata.

Will you only speak of sensitive subjects in rooms with no windows? Will we be silent in public? These are issues where technology, politics, and human rights intersect.

Back when this was new, I remember discussing it on Hackaday or somewhere. Someone remarked that they were amazed that the video could pick up on the motion at all, the video looked still to them. I responded that it really was still--the demo video had been lossfully encoded for the web (re-encoded by YouTube?), and the encoder took out the subtle changes.

I suspect that most footage already stored is similarly lossfully encoded, and that this technique isn't possible on it.

>the UK and Chinese CCTV networks comes to mind

Minor nitpick, but the UK doesn't have a CCTV network. It has a huge number of privately owned CCTV cameras and a relatively small number of CCTV cameras operated by individual local authorities and police forces. The privately-owned cameras aren't joined up in any useful way and are often of very poor quality; the publicly owned cameras are overwhelmingly used for real-time monitoring of busy city centre locations.

Installing CCTV cameras is cheap and easy, but usefully monitoring them is expensive and difficult, even with whizz-bang CV algorithms. I'm deeply sceptical as to how useful any state-level CCTV network would be for mass surveillance. 20 million 4K/30fps cameras would produce something in the region of two exabytes per day; just storing that data would cost about $7bn per month.

Run the algo on any reflective objects in the field of view, feed the outputted audio to a transcription NN. Save all high confidence transcribed voice as compressed txt. Save the source video if you pass a threshold of danger keywords. ("bomb", "kill", "prime minister", "didn't pay my tv licence")

Best case scenario, you'll need a nuclear power station and a few dozen data centers full of ASICs. Oh, and an internet's worth of extra bandwidth.

"Will you only speak of sensitive subjects in rooms with no windows" - yes, the building standards for secure facilities (i.e. a room where discussing classified information above a certain grade is permitted) mandate no external windows, and much more than that.

I tried to look this up, info here: https://www.dni.gov/files/NCSC/documents/Regulations/Technic... (Ch. 3 sec. F, pg. 13)

F. SCIF Window Criteria

1. Every effort should be made to minimize or eliminate windows in the SCIF, especially on the ground floor.

2. Windows shall be non-opening.

3. Windows shall be protected by security alarms in accordance with Chapter 7 when they are within 18 feet of the ground or an accessible platform.

4. Windows shall provide visual and acoustic protection.

5. Windows shall be treated to provide RF protection when recommended by the CTTA.

6. All windows less than 18 feet above the ground or from the nearest platform affording access to the window (measured from the bottom of the window), shall be protected against forced entry and meet the standard for the perimeter.

There's also the room in a room concept where one is suspended inside the other. You might add things like ultrasound masking or blocking, too, since it was a known attack vector. From there, I thought about, but can't recall if implemented, some double doorway with buffer in between so opening a door didn't leak sounds/signals. Open one, go in, close it, and then go through other.

Don't sweat the CCTV or recordings:

Reconstructing audio from video requires that the frequency of the video samples — the number of frames of video captured per second — be higher than the frequency of the audio signal. In some of their experiments, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 60 frames per second possible with some smartphones, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.

Right, which is why the researchers were reliant on video artifacts caused by the rolling shutter effect. Even at NTSC resolution, that could get you 262.5 scanlines per frame at 60 FPS (525 at ~29.75 FPS, because of interlacing... NTSC is complicated), which could potentially get you ~15 KHz as a baseline sample rate, depending on how prominent the rolling shutter effect actually is for that image sensor, and how much of the frame your vibrating object consumes vertically. This isn't perfect and would require all sorts of calibration, but I could see it being used to recover windows of time at much higher sample rates, and make inferences about the inbetween data with some FFT analysis on what you recovered.

Basically, it's more the rate of _scanlines_ that matters for this technique, and the quality of the image sensor used. The rate of full frames isn't the limiter.

The article quite literally is about a technique to bypass that limitation and do it with a regular 60fps camera.

It sounds like that technique doesn't quite recover the audio though, just a portion of it:

While this audio reconstruction wasn’t as faithful as that with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

Saying it may still be good enough pretty strongly implies that the quality is quite low.

There is a video on the webpage that includes the recovered audio. It is low quality, but enough to understand the speaker most of the time, and definately enough to pose a security risk when analysed by a specialist

The youtube clip in the article? The voice reconstruction there is from high speed video. If you mean some other video, I'd appreciate a link.

> Will you only speak of sensitive subjects in rooms with no windows?

That is exactly why almost all DOD secure areas are windowless.

> That is exactly why almost all DOD secure areas are windowless.

I'm pretty sure that a bigger factor in why secure facilities have limitations on windows (and especially on ground floor windows) is physical security.

Not really. Non-operator windows do not degrade physical security that much, or really at all if they are not accessible from the ground. It is primarily danger of surveillance that drives the lack of windows.

Don’t disagree. Scale that to an entire populace though.

In the brave new world that's being built, the entire population (most of it) doesn't get to have true confidentiality if somebody is sufficiently interested in them.

Well, windowless bathrooms are not uncommon. And most modern construction has walk-in closets, which typically lack windows. Even so, a source of white noise is always prudent.

And leads to mold and bad air unless you are careful which people aren't

True, but only if you don't ventilate properly. And exhaust fans are also a useful source of white noise. Along with running water.

> Will you only speak of sensitive subjects in rooms with no windows?

Of course. I mean, that's been obvious for at least a decade.

> Will we be silent in public?

There's no need to be silent. However, one must be aware of surveillance risks, and act accordingly.

> These are issues where technology, politics, and human rights intersect.

For sure. But just wanting privacy doesn't work. And you must always deal with what's so.

I think this has shown up on HN previously... Cant find the link though.

>In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.

Its hard enough to discern intelligible speech from many people who are standing right in front of you.

TIL don't leave snack bags out, and the window shades open.

“I’m sure there will be applications that nobody will expect. I think the hallmark of good science is when you do something just because it’s cool and then somebody turns around and uses it for something you never imagined. It’s really nice to have this type of creative stuff.”

Thinking atomic bombs :(

>Thinking atomic bombs :(

The same ones which have prevented large scale open military conflicts involving superpowers for the last 60+ years? Atom bombs have likely saved more lives than they have taken, if we had conventional wars with modern technology without MAD.

That’s one read on history. Another is that the world has been tremendously lucky to not have suffered a nuclear holocaust since WWII. Many powerful people have done everything they could to advocate for nuclear bombings in the past 60 years, and we’ve had some extremely close calls with near accidental launches.

You are right. It might be too soon to definitively say which take is correct. It might be the case that a disaster (natural, economic, alien invasion, etc) might spark yet another world war.

I think the thing that saved the world was Stalin's stroke, if he had held onto power for longer he had the right kind of volatile personality to go 'fuck it' and start something..

We have no way of knowing, in retrospect, what the real risk of global nuclear war was. We live in the universe where the die roll came up "no", and we don't know if the odds of our survival were 99% or much less than that.

How do you weigh the certain death of millions vs. peace with a small chance of utter annihilation? I don't know, but I don't think it's as easy as you say.


The US atomic bomb project didn't fit that description, I don't think. It was started expressly because people believed they were in a race with the Axis to develop one first. I am surprised that Einstein's letter to FDR isn't all that universally known these days.[1]


That's not really how the history of atomic bombs went down. Szilàrd imagined the destructive power of chain reactions around 1932, nuclear fission was discovered in 1938, and soon after the Manhattan project was started with the clear intention of understanding the science enough to make a bomb.

I frequently encounter people who believe that nuclear energy was harnessed initially for power generation and then co-opted for destructive purposes. The first nuclear reactor was created to enrich uranium to make a bomb.

Now imagine doing this with a high res satellite anywhere on earth.

Satellites will never get such resolution due to optics and atmosphere. Drones however...

That would only be useful if everyone were outside, and you wanted a split second of many different people's conversations. Satellites move rather quickly relative to the Earth's surface.

Some of them do, there are geosynchronous spy satellites.


"It may also have a lower resolution video streaming capacity."

Ok I was coming to say "spy satellites at 40,000 km up - I doubt they can see anything. And if the linked article is correct the Chinese satellites up there have a resolution of 50m - good luck finding a crisp packet.

But the new generation "might" have a resolution of 1m. which is insane.

Then again, good luck knowing which square meter of the 1/3 of the earths surface you can see, has the crisp packet in.

I still think there will be a place for good old bribery corruption and sex spy techniques for a while yet.

Recent Chinese optical satellites are thought to have 10cm resolution. All these are low earth orbit. Depends on cloud cover, atmospheric turbulence and look angle.

Still impractical to get sound vibrations from that. But a drone with a laser would work for windows. Think listening in on a conversation in a car.

It was the (seemingly seriously ) proposed 1m resolution from geostationary orbit that had me.

Still all this tech is useless without knowing where to point it when. Which usually comes down to human led intel and intelligence led tasking.

I think ... when AI starts deciding which conversation to follow or record then ... we'll I for one welcome our new robot overlords

I guess I stand corrected. But the practicality, as others note, still seems limited.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact