I've been doing pro audio stuff or 25 years and this is a landmark paper, biggest breakthrough I've seen in years. I'm astonished at the quality of extracted signals. Biggest thing I've seen since deconvolution became good enough for realtime or near-realtime adaptive noise reduction.
Laser stuff was a single point. The advantage here is capturing a set of 2d pixels which let you see how signal travels through space. This essentially means a large array of microphones which are next to each other and in perfect sync (which I hear is hard when you try to do this with multiple microphones).
It is hard due to phasing, that is, overlapping audio signals from the same source when summed during mixing can cancel out various parts of the signal. Channels on some mixers have a phase inverter and audio engineers will also move microphones around while looking at a phase monitor.
This isn't how the system works. It basically can give you a small number of contact microphones (you need to aim a laser at each point you want to record).
Interesting interpretation of this as a high-density high-resolution array! But does it need to be isolated well from movement? Since sound wavelengths are long, so maybe not?
It has been, but the quality here is much better. Like, that example might have been good enough for intelligence or legal purposes, this is goon enough for commercial/entertainment purposes.
The main headache I see is that it still requires a somewhat expensive and complex camera setup, but I can see that coming within the realms of affordability/standardization quite soon.
'Real' audiophile equipment is the stuff that sells to recording studios like powered studio monitors and rackmount solid state transports. It costs a lot less than the consumer audiophile stuff, too. If you're buying 'consumer' audiophile products like huge floorstanding speakers made of whatever wood of the day is in favour then you're buying status symbols, not audio reproduction quality.
Hmm from my perspective a pair of JBP LSR 308P MkII and a transport matches the price of entry-level consumer audiophile speakers and a cheap (Topping/SMSL) AMP/DAC pairing yet will outperform it any day.
I thought that the sub was less necessary with the 8" monitors? Maybe I misread or misremember.
One of the thing that's interesting to note with monitors is how 'biased flat' is very much a thing and that consumer audiophile speakers are very much not always biased flat, I think the varying companies have a sound profile which they aim to target.
I love the creative use of the rolling shutter, instead of seeing it as a downside, they turned the line-by-line nature of the sensor into sample rate multiplier.
The use of rolling shutter to increase effective sampling rate was present in the original SIGGRAPH 2014 Davis et al paper from Bill Freeman’s group at MIT, “The Visual Microphone: Passive Recovery of Sound from Video” (https://dspace.mit.edu/handle/1721.1/100023).
The authors of the current paper cite this and other prior work. The key innovation is the use of both a rolling shutter and a global shutter reference.
Ooh that's really cool that they're using the laser speckle pattern. I like the fact they exploit the rolling shutter too. Something which https://people.csail.mit.edu/mrub/VisualMic/ also does.
There are devices which are called laser doppler vibrometers, which might also be able to do this by pointing at the strings/base of the guitar?
There do seem to be videos of laser doppler vibrometers being used with guitars on youtube, but I'm not sure if the soundtrack that goes along with them is just from a normal mic.
They're sensitive to very small vibrations. A friend of mine used them while working at a hard drive manufacturer to better understand head and platter vibrations.
LDVs are the gold standard for noncontact vibration measurement and are widely used in acoustics. The main problem is that they’re pretty expensive (I think on the order of a few 10s of k$)
Combined with sophisticated noise cancelation and other relatively mature tech, this could make intentional focus listening possible, analogous to looking at something, as well as closing your eyelids, but for hearing.
Imagine being able to shut off specific ambient noises (and sometimes.. people) without losing spatial awareness. Or tune in a source you're paying attention to (the cocktail party problem).
The issue with super-hearing would be to re-adjust expectations of who can reasonably hear us. Could be used for creepy things, obviously..
I'm imagining Q delivering this technology to 007, except that it would shatter the audience's suspension of disbelief. Truly astounding sorcery, yet at the same time completely straightforward.
Incredible separation, I don't think it's attainable by any other means. Should be super useful for speech in noisy environments.
I have a question though, is capturing lateral movements of a single spot on the instrument enough to represent how it sounds for a human ear? I think it's equivalent to a polarizer filter as it doesn't seem to be capturing depth axis vibrations.
These example audio at the bottom. Seems the answer is no. But you could potentially use it as a means to isolate the sound from the recorded audio. You would have to synchronize the phases though because the optical speed of sound is c.
Be careful what you wish for, I can see a future where it's like EM today, the entire environment saturated with spread-spectrum (audio) noise and everyone has to wear a filtering+array gain device in their ear.
> it doesn't seem to be capturing depth axis vibrations
Good point, the paper mentions x-axis and y-axis, but doesn't mention z-axis. Maybe depth vibrations could be resolved as changes to the interference pattern?
Do human ears resolve sound in more than one dimension? I've always considered that, per ear, we only get a sequence of compressions and rarefactions on the ear drum, that other aspects of hearing are through combination or 'cheating' (skin sensitivity and such). So, would it matter?
There’s binaural localisation from phase and amplitude differences, but importantly also monoaural cues from frequency response. Your pinna act as a set of directional filters - which is part of why some binaural recordings can quite literally feel as though they’re in your head.
What’s interesting with this is that while you can’t get those directional filters, you could use a system like this to provide 3 or more ‘sensors’ (eg the chip packet demo) within a scene and isolate signals the same way as an array mic.
I was just wishing for something like this yesterday.
I wanted to figure out how to detect some very very low infrasound reliably, and no conventional microphone technology seemed like it could do what I needed.
This feels like it could form the basis of a new wave of scientific vibration measurement systems.
Not only is the tech here astonishing, but full credit to the authors for producing such a clear, concise video accessible to just about everyone explaining what they did, how they did it, and why it matters. Excellent storytelling.
When you speak, even your chest vibrates with the sound. Our bones conduct the vibrations in our skulls. This system could eventually be tuned to read those vibrations.
Ok, this is bizarre. A few days ago, and with no connection at all to any of this work, I had a similar idea. Not the same idea, just similar, and mine is just at the “hmm” stage (which also means mine may turn out to not work), but this still feels really weird.
Anyway, my thought was: laser goes to semi-silvered mirror between camera CCD and camera lens, passes through lens to diverge outward into environment, reflects off environment in the same way as a normal laser microphone, separate return signal now exists for all pixels on CCD.
Point this at a wall, do the right transformation (is a Fourier transform sufficient?) and the entire wall can be used as a computational phased array of microphones to listen to a specific (possibly moving) target.
Possibly mix with the original laser light to get beat patterns due to red/blue shift, like radar speed guns, but that feels like a separate application entirely.
I’m really impressed by this result in particular:
> Combining 63-fps video from two cameras, one with a global shutter and one with a rolling shutter, allows the researchers to recover a sound signal at 63,000 Hz
Because shutter speed was my first concern about possible limitations when I had my other idea.
> a 63-fps limit on input data would seem to place a 63-Hz upper limit on the sound this device can "see."
Obviously not because it's not taking one amplitude sample. It's analogous to taking 63 sliding FFT's per second, which may be based on on thousands of samples and capture high frequency content. This speckle pattern being sampled is some kind of FFT-like transform of the signal containing lots of information.
But the 63 Hz sampling will have to show up as a limitation. I would expect it to be excellent for periodic signals, but to struggle with transients, like the attack of a percussion instrument such as a snare drum.
I just watched the video and am a little confused about the laser illumination.
Does the laser have to be "aimed" at each object of interest or is it just laser illumination of the whole scene?
The graphics suggest there's a laser point on each object, but that means it has to be aimed and thus follow the subject, right?
Moreover many objects, like guitars, have complex oscillation modes so if you are "listening" to just "one point" on the surface, you're not picking up the sound from the other parts of the guitar which are oscillating differently.
The laser needs to be aimed at each point of interest. The system can track multiple laser points simultaneously with one pair of cameras, but there's a tradeoff with quality (because essentially the camera's field of view is being divided into slices, one for each laser point). So you've got a small number of virtual contact microphones you can point at a surface, not a video where you can see the vibration at each point.
Since childhood I have been waiting for vinyl turntables that use a laser instead of a needle to 'play' records, without any wear and tear. This may be the breakthrough.
It builds on the technology behind laser microphones. According to the paper, visual vibrometry has historically required expensive cameras, and their method removes this need and appears to have other advantages over using a high-speed camera. They say they contribute “a novel method for sensing vibrations at high speeds (up to 63kHz), for multiple scene sources at once, using sensors rated for only 130Hz operation. Our method relies on simultaneously capturing the scene with two cameras equipped with rolling and global shutter sensors, respectively.“