Show HN: Lipreading with Deep Learning

ConfusedDog · on Oct 20, 2018

What you need a couple CNN layers to identify the most funny possibilities of translations and make a YouTube channel like the Bad Lip Reading, then profit! Even now, https://www.youtube.com/watch?v=5Krz-dyD-UQ still cracks me up.

gliptic · on Oct 20, 2018

It also has some funny Easter eggs, like a "Fibonacchos" stand.

kakarot · on Oct 20, 2018

Combine it with some parametric voice tech and you've got yourself a delicious automated stew.

lwansbrough · on Oct 21, 2018

Using AI to make fun of AI, in a nutshell.

samwyse · on Oct 22, 2018

Obligatory HAL 9000 reference: https://youtu.be/1s-PiIbzbhw

samstave · on Oct 20, 2018

How many secret efforts are there to accomplish this already for the MIC?

I can't imagine that there arent already some Palantir-like efforts to accomplish this.

Imagine a REALLY good zoom lens on a very small drone that can not be seen/heard by a target and that drone is doing something like this to gain info.

Imagine the same zooming through windows as well.

This will be the next big ML-military step towards Total Information Awareness taken, if its not already available in the wild.

kakarot · on Oct 20, 2018

Here you go:

https://news.mit.edu/2014/algorithm-recovers-speech-from-vib...

Frequency attenuation + sub-pixel color profiling means you don't even need an expensive camera in a lot of cases.

Get a plastic cup of water or similar object, put it on someone's desk, record video from far away, combine with something like this [0] and you've got a very interesting avenue for corporate espionage. If you could reconstruct typed passwords from the object, it's a really powerful technique.

[0] https://people.eecs.berkeley.edu/~tygar/papers/Keyboard_Acou...

asdfasgasdgasdg · on Oct 20, 2018

For windows, they already have this, IIUC. You bound a laser beam of the window and measure the vibrations. Random guys can just do this in their garage.

https://www.youtube.com/watch?v=1MrudVza6mo

conistonwater · on Oct 20, 2018

The Applied Science guy is most definitely not a random guy in a garage, though, he's incredibly skilled and talented. The rest of his youtube channel is pretty amazing also.

Shish2k · on Oct 20, 2018

I am a random guy in a garage, and I made a functional laser microphone using random bits of electronics I had in my spare-bits box ($1 laser pointer, old pair of earphones, snip off the earphone and wire in a light dependent resistor) -- admittedly the quality was awful (you could only just make out voices if people in the room talked abnormally loud), but it was great for a fun weekend science project :D

samstave · on Oct 21, 2018

A write-up or vid of the components and build would be interesting..

gugagore · on Oct 20, 2018

I would imagine an LDR would be too slow to have a good response to audio.

ehsankia · on Oct 20, 2018

You can do it with video too, even: https://www.youtube.com/watch?v=FKXOucXB4a8

Also, this is only tangentially related, but you can also see through walls using WiFi: https://www.youtube.com/watch?v=kBFMsY5ZP0o

ttul · on Oct 21, 2018

I wonder if this technique has made its way from MIT to the clandestine services yet... you sort of have to assume it has.

s_m_t · on Oct 20, 2018

https://en.wikipedia.org/wiki/Laser_microphone

gok · on Oct 20, 2018

Maybe I'm misunderstanding the code, but it looks like it's matching audio to video, not actually recognizing speech given a video. That is, it could answer "does this audio line up with this video?" but not "what is being said in this video?"

derimagia · on Oct 21, 2018

I didn't take a deep dive of the code but in order to train it's going to need to be fed audio files with the actual video/mouth shapes/etc. Essentially it needs it to tell the reward to give back (if it was right). Once it "learns" it wouldn't need the audio file.

pavs · on Oct 21, 2018

in order to train doesn't it have to match audio output to a video of mouth movement?

Doesn't deep learning imply training on sample result?

person_of_color · on Oct 21, 2018

Exactly. How is this "lipreading"? Clickbait.

sgt · on Oct 20, 2018

Open the pod bay doors, HAL.

snakeboy · on Oct 20, 2018

This scene would actually make a really cool test case!

donohoe · on Oct 20, 2018

I'm sorry, Dave. I'm afraid I can't do that.

biarity · on Oct 20, 2018

We're getting there!

meow_mix · on Oct 20, 2018

This is fascinating. Has anyone considered repurposing this for something like sign language?

Havoc · on Oct 20, 2018

That's actually a really good application with some real potential for improving lives. High five mate

PowerfulWizard · on Oct 20, 2018

Yeah it is interesting, and it could also be a big boost to plain olde speech to text in cases where you have video if the errors were non-correlated (which I wasn't able to determine from skimming the readme.)

edit: now I see it is being used to match audio samples, not to generate text so it wouldn't create an independent value from the audio in this arrangement. Other than i.e. speaker attribution which they mentioned.

anotheryou · on Oct 20, 2018

no demonstration video?

ehsankia · on Oct 20, 2018

Not the same project, but here's one from Oxford + Deepmind:

https://www.youtube.com/watch?v=fa5QGremQf8

hcs · on Oct 21, 2018

Some prior work from 2001: https://youtu.be/1s-PiIbzbhw

orasis · on Oct 21, 2018

OK. But WHY? All technology has moral implications. Did you create this to actually help people? Do you care if it is weaponized? Think before you create.

orasis · on Oct 21, 2018

It reflects poorly on this community that any comment that questions the ethics of technology gets downvoted.