Hacker News new | more | comments | ask | show | jobs | submit login
Show HN: Lipreading with Deep Learning (github.com)
258 points by irsina 4 months ago | hide | past | web | favorite | 31 comments

What you need a couple CNN layers to identify the most funny possibilities of translations and make a YouTube channel like the Bad Lip Reading, then profit! Even now, https://www.youtube.com/watch?v=5Krz-dyD-UQ still cracks me up.

It also has some funny Easter eggs, like a "Fibonacchos" stand.

Combine it with some parametric voice tech and you've got yourself a delicious automated stew.

Using AI to make fun of AI, in a nutshell.

Obligatory HAL 9000 reference: https://youtu.be/1s-PiIbzbhw

How many secret efforts are there to accomplish this already for the MIC?

I can't imagine that there arent already some Palantir-like efforts to accomplish this.

Imagine a REALLY good zoom lens on a very small drone that can not be seen/heard by a target and that drone is doing something like this to gain info.

Imagine the same zooming through windows as well.

This will be the next big ML-military step towards Total Information Awareness taken, if its not already available in the wild.

Here you go:


Frequency attenuation + sub-pixel color profiling means you don't even need an expensive camera in a lot of cases.

Get a plastic cup of water or similar object, put it on someone's desk, record video from far away, combine with something like this [0] and you've got a very interesting avenue for corporate espionage. If you could reconstruct typed passwords from the object, it's a really powerful technique.

[0] https://people.eecs.berkeley.edu/~tygar/papers/Keyboard_Acou...

For windows, they already have this, IIUC. You bound a laser beam of the window and measure the vibrations. Random guys can just do this in their garage.


The Applied Science guy is most definitely not a random guy in a garage, though, he's incredibly skilled and talented. The rest of his youtube channel is pretty amazing also.

I am a random guy in a garage, and I made a functional laser microphone using random bits of electronics I had in my spare-bits box ($1 laser pointer, old pair of earphones, snip off the earphone and wire in a light dependent resistor) -- admittedly the quality was awful (you could only just make out voices if people in the room talked abnormally loud), but it was great for a fun weekend science project :D

A write-up or vid of the components and build would be interesting..

I would imagine an LDR would be too slow to have a good response to audio.

You can do it with video too, even: https://www.youtube.com/watch?v=FKXOucXB4a8

Also, this is only tangentially related, but you can also see through walls using WiFi: https://www.youtube.com/watch?v=kBFMsY5ZP0o

I wonder if this technique has made its way from MIT to the clandestine services yet... you sort of have to assume it has.

Maybe I'm misunderstanding the code, but it looks like it's matching audio to video, not actually recognizing speech given a video. That is, it could answer "does this audio line up with this video?" but not "what is being said in this video?"

I didn't take a deep dive of the code but in order to train it's going to need to be fed audio files with the actual video/mouth shapes/etc. Essentially it needs it to tell the reward to give back (if it was right). Once it "learns" it wouldn't need the audio file.

in order to train doesn't it have to match audio output to a video of mouth movement?

Doesn't deep learning imply training on sample result?

Exactly. How is this "lipreading"? Clickbait.

Open the pod bay doors, HAL.

This scene would actually make a really cool test case!

I'm sorry, Dave. I'm afraid I can't do that.

We're getting there!

This is fascinating. Has anyone considered repurposing this for something like sign language?

That's actually a really good application with some real potential for improving lives. High five mate

Yeah it is interesting, and it could also be a big boost to plain olde speech to text in cases where you have video if the errors were non-correlated (which I wasn't able to determine from skimming the readme.)

edit: now I see it is being used to match audio samples, not to generate text so it wouldn't create an independent value from the audio in this arrangement. Other than i.e. speaker attribution which they mentioned.

no demonstration video?

Not the same project, but here's one from Oxford + Deepmind:


Some prior work from 2001: https://youtu.be/1s-PiIbzbhw

OK. But WHY? All technology has moral implications. Did you create this to actually help people? Do you care if it is weaponized? Think before you create.

It reflects poorly on this community that any comment that questions the ethics of technology gets downvoted.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact