What you need a couple CNN layers to identify the most funny possibilities of translations and make a YouTube channel like the Bad Lip Reading, then profit!
Even now, https://www.youtube.com/watch?v=5Krz-dyD-UQ
still cracks me up.
Frequency attenuation + sub-pixel color profiling means you don't even need an expensive camera in a lot of cases.
Get a plastic cup of water or similar object, put it on someone's desk, record video from far away, combine with something like this [0] and you've got a very interesting avenue for corporate espionage. If you could reconstruct typed passwords from the object, it's a really powerful technique.
For windows, they already have this, IIUC. You bound a laser beam of the window and measure the vibrations. Random guys can just do this in their garage.
The Applied Science guy is most definitely not a random guy in a garage, though, he's incredibly skilled and talented. The rest of his youtube channel is pretty amazing also.
I am a random guy in a garage, and I made a functional laser microphone using random bits of electronics I had in my spare-bits box ($1 laser pointer, old pair of earphones, snip off the earphone and wire in a light dependent resistor) -- admittedly the quality was awful (you could only just make out voices if people in the room talked abnormally loud), but it was great for a fun weekend science project :D
Maybe I'm misunderstanding the code, but it looks like it's matching audio to video, not actually recognizing speech given a video. That is, it could answer "does this audio line up with this video?" but not "what is being said in this video?"
I didn't take a deep dive of the code but in order to train it's going to need to be fed audio files with the actual video/mouth shapes/etc. Essentially it needs it to tell the reward to give back (if it was right). Once it "learns" it wouldn't need the audio file.
Yeah it is interesting, and it could also be a big boost to plain olde speech to text in cases where you have video if the errors were non-correlated (which I wasn't able to determine from skimming the readme.)
edit: now I see it is being used to match audio samples, not to generate text so it wouldn't create an independent value from the audio in this arrangement. Other than i.e. speaker attribution which they mentioned.
OK. But WHY? All technology has moral implications. Did you create this to actually help people? Do you care if it is weaponized? Think before you create.