
End-To-end Audiovisual Speech Recognition - ghosthamlet
http://arxiv.org/abs/1802.06424v2
======
nmca
Y'all didn't cite some extremely relevant work:
[https://arxiv.org/abs/1709.00572](https://arxiv.org/abs/1709.00572)
(disclaimer - they're from my lab)

~~~
radarsat1
What is HNs impression of the etiquette of "demanding" citations? I agree with
nmca that it's the author's onus to research prior work as widely as possible,
but if something is missed, presumably both by the authors and the reviewers,
and therefore clearly was not actually a reference for the work in question,
is the author really at fault? Is it "ok" to demand a citation like this?

~~~
syllogism
Closed-form peer review is basically a (very tight) spam filter in ML at the
moment. Everyone wants the papers to appear quickly, so reviewers can't
meaningfully require changes --- papers are in or out. Overall this is net
better. Requiring changes leads to bike-shedding, and really long publication
processes.

Most review is community review, of exactly this sort of form. So we really
don't want to add a lot of politeness constraints around requesting citations.

The literature is moving so fast that authors can get away with assuming
readers won't know about relevant work. If we let that thrive, we'll reward
bad actors who "forget" prior work and claim novelty.

------
rahimnathwani
Wow: "In presence of high levels of noise, the end-to-end audiovisual model
significantly outperforms both audio-only models."

------
diminish
I am curious what the success rate would be without any audio - by just
reading lips and face.

~~~
IshKebab
I'm going to guess "very bad" given that even experts can't do this very well.
I don't think there is enough information.

Edit: It does say in the paper actually. The "classification rate" (I assume %
of words correctly identified - they only recognise single works from a
dictionary of 500) is 82 for visual only and 98 for audio, and audio-visual
(without noise).

Without noise the audio is good enough to be near perfect (I assume 100 is
perfect) so the video doesn't really help. It helps when there is noise (which
matches real life experience - people lip read in noisy bars).

~~~
mwcampbell
> people lip read in noisy bars

Maybe that explains why I, as a visually impaired person who can't lip read,
have trouble carrying on conversations in noisy bars. Good to know!

