
Ask HN: Why don't we use subtitled films/tv to train speech recognition? - sycren
There are thousands of films and tv episodes that have subtitles throughout their duration. Millions of music that are sung that we can find lyrics for. Would it not be possible to use this material to train speech recognition. This would then make it possible to train in the multiple different dialects and accents of a particular language.<p>Speech recognition as a technology, has always appeared to move slowly although with the advent of mobile popularity, the technology is becoming increasingly popular.<p>Is anyone doing anything like this?
======
tcarnell
I used to work for a company that built Speech Recognition systems and I came
up with a similar/related idea - the idea being to take a load of videos of
barack obama (for example), and create an accurate 'voice print'. Once done,
any videos or speech could be scanned and if Barack Obama's voice print was
recognized/detected, the recognizer could be tuned to his voice print AND
could apply a set of appropriate grammars/vocabluary (for example the
'politics' grammar, or 'american' grammar or 'economics' grammar) - then you
could very accurately perform speech recognition and automatically create text
translations. Then when you google for text, you could actually retrieve
videos whose content exactly matches the search terms and jump directly to
that part of the video.

Over time you could build up a database of voice prints and grammars for not
just celebrities, policitians, but also criminals (for automatic
identification).

I had this idea almost 4 years ago, submitted it to the company, but it wasn't
taken seriously.

If anybody is interested in this, let me know!

~~~
amirmc
_> when you google for text, you could actually retrieve videos whose content
exactly matches the search terms and jump directly to that part of the video_

The search aspect of this is very interesting and I hadn't thought of it
before (though in hindsight it seems like an obvious benefit).

------
eftpotrm
Aside from issues relating to background noise on the soundtrack, the
subtitles are frequently abridged from the spoken word in the interests of
space and / or readability, so you'd need to account for that in your
algorithm.

If it were me... Project Gutenberg has free books available in both audio and
text formats. You may well again run into issues with the spoken and written
text not exactly matching (it's not something I've looked into to know) but I
wouldn't be surprised if it was rather less than what I've observed in
subtitles, and the data concerned is in a more easily parsed format.

~~~
rcthompson
Audio recordings of book readings are less practical than subtitles because
they are not synchronized. Every subtitle in a film is associated with the
sound clip that plays while it is visible, whereas for an audiobook or
similar, any algorithm would have to "align" the audio and text in order to
obtain usable training data, and then it would have to deal with the errors
introduced by this process.

~~~
eftpotrm
Hmm, good point, though still against a cleaner audio signal, easier input
data to process and (likely) closer matching of text. I'm not remotely
involved with the field so I won't indulge in further wild speculation, but an
interesting balance.

------
killa_bee
I happen to know that they do this at the Linguistics Data Consortium
(<http://www.ldc.upenn.edu/>), at least with cable news shows. They mostly do
that to obtain data for languages with more minimal resources though, and for
the purposes of transcription, not for speech recognition qua engineering
research. The real issue though is the research community is interested in
increasing the accuracy of recognizers on standard datasets by developing
better models, not increasing accuracy per se. Having used more data isn't
publishable. Further, in terms of real gains, the data is sparse (power law
distributed), and so we need more than just a constant increase in the amount
of data. This issue is general to any machine-learning scenario but is
particularly pronounced in anything built on language.

Some related papers ~ Moore R K. 'There's no data like more data (but when
will enough be enough?)', Proc. Inst. of Acoustics Workshop on Innovation in
Speech Processing, IoA Proceedings vol.23, pt.3, pp.19-26, Stratford-upon-
Avon, 2-3 April (2001). Charles Yang. Who's afraid of George Kingsley Zipf?
Ms., University of Pennsylvania.
<http://www.ling.upenn.edu/~ycharles/papers/zipfnew.pdf>

------
hartror
Well I am sure they would do, though subtitles aren't the most reliable source
for movie dialog. Often the dialog is altered subtly to fit the space and
timing requirements.

~~~
tintin
And they are translations not speech-2-text.

~~~
andreasvc
Movies are also subtitled in the same language for the hearing impaired.

------
drKarl
There are two different but correlated fields: Speech recognition and Natural
Language Understanding.Speech Recognition is easier if the scope is minimised,
that is, if th system knows which subset of keywords of orders to recognized.
But recognizing an open scope, including different accents, slang, etc is a
much more difficult task.

~~~
sycren
I mean there must be millions of times where a character has said 'hello' in a
film or tv episode. Each person may have a slightly different way of saying it
which can then be used to make a model for speech recognition software which
may no longer require the user to train the software.

It may also be possible to automate the entire process as we have both the
audio and the words spoken at a particular time.

Take it a step further, we have millions of sung songs with lyrics that can
also be used. Its a gold mine of information that can be repurposed.

------
tcarnell
Very interesting. In a similar vein, automated language translation could be
assisted too - beacuase a film DVD often has different audio and subtitle
languages, so it would be possible to pair up semantically similar audio and
written content... and put it all into a magic computer

------
mooism2
For the purposes of speech recognition, songs strike me as being particularly
_noisy_.

~~~
0x12
That's a plus though, on the testing front. Once it works with random sections
from songs that were not part of the training set that would be a significant
improvement over what we have today.

The problem would be the disproportionate weights given to the words 'I',
'love', 'you', 'baby'. Songs are probably not the best training data when it
comes to getting a well rounded vocabulary.

------
SandB0x
A similar idea: "Learning sign language by watching TV (using weakly aligned
subtitles" from CVPR 2009:

<http://www.comp.leeds.ac.uk/me/Publications/cvpr09_bsl.pdf>

------
adsahay
For films and music the audio data may have too much noise, but TV programmes
with low background noise (news, documentary, interview) with available Closed
Captions (CC) are good training sources. CC transcripts are enforced by
broadcasting regulators so they should be highly accurate.

The big problem with using these sources is the huge vocabulary. Speech
recognition works better for smaller vocabularies than bigger.

------
fbnt
<http://voxforge.com> have been collecting a big speech corpora over the last
few years, under GPL license. That should be the way to follow imho.

Training a speech recognition engine is quite a sophisticated process, and
usually requires at least a clean (not noisy) set of samples, which you can't
find in dubbed movies and surely not in music.

------
detst
Google had a service (6 or more years ago) that would search TV shows (that
they were recording themselves) and provide back the transcripts and thumbnail
images for any matches. I suspect this was used as training data.

