Hacker News new | comments | ask | show | jobs | submit login
Ask HN: Why don't we use subtitled films/tv to train speech recognition?
30 points by sycren on Aug 30, 2011 | hide | past | web | favorite | 34 comments
There are thousands of films and tv episodes that have subtitles throughout their duration. Millions of music that are sung that we can find lyrics for. Would it not be possible to use this material to train speech recognition. This would then make it possible to train in the multiple different dialects and accents of a particular language.

Speech recognition as a technology, has always appeared to move slowly although with the advent of mobile popularity, the technology is becoming increasingly popular.

Is anyone doing anything like this?

I used to work for a company that built Speech Recognition systems and I came up with a similar/related idea - the idea being to take a load of videos of barack obama (for example), and create an accurate 'voice print'. Once done, any videos or speech could be scanned and if Barack Obama's voice print was recognized/detected, the recognizer could be tuned to his voice print AND could apply a set of appropriate grammars/vocabluary (for example the 'politics' grammar, or 'american' grammar or 'economics' grammar) - then you could very accurately perform speech recognition and automatically create text translations. Then when you google for text, you could actually retrieve videos whose content exactly matches the search terms and jump directly to that part of the video.

Over time you could build up a database of voice prints and grammars for not just celebrities, policitians, but also criminals (for automatic identification).

I had this idea almost 4 years ago, submitted it to the company, but it wasn't taken seriously.

If anybody is interested in this, let me know!

Google is doing something comparable with Google Voice and Search recognition transcriptions, inviting corrections both manually and by using similar techniques to spell correct in text search.

I suspect a lack of data is not the biggest challenge in improving speech recognition

> when you google for text, you could actually retrieve videos whose content exactly matches the search terms and jump directly to that part of the video

The search aspect of this is very interesting and I hadn't thought of it before (though in hindsight it seems like an obvious benefit).

Google is doing something comparable with Google Voice and Search recognition transcriptions, inviting corrections both manually and by using similar techniques to spell correct in text search.

I suspect a lack of data is the biggest challenge in improving speech recognition

Execute the idea for yourself. Good ideas are a dime-a-dozen.

Off-topic: I understand your point but I've come to realise that good ideas are not a 'dime-a-dozen'. Mediocre ideas, maybe (although telling them apart is not necessarily obvious).

I would also say that with this particular idea, I have no idea whether you could data mine films and other media without having to pay some hefty licence fees that may perturb startups without enough funding.

I don't believe there's any laws that would mandate this, at least in America. As long as you're merely analyzing legitimately acquired content, I don't see how anybody else has any say in what you're doing. Copyright isn't an absolute control over what can and can't be done with your work — it's just a right to restrict reproduction.

i'm interested.

I would be interested in what you might come up with

The problem with implementing this idea is which technology would actually have to be built, and in which technology the real value would lie. Once the technology is established, scanning videos, tv, radio etc for any source of spoken audio and building up a database of indexable dialogs would almost be the easy part.

Building a speech reconizer is not only difficult, it has also been attempted many times before and unless a speech recognition guru could bring something new to the world, the best we could do is what is already available - so probably best to use existing technology, which often is not cheap to get a license. This is also true with voice print technologies.

The key to getting this up and running lies in finding or building a really good speech recognizer and voice print generator/varifier...

Maybe this is something Y Combinator would be interested in funding? I am based in Europe (Spain at the moment) and I think it would be really hard to convince people to fund this type of technology over here.

If anybody is up for the challenge, I'd love to be involved!

Aside from issues relating to background noise on the soundtrack, the subtitles are frequently abridged from the spoken word in the interests of space and / or readability, so you'd need to account for that in your algorithm.

If it were me... Project Gutenberg has free books available in both audio and text formats. You may well again run into issues with the spoken and written text not exactly matching (it's not something I've looked into to know) but I wouldn't be surprised if it was rather less than what I've observed in subtitles, and the data concerned is in a more easily parsed format.

Audio recordings of book readings are less practical than subtitles because they are not synchronized. Every subtitle in a film is associated with the sound clip that plays while it is visible, whereas for an audiobook or similar, any algorithm would have to "align" the audio and text in order to obtain usable training data, and then it would have to deal with the errors introduced by this process.

Hmm, good point, though still against a cleaner audio signal, easier input data to process and (likely) closer matching of text. I'm not remotely involved with the field so I won't indulge in further wild speculation, but an interesting balance.

As 0x12 states further down, noise can be seen as beneficial. By having such a huge dataset, perhaps it would be possible to advance the technology of speech recognition to transcribe speech in busy places as needed for in mobile applications where the user is not in a quiet room.

Perhaps, but I wouldn't use that as a starting dataset; noise resilience and a training set for enhancing this functionality is surely better developed on top of a working implementation for a lower-noise input? Better to build the easier solution and reinforce it for hard problems than try to go straight at the hard problems.

Put it another way; which would you start by teaching a student: the easy situations or the more complex situations?

I happen to know that they do this at the Linguistics Data Consortium (http://www.ldc.upenn.edu/), at least with cable news shows. They mostly do that to obtain data for languages with more minimal resources though, and for the purposes of transcription, not for speech recognition qua engineering research. The real issue though is the research community is interested in increasing the accuracy of recognizers on standard datasets by developing better models, not increasing accuracy per se. Having used more data isn't publishable. Further, in terms of real gains, the data is sparse (power law distributed), and so we need more than just a constant increase in the amount of data. This issue is general to any machine-learning scenario but is particularly pronounced in anything built on language.

Some related papers ~ Moore R K. 'There's no data like more data (but when will enough be enough?)', Proc. Inst. of Acoustics Workshop on Innovation in Speech Processing, IoA Proceedings vol.23, pt.3, pp.19-26, Stratford-upon-Avon, 2-3 April (2001). Charles Yang. Who's afraid of George Kingsley Zipf? Ms., University of Pennsylvania. http://www.ling.upenn.edu/~ycharles/papers/zipfnew.pdf

Well I am sure they would do, though subtitles aren't the most reliable source for movie dialog. Often the dialog is altered subtly to fit the space and timing requirements.

I've had the subtitles turned on for about a year now and it wouldn't take more than 2 hours of watching broadcast TV with subtitles to realize this isn't a good solution. I've noticed the following.

1. Audio track is censored, Subtitles are not or Vice/Versa. 2. Actors Improvise the audio, the Subtitles are based on the script. 3. English Translations were done by the cheapest person possible so lots of partial words because they weren't clear and the transcriber didn't understand the context. 4. A recent show (2011) seemed to have a symbol every other character, I'm not sure if this is a Double-Byte Character issue, or just a bad translation. 5. Several shows such as American Idol and America's Got Talent display song lyrics and I'm not sure but I would think singing would require changes to the Algorithm.

I wish you well with the idea, but now you have a little more information.

And they are translations not speech-2-text.

Movies are also subtitled in the same language for the hearing impaired.

How about music lyrics?

For training speech recognition? Songs strikes me as poor training data.

I wasn't arguing against movies, just that subtitles rather than a final script isn't the best data source.

well not exactly for speech recognition when it comes to entering text into a document but as spoken word to subtitle. I think there was a TED video that showed this somewhere..

There are two different but correlated fields: Speech recognition and Natural Language Understanding.Speech Recognition is easier if the scope is minimised, that is, if th system knows which subset of keywords of orders to recognized. But recognizing an open scope, including different accents, slang, etc is a much more difficult task.

I mean there must be millions of times where a character has said 'hello' in a film or tv episode. Each person may have a slightly different way of saying it which can then be used to make a model for speech recognition software which may no longer require the user to train the software.

It may also be possible to automate the entire process as we have both the audio and the words spoken at a particular time.

Take it a step further, we have millions of sung songs with lyrics that can also be used. Its a gold mine of information that can be repurposed.

Very interesting. In a similar vein, automated language translation could be assisted too - beacuase a film DVD often has different audio and subtitle languages, so it would be possible to pair up semantically similar audio and written content... and put it all into a magic computer

For the purposes of speech recognition, songs strike me as being particularly noisy.

That's a plus though, on the testing front. Once it works with random sections from songs that were not part of the training set that would be a significant improvement over what we have today.

The problem would be the disproportionate weights given to the words 'I', 'love', 'you', 'baby'. Songs are probably not the best training data when it comes to getting a well rounded vocabulary.

True, we may think about acappellas then.

A similar idea: "Learning sign language by watching TV (using weakly aligned subtitles" from CVPR 2009:


For films and music the audio data may have too much noise, but TV programmes with low background noise (news, documentary, interview) with available Closed Captions (CC) are good training sources. CC transcripts are enforced by broadcasting regulators so they should be highly accurate.

The big problem with using these sources is the huge vocabulary. Speech recognition works better for smaller vocabularies than bigger.

http://voxforge.com have been collecting a big speech corpora over the last few years, under GPL license. That should be the way to follow imho.

Training a speech recognition engine is quite a sophisticated process, and usually requires at least a clean (not noisy) set of samples, which you can't find in dubbed movies and surely not in music.

Google had a service (6 or more years ago) that would search TV shows (that they were recording themselves) and provide back the transcripts and thumbnail images for any matches. I suspect this was used as training data.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact