RTF as in real time factor - how long does the processing take compared to the l...

hiddencost · on Sept 15, 2016

Yup! Good production systems shoot for RTF ~ 1.0. This means that they can usually answer almost as soon as the speech is ended, because recognition is streaming.

And it's _really easy_ to increase accuracy by taking more time, by: building bigger DNN acoustic models; exploring a larger search space of hypotheses; using a slower language model (like an RNN) to rescore hypotheses; considering more possible pronunciations; etc....

(ML is usually a space / time / accuracy trade-off, so if you get phat accuracy gains at the cost of significant slow down, I'm usually unimpressed. The deepmind TTS paper _was_ impressive because it went beyond the best we can do, so even though it was 90 minutes to generate 1 second of speech, it's cool because it shows where we can go. TBH all of these switchboard papers don't do a ton of new stuff, they just get more aggressive about system combination and tuning hyperparameters.)