
Nvidia Accelerates Real Time Speech to Text Transcription 3500x with Kaldi - dsr12
https://devblogs.nvidia.com/nvidia-accelerates-speech-text-transcription-3500x-kaldi/
======
gok
They took a very simple speech recognizer that was running at ~380x real time
on a CPU and got it to run at ~3500x on a GPU. I would call that a ~9x
acceleration. In terms of perf/watt it was more like 5x.

~~~
dheera
Agreed, super bad title.

As an engineer I'm always impressed by a 9X speed-up from state-of-the-art,
but I'm guessing that "9X" would make tech-illiterate shareholders unhappy and
cause them to short the stock?

Would love a Chrome plugin that "fixes" numbers like this on pages based on
crowd-sourced human understanding. People flag incorrect numbers they see on
websites; new visitors start to see corrected numbers after enough people flag
it with the corrected number.

The same would be useful for e.g. scooters advertising 30 miles range
(s/30/10/), laptop battery life of 10 hours (s/10/5/), hardware companies
announcing "availability of" a certain product when it's actually not yet for
sale (s/availability/unavailability/), etc.

~~~
ericd
Yeah, I've been wanting a similar chrome extension, actually. HN comments
serve as a very valuable filter for information that I learn from, and I think
it'd be great if the rest of the web benefitted from that kind of
proofreading. Little annotations as a layer on top of the web.

Managing the community would be a bit tough, but if you seeded it with the HN
community, and managed to keep the same self-regulating culture, I think that
would be a strong base to start from.

------
melling
There’s an open source tool that lets developers program by voice using Kaldi.

[https://www.reddit.com/r/speechrecognition/comments/5p3uxb/p...](https://www.reddit.com/r/speechrecognition/comments/5p3uxb/programming_using_speech_recognition/)

------
wjruoxue
Use GPU to run FST decoding is not something new. The major challenge is how
to run it efficiently when model cannot fit into one GPU. The reason Kaldi
still uses CPU for decoding is that with a beefy CPU machine many decoders can
run in parallel with just one copy of the model in memory. With enough CPUs,
the RTF isn't so bad.

------
melling
“This means 24 hours worth of human speech can be transcribed in 25 seconds.”

With this kind of performance, can’t a large body of video or audio (e.g
podcasts) be transcribed then manually corrected to improve the model?

~~~
intopieces
A little bit, but podcasts and other instances of performed speech represent a
narrow sample of what human speech sounds like in the real world. Eventually,
there is not much to gain from it.

------
m3kw9
Now imagine this with an ASIC or FPGA. They are usually 20 to 70x faster.

