
Getting Deep Speech to Work in Mandarin - kornish
http://svail.github.io/mandarin/
======
weinzierl
The amazing part is that their system seems to be adaptable to any language
with a minimum of human effort.

    
    
       > One of the reasons deep learning has been so valuable is that it has converted 
       > researcher time spent on hand engineering features to computer time spent on 
       > training networks.   
     
       [...]
    
       > We can now train a model on  10,000 hours of speech in around 100 hours on a 
       > single 8 GPU node. That much data seems to be sufficient to push the state of the 
       > art on other languages. There are currently about 13 languages with more than one 
       > hundred million speakers. Therefore we could produce a near state-of-the-art 
       > speech recognition system for every language with greater than one hundred 
       > million users in about 60 days on a single node.

~~~
nshm
The effort definitely not minimal. You have to collect 10000 transcribed hours
first. This is not easy in many languages.

------
larakerns
I'm surprised it doesn't use Character Aware Neural Language Models (CNN ->
LSTM RNN) but instead a layered RNN. Interesting!

------
EliRivers
Facebook disallows some images, based on the personal standards of whoever
happens to be in charge of image disallowing that day. Google controls what
you see based on your own past, limiting your exposure to opinions you might
not like. Companies comply with oppressive government requests for control and
surveillance.

If we surrender our ability to communicate with people speaking in foreign
languages in this fashion, we will literally become unable to talk about
things that we "shouldn't", and everything we do talk about will be on
permanent record and monitored in real-time for dissent and to target adverts
at us.

------
romaniv
I keep reading about these algorithms that are "better than humans". Perfect
image recognition, perfect speech recognition, parsing plain text-queries and
answering questions, etc, etc. So where are the practical implementations?

All the speech recognition engines I've interacted with so far were awful. Not
just bad, awful.

 _> Collecting such data sets could be very difficult and prohibitively
expensive._

Uh, movie subtitles?

~~~
amake
> Uh, movie subtitles?

Movie subtitles are very rarely actual transcriptions of what is spoken; they
are instead summaries, edited for brevity and quick comprehension.

I don't know much about what kind of corpora is required for training this
kind of model, but subtitles don't seem appropriate.

~~~
taneq
Is that necessarily a bad thing? Maybe if you're transcribing text for
dictation, but for many voice recognition uses, you would be happy with text
conveying the semantic content of the spoken text.

~~~
amake
This is really not my field, but my suspicion is It Just Doesn't Work Like
That.

~~~
taneq
Not my field either, but I wouldn't have thought that mere word proximity
would lead to something as interesting as word embeddings either, and they
seem to be a general purpose "input token -> semantic meaning" mapping, so who
knows? (Seriously, if someone does know, tell!)

