
Reaching new records in speech recognition - igravious
https://www.ibm.com/blogs/watson/2017/03/reaching-new-records-in-speech-recognition/
======
nfriedly
I'm responsible for the Watson Speech JS SDK, which is aimed at making the
speech services easy to use in web apps.

Code is at [https://github.com/watson-developer-cloud/speech-
javascript-...](https://github.com/watson-developer-cloud/speech-javascript-
sdk)

Simple demos at [http://watson-speech.mybluemix.net/](http://watson-
speech.mybluemix.net/)

More complex demo at [https://speech-to-text-
demo.mybluemix.net/](https://speech-to-text-demo.mybluemix.net/)

I'm goin to be out some this evening but feel free to ask me questions and
I'll answer them as available.

~~~
sslalready
What's up with the lack of https and encryption in IBM's cloud? I'm sure your
machine learning stuff is great and all but the lack of proper encryption and
security measurements makes it a total no-go. How can you expect anybody to
take your cloud services seriously when you do http in 2017? I'm asking as
someone who had their boss talked into trying out ML stuff one Bluemix by your
sales people.

~~~
nfriedly
Can you give me an example?

[http://stream.watsonplatform.net/](http://stream.watsonplatform.net/) (the
domain that the speech APIs use) redirects to https. Ditto for
[http://gateway.watsonplatform.net/](http://gateway.watsonplatform.net/) which
is what most of the other APIs use.

Both of the linked demos also redirect to https if you try a http URL.

~~~
sslalready
Yes, most of your links (just like many of IBM's cloud related web services).
If anything you, IBM and any potential customers, should be seriously
concerned that they begin with [http://](http://) in the first place.

------
taf2
We have been a pretty large user of this feature within Watson for the last 6
months... while it is pretty good, it lacks the ability to take external
inputs such as stereo recordings with channel markers. I've been working on
migrating our solution to voicebase whom in my opinion has a much more robust
solution when compared to ibm with respect to speaker diarizarion specifically
because they include a feature to do channel markers. The result is a
conversational transcription being much easier to read. Prior to this we used
the Lium project to attempt diarization on a single channel recording. We had
mixed results. Without a doubt in the last 12 months speech to text has
rapidly improved

~~~
nshm
Why migrate to another service in 2017 when open source toolkits like Kaldi
provide you both better results and more features and no vendor lock-in.

~~~
taf2
Cool - well we are hiring, so if you'd like to do this reach out. We have lots
of neat projects like this going on all the time.

------
imh
I wish they clarified whether this claim that humans have a 5.1% error rate is
in "listen to this sentence once and transcribe it" or "study this recording
however you like and transcribe it."

edit: They talk about this in the arxiv paper:

>The transcription protocol that was agreed upon was to have three independent
transcribers provide transcripts which were quality checked by a fourth senior
transcriber. All four transcribers are native US English speakers and were
selected based on the quality of their work on past transcription projects.

>...The transcription time was estimated at 12-14 times realtime (xRT) for the
first pass for Transcribers 1-3 and an additional 1.7-2xRT for the second
quality checking pass (by Transcriber 4). Both passes involved listening to
the audio multiple times: around 3-4 times for the first pass and 1-2 times
for the second.

~~~
woodson
For anyone wondering what the recordings in the HUB5 2000 eval data (the test
data) sound like:
[https://catalog.ldc.upenn.edu/desc/addenda/LDC2002S09.wav](https://catalog.ldc.upenn.edu/desc/addenda/LDC2002S09.wav)

~~~
joshgel
God. Transcribing that would be mind-numbing. Glad computers are getting
better at this.

------
glenngillen
I spent a hack weekend recently using this API as part of a project to help
with collaborating on usability testing reviews.

Upload a video, it strips out the audio, pushes to Watson for transcription,
converts the result to a caption/subtitle track, and then allows people to
comment on the discussion (like Google Docs).

I plan to polish it up a little and open source it soon.

~~~
nfriedly
Yes, please publish this!

Additionally, if you're using WebVVT format subtitles, I'd be interested in
merging that code into the appropriate SDK. They're all on GitHub if you'd
like to send us a PR: [https://github.com/watson-developer-
cloud/](https://github.com/watson-developer-cloud/) (I'm on the team that is
responsible for these.)

------
earthtolazlo
It would be great to see one of these frameworks offered as an offline
solution. Right now the only options are WSR/Sphinx4, which have significant
accuracy issues, and Nuance's products, which are extremely unfriendly to
developers.

~~~
braindead_in
Mozilla's DeepSpeech has a native client
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

------
ghaff
It's great to see the improvements in this area. The voice recognition (if not
the natural language processing) in systems like Amazon's Echo are pretty
decent at this point for basic commands.

That said, I've tried the computer speech to text systems for transcribing
interviews. And even with just one person talking they're nowhere good enough
for me to use. Even budget human transcription (e.g. CastingWords) is just so
much better that it's nowhere worth my time to use a machine-based system.

------
deepnotderp
Tl;dr: they stapled a 6-layer bi LSTM to Wavenet. Good to see IBM admit that
deep learning is better than "Watson".

Also, with wavenet in the mix, no way this is used in production.

~~~
rp36
Also, these are cost prohibitive to use in apps widely or without limitations.
Currently, Google Speech to Text is relatively cheaper at $1 for 166 messages.
Source:
[https://cloud.google.com/speech/pricing](https://cloud.google.com/speech/pricing)

~~~
rhizome
Interesting. I worked for a company that did retail speech recognition in the
late 90s, I should look up how much that cost in comparison to see how the
economics are shaking out.

~~~
lsseckman
We'd be interested in that cost comparison!

~~~
rhizome
Found something:

* $29.95 to register

* $9.95 per month subscription which entitles you to $14.00 worth of free transcription per month

* $3.50 per page (double-spaced, 225 words) for any pages in excess of your $14 allocation

$14 is four 225 word pages, so, 1-900 words for $0.11 per word, then $1.55 per
word. Ish.

~~~
lsseckman
Ahh interesting. I like the subscription model from the perspective of the
business. Seems like it would prohibit signing up for the low-volume user
though, right?

