Hacker News new | past | comments | ask | show | jobs | submit login
Speech-to-Text Benchmark: Framework for benchmarking speech-to-text engines (github.com)
195 points by kenarsa on Aug 7, 2018 | hide | past | web | favorite | 64 comments

I'm surprised no one has questioned that this benchmark is published by the creators of Cheetah. The conflict of interest is extreme.

Is anyone enough of a STT expert to weigh in? Why should I trust this?

I agree with your point. That is why we open sourced the benchmark so you can verify it yourself. You just need an Ubuntu machine with python 3.6. I appreciate the question.

It's the methodology, not the results or the code, that I'm suspicious of. For a highly variable task like STT, I'm sure an expert could contrive a test that gives far better results for either of the other programs tested.

That's why it would help to know why this test is comprehensive and representative enough to be considered unbiased or otherwise where its biases are. I don't have that expertise myself.

I think the biases are pretty obvious, but the most serious shortcoming of this benchmark is that their result (30% WER on CV) is not reproducible: it's not clear what they trained their model on, and the model itself is not available, so you just have to take their word for it.

Thanks for the comment. Just wanted to quickly clarify that the model is available here: https://github.com/Picovoice/stt-benchmark/tree/master/resou...

No, "reproducible" means that I can train it on the same data you used, and get the claimed result.

Anything other than that is taking your word for it.

As someone who has no idea about ML, why is knowing how it was trained important?

I would guess to anticipate possible problems, but I don't really know.

To make sure it wasn't trained on the test set, even if by accident.

Kinda like creating your acceptance test using the same scripts as your unit test... the result would look good, but in fact be less than reliable.

More than kind of like; you nailed it: exactly like. This has been an infamous issue with machine learning for decades, where unwary researchers/developers can do this quite accidentally if they're not careful.

The thing is that training data is very often hard to come by due to monetary or other cost, so it's extremely tempting to share some data between training and testing -- yet it's a cardinal sin for the reason you said.

Historically there have been a number of cases where the best machine learning packages (OCR, speech recognition, and more) have been best because of the quality of their data (including separating training from test) more than because of the underlying algorithms themselves.

Anyway it's important to note that developers can and do fall into this trap naively, not only when they're trying to cheat -- they can have good intentions and still get it wrong.

Therefore discussions of methodology, as above, are pretty much always in order, and not inherently some kind of challenge to the honesty and honor of the devs.

Is there more information on Cheetah anywhere? I was looking at picovoice.ai but it only mentions PORCUPiNE.

Thanks for the comment. Not currently at the moment. But we are going to provide more information on picovoice.ai in coming days.

It isn’t like they are trying to hid this. They made it abundantly clear in the writeup that they made Cheetah.

Who said they're trying to hide it? I didn't. When I wrote my comment, there were 50 upvotes and 0 comments, which suggested to me that most people were ignoring how meaningless a benchmark usually is when it's intended to promote a product.

It is open-source : it is open for review, forks and improvements

Looks like Kaldi gets 4.27% WER on CV: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoi...

But that's likely trained exclusively on CV audio and using a language model derived from the CV training data so that's also not a fair comparison unless the other engines were trained the same way.

Comparing systems trained on different datasets (with different language models) like this is like comparing apples to oranges. Mixing wildly different CPU and memory requirements into the benchmarks just makes it worse.

It should be fairly straightforward to do an unbiased comparison by training and evaluating with the standard Librispeech split and language model. It might be interesting to see how accuracy improves as the models scale up until they match the resource requirements of the other engines.

That said, the speed and memory usage are impressive and I like the focus on very low resource environments. Seems like it has a lot of potential even if it may not be SOTA.

Thanks for the link. I'll be sure to look into it. It is impressive.

A disclaimer is that we use the valid train portion of CV as part of our training set. But it is less than 10% of the train set (in terms of hours). Also, we do not employ an LM mainly because the systems we are targeting do not have enough storage for a strong LM (usually the storage on them maxes out at 64 MB). Cheetah is an end-to-end acoustic model. For later versions, we might be able to add a well-pruned LM for specific domains with limited vocabulary to boost the accuracy with limited storage available.

I fully agree with your points. I am taking notes here as I think we should follow up on a couple of your suggestions. Scaling up to DeepSpeech model size can be a bit tricky as it would require much more compute resources (GPU). But should be quite doable with time and budget.

Thanks again for your comments and suggestions. As you correctly pointed out our main focus is the very low resource (CPU/Memory) embedded systems.

It would be great to have a service that let you personally audition a bunch of speech recognition engines to measure and compare how well they understand YOUR voice.

Yeah, but it'd be like getting a blood test and trying to interpret it yourself :).

There are so, so many variables - and many engines could be optimized for 'your voice and the things you're going to say right now given cpu/memory, quality of microphone, background noise etc..'

The game of NLP is inherently about dealing with 'noisy channels' (in the academic sense) in which there is kind of a probabilistic guarantee of imperfection. So then it comes down to creating the best products in a given context, which is almost always 'less than optimized' for any individual.

So there's the model size, cpu/ram, quality of signal (microphone, network) just to start.

Optimizing for standard english means probably reducing quality for people with accents. Maybe in a specific context you could go from 80% accuracy to 85% accuracy for 'most of us ' - but then yo go from 60% to 40% for anyone with an accent.

And if you reduce the accepted vocabulary, you can get way better results. Of course, we all might want to say words like 'obvolute' and 'abalienate' every so often.

Kind of thing.

It's really fun from an R&D perspective, but it's a product managers nightmare. Consumer expectations with these technologies are really challenging because of inherent ambiguity in a system, people kind of want perfection. And there are always corner cases where it would seem like things should be say, but they're not - because the word you're saying is common, and you'r saying it 'perfectly clear' ... but little do you know there are 2 or 3 other very rare words that sound 'just like that' ergo ... problems.

From a product perspective, it basically always feels 'broken' which is such a terrible feeling :).

But it can be fun if you like really hard product challenges which have less to do with tech and more to do with pure user experience, expectations, behaviours etc.

I think this is a great summary of where we are at now, but I think stronger broad-coverage language models (=expectations for what people say, better generative models of speakers) are feasible (for a couple 10s-100s of millions in R&D) to bring ASR up to parity with people. It's pretty clear we are getting close to the limits of what acoustics can offer, and it's the language model that is the next frontier both in terms of accuracy and real-time performance.

Agreed. We are getting close to a nice reality as all of the pieces are getting better.

For speakers of non-standard English, a quick test might be a useful sanity check. Many speech-to-text algorithms fail catastrophically with certain accents.

The factors in the data aren't just your voice but recording environment and microphone, and a recognition "engine" is really a combination of model architecture (sometimes separate acoustic and language models), training data, hyperparameters, and preprocessing. So the reason you don't see these is that there is an enormous hypothesis space, with only small regions that are performant.

Can this be used to transcribe voice data in real time?

I am building a docker image which will eventually accept in-browser audio via WebSockets outputting transcription in real time without needing Google WebSpeech.


I planned to use DeepSpeech but this looks promising given it's low resources.

Could you tell me some more about this? I have used the web speech recognition API in Chrome (coming soon to Firefox also) to create a free project for real-time, editable, transcriptions in the browser that could be projected on a screen or subscribed to on a person's own device.

I am frustrated by:

- lack of cross-browser support on whatever device is generating the transcript. It will work from any Android phone but not from iOS

- No offline functionality, because the work is actually not done locally

- Poor accuracy.. we can correct this on the fly with the live editing, but it feels like it is not using the latest & greatest Google has to offer in terms of STT

I don't know much about Docker or how to "use" a Docker image. Not asking you to teach me, but do you think your project would be useful for what I'm talking about? Currently using Firebase for the backend but it's really just exploratory at this point.

My goal is to create a docker image with all necessary dependencies deployable with WebSocket endpoint exposed. Similar Watson S2T demo (https://speech-to-text-demo.mybluemix.net) which utilizes WS.

This is after I struggled to find anything in usable form on GitHub and Google Cloud pricing is prohibitive for my free projects. Perhaps there has been progress?

web speech recognition API in Chrome

I looked at this last year (See my comments 8-12 months back) and it seems either Google or Chromium dev team removed recognition.speechURI . It was there in Chrome however later removed. Would have been really helpful to just switch out provider from the browsers default to an alternative perhaps cheaper or free option. I don't know enough about the decision behind this, or if it was actually working in the first place. To be fair the whole "SpeechRecognition" is still in draft. See: https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...

End goal is to easily provide ASR/S2T through WebSockets. The uses and possibilities I have thought out just myself are many and I am sure it will be of use to others.

> I don't know much about Docker or how to "use" a Docker image [....] but do you think your project would be useful for what I'm talking about? [....]

Yes. But accuracy is probably something Google is winning on here. If your app is browser based, unless my assumption is incorrect you do not need an API key therefore are not bound by quota limits/costs.

Thanks for expanding! Yes with the spec being in draft who knows what will change. I'm really interested in seeing what the Firefox implementation turns out to be. Hopefully close enough to Chrome's that everything still works.

Are you on Twitter or is there some other way for me to know about your progress?

I too would like more information about both of your projects. I'm currently designing a digital technology platform for a charity for the blind in the UK and transcription of speech to text is one aspect I'm investigating, in addition to the more obvious text to speech.

Here's a video demo of what we are working on: https://youtu.be/xcUxd9sOkaM

There are a few features not listed (like exporting a correctly-formatted subtitle file) but you'll get the general idea. There's a link to an old demo in the video description.

If Mozilla’s DeepSpeech is getting a 30% WER on this test set with a >2GB model... something is very wrong.

This is a bit of surprise to me as well. That being said. The dataset is a tough one. I am ESL, but some of the accented examples I have a hard time understanding. Also the recordings are not near field and there is sometime some background noise.

What is the WER of your model on Librispeech?

We used entire librispeech as part of our training set.

Until you publish Librispeech WER for your model, it's pointless to discuss any advances in efficiency. I have no clue if your results on CommonVoice are good or bad.

In my experience DeepSpeech with the default Mozilla model + 5-gram is extremely sensitive to background noise. Fiddling with the hyperparameters, e.g. upweighting the language model, helps a little.

Thanks. Do you use the default parameters and tune from there?

Hi, I'm a co-founder of https://snips.ai and we are building a 100% on-device and private-by-design, open-source VoiceAI platform with our own ASR which works on Raspberry Pi3, Linux, iOS, and Android

We already have a community of more than 14 000 developpers on the platform,

you can get your own assistants running on a Raspbbery Pi in less than 1h, those are a few tutorials:

- https://medium.com/snips-ai/voice-controlled-lights-with-a-r... - https://medium.com/snips-ai/an-introduction-to-snips-nlu-the...

I can't be the only one that sees "Token Sale" and immediately loses all interest.

I wonder how these engines compare to cloud services described in [1].

[1]: https://blog.rebased.pl/2016/12/08/speech-recognition-1.html

Great question. I believe that someone has already performed this measurement on a variety of could APIs (Google, Amazon. MS. etc.). I remember seeing it on GitHub while ago. I can't find it right now with a simple Google search. But I will spend more time later in the evening and comment here when I have it.

The comments are absolutely correct. Could services (can) do better simply because they have access to more compute resources and also data (what is sent to them can be/is used for training later). There are situations where on-device is preferred due to privacy reasons, latency, cost, or lack of internet connection.

Commercial APIs from leading companies generally achieve better performance, but besides obvious price and network latency, they are complete black boxes so you can't diagnose and fix problems.

The cloud options almost certainly have better accuracy and use less memory but it's at the cost of latency, network dependency and usually price.

This basically seems like marketing paraded as research.

If I'm reading this correctly, the resources used by Cheetah would allow your laptop to do stt for around 30 streams?

If you use the CPU that we used for testing yes. It's an Intel CPU. The detail is in readme. You can do 30 streams per core. If you have 2 cores you can do 60.

With such amazing efficiency, I would suggest forking off a second branch that massively increasing complexity for gains in your WER. CPU is just going to keep getting cheaper and faster, and being able to leverage the extra cycles for a platform that has them would allow you to dominate from embedded up to the Xeon space.

That is a good point. Definitely a valid roadmap and something we should consider. Thank you for the suggestion.

I wonder if they actually trained Cheetah on a different or the same dataset they are benchmarking on.

Thanks for comment. We don't train on the dataset being tested on. I am fairly certain the other engines don't as well.

What would you folks use for a much narrower use case more like an IVR system? Like if I could give a very restricted set of inputs to recognize and don't need much if any learning to happen.

IBM Watson with https://freeswitch.org/confluence/plugins/servlet/mobile#con...

Or stream to Google Cloud.

Both are used in production successfully already. Latency is on par with what you would experience calling FedEx and talking to their ASR bot.

Thanks for the question. You can use our first product: https://github.com/Picovoice/Porcupine. It is a voice-control (wake-word) engine. It allows you to detect multiple keywords (phrases) in the audio stream in real time with no delay and it fully runs on-device. No cloud connection required.

Just checked out and tried to play with it, turns out this is a binary-only(closed source) release actually. I'm on a mips platform so there is no way I can test it there.

Unfortunately, we only offer Ubuntu x86_64 at the moment. We are planning to add more platforms (Mac and Windows). I make a note of your request and add it to our todo list. Thank you.


If you won't post according to the guidelines we'll ban the account.


test to see if I'm banned

I'm curious about the background of the team. They don't seem to have any linguistics experts from what I could tell.

If that quote attributed to Fred Jelineck ("Every time I fire a linguist, the performance of the speech recognizer goes up") is right, they should have negative linguists.

> If that quote attributed to Fred Jelineck ("Every time I fire a linguist, the performance of the speech recognizer goes up") is right, they should have negative linguists.

That's a really, really, really stupid quote that pushes the increasingly popular world view that experts only drag the world down on every subject.

Linguists' contribution is unrelated to performance, positive or negative. Performance is the job of computer scientists designing architecture and of skilled programmers doing smart implementations of architecture.

The performance would probably go up if the company fired all presidents/vice presidents/CFOs etc., too (giving implementors free reign, unconstrained by costs and schedules), but the company would probably go under.

Every specialist has their own role to play. Thinking that subject domain expertise is a negative is indefensible.

Just to add my two cents (I work for Mozilla on Common Voice): without help from linguists, Common Voice would have made some very different and very bad decisions about all sorts of things like: accents, dialect segmentation, corpus curation, licensing, and many other things. Linguistics were absolutely instrumental. We tried to thank some of them at the end of our blog post: https://medium.com/mozilla-open-innovation/more-common-voice...

I think you are misinterpreting the quote. He meant that experts (computational linguists and statisticians) had a firmer grasp of the task of inference in language than linguists, who tend to care more about formal structure (i.e. Chomsky's competence/performance distinction) and are less aware of problems like overfitting. If anything, it's stressing the importance of domain expertise, just suggesting that who those experts are may be nonobvious.

Thanks for explaining. I would then say that a quote that actually means the opposite of what it appears to mean is a stupid quote for other reasons.

Even with the explanation, it still comes across more as blaming the linguists than the people who assigned the wrong people to the wrong tasks.

(Why are they fired rather than reassigned? Why were they ever hired if they're the wrong specialty and aren't needed elsewhere?)

And if you blindly follow the machine learning path with no scientific context you will convince yourself that having negative people on a team is an answer that makes total sense.

I worry about this in a lot of fields where we are heavily pushing machine learning. I was also not saying that this project is somehow poor quality, I just could not find much about the team and was curious about their research backgrounds.

Hello, sorry to interrupt the conversation. I am Alireza. I am the founder of Picovoice (maker of Cheetah).

Thanks for mentioning the quote I heard it from few friends who used to work in Nuance. But didn't know where it came from originally.

We have a very small engineering team who deeply understands machine learning and (embedded) software. I have invented/co-invented few US patents in speech processing/recognition prior to Picovoice. But have no academic background in linguistics. Having worked with computational linguists, they are definitely a solid plus to the team when their skill set is utilized correctly. I appreciate your question and curiosity. We hopefully will provide more information about the team on our website soon. Thank you.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact