Is anyone enough of a STT expert to weigh in? Why should I trust this?
That's why it would help to know why this test is comprehensive and representative enough to be considered unbiased or otherwise where its biases are. I don't have that expertise myself.
Anything other than that is taking your word for it.
I would guess to anticipate possible problems, but I don't really know.
The thing is that training data is very often hard to come by due to monetary or other cost, so it's extremely tempting to share some data between training and testing -- yet it's a cardinal sin for the reason you said.
Historically there have been a number of cases where the best machine learning packages (OCR, speech recognition, and more) have been best because of the quality of their data (including separating training from test) more than because of the underlying algorithms themselves.
Anyway it's important to note that developers can and do fall into this trap naively, not only when they're trying to cheat -- they can have good intentions and still get it wrong.
Therefore discussions of methodology, as above, are pretty much always in order, and not inherently some kind of challenge to the honesty and honor of the devs.
But that's likely trained exclusively on CV audio and using a language model derived from the CV training data so that's also not a fair comparison unless the other engines were trained the same way.
Comparing systems trained on different datasets (with different language models) like this is like comparing apples to oranges. Mixing wildly different CPU and memory requirements into the benchmarks just makes it worse.
It should be fairly straightforward to do an unbiased comparison by training and evaluating with the standard Librispeech split and language model. It might be interesting to see how accuracy improves as the models scale up until they match the resource requirements of the other engines.
That said, the speed and memory usage are impressive and I like the focus on very low resource environments. Seems like it has a lot of potential even if it may not be SOTA.
A disclaimer is that we use the valid train portion of CV as part of our training set. But it is less than 10% of the train set (in terms of hours). Also, we do not employ an LM mainly because the systems we are targeting do not have enough storage for a strong LM (usually the storage on them maxes out at 64 MB). Cheetah is an end-to-end acoustic model. For later versions, we might be able to add a well-pruned LM for specific domains with limited vocabulary to boost the accuracy with limited storage available.
I fully agree with your points. I am taking notes here as I think we should follow up on a couple of your suggestions. Scaling up to DeepSpeech model size can be a bit tricky as it would require much more compute resources (GPU). But should be quite doable with time and budget.
Thanks again for your comments and suggestions. As you correctly pointed out our main focus is the very low resource (CPU/Memory) embedded systems.
There are so, so many variables - and many engines could be optimized for 'your voice and the things you're going to say right now given cpu/memory, quality of microphone, background noise etc..'
The game of NLP is inherently about dealing with 'noisy channels' (in the academic sense) in which there is kind of a probabilistic guarantee of imperfection. So then it comes down to creating the best products in a given context, which is almost always 'less than optimized' for any individual.
So there's the model size, cpu/ram, quality of signal (microphone, network) just to start.
Optimizing for standard english means probably reducing quality for people with accents. Maybe in a specific context you could go from 80% accuracy to 85% accuracy for 'most of us ' - but then yo go from 60% to 40% for anyone with an accent.
And if you reduce the accepted vocabulary, you can get way better results. Of course, we all might want to say words like 'obvolute' and 'abalienate' every so often.
Kind of thing.
It's really fun from an R&D perspective, but it's a product managers nightmare. Consumer expectations with these technologies are really challenging because of inherent ambiguity in a system, people kind of want perfection. And there are always corner cases where it would seem like things should be say, but they're not - because the word you're saying is common, and you'r saying it 'perfectly clear' ... but little do you know there are 2 or 3 other very rare words that sound 'just like that' ergo ... problems.
From a product perspective, it basically always feels 'broken' which is such a terrible feeling :).
But it can be fun if you like really hard product challenges which have less to do with tech and more to do with pure user experience, expectations, behaviours etc.
I am building a docker image which will eventually accept in-browser audio via WebSockets outputting transcription in real time without needing Google WebSpeech.
I planned to use DeepSpeech but this looks promising given it's low resources.
I am frustrated by:
- lack of cross-browser support on whatever device is generating the transcript. It will work from any Android phone but not from iOS
- No offline functionality, because the work is actually not done locally
- Poor accuracy.. we can correct this on the fly with the live editing, but it feels like it is not using the latest & greatest Google has to offer in terms of STT
I don't know much about Docker or how to "use" a Docker image. Not asking you to teach me, but do you think your project would be useful for what I'm talking about? Currently using Firebase for the backend but it's really just exploratory at this point.
This is after I struggled to find anything in usable form on GitHub and Google Cloud pricing is prohibitive for my free projects. Perhaps there has been progress?
web speech recognition API in Chrome
I looked at this last year (See my comments 8-12 months back) and it seems either Google or Chromium dev team removed recognition.speechURI . It was there in Chrome however later removed. Would have been really helpful to just switch out provider from the browsers default to an alternative perhaps cheaper or free option. I don't know enough about the decision behind this, or if it was actually working in the first place. To be fair the whole "SpeechRecognition" is still in draft. See: https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...
End goal is to easily provide ASR/S2T through WebSockets. The uses and possibilities I have thought out just myself are many and I am sure it will be of use to others.
> I don't know much about Docker or how to "use" a Docker image [....] but do you think your project would be useful for what I'm talking about? [....]
Yes. But accuracy is probably something Google is winning on here. If your app is browser based, unless my assumption is incorrect you do not need an API key therefore are not bound by quota limits/costs.
Are you on Twitter or is there some other way for me to know about your progress?
There are a few features not listed (like exporting a correctly-formatted subtitle file) but you'll get the general idea. There's a link to an old demo in the video description.
We already have a community of more than 14 000 developpers on the platform,
you can get your own assistants running on a Raspbbery Pi in less than 1h, those are a few tutorials:
The comments are absolutely correct. Could services (can) do better simply because they have access to more compute resources and also data (what is sent to them can be/is used for training later). There are situations where on-device is preferred due to privacy reasons, latency, cost, or lack of internet connection.
Or stream to Google Cloud.
Both are used in production successfully already. Latency is on par with what you would experience calling FedEx and talking to their ASR bot.
That's a really, really, really stupid quote that pushes the increasingly popular world view that experts only drag the world down on every subject.
Linguists' contribution is unrelated to performance, positive or negative. Performance is the job of computer scientists designing architecture and of skilled programmers doing smart implementations of architecture.
The performance would probably go up if the company fired all presidents/vice presidents/CFOs etc., too (giving implementors free reign, unconstrained by costs and schedules), but the company would probably go under.
Every specialist has their own role to play. Thinking that subject domain expertise is a negative is indefensible.
Even with the explanation, it still comes across more as blaming the linguists than the people who assigned the wrong people to the wrong tasks.
(Why are they fired rather than reassigned? Why were they ever hired if they're the wrong specialty and aren't needed elsewhere?)
I worry about this in a lot of fields where we are heavily pushing machine learning. I was also not saying that this project is somehow poor quality, I just could not find much about the team and was curious about their research backgrounds.
Thanks for mentioning the quote I heard it from few friends who used to work in Nuance. But didn't know where it came from originally.
We have a very small engineering team who deeply understands machine learning and (embedded) software. I have invented/co-invented few US patents in speech processing/recognition prior to Picovoice. But have no academic background in linguistics. Having worked with computational linguists, they are definitely a solid plus to the team when their skill set is utilized correctly. I appreciate your question and curiosity. We hopefully will provide more information about the team on our website soon. Thank you.