Hacker News new | comments | show | ask | jobs | submit login
Speech Recognition Is Not Solved (awni.github.io)
435 points by allenleein 11 months ago | hide | past | web | favorite | 201 comments

It is one thing to hear and correctly identify the words. It is another to understand the meaning.

I've been thinking about this because my son has Auditory Processing Disorder, APD. He can hear great, even a whisper across the house. The trouble is the words don't always make sense. He can tell me the words he heard and they are correct, but assigning a meaning to them doesn't work like it does for most people.

After playing him a bunch of audio with missing words, the lady who tested him was blown away by how smart he is - explaining that he is quickly filling in almost every missing word based by guessing from all thing possible things it could be, narrowing it down on context and getting it right. I guess that is a normal, all day long task for him.

Since then I've thought about voice recognition differently. The AI to understand the context or fill in the blanks is what will make or break it.

I'd never heard of APD.

Interesting, all my life I've struggled to follow a conversation in a crowded environment, so much so that I actively avoid background noise with words in, I work with silicon ear plugs in or headphones and music with no lyrics.

Looking at NHS symptoms they describe me as a child, didn't learn to read until I was 8.

I nearly ended up in the remedial track but for a single awesome teacher who spotted that it wasn't that I was incapable but that I was struggling to understand in a noisy classroom, she gave up her lunch time for 6mths and taught me one on one by the end of that year I could read much better.

Wow, I wish there was an easy way to pay back star teachers or mentors.

Write them a handwritten letter and make sure it gets to them.

This. I did this with a math teacher (pre-calc and trig) I had at a community college years later when I was taking calculus etc in college and killing it (after having a less than stellar math education before and during highschool) and I think it was a huge deal to him. He really put his all into teaching that class and a lot of people probably went on to 4 yr college and didn't think much of it but after having professors that hardly spoke a word to their students and spent a straight hour just writing on a chalkboard I totally appreciated an invested, engaged math teacher.

The best way to pay them back is to succeed, but it's not easy!

>is to succeed //

And then buy them a holiday?

Is it really that they were a star? Or just that they cared.

The two aren't necessarily mutually exclusive. Sometimes just caring when no one else does can make someone a star... at least to the person they help.

What's the difference?

I don't mean to denigrate the wonderful contribution that the teacher in question made, as it's truly a selfless commitment with life changing impact. However -

One distinction may be that the star teacher scales better. What if there were two students with this problem? Should the teacher stay late? Should teachers not take lunch? What if someone needs help in math, etc.

A star teacher may be able to reproduce this effect for a large number of students. Perhaps using technology, recorded lectures, practice apps, etc. Perhaps organizing study groups for kids to teach and learn from each other.

A caring teacher may sacrifice her lunch. A star teacher may not need to.

The thing about teaching is that you can be star just by having one success story. If you are a vital part of the success of 2 people you've already clearly contributed more than your share.

But one is enough; some of the most famous scientists only got where they were because they had a specific advisor. Neither the teacher nor student would ever have accomplished half as much working alone.

I am famous to maybe six people. So, it's not much. However, I wanted to add an account to this.

I can tell you the exact moment, problem, response, and teacher - that changed my life forever. It was at that single moment that I understood mathematics. I wasn't suddenly good at it. No, I still pretty much sucked at math. But, I understood it was a language and it was descriptive.

Many years later, that same teacher would be there to help me celebrate defending my dissertation. I didn't actually march, I was already to busy by the time spring rolled around. He's an old man, now. I still stop in and we are still fast friends.

That one comment changed my life. There's a not insignificant chance that it has had a small impact on your life, assuming you've been impacted by modern traffic engineering, or in some pedestrian venues, or had some goods delivered. All that from a single comment that is entirely obvious, in retrospect.

I think you're making a mistake here.

You might be a star programmer, but only if you re-wrote every single program exceptionally, because clearly it's not making a fundamental change to someone's life once that counts, you're not a star until you've done that for everyone ...?

Going above and beyond the expectations of your role in order to make a fundamental change to someone's life is enough to call someone a "star" IMO. It's not like someone asked you to personally pay them a bonus, you can award all exceptional effort s a "star" without problem.

I am not clear on what definition you are using for "star teacher". But it sounds like "caring" is in the set of things that are required to be a star teacher.

Just because they didn't solve all the problems doesn't mean it's irrelevant they solved one problem.

I am sure you have thought about it but since you didn't mention it, are your symptoms those of Asperger's syndrome? Asperger's is often thought of as a sensory perception problem and Aspies tend not to be able to filter out sounds from background noise.

I took one of those online tests a while back that said if you get over 25 you need to see a specialist, I got 43.

Not sure I have Asperger's though, I'm just a programmer who likes his own company.

One the pillars of psychiatrical diagnosis according so doctors I met, is that any kind of diagnosis or treatment is only to be judged by its ability to improve your quality of life.

If you don't suffer in any way shape or form, it doesn't matter how many boxes you tick, if you feel fine there is no compelling reason to go through with diagnosis, except possibly if you have kids.

Autism spectrum disorders are quite heritable, and it's a tremendous advantage to be aware of ones own psychiatric oddities when you raise a child with his or her own. I speak from experience, I wish I known about my ADHD-PI/ADD with a dash of Autism earlier. I am however very happy I got to know about it early enough to help me guide my daughter into a way of thinking that has helped her a lot.

In general, without the knowledge that people on average might have a WILDLY different cognitive landscape than yours, it's very very hard to know which advice to take to heart, and what to completely ignore.

With that said, I'm fairly well aware of what type of questions are on those questionnaires. If the test is legitimate, a result that high over the limit would probably include at least a few 'yes' answers to questions that usually have a negative impact of quality of life.

Adding to that, in my experience people actually on the spectrum are the least likely to self diagnose prematurely, even when the answer is essentially staring them in the face.

Anything that can improve quality of life a little is worth considering, because sometimes it's really hard to judge the magnitude of how it can help.

Hmmm I can relate to what you are describing. I have some of that and was diagnosed with ADD in my 30s and I am a woman. Diagnose and medication helped immensely. It's as if you can't process too much signal at once. There is such a thing as asperger's spectrum. I cannot deal with high-pitched noises and noise in general really disturbs me.

Only one data point, but I find it much harder to filter out speech from background noise in my second language than I do when people are speaking my first language in a noisy environment. My guess would be that we have multiple ways of processing language that use different bits of our brains.

I also notice that even though I can't read lips, I often find it much easier to understand what people are saying when I can see them. I think part of the reason speech recognition isn't generally at a human-level is that they don't receive the same amount or kind of data that we do.

An interesting experiment might be to include a speaker's native tongue when trying to recognize their speech in a different language. I bet that speech recognition would be a lot easier on e.g. native Spanish speakers speaking English if you know to ignore, for example, the sound "e" when spoken before "st" or "sp"

My son was so good at reading lips that we'd play a game where we only mouthed words and he'd tell us what we "said". In school we let the teacher know he needed to see her say the words on a spelling test. In those younger years that was the only way he could tell the different in some sounds. The big one I remember was "th" and teaching him to look for the tongue on the front teeth.

As he is getting older he relies less on seeing lips. I'm hopeful at some point he'll outgrow most issues.

I'm pretty sure I've seen this exact question (do you have trouble understanding speech in noisy environments, etc.) on several Asperger and related self-assess questionnaires.

Wow, you just made me cry. That's beautiful that your teacher was willing to do that for you. Thanks to all the awesome teachers out there.

wonder if i have that. i always try to fill in words with context or words that rhyme with the sound. - when people talk on tv, i need the volume at 95%, everything else like 4 bars. - i hate phones because only one ear gets the information, headphones makes talking on the phone so much more relaxing. - i dont understand how anyone understands anything at clubs, how is that even possible?

> i dont understand how anyone understands anything at clubs, how is that even possible

Mostly, I don't. Earbuds help for some weird reason, as does pressing your ear shut. Looking at their mouth does 80% of the work. And most conversations I have in clubs are pretty mundane and uninspired. “What do you want to drink?” “This DJ is good. Do you also think this DJ is good?” “Do you agree with late justice Earl Warren that Baker v Carr was the most important supreme court decision of his era?” etc... Low entropy, easy to error correct.

> Since then I've thought about voice recognition differently. The AI to understand the context or fill in the blanks is what will make or break it.

Of course, and all humans rely on this as well. No one hears every word perfectly all the time --- it's impossible, because the source person doesn't pronounce every word perfectly all the time. Context clues are a huge part of speech recognition, as well as gestural typing recognition and other forms of machine interpretation of human input. While it's always been a component of NLP, you can clearly see it in action with the Android keyboard these days because after you type two words in a row, the first may be corrected after you enter the second one, based on context provided by the second one.

This is why I think Google recently saying 95% word accuracy is just as good as a human is wrong. If I ask if you after dinner if you "Want to get a pizza cake", you'll probably quickly realize I mean "piece of cake". Mistakes like that in 1/20 words is a lot.

Google is the best of all the big players at figuring out this context (I once asked it what a Dead Left Shrimp was, apparently I was mishearing the name of basketball player Detlef Schrempf), but it will still search for "Pizza Cake" rather than correcting it.

Not only that, but even if you were to get 100% word accuracy, context is still important for choosing between multiple possible meanings.

As an example, I was being directed by Google Maps to a new place, and I asked it "What is the ETA?" It responded, "From Wikipedia, the estimated time of arrival or ETA is the time when a ship, vehicle, aircraft, cargo or emergency service is expected to arrive at a certain place." It was a completely valid answer to the question, but not one that any human would give.

I think you give humans too much credit. If it were so, this joke[0] wouldn't be funny.

s/Microsoft/Google/g s/Seattle/Mountain View/g

[0] http://alunthomasevans.blogspot.com/2007/10/old-microsoft-jo...

Humans make those mistakes too. Once, while talking to somebody about money, I asked "do you know what inflation is?", meaning what the rate is,to help make a decision, but they were offended and replied that yes, of course they knew what the concept meant.

Having a "the" in there like you did should remove ambiguity though. I guess Google Maps defers to Wikipedia when it doesn't know what a term means.

Have you ever considered that pizza cake is delicious?

I don’t know about you, but I wanna try this: https://g.co/kgs/4D4er9

That Gboard "feature" is extremely annoying because it's almost always wrong. This is another case where artificial intelligence ends up being worse than no intelligence.

This reminds me a lot of https://en.wikipedia.org/wiki/Prosopagnosia also known as "face blindness". People with the disorder can see perfectly fine, but have trouble recognizing people's faces, just as your son hears just fine, but has trouble recognizing people's speech. Similarly, I know someone who beat brain cancer, but after a certain point he became unable to properly taste/enjoy food.

Interestingly, these disorders all seem to have pretty direct neurological causes: there is something causing a difference in the person's brain between where they sense the information and where they process it, whether it's genetic (i.e. affecting the layout and growth of the brain) or due to trauma such as a head injury.

I'm curious as to whether your son is able to enjoy music, especially music with multiple different instruments playing simultaneously. I have a theory that the same process on a neurological level that makes food pleasurable is what makes smells smell good, music sound good, certain surface (e.g. soft fur) feel good, and certain sights (e.g. a mountain stream) look good. I mean that not from a neurotransmitter perspective, but in the actual neural processing. It seems these configurations are programmed into our brains genetically, just as our propensity toward recognizing faces/speech are, which is fascinating.

> After playing him a bunch of audio with missing words, the lady who tested him was blown away by how smart he is - explaining that he is quickly filling in almost every missing word based by guessing from all thing possible things it could be, narrowing it down on context and getting it right. I guess that is a normal, all day long task for him.

Sure it is; it's an essential part of speech recognition for all humans.

What you're asking for is equivalent to a general purpose intelligence, i.e. human intelligence, i.e. AI-complete. All (most?) of our ideas can be expressed in language, so in order to actually understand language, you have to actually understand any idea.

Interesting. I wonder if your son will be smarter as a result, kinda like when neural networks generalize better if trained with a dropout.

He seems a bit behind on social cues, but I am often shocked at things he picks up on. We just started playing the board game "Clue". The second game he was using tricks that I use but hadn't mentioned. His deductive reasoning was awesome. He won that game and I don't "let" my kids win.

Sorry to hear that your son has APD! This makes me wonder though - do you know if this condition is language-specific? I assume you taught him English; would he have same problems with a very foreign language - say, Chinese?

I don't think it is language specific. He is in karate and I didn't see and difference.

The background noise seems to be key. He was doing great in karate and a new teacher started blasting music. He went from doing great to being unable to know he should put his right for forward when told to. That was one of the clues that something wasn't normal.

TIL I have APD

What I would like to see you would be an open source speech recognition system that is easy to use, and works. There is really no match for the proprietary solutions in the open source world right now, and that is disappointing.

Not to say that there aren't open source speech recognition systems, they just aren't completely usable in the way proprietary solutions are. A lot of research goes into open source speech recognition, where they are lacking is in the datasets and user experience.

Hopefully, Mozilla's Common Voice project https://voice.mozilla.org/ will be successful in producing an open dataset that everyone has access to and will spur on innovation.

I agree - I don't care how good Google gets it, this is an unsolved problem until I can do it with open source tools operating without an internet connection.

Google has basically invented a special processor with a very, very, VERY weird architecture for these sorts of tasks: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...

I don't think this level of computational power can be achieved on a modern CPU, or even a GPU! But GPUs are probably the closest analog to Google's absurdly parallel architecture.

To get a GPU working at maximum performance, you either have to go OpenCL2.0 or CUDA. Compared to OpenCL1.2, OpenCL 2.0 has a better atomics model, dynamic parallelism (kernels that can launch kernels), shared memory, and tons of other features.

NVidia of course supports those features in CUDA, but NVidia's OpenCL support is stuck at 1.2. So in effect, CUDA and OpenCL are in competition with each other.

Anyway, that's the current layout of the hardware that's available to consumers. I think its reasonable to expect a graphics card in a modern machine, even Intel's weak integrated-GPUs have a parallel-computing advantage over a CPU.

So for high-parallelism tasks like audio analysis or image analysis, it only makes sense to target GPUs today.

The TPU architecture isn't that weird...it's basically a hardware implementation of matrix multiplication. It also isn't a silver bullet for ASR, where neural networks are usually only used for a part of the recognition process.

> The TPU architecture isn't that weird

Its "weird" in the ways that matter: there's no commodity hardware in existence that replicates what a TPU does. The only place to get TPUs is through Google's cloud services.

CPUs are basically Von Neumann Architecture. GPUs (NVidia and AMD) are basically SIMD / SIMT systems.

Google's TPU is just something dramatically different, optimized yes for Matrix Multiplication, but its not something you can buy and use offline.

It's not that dissimilar in architecture or performance to the Tensor Cores in Volta, which you can buy soon.

I wasn't aware of Tensor Cores in Volta.

I'll look into them, thanks!

> but its not something you can buy and use offline.

But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.

> But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.

As far as I can tell, they put a microphone on your phone and then relay your voice to Google's servers for analysis.

Or Amazon's servers, in the case of the Echo.

I don't see any near-term future where Google's TPUs become widely available for consumers: be it on a phone or desktop. And I'm not aware of any product from the major hardware manufacturers that even attempt to replicate Google's TPU architecture.

NVidia and AMD are sorta going the opposite direction: they're making their GPUs more and more flexible (which will be useful in a wider variety of problems), while Google's TPUs specialize further and further into low-precision matrix multiplications.

Neural nets take large GPUs (or TPUs) to train. Realtime inference on CPUs has been possible since forever.

Also, I just turned on airplane mode and google assistant recognized my voice.

Is that the point? I ask, because the "weird" in the TPU is mostly its scale. Its not like you can't do matrix multiplies with the vector units on a CPU or with a GPU. Its really the scale, by that I mean its more elements than what you get with existing hardware, but its also lower precision, and appears less flexible, and is bolted to a heavyweight memory subsystem.

So, in that regard its no more "weird" than other common accelerator/coprocessors for things like compression.

So, in the end, what would show up in a phone doesn't really look anything like a TPU. I would maybe expect a lightweight piece of matrix acceleration hardware, which due to power constraints isn't going to be able to match what a "desktop" level FPGA or GPU is capable of much less a full blown TPU.

If this were true i’d expect to see some effort to open and standardize the hardware. Otherwise what’s the point?

Neural networks are used for nearly all of ASR now. Last I heard only the spectral components were still calculated not using a neural net and the text-to-speech is now entirely neural network (i.e. you feed text in and get audio samples out). I'd be surprised if they don't do that for ASR too soon if they haven't already.

Although some models are end-to-end neural nets, most of the ones in production (and all of the ones that get state of the art results) only use a neural net for one part of the process. Lots of people are as surprised as you, but that's the way it is.

Edit: I should say that in state of the art results there tend to be multiple components, including multiple neural nets and the tricky "decode graph" that gok and I are talking about. These are trained separately then get stuck together, as opposed to being trained in an end-to-end fashion.

Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).

[1] https://arxiv.org/abs/1610.09975

Yes, exactly. I do wonder whether a similarly good end-to-end system could be trained by constraining the alignments as I've seen done in some papers.

AFAIK state of the art models are hybrid of HMM/GMM and CNN for phoneme classification. There are exotic CTC/RNN based architectures for end-to-end recognition but they aren't state of the art.

You're right in that TPU's allow Google to train very large datasets faster and using less power. But I think a reasonable ASR model should be trainable with GPUs alone.

The issue previously has been a lack of large enough high quality annotated datasets, and open source ASR libraries being a bit behind or not well integrated with cutting edge deep learning. I think that's changing now though. I hope it won't take too long until pre-trained, reasonable size and high accuracy TensorFlow/Kaldi models for many languages are common.

The level of the computation can be achieved just fine with a GPU or some co-processors. What the TPU excels at is performing forward inference very efficiently. So, you can't train on it, or perform arbitrary computation that well, but if you had a pre-trained neural net, you could run it very fast and with little power.

I don't think "donate your voice" will ever work. They should incorporate it in a product and add a checkbox to share your voice/statistics to improve the system.

Why don't you think it will work? Their goal is 10,000 hours of speech, which doesn't seem like an unattainable goal to me.

The dataset is only going to grow as time goes on. They just need to get their website in order (a stats page to show where they're currently at and how far they need to go would help a lot) and throw a little advertising behind it. It's just a matter of time

I recently have been trying out open source solutions for voice recognition for a personal project and you are very correct that it lags very far behind proprietary solutions. Pocketsphinx is still very limited and Kaldi takes quite a bit to setup in a usable fashion. There were a few other options I looked at that I can't think of from the top of my head but were all in similar condition.

As the article says latency is still a problem and it's a huge problem in current open source solutions, some stuff I was testing was easily 5 seconds. I know that can be improved with configuration, but when dealing with libraries of 10 words or so, that's pretty bad.

I feel like anyone who is seriously interested in this space has been scooped up by all the big companies and the open source solutions have really seemed to linger because of it. It's one of the first areas I've seen where open source alternatives are really behind the proprietary solutions. Kind of bummed me out.

Kaldi is the best, there was just Tensorflow integration added which will hopefully speed up development (though I haven't seen any pretrained models for that yet).

Here's a blog post - http://www.googblogs.com/kaldi-now-offers-tensorflow-integra...

The easiest way to deploy Kaldi is this - https://github.com/alumae/kaldi-gstreamer-server (or a docker image of the that)

That's the option I was looking at using, so glad to see I'm on the right track. Thanks!

Unfortunately that Tensorflow integration didn't include acoustic modeling, so you still need to use Kaldi's neural net toolkit for that.

I too would be interested in pointers to the leading open source options.

Just yesterday there was a Show HN built with the https://github.com/kaldi-asr/kaldi project, emscripten-ized: https://news.ycombinator.com/item?id=15534531

When I looked a while back, CMUSphinx seemed to be the most promising option but I struggled to get it installed and got distracted with real work. Some discussions online suggest it’s still fairly poor compared to the online engines.

Snips was mentioned here recently but I haven’t taken a look at it.

CMUSphinx is really old. Kaldi is hard to use, but it's much better.

The first time I read this I misunderstood it's meaning. I now believe the parent is discussing the differences in usability not output. In which case I completely agree.

However I will say that for my company's use case a properly configured sphinx install produces better results than kaldi. However, getting to a point where you can say that was not an easy task.

Additionally, I actually believe that for most workloads that kaldi is likely better. Not ours though.

I had a working cmusphinx setup at one point. It was so bad that I eventually just tore it down.

It showed promise, but I don't know if any work is being done on it.

Open source speech recognition on Github:

1. speech-to-text-wavenet (https://github.com/buriburisuri/speech-to-text-wavenet)

2. kaldi(https://github.com/kaldi-asr/kaldi)

3. Speech recognition module for Python (https://github.com/Uberi/speech_recognition)

4. DeepSpeech(https://github.com/mozilla/DeepSpeech)

5. Natural Language Processing Tasks and References(https://github.com/Kyubyong/nlp_tasks)


Is it an issue of open source software being inadequate, or is it the lack of sufficient training data and compute power local to your home?

I’m under the impression that Google is mostly dogfooding its open source tooling for machine learning in GCP, and actually differentiates based on trained models and compute power.

A bit of both. When data doesn't exist, people aren't motivated enough to create the open source tools which would leverage it. While Librispeech is okay for academic research, it's not enough to create a good production-grade speech recognition system.

The problem wrt open source / free solutions is data. Kaldi is open source and gets state of the art results -- but the data costs a lot of money. Training the models is doable on a commodity GPU although it takes quite a while.

I didn't realize Kaldi could get state of the art results. Do you say that because you know of people doing that, or is your comment based on knowing the architecture of Kaldi?

The 8.5% in this file is what you'd compare to Microsoft and IBM's recent ~5% results.


Kaldi hasn't been in first place on that dataset recently, but it was a few years ago.

On other more researchy datasets (eg. for distant speakers or languages other than English), the best system is often based on Kaldi.

One reason for the discrepancy between quoted numbers is that, if you are only after pushing that number down and not particularly interested in getting a scalable system, then you are free to run as many systems as you like in as many configurations as possible and then try to combine their outputs (ROVER etc.).

Yeah. I don't get the impression that the Kaldi core team has been trying very hard recently to get SOTA on eval2k/switchboard. This number uses one acoustic model with a trigram LM decode + fourgram rescoring -- there isn't even a neural net language model in there. If I remember correctly, Microsoft's first "human parity" result used something like three acoustic models and at least four types of language models. This Kaldi model is competitive with the best single acoustic model Microsoft used.

Fully agree. I think their work on training data augmentation (e.g., their ICASSP paper, http://danielpovey.com/files/2017_icassp_reverberation.pdf, or the ASPiRE model before) has a bigger impact on the practical usefulness of ASR than getting an X% relative improvement over the previous SOTA on the eval2000 set.

Mozilla may not be able to make much headway until these patents expire:


Apparently, the business model for voice services in the past has been to snap up as many broad patents as possible to keep competitors at bay. I read an interview with a google engineer a couple years back claiming the same thing. They have to carefully work around a patent minefield with their own services and how these patents are holding back better voice search on mobile technology.

You are correct that the problem is data. Kaldi is hard to use, but making it easier to use isn't as hard as getting good training data. Mozilla's project is a good start for some purposes. One flaw with it is that they're having people read sentences. When people read, they tend to speak more clearly than when they're figuring out what to say on the fly. This means models trained with Mozilla's data will tend to need people to speak extra clearly than eg. a model trained on conversational data.

(edits: spelling/grammar)

It's only a data issue if they're trying to reproduce google voice. Personally I got much better results from dragon naturally speaking 20 years ago than I get from google voice today, the cost was that you had to train it yourself first, but the benefit was it was trained for you, not "everyone". The later is the approach I'd prefer to see mozilla/OSS take.

That's fair. I think both approaches are useful in different situations. Mozilla seems to be focussing on Siri-like use-cases as opposed to dictation. Even for dictation, for many people having to train the system themselves is more work than they're willing to do. I'm sure what you want will exist eventually :)

Oh it's not "hard" to get training data, you just need loads of money to buy the existing datasets.

or have more effective ways to collect tons of open source speech data. The Mozilla Common Voice project is really cool, but they should make it way easier for people to contribute.

Like, adding a mic button for voice search next to their main search toolbar on Firefox, and then ask for permission to use that data for research.

Hah sure. By "hard," I meant that it's the largest hurdle. And probably even the common research datasets aren't enough to give you results competitive with Google, etc. AFAIK Google uses its own hand-transcribed data.

Speech synthesis is in a similar situation. The FOSS options that I know of are completely primitive compared to their proprietary counterparts, particularly those locked away behind the cloud. Unfortunately this area seems to receive much less attention than speech recognition (AFAIK it's a non-goal of the Mozilla project, for example).

I suspect that it’s widely recognized that getting incremental advances in speech synthesis is really hard and, unlike potentially speech recognition, it doesn’t really solve a problem that people have. For business services and consumer devices slightly better voices are valuable but, for example, the improvements in the latest Siri voice don’t make any new things possible. But it’s part of the fit and finish of an expensive phone.

I'm the cofounder of Snips.ai and we are building a 100% on-device Voice AI platform, which we want to open-source over time

You can build your voice assistants and run them for free on a Raspberry Pi 3, or Android

It's always been a puzzle to me that published WER is so low, and yet when I use dictation (which is a lot--I use it for almost all text messages) I always have to make numerous corrections.

This sentence is crucial and suggests a way to understand how WER can be apparently lower than human levels, yet ASR is still obviously imperfect:

> When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.

This suggests that humans make mistakes specifically on semantically unimportant words, while computers make mistakes more uniformly. That is, humans are able to allocate resources to correct word identification for the important words, with less resources going to the less important ones. So maybe the way to improve speech recognition is not to focus on WER, but on WER weighted by word importance, or to train speech recognition systems end-to-end with some end goal language task so that the DNN or whatever learns to recognize the important words for the task.

The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.

Very true. Which is why all the home assistants (Google Home, Amazon Echo etc) use array mics and beamforming - they get a much cleaner speech signal from far field audio, and better WER as a result

Exactly. This is also why Google sponsors the CHiME challenge, the existence of which is more proof that ASR is pretty far from solved.


Good stuff. Was looking into cheap array mics with linux drivers a few times in the past but not much is available.

Speech separation - Mitsubishi Research has done some pretty impressive stuff on that - http://www.merl.com/demos/deep-clustering. Haven't seen equivalents of that in open source ASR

This is the useful info I've seen hardware-wise recently:



Amazon's 7-mic hardware has its own OEM program.

Totally. You may want to take a look at the papers from CHiME4 for more along those lines:


I'm really fascinated by the whole idea of blind source separation and the fact that speech signals are "sparse" in frequency space.

We've had a similar experience looking for hardware / open source beamforming. There's a package called beamformit, but I think it's pretty old.

This is, to me, one of the major problems with many algorithmic solutions to problems. An x% increase does in precision, F measure or any other score does in no way mean that the results are better.

I've repeatedly seen improvements to traditional measures that make the subjective result worse.

It's incredibly hard to measure and solve (if anyone has good ideas please let me know). I check a lot of sample data manually when we make changes, doing that (with targeting at important cases) is really the only way I think to do things.

If you've got a dictation system on a phone, wouldn't a very good metric be the corrections people make after dictating?

I guess a problem would be if people become so used to errors that they send messages without corrections. I have some friends who do this: they send garbled messages that I have to read out loud to understand. But there will always be a subset of people who want to get it right.

Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.

I agree. I think a big part of the reason that people use WER is that it's relatively unambiguous and easy to measure.

I think some people overestimate how good humans are at speech recognition. Unfamiliar accents and noisy environments cause havoc with many people. I had a friend who learned English in India when I was in High School, so I was used to that accent; many of my classmates in College could not understand anything our Indian T/A said freshmen year.

Similarly I had friends for whom English was a second language who had lived in the US for years and were definitely fluent, but had to enable subtitles for a movie in which the characters had a strong southern accent; in general non-rhotic accents were very troublesome for them having only spoken english with midwesterners.

The article mentions the Scottish accent, and I would call that the hardest accent of native English speakers for those in the US to understand.

My grandfather was on the ferry from Norway to Denmark with my sister. We're all from Norway. He had quite reduced hearing and used a hearing aid.

They were having dinner at the ships restaurant, and the waiter asked my grandfather something. My grandfather just didn't understand the guy and asked the waiter to repeat several times. After the 4th time, my sister told my grandfather: "Grandpa, he's Swedish". My grandpa paused for a second and then immediately recognized what the waiter had said.

Turns out he had assumed the guy was Danish, and thus failed to interpret the limited sounds he could hear, given the hearing loss and background noise.

I'm half-deaf with a cochlear implant; context switching is incredibly difficult! I've had similar problems where I assume someone is speaking either Japanese or English and "filter" accordingly.

It's also cool that your sister was able to assess what the problem was with nothing more than observation, and then cause your grandpa to switch to a more effective processing model with a simple utterance.

Obligatory "voice recognition lift in Scotland" comedy sketch: https://www.youtube.com/watch?v=sAz_UvnUeuU

Personally I've had to "translate" between a South African and an Ulsterman before, both of whom were speaking English but with extremely different accents.

I find with accents if I try and imitate the sound then it can become clear the closest word form that produces that sound, if that makes sense.

This works well both in speech and sign with our 1 year old too (oh not "poo" but "boots"; {moves open hand forward and backwards away from body at chest height} oh right "lawnmower" ...).

I had a chance to live in the middle of the "hood" in a large American city for about 6 months. I was one of the few white guys around. I spoke English, they spoke English - but it was about a week before I could understand people, especially on a crowded city bus. The first few days felt like a foreign country and it was the same country I grew up in! I wonder how well Siri and friends work with Ebonics?

That's a double whammy because diction is so different as well; our brain fills in the gaps in phoneme recognition by pattern matching and the differing dialect makes that much harder as well.

If you think that's bad, check out the Geordie accent from Newcastle: https://www.youtube.com/watch?v=ZY4TT3VtR8o

According to Wikipedia it's "a direct continuation and development of the language spoken by the Anglo-Saxon settlers" of the region. https://en.wikipedia.org/wiki/Geordie

Jimmy Nail! Nevertheless, Rab C. Nesbitt remains my high water mark for impenetrability in British television:


I defer to your obvious familiarity with British TV.

Agree with this. I’m currently traveling to a country where English is not the primary language, but is still spoken universally. I am having a hell of a time understanding people - the unfsmikiar accent is wrecking havoc with my brain’s speech recognition centers.

I’m currently where speech recognition software was in the 90s.

The thing is, even if you can't do much better than a computer on pure recognition of words, you have far superior error detection and correction abilities. If a computer mishears a word, it likely just assumes it was whatever was closest, resulting in nonsense sentences. A human will be usually able to "fill in the blanks" on something they didn't hear, or failing that, be able to ask for someone to repeat it.

I agree with your point. It can be hard for a US native English speaker to recognize a Scottish accent.

But, other Scottish people certainly don't have trouble with understanding a Scottish accent. So I view that as a certificate that we should be able to build a speech recognizer which can recognize Scottish accents.

> But, other Scottish people certainly don't have trouble with understanding a Scottish accent. So I view that as a certificate that we should be able to build a speech recognizer which can recognize Scottish accents.

As a Scottish person, I'll say there's a huge amount of variation between Scots dialects. As someone who grew up in Fife, it took me well over a year of living in Glasgow to be able to reliably understand people there—and both of them are typically classed as Central Scots.

Parliamo Glasgow: https://www.youtube.com/watch?v=TfCk_yNuTGk

I also grew up in Fife, although my parents paid good money so I would have an Edinburgh accent. Glasgow was like a foreign country to us...

> I also grew up in Fife, although my parents paid good money so I would have an Edinburgh accent.

I grew up in St Andrews, both of my parents having grown up in England, and went through speech therapy as a young child (due to dyspraxia); unsurprisingly, with that, you can imagine my accent is much closer to RP than any broad Fife accent, though most of my speech is definitely Standard Scottish English.

Well...St Andrews really isn't Fife ;)

A speech recognizer that has been trained with General American (which is likely the largest corpus we have), is analagous for a US native English speaker, so I wouldn't expect it to work on Scottish accents.

Whether or not gathering a sufficiently large corpus of other dialects will solve the problem would be interesting; also it might be uneconomical to gather a large enough corpus of some dialects, leaving minorities out.

Just the other day, I participated in a discussion about how "language identification" is a solved problem -- in fact, hasn't it been solved for a decade?

As anyone who's had to use langid in practice will testify, it's solved only as long as:

A) you want to identify 1 out of N languages (reality: a text can be in any language, outside your predefined set)

B) you assume the text is in exactly one language (reality: can be 0, can be multiple, as is common with globalized English phrases)

C) you don't need a measure of confidence (most algos give an all-or-nothing confidence score [0])

D) the text isn't too short (twitter), too noisy (repeated sections ala web pages, boilerplate), too regional/dialect, etc.

In other words, not solved at all.

In my experience, the same is true for any other ML task, once you want to use it in practice (as opposed to "write an article about").

The amount of work to get something actually working robustly is still non-trivial. In some respects, things have gotten worse over the past years due to a focus on cranking up the number of model parameters, at the expense of a decent error analysis and model interpretability.

[0] https://twitter.com/RadimRehurek/status/872280794152054784

I'm reminded of how often I see Twitter offer to translate English-language tweets containing a proper noun or two from absurdly unconnected languages. And that's with text containing mostly common and distinctive English words.

I'm pretty sure Twitter's langid uses character n-grams. You'll see a tweet that's plain English that happens to match n-gram statistics unusually well with some other language, which pushes the likelihood score to just above English. (I checked this by running an example or two through my own langid code.)

It shouldn't be hard to improve on by treating the score on a tweet as bayesian evidence to combine with a prior from preceding tweets.

In other words, we have some systems that are great at processing spherical cows in a vacuum...

Speech recognition for multi-lingual speakers is another pain point.

I live with a native French speaker, so my conversations naturally include a lot of French proper names, as well sometimes switching languages mid conversation or even mid sentence.

Lots of recognition engines can handle English and French, but treat them as mutually exclusive. It frustrates me to no end when I know that Siri can recognize a French proper name just fine if I switch it modally into French, but will botch it horribly in English.

As a multi-lingual speaker I've even had problems with speech synthesis. For instance, Google Maps insists on narrating driving directions in the system language, which is set to English; this makes the narrated local non-English place names incomprehensible. If I succumb and change the system language I can understand the narrated place names, but in return I have to put up with the poor speech synthesis of non-English languages.

I was just in France last week suffering through the same problem. I’m American, so I want to keep my system set to English both due to familiarity of the interface and because I want data in familiar units (miles, degrees Fahrenheit), but it kills me when every street and place name is absolutely butchered.

An interesting observation I made last week was that the poor synthesis of French street names in English mode was worse than American-mispronouncing-French level. I wonder if that exposes some level of mismatch between how the synthesis engine models language and how humans do.

In the case of French names it probably depends on if you have any exposure to French pronunciation rules at all. I have a pretty crappy level of high school French but I can usually manage not to butcher pronunciation. I was with a friend in Montreal though and she had absolutely no idea how to pronounce many things.

Of course it's not solved. We don't even know how to define the problem.

Speech is[1] a fundamental component of Language, which is a fundamental component of Intelligence[2]. This is addressed somewhat in the conversation around semantic error rate; that there is more to processing raw audio speech than the calculus of mapping signals to tokens; some understanding of semantics and context is required to differentiate between otherwise indistinguishable surface forms.

I find it doubtful that there's a clean interface that separates the 'intelligent' parts of the brain from the 'language' parts of the brain from the 'speech' parts of the brain. This leakiness (or richness, really) means that you can't neatly solve any one part of this chain to the level of competence that the brain solves it. That means to 'solve' speech recognition, you have to 'solve' language, and thus 'solve' general intelligence. And to 'solve' general intelligence, you have to understand it, in a theoretical sense, which we don't. Indeed, it will likely involve solving all the other modalities of sensation as well. It's definitely the case that you need to have a model for prosody to understand speech. It is entirely possible that vision is a large factor as well, in the form of body language, lip reading, eye contact, and so on.

Speech recognition is quite good for what it is. For many practical applications, especially to do with young, white, newspeaker-accented English speakers who sound the most like the people who develop it, and the data sets used to develop it, it is good enough in the 80/20 sense. But it is nowhere near solved by even the least rigorous definition of the word.


[1] according to the philosophy I subscribe to, at any rate

[2] according to the philosophy I subscribe to, at any rate

Problem definition really is the issue. Even appealing to the Turing test is less than satisfying. I’m fairly sure I’m a human, but Siri and similar will likely outperform me in certain categories like place names and popular music.

Don't forget that being too good can also cause you to fail the Turing test. I would not expect a human to be able to answer some questions that may be trivial to a computer. Things like, "What's the square-root of 137?" Or, "Identify this obscure song within 5 seconds of listening from a random starting point."

I suppose that depends on the precise setup of the test. Is the subject (if they're a human) allowed access to a calculator or a computer with an internet connection? Even if they were, timing would be an obvious tell, but an AI could easily be programmed (or could learn) to introduce an appropriate delay.

The idea behind the Turing test isn't whether they can arrive at the same answer given the tools. The idea is whether a human can tell the difference. I would expect any human to answer with something like, "I don't know." Or, "Let me find my calculator..." Either answer would be a lie for a computational AI -- it would know the answer and not require a calculator.

This is, I think, one of the failings of the Turing test. It's easy enough for us to make new humans; making a machine that acts exactly like a human seems like a silly endeavor. I want a machine that can assist us and reinforce our failings. Which means that we can necessarily differentiate it from another human. I vastly prefer that over a machine that has learned to lie to us.

> The idea is whether a human can tell the difference.

I know, that's why I'm saying the precise setup of the test is important. That's why, for instance, it's usually presented as a text messaging setup, because the goal is to test intelligence via language and conversation skills. A "face to face" setup wouldn't make much sense, unless we wanted to test that aspect of robotics.

Problem definition is not the issue. Contextual awareness is the issue.

Humans use different sub-languages to speak to friends, bosses, dates, lovers, spouses, teachers, students, and so on.

Each social context has its own vocabulary, its own set of expected conversation starter statements, its own set of likely responses, its own set of problems that may need to be solved - and so on.

A lot of what passes for intelligent interaction among humans is really just this social awareness.

Machines will fail the Turing Test without it. But you can't teach a machine to mimic social awareness by throwing 100,000 hours of speech samples at it.

Nor can you expect Echo/Siri/etc to know the context you're working in with no input from social cues - location, dress, time of day, social relationship, facial expression, etc.

So practical ASR turns into spoken-command-line-plus-guessing.

That turns out to be a pretty poor imitation of even the simplest social relationship.

It's not a useless imitation. Even today's limited voice assistants do a fair job at providing a useful service.

But the idea that you can drown neural networks in sample soup to train them, and build yourself a machine capable of intelligent-seeming conversation is just naive.

That's not how baby humans learn to hear and speak, and it's certainly not going to teach machines to converse at a human level.

Mistakes in understanding speech are common even among humans. My wife and I have to repeat and clarify ourselves fairly frequently in our day to day conversations. She even jokes at times that I must check my hearing, because I often mishear what she said, while she thought she was being perfectly clear.

I think where computers fall short is in two areas:

1) The rate of errors hasn’t hit the inflection point of being comparable to day-to-day intra-human interaction, and

2) There is no good mechanism for detecting and correcting errors. At least, none that I’ve seen.

That second one is important. When I hear my wife ask me “Please, hand me a tractor”, I realize that I must’ve misheard, and ask her to clarify “what?” With speech recognition, I either have to manually re-read and modify recognized text, or cancel and repeat the entire request. Both take time, and negate some of the efficiencies of using speech recognition.

One needs to know your wife to know that "hand me a tractor" is wrong. Perhaps she makes model farms as a hobby, perhaps you work in plant rental and that means "pass me the keys to one of the tractors", the possibilities are endless. You need vision, memory, profiling, etc., to even begin to properly contextualise day-to-day conversation.

Hm, I think I disagree with you and GP. The answer is Bayesian inference to both of you isn't it? The prior is going to give very low weight to your domestic partner asking you to "hand them a tractor", despite the fact that it's not impossible that it's the correct words.

"Please, hand me a tractor" - Did she say tracker [1]?

Something like Rhymezone (searching for similar sounding words) would be a solution. It would still need a human who's deciding which word to choose, but after some time it could learn which words you prefer.

[1] https://www.rhymezone.com/r/rhyme.cgi?Word=tractor&typeofrhy...

Just the other day I was arguing (pleasantly) with someone here on HN saying that AI has solved speech rec (among other tasks).


Wish this article was written a few days earlier.

An analogue of this article exists for most other domains claimed to have been solved.

Sorry to bother you, but the translation of your comment is pretty good (English -> Spanish -> French -> English):

  No, it's not learning new classes of objects from a single image or a few images is very difficult. See
  The machine translation is a joke.
  Put comments on this page by translating Google into another language and go back to English and see what you get.
  I did a little part of you. Just human level for just a simple little prayer.
  > But even if you do not make this assumption, identifying the object involves spitting the distance from the performance of the human level.

source: https://news.ycombinator.com/item?id=15429862

Not to debunk your claims, just interesting to see how good translation works (even if it's not human-level, I can understand what you say).

I am not sure that is a good translation because it completely misses a common idiom: "within spitting distance"

"spitting the distance from the performance of the human level"

If this were spoken out loud, it might get mistaken for "splitting the distance" which seems to have a different meaning (half as good as humans).

But the point is subtle differences matter a lot in human communication.

But, I do agree with your point that in a lot of cases even without human-level performance one can get things done.

After reading the translated version again, I'm not sure I can support my statement anymore. I think a big part of my impression was based on the fact the I read the original comment and knew what you were talking about.

> I did a little part of you. Just human level for just a simple little prayer.

It's just bad. It sounds like it's based on a n-gram model - about something religious, strangely enough.

> Latency: With latency, I mean the time from when the user is done speaking to when the transcription is complete. ... While this may sound extreme, remember that producing the transcript is usually the first step in a series of expensive computations.

For many applications, making a transcription seems like an unnecessary step and source of errors. Skipping transcription when the user doesn't need it (most cases where I use it) would seem like a way to get some gain, but perhaps at the reduction of debuggability.

> For example in voice search the actual web-scale search has to be done after the speech recognition.

That's an area where literal exact transcription is usually required. But even then, Siri/Cortana/Alexa might be better off trying to figure out the meaning of what someone's asking rather than figuring out the exact words spoken in order to return the best results.

Most people are quite bad at formulating good internet searches without a lot of trial and error. Let Google listen to a person talk about the problem they have or issue and then formulate the best results for that instead of forcing us to come up with the exact right phrasing to get appropriate results. It would help tremendously with the synonym and homophone issues that are so annoying now.

> For many applications, making a transcription seems like an unnecessary step and source of errors. Skipping transcription when the user doesn't need it (most cases where I use it) would seem like a way to get some gain, but perhaps at the reduction of debuggability.

Agree. What's really needed is research into (and development supporting) how to combine the expertise from a speech recognition layer with the next layer in a machine learning process. That higher layer contains the domain specific knowledge needed for the problem at hand, and still leverage a speech layer focused on a broad speech data set and speech-specific learning (from Google, Microsoft, the community, etc.)

Today, how richly can information be shared? I see with Google's speech API you can only share a very finite list of domain-specific expected vocabulary.

Why not have speech tools at least output sets of possible translations with associated probabilities? Do any of the top tools allow this?

Then you could at least train your next level models with the knowledge of where ambiguity most exists, and what a couple of options might have been for certain words or phrases...

Speech Recognition, beside transcribing phonemes and match to a NN of possible words, is not solved because speech is highly integrated with the human context: who is speaking, to whom is speaking, where is the speech happening, why is the speech initiated, and so on.

My kids needed 5-6 years of continous daily talking until I could say they understand a conversation almost completly. Every single word or phrase I directed at them was spoken in a certain context and had a certain role in communicating with them when from the context it wasn't clear what my intentions were. Of course, they had their fun, throwing words away and repeating endlesly some funny word or phrase or terribly spelling them. It is interesting that you as an adult need too learn their own prononciation, at some moment in time I even wrote a small dictionary.

Again, speech recognition will simply stay at recognizing phonemes/words only for a very long time, until we have a true AI Assistant that walks with us, sees with us, and hears with us in the same time. Then it can apply some semantic and other context based related NN.

As Portuguese native speaker forced to use foreign languages to talk to devices, it is not solved at all.

Even English doesn't work the way it should. Remember Apple had to rush out a patch shortly after Siri was released because it couldn't understand Australians.

Siri still doesn't understand the majority of my Minnesota relatives.

I would argue that it can't be solved without severe privacy implications. Apple's talk-to-text, for example, should know my son's name by now, but it doesn't. And while I'm mildly frustrated at the fact that I have to go back and edit text messages on a regular basis, I'm pretty glad that Apple doesn't know my son's name. I'd hate for a company like Facebook to have access to all of the proper nouns in my everyday life.

You can teach your iPhone to know your son's name:


You'd need to create a contact if he doesn't already have one.

As a deaf person I am using Ava [0] which uses IBM's speech to text service [1] as its backend AFAIK. I am always impressed by how it picks up on context clues to make corrections in realtime and capitalizing proper nouns (Incredible Pizza for example). However, it does not work with multiple speakers on a single microphone.

[0] https://www.ava.me/

[1] https://www.ibm.com/watson/services/speech-to-text/

I am a native English speaker from California. It is hard to get my Google Personal Assistant to understand the difference between desert and dessert. It's also hard to request music by an artist with name similar to another artist.

"Hey Google, play Mika radio" has a 50/50 chance of starting music by Meiko. The additional "problem" is that I like Meiko, too, but I feel obligated to cancel the Meiko music & re-request Mika so that Google (hopefully) learns to recognize the difference between Mika and Meiko.

Maybe our spoken language will start to transform into distinctly unique sounds so we can verbally interact with computers with relative ease. When I was in the US Army, I was trained to speak in a certain manner to help my communication to be more clear. [1] I don't see a reason humans and computers can't each make reasonable compromises to make verbal communication easier.

[1] https://en.wikipedia.org/wiki/Voice_procedure

I'm constantly surprised by the poor contextual quality of speech recognition. I think the basic audio recognition does well, but when there is ambiguity, it seems like systems that are popularly considered high-performing degrade drastically. For instance, I'm using Dragon NaturallySpeaking to dictate this right now, but if I say a certain punctuation mark at the end of a sentence, half the time it's going to say excavation mark!

Ditto with Google's Google Now assistant, or whatever the heck it's called these days. I have a Pixel 2 phone (Dragon heard "pixel to phone" -- it doesn't have up-to-date context on proper nouns in the news), but when I tried to create a calendar event using "Create calendar event... meet Bruno for pizza", it heard "MIT pronoun for pizza". It has hundreds of samples of my voice, and it already knew I was creating an event! "Meet" has to be one of the most common first words used in events.

It seems to me like there is pretty low hanging fruit, and that we need more focus on flexibility and resourcefulness rather than acting as though we're moving from 99.5% accuracy to 99.6%.

One thing I see missing is "conversational smarts".

People mishear a certain fraction of what they hear and they'll ask you to clarify what you said.

You could have superhuman performance at tasks like Switchboard and still have something embarrassingly bad in the field because the real task (having a conversation) is not trained for in the competitions with canned training sets.

I remember using the Merlin Microsoft Agent to create speech-based "apps" in 2001, including text-to-speech as well as speech recognition. I made a SAT preparation that would test me by presenting a word and I had to say the meaning. I also made a Spanish learning app using Merlin, where it would read out the spanish word (in spanish!) and I would have to say the english one. It worked really well. Its been 16 years since then, and I have to say that I had expected this area to have been completely "solved" by now.

I'm part of the stenographer/captioner community as a hobby and you'd be surprised at how many people think that all captions are automatically computer-generated. In reality the practical limitations of autogenerated captions are demonstrated by YouTube's caption system. It's good but even a mistake every other sentence (95%+ accuracy) can completely obfuscate the meaning behind the video.

I've seen some pretty terrible human produced captions on television shows as well.

Big time. There are various levels of caption quality and part of it is the tools used.

1. QWERTY transcriptionists working brief, 5-10 minute shifts typing as fast as they can and rotating.

2. Voice-mask reporters using Dragon with a voice mask (more commonly phonetic typos)

3. Stenographers using a steno machine, some of them were not trained to do realtime and lack the ability to edit as they go.

It's all up to the individual, unfortunately. Furthermore, the budget for live captions is sometimes smaller than it should be and so the service purchased is the cheapest, not the best.

I pay attention to captions now. I've watched TSN (Canada) and it seems that past 10 p.m. they put in someone who isn't fully trained or graduated from school and they most definitely shouldn't be providing captions as they are near unreadable.

Good captions are priceless for the access that they offer.

A big thing for me is the difference between offline and online speech recognition. This really ties in with the last point, advancements need to be efficient in order to be used.

I'm not sure how much work it would take to scale down Apple's voice recognition to run on the device or if it is feasible with their model, but currently it can take 5-10 seconds longer to get an answer from Siri during peak times.

> It’s the only way to go from ASR which works for some people, most of the time to ASR which works for all people, all of the time.

All people all the time doesn't remotely work for people, either. That's why the Air Force, for example, uses the Phonetic Alphabet, and aviation in general uses special jargon that is harder to misinterpret, such as "affirmative" for "yes".


Radio DJs and actors tend to enunciate more clearly than regular people do. You'll notice this more if you speed up the sound - the professionals are still clear at higher speeds where non-professionals become unintelligible.

While understanding “any” speech is a great long-term goal, I wish they’d allow these systems to be a little dumber when I want them to.

One example: I often give the same commands to Alexa every day, even at similar times. Yet every few days it just utterly misunderstands those commands, picking words I have never even used before. It doesn’t even offer to choose similar commands I’ve used recently. Worse, the misunderstood command might trigger a paragraph of senseless babble about the misunderstood command, forcing me to shout over Alexa to make it hear the command I really wanted.

So please, please, add some “dumb” options. I want to pick some words, have the system learn them, and just obey commands 90% of the time.

We recently did a comparative analysis of cloud speech-to-text providers for a project. We looked at:

1. Google Cloud Speech API

2. Microsoft Bing Speech API

3. IBM Watson Speech to Text

The ranking was as listed above but we had real challenges working with call-center audio recordings. The quality was less than idea but still very clear. We saw a huge reduction in accuracy compared to in-browser testing. Additionally Australian English is particularly not solved.

Because Google's API isn't currently doing speaker-detection, we looked at using Watson's speaker-detection as a secondary step but found it too complex and error prone. There is definitely room for a startup in this area and it also needs continued investment from the bigger cloud providers.

For Watson Speech to Text - Did you choose the correct model to match your source audio quality? They default to a "Broadband" model intended for high quality audio sources, but you can also select "Narrowband" for things like phone quality. Not guaranteeing a difference, but in my experience, matching the source quality to the correct model makes some difference.

I've not compared them extensively but for streaming realtime, I found that Watson beat the Google api for a specific use-case. Your mileage may vary!

They also provide a handy Mic / File reader interface for browsers: https://github.com/watson-developer-cloud/speech-javascript-...

Here are some benchmarks on telephone speech, including both APIs and human transcription services:


Google actually did pretty badly for us on extended telephone speech. Not sure why.

It seems to me that much of this can be explained by one of the fundamental principles of machine learning:

Training and test samples must be iid (ie independently sampled from the same distribution) to get good performance. Otherwise there is bias.

This applies not only to speakers as individuals, but to all the other factors mentioned, like context, bg noise, gender, age, accent, education, mike, emotion, etc etc.

Many of the issues described can be traced down to this principle.

This is why, possibly, the core issue is data.

Although models and *PUs matter too, the theoretical performance boundary dependends on data quality only (with a model to match).

Then there is Chinese, which the same "sound" can have many different words. And we constantly make up new short form or words, a lot more so then English means recognizing these are much harder. ( Then there is Cantonese... )

Sometimes I wonder, if Apps could add vocabulary to devices dictionary, so certain terms in Gaming, Fashion, or Tech or other domain, which are rarely used in normal conversation would be recognized in speech to text as well as Auto Correct.

Do you guys thing “understanding” needs more that what we have right now in terms of AI, does understanding require intelligence that is self aware? I mean when we talk to each other, we understand in spoken words, little expressions, even when we talk over the phone. I feel like without the ability to process emotion and self-aware AI we wont really achieve “full understanding” only simulated ones.

Can this be attributed to lack of good challenging open and large enough dataset that can be used as benchmark?

The switchboard dataset was "challanging" in its time but pointless from current standards. One doesn't have to look beyond trying to do voice dictation in car to see how horrible current speed recognition systems are compared to humans.

There are the CHiME challenges (http://spandh.dcs.shef.ac.uk/chime_challenge/) that have test data with distant microphone speech and noisy speech data, and IARPA ran its Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge in 2014/15 (https://www.iarpa.gov/index.php/working-with-iarpa/prize-cha...).

Technically yes. But in the realm of application, it is mature enough that people can use some drop-in solution, without too much tuning, and get passable result.

It is like say Web Application is not solved. Surely, it is not. But the remaining part falls into the arts category, while the hard barrier of the technologies itself is no longer there.

Don't forget speech of children and the elderly, which are also ifficult. I had to laugh a little at Indian accent recognition being so low in humans. I like Indian people but I've always hated the accent because it's so difficult to understand.

In the caption for figure 2 there is this quote, "Notice the humans are worse at transcribing the non-American accents." But when I look at the bar graph, I see the word error rates of the model is much higher than the word error rates of humans for most of the non-American accents. Am I missing something?

Too many people equate speech to text as 'speech recognition' and it clearly isn't. What is not a solved problem is Natural Language Processing. That NLP isn't solved, is a fertile source of papers. What is 'kinda' solved is the voice CLI, that is where speech is converted to text, that text is inserted in a class command interpreter, and if recognized the command is executed, otherwise it isn't.

Nobody expects to be able to type 'find me the files that changed yesterday' into a computer shell and have it do something sensible. But 'find ./ -type -f -mtime yesterday' sure we can expect that to work. Getting from the first one to the second one requires something called an ontology.

I worked on a team at IBM that was building this sort of capability. The plan was that in conjunction with an excellent tool kit for disassembling a sentence into its component parts (verbs, nouns, adjectives etc) and then matching the elements recognized with an onotology associated with that action. 'find' being the verb would select the search action ontology, 'files' as an object of the verb would select the computer files object ontology, and then 'changed' and 'yesterday' would match 'modification' in that ontology and 'yesterday' would match one of the two sub children of 'modification' (time and change-actor). Then you would go back with what was the compiler equivalent of an abstract syntax tree which the 'command generator' would use to emit the necessary command actions and flags.

I have no idea why they cancelled it. It was an inscrutable place to work at in many ways :-).

I fully expect to be able to type (or say!) "find me the files that changed yesterday, sort by modification time, and display the first five images in that set".

The biggest failing of things like Siri and Cortana is they're not conversational, they're really not good at learning from example, they can only respond in hard-wired ways. "What's the weather?" will work. "Do I need a jacket tomorrow?" won't because they don't understand, instead giving the closest answer they can based on available information, not knowing how you prefer to dress or what you think is cold.

Until we can have full feedback neural networks we're not fully capitalizing on this AI stuff. Once we can cut the retraining time down to something marginal, maybe we can have computers "dream" and self-reprocess based on their accumulated learnings instead of waiting for a new kernel from the cloud.

"Speech Recognition" is an accepted term of art for "speech to text" or "transcription." "Speech Understanding" is a related but different problem.

I'd challenge you to transcribe a casual conversation without understanding. Too many words sound very similar, if not identical, and the context of the conversation dictates which word is in play.

Even something simple like "They're unhappy, Ness" could be interpreted as "Their unhappiness" unless you know Ness is a person in the converation.

That's like saying because "O" and "0" have the same shape that we can't solve OCR without general AI.

A modern statistical speech recognition system has no trouble determining that "they're unhappy, ness" is a dramatically less likely word sequence than "their unhappiness".

edit: I read your example backwards, but still, a statistical system can easily incorporate contextual words without actually understanding what they mean. Names from the speaker's contacts in particular are widely used in ASR systems for this reason.

That's because it doesn't care, it just goes for the most statistically probable phrasing in a general conversation, not the one you're actually having.

As for the OCR problem, try writing one for Chinese calligraphy and get back to me on if context is important or not.

Context is certainly extremely important, but a model can incorporates context without general understanding.

> Nobody expects to be able to type 'find me the files that changed yesterday'

Google Photos is pretty close.

"photos from december" works fine. "Find photos from decemeber" doesn't.

I can type in "effects photos" and find photos Google applied effects to, which is pretty nifty.

It is probably equivalent to the old game parsers in intelligence, but for a given domain it works pretty well.

They could stand to filter out a lot more words, but it works OK in general. If I type in "photos with a blue sky" it knows to drop "photos with a", photos presumably being redundant on photos.google.com!

The query parser we had Blekko could do this as well. It could process 'photos' and 'december' as keywords and use them in a search. It would fail on 'photos last month' for the reasons I mentioned (it didn't really know december was a date it just looked for it to show up in the description or metadata)

That said, Google has done some great stuff in inferring things like units for conversion 10 feet in inches is easily parsed for example as a conversion request.

> I have no idea why they cancelled it.

My guess: because statistical methods are all the rage now, and have been for a while. This is famously illustrated by Fred Jelinek's quote "every time I fire a linguist, the performance of the speech recognizer goes up"[1], and cemented with the success of Google Translate.

[1] https://en.m.wikipedia.org/wiki/Frederick_Jelinek

Speech to text is exactly speech recognition. Speech to text isn't solved.

It's pretty good right now for English speakers. At least usable.

We've tried https://www.btwtalk.com for transcribing our internal calls, and while it does mess up a bit, it's good enough for later reference.

I would also like to confirm that speech recognition in language other than English is completely not solved.

In Italian, for instance, any device will barely be able to recognize what you're saying, and will almost always do something completely different from what you've asked.

Did you say ... "speed trek ignition involved?"

If yes, say "yes" or press 1, otherwise say "no" or press 2 to speak your request again, star to hear the options, or stay on the line for the next available agent ...

[Slightly off-topic] And what about people with minor to major stuttering troubles like I do ? It is so frustrating to tell Alexa, Siri or Google (even Cortana) what I want because the timer delay expiration is so short.

Speech Recognition (STT) used to be completely monopolized by Nuance. Slowly the gates are opening (thanks to google and DL), but I guess it is still a minefield of patents.

Siri reminds me of that every time I attempt to use it.

Baffled every time that Siri has actually been shipped by Apple to be used in production and not as a pre-pre-pre-pre-alpha.

Seriously, who thought it was solved? Anyone who has used any voice interaction can tell that it's just not there yet.

I suspect we (human) are very bad at hearing, computers already are doing are much better jobs than us. However, I suspect for most our communication, we do not actually hear the conversation -- we guess the conversation. Only when evidence such as facial response or out-of-context words is caught then we actually try to hear. Even then, we are still trying to second guess.

Computers cannot compete with our guessing ability, not until they are trained with our life experiences.

we are exceptionally good at hearing. Our ears are truly exquisitely precise things.Where "hearing" ends and "understanding" begins is, shall we say, a point of contention.

My vote is for the auditory nerve as the inflection point. Anything above the cochlear nucleus on the auditory pathway is black voodoo magic and anyone who claims they understand it needs to reevaluate their kool-aid intake.

Speech recognition is such a gimmick. In practice the applications are quite narrow. The truth is, most people with properly functioning social faculties feel like complete dicks when talking to a machine.

I heard even voice synthesis, ie text to speech, is not solved yet.

jtth 11 months ago [flagged]

Also, water is wet and rocks are hard.

Would you please stop posting unsubstantive and/or snarky comments?


I think that some speech recognition systems depend too much on language model priors. This works well for routine tasks where most of the speaker's words are easy to predict. It fails when the speaker's words are specific or unusual.

For example, try speaking the words: "OK Google, the thick round box jumped over the hazy bog."

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact