
Whither Speech Recognition? (1969) [pdf] - apengwin
https://pdfs.semanticscholar.org/0155/01c4d26a92993332ada795e27b126ae3028a.pdf
======
lqet
This quote by William James from 1899 struck me as important:

> How little we actually hear, when we listen to speech, we realize when we go
> to a foreign theatre - for there what troubles us is not so much that we
> cannot understand what the actors say as that we cannot hear their words.

As a non-native speaker of English, I started watching English movies with
subtitles when I was a teenager. This had an interesting effect: after a few
years of doing this, I am now used to knowing each word that is spoken in a
movie exactly, on its own - after all, it is clearly printed on the screen.

I now get nervous watching movies in my native language (German) without
subtitles, simply because I am not able to extract each word precisely.
Somehow I trained myself to expect an exact "acoustic" understanding from
movies, as opposed to a "semantic" understanding. It is incredibly how the
human brain is able to extract the meaning of a spoken sentence by context,
facial expressions and gesture, even if we only understand half of the
sentence acoustically.

------
abecedarius
This note from 1969 implies that speech-recognition researchers were hardly
more than charlatans. [http://www.dragon-medical-
transcription.com/history_speech_r...](http://www.dragon-medical-
transcription.com/history_speech_recognition.html) says that the founders of
Dragon Systems (iirc the first successful speech-recognition company) started
in 1970. (Though they didn't start the company until 1982.)

So in retrospect I'd guess the level of funding at the time was closer to
right than this critique, even if most of the work was flimflam. (The author
wrote a popular book about information theory which I liked, so this is
disappointing.)

------
taneq
Speech recognition is great (when it works) for when you need hands-free
control of a machine. The big issue with it, though, is that no-one seems to
want to publish any reference as to exactly what commands you can use. It's
not a natural language interface (and I'd argue that we don't yet have the
technical capability for a real natural language interface) so you're left
with the equivalent of a command line without any way to discover ccommands.
Touchscreen gesture interfaces have the same problem, implementers are too
busy trying to maintain the illusion that it's "intuitive" to actually explain
the secret handshakes you're meant to use with them.

------
melling
They claimed 95% correctness 50 years ago.

When am I going to wake up and be able to dictate into my phone and make
corrections with my voice? We must be close.

I don’t mind the mistakes made when dictating but having to pull up the
keyboard takes away from the “magic“.

~~~
lern_too_spel
95% on a tiny vocabulary (like the digits) spoken clearly and distinctly by a
single speaker. The author's point was that this is very nearly useless.

Dictating with corrections is a UI issue. I already dictate messages to my
phone running Android Auto, and it confirms the entire message. To make
corrections, I have to say the entire message again. It would be better if I
could just restate the part that was heard wrong and let the application
figure out which part to replace, and this is entirely doable today with the
available building blocks. Just don't expect a Google product manager to
figure that out.

------
hprotagonist
We can do “sound pressure wave to phoneme stream” pretty darn well, and in way
that generalizes to anything you have a phonetic mapping for. (cf DeepSpeech,
etc)

Going from text stream of phonemes (“mmaaaayyynaammeezzzbahhhb”) to text
stream of sequence of words (“my name is bob”)is vastly more limited.

One of them is speech recognition, the other is language modeling.

language _comprehension_? From experience, it’s cheaper and faster to get
results by making a new human :)

