
Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant - gok
https://machinelearning.apple.com/2017/10/01/hey-siri.html
======
kejaed
This is interesting and a feature that I didn't know about, and Hey Siri will
often not trigger in the car for me while driving. Now I know to retry and
I'll have a better chance at triggering Hey Siri.

"We compare the score with a threshold to decide whether to activate Siri. In
fact the threshold is not a fixed value. We built in some flexibility to make
it easier to activate Siri in difficult conditions while not significantly
increasing the number of false activations. There is a primary, or normal
threshold, and a lower threshold that does not normally trigger Siri. If the
score exceeds the lower threshold but not the upper threshold, then it may be
that we missed a genuine “Hey Siri” event. When the score is in this range,
the system enters a more sensitive state for a few seconds, so that if the
user repeats the phrase, even without making more effort, then Siri triggers.
This second-chance mechanism improves the usability of the system
significantly, without increasing the false alarm rate too much because it is
only in this extra-sensitive state for a short time."

~~~
dannyw
This is brilliant. Apple could also cut out the “hey” part to make it more
natural.

“hey Siri... Siri”

~~~
crooked-v
"Hey Siri" has always seemed like such an awkward phrase to me. "Alexa" flows
so much better, besides the general awkwardness if someone in the room has
that name.

~~~
mikehines
Alexa is actually a difficult word to pronounce for non-English speakers.

~~~
alexsb92
Really? Are these non-European languages that you're referring to, because
many of the European languages have their own versions of the name Alexander
for both men and women, so I'd imagine Alexa wouldn't cause too many issues.

~~~
throwanem
The 'x' in 'Alexander' is (typically) voiced, where that in 'Alexa' is not. I
can see that making a substantial difference for speakers of languages where
those phonemes are differently composed.

~~~
arghwhat
What do you mean by voiced? I put the stress differently, but otherwise I
would pronounce Alexa as Alexander cut short. I can imagine the vowels being
pronounced differently in various accents, but I can't imagine an accent where
the 'x' in those two names are different.

~~~
throwanem
"Voiced" in the phonetic sense [1], i.e., spoken with the vocal cords
vibrating. Voiced 'x' sounds like the /gz/ in "eggs", /ɛgz/; voiceless 'x'
sounds like the /ks/ in the American English pronunciation of 'x' itself,
/ɛks/.

Many, if not all, American English dialects pronounce the words in the fashion
I describe. In them, the name 'Alexander' would be

    
    
        /ˌæ.lɛˈ(gz)æn.dər/
    

while 'Alexa' would be

    
    
        /əˈlɛ.(ks)ə/
    

\- in both of which, the phoneme corresponding to the letter 'x' is
parenthesized.

Generally in English 'x' is voiced when it precedes a stressed vowel, which it
does in 'Alexander'; in 'Alexa', 'x' precedes a reduced vowel, and therefore
would always take the unvoiced pronunciation. (It'd sound very odd to an
anglophone ear otherwise - say /əˈlɛ.gzə/ one time out loud and see if you
don't feel the same.)

That said, it wouldn't be incorrect to pronounce 'Alexander' in American
English with an unvoiced 'x', as

    
    
        /ˌæ.lɛˈksæn.dər/
    

but, while I believe some dialects of English may default to this
pronunciation, certainly not all do. (Neither of the dialects I speak does so,
at the very least.) This pronunciation also produces a "hitch" or break in the
word between the unvoiced 'x' and its preceding vowel, which would tend to
make it a little odd both to hear and to say.

[1]
[https://en.wikipedia.org/wiki/Voice_(phonetics)](https://en.wikipedia.org/wiki/Voice_\(phonetics\))

------
RKearney
I own multiple iOS devices, a few Echos, and a Google Home. One of the things
I noticed after getting an Echo was how much more fluid and simple it was
invoking the "assistant". Simply asking, "Alexa, what's the weather today"
just seemed so much more natural than having to prefix everything with "Hey"
or "Okay".

Using Siri or Google Assistant for more than one question at a time quickly
makes me feel like I'm going insane. "Hey Siri.. Hey Siri.. Hey... Hey..."

I'm hoping Google and Apple fix these subtle annoyances. Or maybe it's just
me.

~~~
danso
Isn't part of the point of the prefix to avoid collisions with sounds that are
part of everyday speech not intended for the assistant? "Alexa" becomes
problematic when Echo is used in a office/home in which someone is named Alexa
-- among the top 100 most popular female baby names since 1995 [0] -- which is
why the wakeup word can be changed to "Echo" or "Amazon" or "Computer".

But the sound of the wake word, whether it's just "Alexa" or "Hey Siri" vs
"Siri, doesn't seem to deal with the main issue of your complaint, which to me
is how limited "conversation" is with the assistant.

If you ask Alexa for the weather, you'll still have to say her name for any
followup questions within that immediate context, i.e. "Alexa, what's the
weather today? Alexa, what's the weather this weekend?".

Though there are a few functional exceptions in which Alexa will prompt you
for additional information without needing to be re-awakened, e.g.

You: "Alexa, set my alarm for 6 'o clock"

Alexa: "Is that 6 'o clock in the morning, or in the evening?"

[0]
[https://www.ssa.gov/OACT/babynames/index.html](https://www.ssa.gov/OACT/babynames/index.html)

~~~
Darthy
I always felt that Siri was named with the intent that it could be used as a
wake word (so you could say "Siri what's the weather today?") because Siri
itself is a rare given name and [si ri] are sounds rarely said at the
beginning of a sentence.

And then at some point Apple realized they had to make a longer wake word to
cut down the number of false positives ("Siri" -> "Hey Siri", from 2 syllables
to 3).

Google probably went through the same process ("Google" -> "Okay Google", from
2 syllables to 4).

Amazon probably deliberately chose a 3 syllable name with "Alexa" for the same
reason.

I can imagine future improvements where we can have the originally imagined
wake words "Siri", "Google" and "Alexa", and at that point I would be most
happy with "Siri" because it would be short and not-corporate.

~~~
osteele
Siri (the company) was a spin-off from SRI – Stanford Research Institute.

The initial product, before the Apple acquisition, was an iPhone app with a
chat interface. I don't recall that it supported voice input.

It is still possible that the founders were thinking ahead to voice input and
wake words when they named the company.

~~~
woodson
The technology was initially developed under the DARPA CALO research program
([https://en.m.wikipedia.org/wiki/CALO](https://en.m.wikipedia.org/wiki/CALO)).

------
Isamu
Pretty cool how they reduce the power consumption - when it first came out,
"Hey Siri" required your device to be plugged in:

> To avoid running the main processor all day just to listen for the trigger
> phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary
> processor, that is, the embedded Motion Coprocessor) has access to the
> microphone signal (on 6S and later). We use a small proportion of the AOP’s
> limited processing power to run a detector with a small version of the
> acoustic model (DNN). When the score exceeds a threshold the motion
> coprocessor wakes up the main processor, which analyzes the signal using a
> larger DNN.

> Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model
> intermediate in size between those used for the first and second passes on
> other iOS devices. The “Hey Siri” detector runs only when the watch motion
> coprocessor detects a wrist raise gesture, which turns the screen on. At
> that point there is a lot for WatchOS to do—power up, prepare the screen,
> etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the
> rather limited compute budget. It is a challenge to start audio capture in
> time to catch the start of the trigger phrase, so we make allowances for
> possible truncation in the way that we initialize the detector.

Another interesting nugget:

> There are thousands of sound classes used by the main recognizer, but only
> about twenty are needed to account for the target phrase (including an
> initial silence), and one large class class for everything else. The
> training process attempts to produce DNN outputs approaching 1 for frames
> that are labelled with the relevant states and phones, based only on the
> local sound pattern. The training process adjusts the weights using standard
> back-propagation and stochastic gradient descent. We have used a variety of
> neural network training software toolkits, including Theano, Tensorflow, and
> Kaldi.

------
ComputerGuru
Just because (judging from the comments) non-iPhone owners may not be aware,
this isn't a new feature in the iPhone _at all_ , Siri has had voice-activated
prompts since late 2015 [0]. What it looks like is the team finally got the OK
to share technical details about how this works with the public, and that's
what we're seeing here.

Apple doesn't usually do tech writeups; I imagine a pinhead at the corporate
level decided it wasn't worth the risk of leaking any "secret sauce" until
now.

0: [https://www.cultofmac.com/390181/5-ways-hey-siri-will-
change...](https://www.cultofmac.com/390181/5-ways-hey-siri-will-change-your-
life-for-the-better/)

~~~
IBM
That pinhead was Steve Jobs.

------
Viper007Bond
Pretty cool of Apple to post such technical blog posts as this. I love it when
companies do this.

~~~
ktamura
Indeed. Quietly but surely, Apple is changing its corporate policies around
developers. We used to never see Apple developers at conferences with an
official affiliation, let alone having them onstage as a speaker. I once ran
into an engineer at a Ruby conference in which the guy demurred who he worked
for, and when he finally told me he worked for Apple, he had to remind me the
whole "what I say does not reflect nor represent..." preamble. That was as
late as 2014.

Great to see such corporate changes toward developer-friendliness.

------
ChuckMcM
Great explanation, this technique has a lot of applications for extracting
event triggers out of audio streams. Even though I've trained my iPad
repeatedly though it still is a bit to eager to answer to others who talk to
it (either on purpose or by accident).

At some point I expect Apple to design an audio neural network processor to
put on their CPU chips which will allow them to do both phrase recognition and
highly accurate speaker dependent text to speech on their devices. It will be
yet another way that people who don't build silicon won't be able to compete.

------
georgehm
Does anyone know of papers and/or example implementations of similar DNNs for
accoustic modeling using tf or some other framework?

~~~
gok
Google's Deep KWS paper [1] is kind of similar, although they don't use an
HMM.

[1]
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42537.pdf)

------
ecesena
Is there any open source projects to do this? I mean just the triggering part,
not the command recognition later.

~~~
MrBuddyCasino
There are Snowboy and uSpeech on Github. Haven‘t used them yet.

~~~
woodson
Snowboy isn’t open source, though.

------
singularity2001
Your connection is not private

Attackers might be trying to steal your information from
machinelearning.apple.com (for example, passwords, messages, or credit cards).
Learn more NET::ERR_CERT_COMMON_NAME_INVALID

Access Denied

You don't have permission to access
".../machinelearning.apple.com/2017/10/01/hey-siri.html" on this server.
Reference #...

------
allenleein
2012: OK Google / 2014: Alexa / 2017: Hey Siri

~~~
gordyf
Hey Siri has been an iPhone feature since the 6s, which came out in 2015.

~~~
Viper007Bond
It's a feature on older iPhones as well (such as my regular 6) but requires
being on a power source since those lack the specialized, low-power chip.

