
Alexa and Siri Can Hear Hidden Commands - GW150914
https://www.nytimes.com/2018/05/10/technology/alexa-siri-hidden-command-audio-attacks.html
======
coleca
Isn't there an inaudible tone that you can use to disable the assistant? I
recall reading somewhere that Amazon used it in their commercials for Alexa so
that everyone's Echos weren't lighting up during the commercials. I know when
a commercial for the Echo comes on and the voices repeat "Alexa, do X" the
Echo I have near the TV speaker doesn't light up.

~~~
adambowles
If I recall correctly, the adverts omit a certain frequency range from the
assistant's invocation phrase which human's won't notice is missing

Edit: Yep, omit / reduce tones in the 3000 - 6000 Hz range
[https://www.reddit.com/r/amazonecho/comments/5oer2u/i_may_ha...](https://www.reddit.com/r/amazonecho/comments/5oer2u/i_may_have_found_how_amazon_prevents_the_echo/)

~~~
SlowRobotAhead
Which makes WAY more sense than a range outside of human hearing that OP
implicated ("inaudible").

Expecting TV's to play tones outside the range people can hear is ridiculous.

~~~
jimrandomh
TV channels already contain inaudible-sound identifiers. Nielsen has people
put listeners in their homes, and uses the identifiers to track which channels
are getting played.

~~~
jdietrich
You can't count on a TV being able to reproduce sounds below about 100Hz or
above about 16kHz. The position of the speakers on the back of the TV means
you're likely to get a lot of weird phase effects and many TVs have quite
heavy audio DSP to compensate for the inadequacy of their speakers. Any hidden
signals will need to be in-band and low bit rate with a high level of
redundancy.

~~~
tinus_hn
You can even less expect all the compression in the system that’s designed to
leave out everything people can’t hear, to leave in this signal people can’t
hear.

------
Jun8
The interesting thing is that Google Assistant has teh same problem, it's
right there in the subtitle. It's interesting that it was omitted from the
main title.

Rather than journalistic oversight I think this verifies what people have
commented many times: that the fact that GA does not have a personalized name
makes it to refer to it. SO much so that a very distant third product is
included rather than GA.

~~~
lozenge
Usually the attack requires the source code (or weightings of the neural
network), I'd be surprised if they are able to actually attack these systems.

~~~
saagarjha
Does it really? As far as I was aware, it is still possible to perform a black
box attack without knowing the weights of a network. Using specially crafted
input, it's even possible to "steal" weights from a network!

~~~
srtjstjsj
[https://www.usenix.org/conference/usenixsecurity16/technical...](https://www.usenix.org/conference/usenixsecurity16/technical-
sessions/presentation/carlini)

"We evaluate these attacks under two different threat models. In the black-box
model, an attacker uses the speech recognition system as an opaque oracle. We
show that the adversary can produce difficult to understand commands that are
effective against existing systems in the black-box model.

Under the white-box model, the attacker has full knowledge of the internals of
the speech recognition system and uses it to create attack commands that we
demonstrate through user testing are not understandable by humans."

------
sailfast
I can't recommend highly enough that everyone at least turn on the audible
sound when their assistants are listening. At a minimum you should know the
kinds of things that end up triggering the device. There's a really wide area
of detection and it's interesting to see where that is.

I would also recommending changing your default word at a minimum. Then again,
I might also recommend ditching the device entirely but I happen to have one
in my kitchen that I like OK sometimes.

~~~
mkirklions
I want to get rid of my alexa, but my wife uses it as a kitchen timer.

We literally dont use it for anything else.

Might get one of those philips light sets since our living room is weird...
Tbh, id rather not use alexa..

~~~
toast0
The clock / timer functions of Alexa/Google Home are super useful, but nothing
else seems compelling to me.

Someone could clearly make an offline device that did voice recognition clock
and timer, but does anyone?

~~~
chatmasta
Or you could spend $3 and get a kitchen timer that you can slap to start, and
beeps when it’s done!

~~~
Normal_gaussian
Alexa has named timers, which means you can time multiple things (a common
occurrence in cookery).

It is also voice activated, which means it can be done when your hands are
full (another common occurrence in cookery).

~~~
gus_massa
It looks like nice product request that someone here can make with knowledge
in the area. Perhaps a kickstarter.

------
teachrdan
The real problem is that Alexa and Siri increase your attack surface area to
include every speaker in your home--including any cheapo Bluetooth or
internet-connected speakers that could get hacked to produce these human-
inaudible sounds.

(To be fair, I recall this specific point coming up in an HR thread about
Alexa being able to open your door for Amazon deliveries, but I thought it was
worth reiterating here.)

------
bsenftner
Hidden audio is simply too easy. Hidden audio is the knife that kills desire
for any financial services access through a voice assistant - for those smart
enough to not follow the horde.

~~~
Shivetya
for me it kills all desire to have a voice assistant. I am already in the camp
of taping over the camera's in my computers now will I need to worry about the
microphone or what comes from the speakers?

so the question is, shouldn't they be able to detect the wavelength of what
they are processing to weed out some of the more obvious tricks? with voice
recognition could it also not be limited to a voice it is trained to know?

~~~
djsumdog
I don't bother taping over stuff. If you think about it, there are probably
10+ microphones in your room (Samsung TVs, phones, laptops, tablets, etc.)

I run 3rd party roms, Linux on all my dev/tv machines, disable Cortana on my
gaming laptop, and hope there isn't something listening in all that trusted,
untrusted and oss code I'm running.

I told my roommate I'd move out if he ever got an Alexa or Google home device.
I do want to run Jarvis, or one of the OSS alternatives. Many of them send
your data to Google/Amazon as well if you enable using their Speech-to-Text
services, but they also have options for using local OSS decoders as well (and
typically enable those by default).

Our phones are so powerful today there is no reason to send your speech to the
cloud (someone else's computer). It should just be done locally; and tech
should be improved so accuracy is improved locally without needing the larger
datasets that Google/Amazon/Apple use.

More devs need to use the OSS assistance instead, and maybe that will push
other engineers to no go the easy route and opt to protect their privacy
instead.

~~~
freehunter
You should bother taping over cameras you're not using though. It's way too
easy to hijack them and keep the light from turning on when you do.

~~~
jonknee
Do you really put tape on your phone?

~~~
freehunter
I use my phone's camera, so no.

------
ComputerGuru
iOS now has a “text to siri” feature that can disable spoken interfacing but
retain the “smart” capabilities of the digital assistant.

Not that digital assistants are worth the risk they bring. Until now I can’t
get Siri to do anything useful that isn’t very artificially and carefully
phrased.

~~~
mkirklions
I got an iphone from work and I was shocked at how little Siri could do.

Given all the hype from my friends and the commercials, I expected something
outstanding.

Nope, significantly worse than google's assistant.

That was the start of my complete disappointment in apple as I continued to
use an iphone and wonder- Why is anyone buying this?

~~~
kfrzcode
The same reason people buy Coach bags etc... status and brand.

~~~
sillyquiet
Why do people keep repeating this tired and offensive myth? I bought my iphone
for the hardware and software capabilities that I judged to be the best for my
use cases. And I am not the only one that actually had a non-trite reason, I
am sure.

~~~
mkirklions
Is this a myth? What does apple have better than android in 2018?

~~~
sillyquiet
As far as what Apple iphone does better? Privacy, os updates, integration with
my Mac, App store apps, and a bunch of other things. But that all is
completely besides the point. Even if there were nothing at all iPhones do
better, it's a bit absurd to go from 'well apple is not better than android'
to 'people therefore only buy apple because they are shallow'

~~~
mkirklions
Has it been a while since you used an alternative?

All of those seem like expected features in any OS/phone.

~~~
saagarjha
What other phone does this?

------
devy
Just out of curiosity, what's the frequency range that a typical mic can pick
up the signal from? The article did not specially mention about the range
instead it said inaudible.

And here is another article I found that mentions the normal 20-20kHz
frequency response range: [http://blog.shure.com/mic-basics-frequency-
response/](http://blog.shure.com/mic-basics-frequency-response/)

Isn't that mostly overlap with the human ear capability? I understand each
person is different, etc. But just curious the specifics.

~~~
jerf
Yes, typical mics tend to pick up the typical human frequency range, though
cheaper mics may have some really poor characteristics at the edges. Usually
in the speech range they'll be pretty solid.

However, there's a lot of play within the space. One difference is that
microphones do a very direct recording of the sound waves, but what we hear is
actually very distorted compared to the "real" sound by the nature of our ear.
One of the big differences is that if there is a very loud 4000Hz sound, we
can't hear a soft 4005Hz sound near it very well, but the microphone "hears"
it just fine. So for instance, you could put out a loud sound for a user, but
embed a very quiet command in frequencies the human couldn't hear, but if the
listening model doesn't account for that (and there are reasons it wouldn't
necessarily _want_ to, because it _wants_ to hear commands even in the
presence of significant background noise), you could get commands in to a
system. See
[https://en.wikipedia.org/wiki/Psychoacoustics](https://en.wikipedia.org/wiki/Psychoacoustics)
for discussion about how our ears fail to pick up the "real audio" signal, and
how much we've exploited that in music compression.

Now, that was a very brute force example. It sounds to me like what this
article is talking about are called "adversarial examples"
([https://blog.acolyer.org/2017/02/28/when-dnns-go-wrong-
adver...](https://blog.acolyer.org/2017/02/28/when-dnns-go-wrong-adversarial-
examples-and-what-we-can-learn-from-them/) ). Voice recognition doesn't listen
the same way we do, it doesn't necessarily take a holistic view of the signal,
but is looking for specific frequency patterns and changes and turning that
into phonemes, into words, etc. (There's a lot of ways of doing this and I
don't specifically know what Alexa and Siri are doing, so that's a really
vague overview.) If you know what they are looking for, you can use filters to
very, very selectively remove the patterns from a bit of music or something
that Alexa might trigger on, and then insert just the bare minimum skeleton of
the sounds that it is really recognizing. A human won't be able to hear the
difference (most likely; depends on how badly the original is mangled but even
if it is audible it is almost certainly not audible without an A/B test and
very good ears), but the probably-neural-nets monitoring for sounds will end
up superstimulated and interpret the adversarial example as words.

While the adversarial examples work best with tuning to the target network,
widely-shared networks like Alexa or Siri mean that such tuning is practical
where attacking some custom-trained model used by one person isn't, _and_
experiments have shown that adversarial examples travel between separately-
trained nets and even non-neural-net models to a much, much greater degree
than what at least my own intuition would have suggested before hand. (See
previous link and look for the discussion of "Practical black-box attacks
against deep learning systems using adversarial examples". It is extremely
counter-intuitive to me how easy this is.)

~~~
floatrock
hmm... the big idea that MP3 figured out was you can document all these "if
there is a very loud 4000Hz sound, we can't hear a soft 4005Hz sound near it
very well" psychoacoustic phenomena and just throw away all that extra "can't
hear it very well" information, resulting in a vastly-smaller filesize that
still sounds reasonable (yeah yeah it's not FLAC and the purist needs their
gold-plated Monster cables, lets not go there, that's not the point)

So this attack is kinda a "reverse-MP3" that adds those lossy bits back in,
but shaped with an attack payload. Or at least it adds enough pieces of the
attack payload that the neural net pattern recognition triggers, while the
humans say "Doesn't sound like anything to me".

Is that a close-enough explain-like-im-a-freshman?

~~~
jerf
I primarily brought up psychoacoustics as an example of the way we don't hear
the way microphones do. While you could abuse them, it would be more obvious.
In this case what we're getting is the audio equivalent of adversarial
examples; see the link I gave for some visual examples. What's interesting
there is that they are basically invisible to us, but surprisingly robust.

(As another sort of philosophical sidebar, this either proves, or provides
very strong evidence, that whatever it is our brains are doing, it is not what
deep learning nets are doing, nor anything else vulnerable to such trivial
adversarial examples. I've seen adversarial examples against another technique
that do seem to work against humans as well, but it requires such a distortion
to the image that "I can't tell if that's a dog or a toaster" actually makes
sense; it's not just some sort of attack against human vision or something,
it's a fancy morphed thing halfway between the two that would probably confuse
anything and anybody.)

~~~
floatrock
ah, thanks for the clarification! (and the interesting philosophical sidebar!)

------
newsbinator
I can't get Siri to turn on the flashlight or lock my phone.

What can Siri do that's dangerous?

~~~
Nelson69
Well, it can read your schedule, tell you your location and then there is the
homekit stuff. It could potentially disable certain security features you
might have installed at your house, it could possible perform a very expensive
modification to your HVAC configuration, in an extreme case that could maybe
be fatal (disable heating in the winter at an older person's home or something
like that.) It can also read your messages which are used for MFA in some
situations. My wife and I have our accounts hooked together and I can ask Siri
where she is and it uses find my friends, it can also kick off find my iPhone
which shows my wife's presumed location on a map.

I think it can do Apple Pay actions too.

Degrees of dangerous. I don't have a homepod but presumably it couldn't do
anything with Apple pay or your messages. Having Alexa or Siri control home
automation stuff seems like something you might want to think about a little,
leaving the lights on all day and burning some energy is a very different
thing than re-configuring your HVAC or a security camera.

------
exodust
I hadn't thought of this, it's quite concerning. I don't see how they can
safeguard against this without reducing the effectiveness of the voice
recognition.

A secret command to "paste clipboard into new email, send to [address]" is a
shiny new attack vector without any apparent straight forward way to plug the
security hole.

~~~
djrogers
The obvious plug would be to not allow such a ridiculously unsafe command
without requiring you to unlock your device, much like my iPhone does today.

~~~
exodust
Sure, but your phone is sometimes already unlocked because you used it 30
seconds ago and it now sits on the table. Or it's playing music, or your kid
has it etc. I don't think I was thinking about a phone anyway, more the
dedicated devices that sit there listening all the time.

~~~
DecoPerson
So lock your phone every time you set it down. Never leave it unlocked.

I used to have my iPhone lock 5min after I pressed the sleep button. Now that
TouchID makes it very easy to unlock, I have it locking immediately.

When I let my friend's 4yo use my iPad, I triple tap the home button and press
"Guided Access", which can prevent the user from accessing other apps until I
disable it. (I do this because I'm worried about what he may accidentally
search on the web, not because I'm worried he'll steal my data!)

------
hedora
Am I the only one that wants a nice cherry mechanical keyboard that
transcribes typed commands to inaudible voice commands?

~~~
freeone3000
Why use the voice commands, then? Why not just type?

~~~
tudelo
The only thing I can think of is to either not be heard by others nearby or to
mess with people who have these devices. 1 can be done by just typing to
something that can actually natively store what you want and 2 is just for fun
I guess.

------
callumprentice
DolphinAttack: Inaudible Voice commands:
[https://youtu.be/21HjF4A3WE4](https://youtu.be/21HjF4A3WE4)

~~~
Froyoh
Can't they just limit the activation frequency? Seems easy enough

~~~
tasty_freeze
I don't know anything about it, but based on the name, I'll venture a guess.

The audio system has an A/D converter which samples audio at a specific rate
-- say 48 KHz. Aliasing occurs when the input to the D/A convert is above 1/2
the sample rate. A 24001 Hz signal is indistinguishable from a 23999 Hz
signal. A 25000 Hz signal is indistinguishible from a 23000 Hz signal, etc.

To eliminate these types of problems, there will be an analog lowpass filter
before the sampling circuit. There is a gradual rolloff of signal sensitivity.
Aliasing still occurs, but the energy of the aliased signals is significantly
reduced.

My guess is you take a voice command, even if it constrained to be say 200 Hz
to 2KHz, then invert the spectrum and shift it to the 46-48 KHz range. When
this high frequency is played back, due to aliasing, the software after the
A/D converter sees it as a 0-2KHz signal, though greatly attenuated. To
overcome that, the source audio can be tremendously loud. Humans can't hear
it, so it remains stealthy.

~~~
cozzyd
That's clever but that's so many dB down with any sane anti-aliasing filter
that it would require quite the sound source.

Based on flipping through the pages of the paper
([https://arxiv.org/pdf/1708.09537.pdf](https://arxiv.org/pdf/1708.09537.pdf)),
it looks like they're taking advantage of the non-linearity in the response at
high frequencies to effectively demodulate a lower-frequency signal that was
mixed up to ~22 KHz.

Which, if that's what they're doing, is totally awesome!

------
ColanR
This kind of thing has been on HN before...the new thing here is that the
command is embedded in a human-audible sound clip.

------
TheGuyWhoCodes
Can't this just be fixed so that Alexa, Siri, etc. will only accept your voice
pattern?

~~~
tinus_hn
While that’s really difficult it would also mean you’d have to train the
assistant before you’d be able to use it, which is a big hurdle most customers
probably don’t want.

------
mdekkers
It is beyond me why people would want to put a live mic in their home. Every
dystopian story, real or fiction, features some element of constant
observation, and here we go, happily placing these devices in our homes.
Insane

------
danShumway
> This month, some of those Berkeley researchers published a research paper
> that went further

Pet peeve, I really wish that this was a link.

Was I just blind? Is the actual paper linked anywhere in the article?

~~~
srtjstjsj
It's obliquely linked at "More recently, Mr. Carlini and his colleagues at
Berkeley have [LINK: incorporated commands] into audio recognized by Mozilla’s
DeepSpeech voice-to-text translation software, an open-source platform."
[https://nicholas.carlini.com/code/audio_adversarial_examples...](https://nicholas.carlini.com/code/audio_adversarial_examples/)

It's probably this paper:
[https://nicholas.carlini.com/papers/2018_dls_audioadvex.pdf](https://nicholas.carlini.com/papers/2018_dls_audioadvex.pdf)

discussed in January when it went up on Arxiv:
[https://news.ycombinator.com/item?id=16220376](https://news.ycombinator.com/item?id=16220376)

~~~
danShumway
Thanks, I guess I was just blind :)

------
jacksmith21006
Replaced our Echos with Google homes. Curious if they are also vulnerabile?

------
frenchie4111
> In the wrong hands, the technology could be used to unlock doors, wire money
> or buy stuff online — simply with music playing over the radio.

Why hide it in radio content? Couldn't they just play it out loud when I am
not home?

------
saagarjha
> Amazon said that it doesn’t disclose specific security measures, but it has
> taken steps to ensure its Echo smart speaker is secure.

So, security by obscurity?

~~~
notsofastbuddy
> So, security by obscurity?

Obscurity is a perfectly valid layer in a security system. It's just not
sufficient as the primary security mechanism.

------
banku_brougham
Serious off-topic question:

Are there docs for Siri so that i can learn what it can/can’t do?

I have tried skipping songs, playing a genre, set random play, and similar in
iTunes — generally a failure, often initiates an unwanted phone call.

On the phone I can successfully call the intended contact about 50% of the
time, possibly because I have ~250 contacts.

I suspect that if i knew the right words to interact with the API I could have
a more enjoyable Siri experience.

Alternatively, is there a way to disable it completely — as in long hold on
headphones button does not initiate.

~~~
hunter2_
This doesn't address your question but might be interesting. As someone who
has used Android forever and never tried Siri, those anecdotes are mind
blowing to me. For me with Google Assistant, media commands work about 80% and
successful call initiation is about 90%. And I haven't looked for
documentation either. YMMV of course.

~~~
rootusrootus
Sometimes it works great. Sometimes I can't get Siri to do anything right.
Anecdotally I've found that Google's voice assistant is quite a lot better.
Unfortunately I am unwilling to accept the rest of Google's terms and
conditions so I am stuck with Siri for the foreseeable future.

~~~
dirkgently
And what are those specific "terms and conditions"?

It's funny hoe Google is somehow perceived as evil while Apple or Amazon not.

If I have to trust someone with mt data (and we all do), I will choose Google
over anyone else.

~~~
denverkarma
I disagree that Amazon is not considered evil.

But really I don’t think anyone deeply believes that these companies are good
or evil in the personal human sense, rather it’s a question of incentives and
interests.

Google makes money by selling me to advertisers. I understand the business
value but I’m personally not comfortable with it.

Amazon makes money by selling me other people’s stuff. I’m comfortable with
the business, but sometimes I’m concerned that what’s good for Amazon isn’t
what’s good for the people who make the stuff I like.

Apple makes money by selling me stuff that they make. This is the business
model that I like best, because when they make stuff I don’t like I don’t buy
it, and when they make stuff I love I’m happy to give them my money in
exchange.

Buying from the maker is the best win-win virtuous cycle, in my opinion.

