
Hackers send silent commands to speech recognition systems with ultrasound - Garbage
https://techcrunch.com/2017/09/06/hackers-send-silent-commands-to-speech-recognition-systems-with-ultrasound/
======
ConfucianNardin
Previous discussion:
[https://news.ycombinator.com/item?id=15191640](https://news.ycombinator.com/item?id=15191640)

------
baldfat
This is MUCH bigger deal than most understand. This will cost less than $10 to
build and their is no hardware solution on phones or Alexia.

Phreaking is back.

~~~
rachitgupta
This will be fixed with a simple software update that ignores all sounds at
inaudible frequencies.

~~~
femto
It's happening at the hardware level, so there is potentially limited scope to
fix it in software. My guess is that when the author refers to "harmonics"
they are really talking about intermodulation.

The idea is that if you want to create a frequency of "A", you can emit two
powerful tones at frequencies "B" and "B+A", where the frequency B is high
enough to be out of hearing range. The non-linearity of the microphone means
the two tones mix together to produce a number of other frequencies, including
the frequency "B+A"-"B" = "A".

Thus the conversion from ultrasonics to audible is happening in the microphone
itself, before the software has a chance to distinguish the difference. The
mixing process typically produces other frequencies other than "A", so there
might be hope of a countermeasure if the microphone is able to pick up these
other frequencies and the software is smart enough to use them to figure out
that an attack is in progress. It's not a simple case of just filtering out a
particular frequency and an intelligent choice of ultrasonic frequencies may
leave only a single frequency in the band of the microphone.

It's the same principle that is used in ultrasonic beamforming speakers. That
adds another element of stealth to the attack, in that the high frequencies
can allow the sound to be beamformed and illuminate the microphone and not
much else.

~~~
tzs
> The idea is that if you want to create a frequency of "A", you can emit two
> powerful tones at frequencies "B" and "B+A", where the frequency B is high
> enough to be out of hearing range. The non-linearity of the microphone means
> the two tones mix together to produce a number of other frequencies,
> including the frequency "B-A"-"B" = "A".

Does this work for ears, too?

If so, are the non-linearities of different people's ears similar enough that
two people hearing the same A and B would get the same results, or would
person to person variations in non-linearity mean they might hear different
results?

~~~
femto
Yes it does:

[https://makezine.com/2008/10/08/homebrew-parametric-
speak/](https://makezine.com/2008/10/08/homebrew-parametric-speak/)

[http://www.soundlazer.com/](http://www.soundlazer.com/)

It's reasonably consistent. Differences in non-linearity will result in
different amplitudes for each intermodulation product, but not different
frequencies. Typically these systems use the "third order" product. I gather
that the non-linearity exploited is as much a property of the air as the ear.

~~~
tzs
I wonder if this has any implications for recording?

If someone is listening to a live musical instrument that is producing both
audible sound and ultrasonic sound [1], is what the person perceives affected
by intermodulation in the ear?

If the performance is also recorded using a technology that for all practical
purposes reproduces perfectly everything in the audible range, then I can see
a couple possible cases.

1\. The microphone is designed to filter out ultrasonics or is sufficient
linear to not have intermodulation.

In this case, the recording is what would be heard with no intermodulation.
When played back, all the listener gets is the audible portion of the original
sound, without any ultrasonics. Thus there is nothing to produce
intermodulation in the listner's ear, and so the listener might perceive the
recording as having a different timbre than the live instrument.

2\. The microphone does not filter ultrasonics and is non-linear enough to
have intermodulation. The audible intermodulation products will then be
included in the recording.

When played back the listener will hear intermodulation products, but they
will be the ones from the microphone's non-linearity, not the ear's non-
linearity.

The question then is how close are microphone non-linearities to ear non-
linearities. If they are similar, then the timbre of the recording should
match live. If they are sufficiently different, the timbre could sound off.

It should be possible to design a system that records only audible frequencies
and plays back only audible frequencies and sounds identical to live, but it
may require specifically taking into account ultrasonics instead of just
cutting them out like I think we currently do.

[1] A trumpet with a Harmon mute playing a quiet note has about 2% of its
energy above 20 KHz. Playing a loud note drops that to about 0.5%. A cymbal
crash is about 40% above 20 KHz. (Keys jangling are almost 70% above 20 KHz,
which probably has something to do with why back in the early days of TV
remote controls when they were ultrasonic instead of IR or RF people would
report that if someone's keys jangled the channel would sometimes change).
See:
[https://www.cco.caltech.edu/~boyk/spectra/spectra.htm](https://www.cco.caltech.edu/~boyk/spectra/spectra.htm)

------
akerro
Talking Behind Your Back (33c3)

[https://www.youtube.com/watch?v=WW1-xnTIDjQ](https://www.youtube.com/watch?v=WW1-xnTIDjQ)

~~~
pluma
33c3 was in 2016, for those like me who can't keep track of that.

------
athenot
Using ultrasounds as a side-channel is also used by Cisco's proximity sensors
for telepresence units and the new Spark Boards. While on a call on my phone,
I can walk into any conference room and swipe to handoff the video call from
my phone to the room's telepresence unit, even if my phone is not on wifi
(e.g. because I just walked into the building or had wifi turned off). In my
experience, this flows much smoother than any wifi– or bluetooth–based
pairing.

[https://www.cisco.com/c/en/us/products/collaboration-
endpoin...](https://www.cisco.com/c/en/us/products/collaboration-
endpoints/spark-board/index.html)

Full disclaimer: I work in that BU.

------
tomalpha
Could this phenomena have caused the recent hearing loss among US and Canadian
diplomats in Cuba?

~~~
PerryCox
That's actually a very good question. I'm very curious to learn what was the
actual cause of that.

------
beager
As the article says, there is a physical presence bar to meet to make this
workable now, but I can't not think about clever ways that this could become a
"worm" or spread digitally. Infected viral (actually viral!) YouTube videos?
Robocalls that you hope go on speakerphone (where the fidelity of the attack
signal may be questionable)? Or drive a car with big speakers around a
neighborhood? How about a mobile phone botnet that just constantly speaks to
digital assistants?

~~~
narrowtux
The speakers would have to be able to emit ultrasonic waves. I'm pretty sure
that you need special hardware for this. (because the researchers themselves
used a special ultrasonic emitter, not the speaker of the phone they used).

~~~
beager
Good point. I wouldn't trust that the speaker on a mobile phone has a very
good frequency response, but a TV or car stereo might be able to swing it.

------
toyg
The "solution" will be another layer of security hacks: passphrases (which
anybody can trivially register), registration of known tone (which means the
phone will stop accepting it in noisy surroundings)... another cat & mouse
game starts and it's like we're back to 1970.

------
willvarfar
The range limitation seems possible to overcome too.

There are directional megaphones that can send sound for hundreds of meters.
Directional megaphones send a beam of ultrasound that is focused on the
surface where the sound is to be generated, so an attacker may be able to use
a directional megaphone to beam the voice commands to the target's vicinity.

Suddenly all kinds of movie plot tricks could be used in heist movies now!
Perp uses an air vortex cannon to press a door bell button from a distance and
a directional megaphone to talk to in the buzzer and then ... well, easy to
get carried away :)

------
gourou
These speech recognition tools need to have some sort of authentication:

\- How about having a secret "wake word" instead of "Alexa" or "Hey Siri"? \-
Only treating signals using human voice range \- Voice identification

If this isn't patched soon (excluding ultrasounds), it could mean that these
tools are already using inaudible signals for other purposes. For example,
commercials could add ultrasounds to know who's watching them.

~~~
wiredfool
You know, I'm ok with a physical button press as the authentication method.

~~~
fredley
Preferably some kind of hardware on-off switch for the physical microphone.

~~~
isostatic
And camera

------
tinus_hn
Even if the commands are not inaudible it's a security issue if the device
performs a dangerous action because someone in the neighborhood said
something.

The feature is presented as if it only responds to the owner but realistically
that distinction doesn't work at all.

If you shout 'hey Siri' into the microphone at a large event, a lot of phones
are going to respond.

~~~
js8
> If you shout 'hey Siri' into the microphone at a large event, a lot of
> phones are going to respond.

Curious - anybody ever tried that?

~~~
sangnoir
Burger King had a TV ad[1] that triggered Google Assistant and Google Home
("OK Google, what is the whopper?")

1\. [https://www.theverge.com/2017/4/12/15277278/google-home-
burg...](https://www.theverge.com/2017/4/12/15277278/google-home-burger-king-
whopper-ad-campaign)

------
Retr0spectrum
Presumably, you could also broadcast an AM radio signal at just the right
frequency that the wires in the microphone pick it up.

~~~
parhurs
This already been demonstrated on headphones[0].

[0] [https://www.wired.com/2015/10/this-radio-trick-silently-
hack...](https://www.wired.com/2015/10/this-radio-trick-silently-hacks-siri-
from-16-feet-away/)

------
crehn
Could get pretty sci-fi with this. Repeatedly send "Hey Siri" with a custom
message in a crowded area. Watch people shake heads whenever they check their
phones.

------
bsenftner
Don't for forget hiding audible commands inside noise, like A/C startups,
vacuums, any grinding / background noise we typically ignore can be a carrier.

------
EGreg
Can't Alexa, et al simply look for a certain volume and duration in the
audible range, which sounds like speech, in order to activate?

~~~
javra
I think volume is the key here... The hackers induced harmonics in the audible
range but at a volume that was only perceivable by the microphone and not by
the bystanders. So to me it seems that it's possible to prevent this by
require a higher threshold of volume to activate.

~~~
EGreg
I am just worried the hackers will then make sounds that sound like noise, but
at least it will be characteristic and alert the person (maybe).

------
mozumder
Wait.. wouldn't the ear hear the harmonics as well, since it's also a
membrane-based microphone?

------
swsieber
"None worked farther than 5 feet away"

Well, that seems pretty easy to combat.

~~~
j_s
There are a lot of audio positioning options that very few have the expertise
to apply. Reminds me of this Kickstarter:

[http://www.soundlazer.com/what-is-a-parametric-
speaker/](http://www.soundlazer.com/what-is-a-parametric-speaker/)

------
golergka
Can't it be fixed by a simple low-pass filter?

------
ajuc
You can't stop the phantom frequencies from appearing on microphone, so the
solution is to stop the recognition if there are powerful ultrasonic waves and
alert user something weird is happening.

