Hacker News new | comments | show | ask | jobs | submit login
[dupe] Hackers send silent commands to speech recognition systems with ultrasound (techcrunch.com)
202 points by Garbage on Sept 8, 2017 | hide | past | web | favorite | 82 comments

This is MUCH bigger deal than most understand. This will cost less than $10 to build and their is no hardware solution on phones or Alexia.

Phreaking is back.

Yeah, those parts are easily available on Aliexpress in droves. Imagine this in the crowded subway.

"hey siri" "show me pictures of CENSORED" "send the first picture to mom" "yes, send it"


I’m pretty sure that Hey Siri is keyed to the users voice to some extent, so Siri should be safe. Not sure about Google, Cortana, Alexa etc.

"We validated DolphinAttack on major speech recognition systems, including Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana, and Alexa."

I take that as, Siri is probably not trained to only the wake-up phrase, and as such, seems to be _not_ trained to the voice at all since this attack worked.

That, or poorly trained. Don't know, can't verify.

Emphasis on "to some extent". Relying on it is a bad idea even though it'll keep out most randos.

Speech fingerprinting, much like actual fingerprints, is not a reliable way to establish identity, especially if you want to avoid false negatives as you probably would in mass consumer facing technology.

Siri recognises the "hey siri" bit unique to the owner, further commands aren't validated.

"OK Google" on my Pixel is keyed to my voice and can be used to unlock the phone (though I turned that feature off), but I'm not sure if it validates my voice when the phone is already unlocked, or if other google devices like the Home allow this.

You are correct that Apple claims that the opening "Hey, Siri" is trying to match against your voice specifically: https://techcrunch.com/2015/09/11/apple-addresses-privacy-qu...

No. It absolutely is not. There is no way (currently) to train Siri to just your voice.

Cortana has that as an option in Settings, but it's not on by default.

Siri is not keyed to anything . My daughter can talk to her as i do.

"Hey Siri" is.

Not on my phone, my whole family can trigger it.

While it's locked?


How about forwarding password reset emails?

"Siri, forward my Ashley Madison password email to my wife"

Unplug Alexa, turn "Hey Siri" off. Schlep to store for Cheerios.

But how will the store acquire all my personal information that way?

Loyalty cards.

By using the same credit card for each purchase.

Turn Alexa on, order Cheerios, turn Alexa off.

"...no hardware solution..."

There is for new hardware. It's relatively simple to include a couple components on the mic input to filter out ultrasound.

There is indeed no hardware fix for anything already manufactured.

They are taking advantage of nonlinear properties in the microphone system, as described clearly in the article. There is no software filtering to protect against this using classical signal processing.

That said, it may be possible to detect the difference between human speech and the ultrasonic trick with a machine learning solution.

The thing is, this is creating harmonic vibrations in the membrane at human speech frequencies. I really don't think you need to go with ML. It seems to me that if you remove the LPF, let the ADC access all the frequencies, then (either in the ADC or in software) detect the harmonics and ignore those with an ultrasound source.

Edit: suggest LPF removal.

Wouldn't it go through a bandpass filter to remove the ultrasonics before it makes it to the ADC? I thought that was common with DACs to reduce aliasing issues.

Apologies - I was amidst edits when I was called away. I've addressed this by suggesting removal of the LPF.

This has nothing to do with phreaking. Phreaking is about hacking the phone network, not covert use of the built-in features of a network-enabled device.

well Phreaking seemed to me to be phone based.

This will be fixed with a simple software update that ignores all sounds at inaudible frequencies.

It's happening at the hardware level, so there is potentially limited scope to fix it in software. My guess is that when the author refers to "harmonics" they are really talking about intermodulation.

The idea is that if you want to create a frequency of "A", you can emit two powerful tones at frequencies "B" and "B+A", where the frequency B is high enough to be out of hearing range. The non-linearity of the microphone means the two tones mix together to produce a number of other frequencies, including the frequency "B+A"-"B" = "A".

Thus the conversion from ultrasonics to audible is happening in the microphone itself, before the software has a chance to distinguish the difference. The mixing process typically produces other frequencies other than "A", so there might be hope of a countermeasure if the microphone is able to pick up these other frequencies and the software is smart enough to use them to figure out that an attack is in progress. It's not a simple case of just filtering out a particular frequency and an intelligent choice of ultrasonic frequencies may leave only a single frequency in the band of the microphone.

It's the same principle that is used in ultrasonic beamforming speakers. That adds another element of stealth to the attack, in that the high frequencies can allow the sound to be beamformed and illuminate the microphone and not much else.

> the author refers to "harmonics" they are really talking about intermodulation.

No, he's talking about harmonics. It's a different effect from intermodulation. It's true that intermodulation involves the sum and difference two or more frequencies. Harmonics, however, involves integer multiples of a single frequency.

But the impact is the same as intermodulation in that it's really a hardware issue and cannot be countered using a simple frequency filter.

Harmonics are multiples of the fundamental, so in this case they will also be ultrasonic.

Equation 2 in the paper and the subsequent paragraph shows what is going on. They use an ultrasonic carrier with modulation. The non-linearity causes the carrier to mix with the sidebands, the third-order intermodulation product being a copy of the modulation centred on 0Hz (ie. a baseband signal).

Edit: Figure 12 talks about harmonics, in the context of harmonics of the third order intermodulation product. What they are really refering to are the higher order: 5th, 7th, and so on intermodulation products, which in this case will be multiples of the third order product's frequency.

> The idea is that if you want to create a frequency of "A", you can emit two powerful tones at frequencies "B" and "B+A", where the frequency B is high enough to be out of hearing range. The non-linearity of the microphone means the two tones mix together to produce a number of other frequencies, including the frequency "B-A"-"B" = "A".

Does this work for ears, too?

If so, are the non-linearities of different people's ears similar enough that two people hearing the same A and B would get the same results, or would person to person variations in non-linearity mean they might hear different results?

Yes it does:



It's reasonably consistent. Differences in non-linearity will result in different amplitudes for each intermodulation product, but not different frequencies. Typically these systems use the "third order" product. I gather that the non-linearity exploited is as much a property of the air as the ear.

I wonder if this has any implications for recording?

If someone is listening to a live musical instrument that is producing both audible sound and ultrasonic sound [1], is what the person perceives affected by intermodulation in the ear?

If the performance is also recorded using a technology that for all practical purposes reproduces perfectly everything in the audible range, then I can see a couple possible cases.

1. The microphone is designed to filter out ultrasonics or is sufficient linear to not have intermodulation.

In this case, the recording is what would be heard with no intermodulation. When played back, all the listener gets is the audible portion of the original sound, without any ultrasonics. Thus there is nothing to produce intermodulation in the listner's ear, and so the listener might perceive the recording as having a different timbre than the live instrument.

2. The microphone does not filter ultrasonics and is non-linear enough to have intermodulation. The audible intermodulation products will then be included in the recording.

When played back the listener will hear intermodulation products, but they will be the ones from the microphone's non-linearity, not the ear's non-linearity.

The question then is how close are microphone non-linearities to ear non-linearities. If they are similar, then the timbre of the recording should match live. If they are sufficiently different, the timbre could sound off.

It should be possible to design a system that records only audible frequencies and plays back only audible frequencies and sounds identical to live, but it may require specifically taking into account ultrasonics instead of just cutting them out like I think we currently do.

[1] A trumpet with a Harmon mute playing a quiet note has about 2% of its energy above 20 KHz. Playing a loud note drops that to about 0.5%. A cymbal crash is about 40% above 20 KHz. (Keys jangling are almost 70% above 20 KHz, which probably has something to do with why back in the early days of TV remote controls when they were ultrasonic instead of IR or RF people would report that if someone's keys jangled the channel would sometimes change). See: https://www.cco.caltech.edu/~boyk/spectra/spectra.htm

There are (marginally) commercialized ultrasonic speakers:


The air acts as the demodulator though.

Shaping the ultrasound to modulate the eardrum sounds scary.

If I understand this correctly, the phone cannot tell if the audio is outside of human hearing range. The point of the LPF is to filter out all audio that is outside of that range.

The attack they are using is transmitting the audio at a high frequency, that when detected by the microphone generates harmonics that are within the normal range and can pass through the filter. By the time the audio signal gets to the processor, it is within audible range.

> This is MUCH bigger deal than most understand.

>> This will be fixed with a simple software update that ignores all sounds at inaudible frequencies.

In the most helpful and constructive way I can possibly say this directly: the group of individuals not understanding may include you, rachitgupta.

According to my very limited understanding, the attack occurs in hardware prior to digitization. See yesterday's discussion for more details.

If you were to read the article you'd see this isn't how it works. They're inducing harmonic signals within the speech passband.

Except the inaudible sounds are used for marketing purposes. Most companies aren't going to want to just close that door.

I'm genuinely curious - can you share a link how it works please? I've never heard of it.

Watch the 33c3 presentation linked elsewhere in this discussion, or just skip to solutions/Q&A: https://youtu.be/WW1-xnTIDjQ?t=35m05s

You can also read the paper: https://petsymposium.org/2017/papers/issue2/paper18-2017-2-s...

Here is the blurb from their talk:

Cross-device tracking (XDT) technologies are currently the "Holy Grail" for marketers because they allow to track the user's visited content across different devices to then push relevant, more targeted content. For example, if a user clicks on a particular advertisement while browsing the web at home, the advertisers are very interested in collecting this information to display, later on, related advertisements on other devices belonging to the same user (e.g., phone, tablet).

Currently, the most recent innovation in this area is ultrasonic cross-device tracking (uXDT), which is the use of the ultrasonic spectrum as a communication channel to "pair" devices for the aforementioned tracking purposes. Technically, this pairing happens through a receiver application installed on the phone or tablet. The business model is that users will receive rewards or useful services for keeping those apps active, pretty much like it happens for proximity-marketing apps (e.g., Shopkick), where users receive deals for walk-ins recorded by their indoor-localizing apps.

-- https://www.blackhat.com/eu-16/briefings.html#talking-behind...

Not sure if this will help, but I was surprised by this: https://arstechnica.com/tech-policy/2015/11/beware-of-ads-th...

SilverPush has since stopped, it seems, but that might just mean others are doing it more profitably that they were...

I know of traditional media ad-tech that uses ultrasound markers embedded in ads to track/verify if the ads were really broadcast as promised (number of times & in the correct time slots.

So the steps are:

1. Inject ultrasound markers into ad during post-production.

2. Have a server with multiple tuner cards to monitor multiple stations, grab the audio.

3. Filter audio on specific ultrasound frequencies, search for the pre-injected patterns.

4. Generate reports.

5. Get paid.

The post specifically addresses why it's not that simple - they're taking advantage of harmonics to generate signal in the audible frequencies on the microphone itself.

Talking Behind Your Back (33c3)


33c3 was in 2016, for those like me who can't keep track of that.

Using ultrasounds as a side-channel is also used by Cisco's proximity sensors for telepresence units and the new Spark Boards. While on a call on my phone, I can walk into any conference room and swipe to handoff the video call from my phone to the room's telepresence unit, even if my phone is not on wifi (e.g. because I just walked into the building or had wifi turned off). In my experience, this flows much smoother than any wifi– or bluetooth–based pairing.


Full disclaimer: I work in that BU.

Could this phenomena have caused the recent hearing loss among US and Canadian diplomats in Cuba?

That's actually a very good question. I'm very curious to learn what was the actual cause of that.

As the article says, there is a physical presence bar to meet to make this workable now, but I can't not think about clever ways that this could become a "worm" or spread digitally. Infected viral (actually viral!) YouTube videos? Robocalls that you hope go on speakerphone (where the fidelity of the attack signal may be questionable)? Or drive a car with big speakers around a neighborhood? How about a mobile phone botnet that just constantly speaks to digital assistants?

The speakers would have to be able to emit ultrasonic waves. I'm pretty sure that you need special hardware for this. (because the researchers themselves used a special ultrasonic emitter, not the speaker of the phone they used).

Good point. I wouldn't trust that the speaker on a mobile phone has a very good frequency response, but a TV or car stereo might be able to swing it.

The "solution" will be another layer of security hacks: passphrases (which anybody can trivially register), registration of known tone (which means the phone will stop accepting it in noisy surroundings)... another cat & mouse game starts and it's like we're back to 1970.

The range limitation seems possible to overcome too.

There are directional megaphones that can send sound for hundreds of meters. Directional megaphones send a beam of ultrasound that is focused on the surface where the sound is to be generated, so an attacker may be able to use a directional megaphone to beam the voice commands to the target's vicinity.

Suddenly all kinds of movie plot tricks could be used in heist movies now! Perp uses an air vortex cannon to press a door bell button from a distance and a directional megaphone to talk to in the buzzer and then ... well, easy to get carried away :)

These speech recognition tools need to have some sort of authentication:

- How about having a secret "wake word" instead of "Alexa" or "Hey Siri"? - Only treating signals using human voice range - Voice identification

If this isn't patched soon (excluding ultrasounds), it could mean that these tools are already using inaudible signals for other purposes. For example, commercials could add ultrasounds to know who's watching them.

> Only treating signals using human voice range

Inter-modulation will let you create something hardware can't tell isn't. Two inaudible sounds, both getting received end up looking like an audible one to the hardware.

The harmonic effect described in the article is similar, and damn hard to filter against physically.

> For example, commercials could add ultrasounds to know who's watching them

Yep, and we're already there. [0]

[0] https://www.blackhat.com/eu-16/briefings.html#talking-behind...

But if mics can pick it up, human ears will also be able to hear it.

Why? Most microphones have a greater range than your ears do.

Here's a StackOverflow thread [0], exploring what sounds a smartphone can hear, that a human can't.

So, going with interleaving two sounds to make something the mike thinks is audible, when it isn't:

Sound A is at 13khz, Sound B is at 12khz.

Human ear can't receive either, so is unlikely to notice it at all.

But most smartphones can actually pick up those signals, and together, get a signal at probably < 18khz, which is in the audible range.


(Disclaimer: I'd have to investigate further to give a precise, or meaningfully correct answer, but the above outlines the theory).

[0] https://electronics.stackexchange.com/questions/59157/over-w...

You know, I'm ok with a physical button press as the authentication method.

Preferably some kind of hardware on-off switch for the physical microphone.

And camera

Most of the point of "Hey Siri" and similar features is for when your hands are occupied, soiled, or otherwise unavailable.

Even if the commands are not inaudible it's a security issue if the device performs a dangerous action because someone in the neighborhood said something.

The feature is presented as if it only responds to the owner but realistically that distinction doesn't work at all.

If you shout 'hey Siri' into the microphone at a large event, a lot of phones are going to respond.

> If you shout 'hey Siri' into the microphone at a large event, a lot of phones are going to respond.

Curious - anybody ever tried that?

Burger King had a TV ad[1] that triggered Google Assistant and Google Home ("OK Google, what is the whopper?")

1. https://www.theverge.com/2017/4/12/15277278/google-home-burg...

There are others issues with out user keyed audio triggers.

Stand outside a living room door and yell "Alexa, open the door" and it may open. Even before this finding, there is the ability to embed things in app/game audio or advertisements. This now also lets commands be embedded here as well and not be audible. Soon, we'll have secret commands hidden in songs that try to turn on all your tvs and audio players and switch everything to play a band's song or youtube channel.

Or just trackers.

You TV can emit tones your various devices can pick up, sending back data on what content you're consuming.

Presumably, you could also broadcast an AM radio signal at just the right frequency that the wires in the microphone pick it up.

This already been demonstrated on headphones[0].

[0] https://www.wired.com/2015/10/this-radio-trick-silently-hack...

Could get pretty sci-fi with this. Repeatedly send "Hey Siri" with a custom message in a crowded area. Watch people shake heads whenever they check their phones.

Don't for forget hiding audible commands inside noise, like A/C startups, vacuums, any grinding / background noise we typically ignore can be a carrier.

Can't Alexa, et al simply look for a certain volume and duration in the audible range, which sounds like speech, in order to activate?

I think volume is the key here... The hackers induced harmonics in the audible range but at a volume that was only perceivable by the microphone and not by the bystanders. So to me it seems that it's possible to prevent this by require a higher threshold of volume to activate.

I am just worried the hackers will then make sounds that sound like noise, but at least it will be characteristic and alert the person (maybe).

Wait.. wouldn't the ear hear the harmonics as well, since it's also a membrane-based microphone?

"None worked farther than 5 feet away"

Well, that seems pretty easy to combat.

There are a lot of audio positioning options that very few have the expertise to apply. Reminds me of this Kickstarter:


Can't it be fixed by a simple low-pass filter?

You can't stop the phantom frequencies from appearing on microphone, so the solution is to stop the recognition if there are powerful ultrasonic waves and alert user something weird is happening.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact