Phreaking is back.
"show me pictures of CENSORED"
"send the first picture to mom"
"yes, send it"
I take that as, Siri is probably not trained to only the wake-up phrase, and as such, seems to be _not_ trained to the voice at all since this attack worked.
That, or poorly trained. Don't know, can't verify.
Speech fingerprinting, much like actual fingerprints, is not a reliable way to establish identity, especially if you want to avoid false negatives as you probably would in mass consumer facing technology.
There is for new hardware. It's relatively simple to include a couple components on the mic input to filter out ultrasound.
There is indeed no hardware fix for anything already manufactured.
That said, it may be possible to detect the difference between human speech and the ultrasonic trick with a machine learning solution.
Edit: suggest LPF removal.
The idea is that if you want to create a frequency of "A", you can emit two powerful tones at frequencies "B" and "B+A", where the frequency B is high enough to be out of hearing range. The non-linearity of the microphone means the two tones mix together to produce a number of other frequencies, including the frequency "B+A"-"B" = "A".
Thus the conversion from ultrasonics to audible is happening in the microphone itself, before the software has a chance to distinguish the difference. The mixing process typically produces other frequencies other than "A", so there might be hope of a countermeasure if the microphone is able to pick up these other frequencies and the software is smart enough to use them to figure out that an attack is in progress. It's not a simple case of just filtering out a particular frequency and an intelligent choice of ultrasonic frequencies may leave only a single frequency in the band of the microphone.
It's the same principle that is used in ultrasonic beamforming speakers. That adds another element of stealth to the attack, in that the high frequencies can allow the sound to be beamformed and illuminate the microphone and not much else.
No, he's talking about harmonics. It's a different effect from intermodulation. It's true that intermodulation involves the sum and difference two or more frequencies. Harmonics, however, involves integer multiples of a single frequency.
But the impact is the same as intermodulation in that it's really a hardware issue and cannot be countered using a simple frequency filter.
Equation 2 in the paper and the subsequent paragraph shows what is going on. They use an ultrasonic carrier with modulation. The non-linearity causes the carrier to mix with the sidebands, the third-order intermodulation product being a copy of the modulation centred on 0Hz (ie. a baseband signal).
Edit: Figure 12 talks about harmonics, in the context of harmonics of the third order intermodulation product. What they are really refering to are the higher order: 5th, 7th, and so on intermodulation products, which in this case will be multiples of the third order product's frequency.
Does this work for ears, too?
If so, are the non-linearities of different people's ears similar enough that two people hearing the same A and B would get the same results, or would person to person variations in non-linearity mean they might hear different results?
It's reasonably consistent. Differences in non-linearity will result in different amplitudes for each intermodulation product, but not different frequencies. Typically these systems use the "third order" product. I gather that the non-linearity exploited is as much a property of the air as the ear.
If someone is listening to a live musical instrument that is producing both audible sound and ultrasonic sound , is what the person perceives affected by intermodulation in the ear?
If the performance is also recorded using a technology that for all practical purposes reproduces perfectly everything in the audible range, then I can see a couple possible cases.
1. The microphone is designed to filter out ultrasonics or is sufficient linear to not have intermodulation.
In this case, the recording is what would be heard with no intermodulation. When played back, all the listener gets is the audible portion of the original sound, without any ultrasonics. Thus there is nothing to produce intermodulation in the listner's ear, and so the listener might perceive the recording as having a different timbre than the live instrument.
2. The microphone does not filter ultrasonics and is non-linear enough to have intermodulation. The audible intermodulation products will then be included in the recording.
When played back the listener will hear intermodulation products, but they will be the ones from the microphone's non-linearity, not the ear's non-linearity.
The question then is how close are microphone non-linearities to ear non-linearities. If they are similar, then the timbre of the recording should match live. If they are sufficiently different, the timbre could sound off.
It should be possible to design a system that records only audible frequencies and plays back only audible frequencies and sounds identical to live, but it may require specifically taking into account ultrasonics instead of just cutting them out like I think we currently do.
 A trumpet with a Harmon mute playing a quiet note has about 2% of its energy above 20 KHz. Playing a loud note drops that to about 0.5%. A cymbal crash is about 40% above 20 KHz. (Keys jangling are almost 70% above 20 KHz, which probably has something to do with why back in the early days of TV remote controls when they were ultrasonic instead of IR or RF people would report that if someone's keys jangled the channel would sometimes change). See: https://www.cco.caltech.edu/~boyk/spectra/spectra.htm
The air acts as the demodulator though.
Shaping the ultrasound to modulate the eardrum sounds scary.
The attack they are using is transmitting the audio at a high frequency, that when detected by the microphone generates harmonics that are within the normal range and can pass through the filter. By the time the audio signal gets to the processor, it is within audible range.
>> This will be fixed with a simple software update that ignores all sounds at inaudible frequencies.
In the most helpful and constructive way I can possibly say this directly: the group of individuals not understanding may include you, rachitgupta.
According to my very limited understanding, the attack occurs in hardware prior to digitization. See yesterday's discussion for more details.
You can also read the paper: https://petsymposium.org/2017/papers/issue2/paper18-2017-2-s...
Here is the blurb from their talk:
Cross-device tracking (XDT) technologies are currently the "Holy Grail" for marketers because they allow to track the user's visited content across different devices to then push relevant, more targeted content. For example, if a user clicks on a particular advertisement while browsing the web at home, the advertisers are very interested in collecting this information to display, later on, related advertisements on other devices belonging to the same user (e.g., phone, tablet).
Currently, the most recent innovation in this area is ultrasonic cross-device tracking (uXDT), which is the use of the ultrasonic spectrum as a communication channel to "pair" devices for the aforementioned tracking purposes. Technically, this pairing happens through a receiver application installed on the phone or tablet. The business model is that users will receive rewards or useful services for keeping those apps active, pretty much like it happens for proximity-marketing apps (e.g., Shopkick), where users receive deals for walk-ins recorded by their indoor-localizing apps.
SilverPush has since stopped, it seems, but that might just mean others are doing it more profitably that they were...
So the steps are:
1. Inject ultrasound markers into ad during post-production.
2. Have a server with multiple tuner cards to monitor multiple stations, grab the audio.
3. Filter audio on specific ultrasound frequencies, search for the pre-injected patterns.
4. Generate reports.
5. Get paid.
Full disclaimer: I work in that BU.
There are directional megaphones that can send sound for hundreds of meters. Directional megaphones send a beam of ultrasound that is focused on the surface where the sound is to be generated, so an attacker may be able to use a directional megaphone to beam the voice commands to the target's vicinity.
Suddenly all kinds of movie plot tricks could be used in heist movies now! Perp uses an air vortex cannon to press a door bell button from a distance and a directional megaphone to talk to in the buzzer and then ... well, easy to get carried away :)
- How about having a secret "wake word" instead of "Alexa" or "Hey Siri"?
- Only treating signals using human voice range
- Voice identification
If this isn't patched soon (excluding ultrasounds), it could mean that these tools are already using inaudible signals for other purposes. For example, commercials could add ultrasounds to know who's watching them.
Inter-modulation will let you create something hardware can't tell isn't. Two inaudible sounds, both getting received end up looking like an audible one to the hardware.
The harmonic effect described in the article is similar, and damn hard to filter against physically.
> For example, commercials could add ultrasounds to know who's watching them
Yep, and we're already there. 
Here's a StackOverflow thread , exploring what sounds a smartphone can hear, that a human can't.
So, going with interleaving two sounds to make something the mike thinks is audible, when it isn't:
Sound A is at 13khz, Sound B is at 12khz.
Human ear can't receive either, so is unlikely to notice it at all.
But most smartphones can actually pick up those signals, and together, get a signal at probably < 18khz, which is in the audible range.
(Disclaimer: I'd have to investigate further to give a precise, or meaningfully correct answer, but the above outlines the theory).
The feature is presented as if it only responds to the owner but realistically that distinction doesn't work at all.
If you shout 'hey Siri' into the microphone at a large event, a lot of phones are going to respond.
Curious - anybody ever tried that?
Stand outside a living room door and yell "Alexa, open the door" and it may open. Even before this finding, there is the ability to embed things in app/game audio or advertisements. This now also lets commands be embedded here as well and not be audible. Soon, we'll have secret commands hidden in songs that try to turn on all your tvs and audio players and switch everything to play a band's song or youtube channel.
You TV can emit tones your various devices can pick up, sending back data on what content you're consuming.
Well, that seems pretty easy to combat.