Hacker News new | comments | ask | show | jobs | submit login
Personalized Hey Siri (apple.com)
117 points by dantiberian 9 months ago | hide | past | web | favorite | 73 comments

My biggest problem with the homepod is that my 5 year old has learned that he can turn on _any song he wants_ whenever he wants by yelling "hey siri." He loves it! He yells at the dinner table, when his little sister is asleep, and when we are on the phone. His favorite phrase is "Hey Siri, set volume to 100%." His favorite song is the PJ Masks song, which Siri dutifully plays when called upon. Every. Single. Time.

I'd love it if HomePod recognized who was speaking, but I'd settle for Siri just recognizing "this is a kids voice" and ignoring it!

Unfortunately “Hey Siri” starts to grate the 70th time you say it in 3 days. I would actually now rather walk across the room to touch the Homepod than have to say the phrase so much. It’s still in uncanny valley territory a lot of the time as it isn’t smart enough to keep context between interactions or react in a human-like way.

I think a retroactive activation feature would be the way to go, ie. be able to say “Siri” at the end of a sentence and have it parse the phrase that came before. It feels more natural to say “skip this please Siri” than “Hey Siri, skip” because it lets you amend the mistake of not identifying the subject earlier in the sentence rather than the phrase-exasperation-hey Siri-phrase.

I asked Siri to DJ and my daughter said “hey Siri, I don’t like that song”. Siri said “I’ll remember that” and played something else. Now I have no idea how to amend Siri’s notion that I dislike one of my favourite tracks. There should really be a Homepod app where you can see recent interactions and correct misinterpreted ones.

There are also some artists and tracks with unusual spellings that are always going to cause problems for a voice interface and so they need a solution for that.

It could be worse. "OK Google" is annoying the first time you say it.

You can say “Hey Google” now, though it’s not quite as smooth as Alexa.

I recommend "Eh, Google!" 🇨🇦

Hard call when the functionality is far better with OK Google.

If you remember the song, you can play it again and say "Hey Siri, I love this song" to amend that. If you don't remember the song, you can go into Apple Music and view your votes and change from there.

Something about "Hey Siri, ..." is much more grating to me to say than "Alexa, ...". Maybe it's just that conjunction of phonemes being rare in English?

That's nothing compared to "Okay Google".

Try saying that three times fast. Actually, try saying it just once fast.

I always end up spitting out something like "Okaygggeellggglle"

It's utterly astonishing that Google still has the hubris not to give it a personal name.

People want to believe they're talking to a personal assistant, even if that isn't true, and Google is stupidly stripping them of that.

I think it’s just a branding thing. Notice how e.g. the app is named “Google Chrome” (but not “Mozilla Firefox”).

But in this case it doesn't work. I don't repeat the name "Chrome" 20 times a day like I will for a Google Home. This is a serious f*cking flaw and it actually makes me worry for Google's stock price that they can't see how obvious of a problem this is and/or are too paralyzed to fix it.

You can say "Hey Google".

The "Hey" gets extremely grating after the fifth time in a row.

Amazon has it right on this with Alexa.

I feel like this is a big missing piece for voice activated things. I've got an Echo device ~10ft from my TV, and it gets activated from Netflix (Parks & Rec mostly... I blame Rob Lowe) all the time. I should be able to go into the app and label it as a false activation so it can ignore it in the future

The app already has a feature for that. It will keep a list of all the times it was activated and a recording you can play back of what phrase activated it. You can also tap a button to send feedback if it activated with the wrong phrase.

I'm sorry, this warmed my heart and terrified me at the same time. A hilariously adorable and obnoxious use of technology.

To clarify, this is about training your phone to recognize only "Hey Siri" as spoken by you, as opposed to letting you personalize your trigger phrase to something other than "Hey Siri".

Yeah, they don't want to give users too much control; Siri is an important brand by itself and Apple doesn't want people going around skipping it.

Yes. In this case, the selection of a wake word ("Alexa") or phrase ("Hey Siri" or "OK Google") is driven primarily by performance (i.e., reduction of false accepts & false rejects). The folks designing these products know that, even though customers say that they'd like to customize the wake word or phrase, if the actual performance is significantly worse, those customers will not be happy.

We experimented with this extensively.

If you can share, what kind of variation in F1 score were you seeing between different trigger words? I'd expect even within a single product family (i.e. Alexa/Echo/Amazon/Computer) there's probably significant variation.

I'd also be interested in the differences between a collectively-optimized wake word (without personal training) and one which the user personally trains against their voice à la the parent article. To what extent does that personal training supplement/supplant (or fail to do so) the "good enough for everyone" wake word kernels?

Nothing I can share, unfortunately.

But it's infuriating when you don't want to use the phrase and/or want to have a secret phrase. Better would be to allow customers to choose their own phrase and give immediate feedback, during setup, on how good it's likely to be.

There's such a difference, and it's so hard to find really high-performing ones, that you would be constantly telling customers, "that won't work well"...

We considered literally thousands of potential words. We also really didn't want anything like "hey" or "ok" to be required.

Note how long it took for "Computer" to be available - and you can imagine how many of us wanted that. I think it gives an idea of the difficulty.

Fair enough.

"Computer" should be the wake word, and everyone would be happy with that thanks to global, english interoperability in all places and contexts.

But thanks to the hubris of corporate branding and marketing departments we are unlikely to have this universally understandable command usable. Instead to use any light switch every consumer will need to ask what bloody set up the person has...

Alexa devices actually have "Computer" as an option. I did it for a while, because I dislike pretending a computer is a person ("Alexa") and I certainly don't want to address a company ("Amazon"). I ended up switching back to "Echo" after I realized that people in my house say "computer" too often for it to work well as a wake word.

Weak move by Amazon, "OK Computer" would have been so good of an option, also for being a small tribute to Radiohead.

The problem with that is that people use the word "computer" in natural speech far too often to make it useful as a wake word. You'll continually get false positives anytime someone in a sci-fi show says "check the computer" or "computer, tea, earl gray, hot".

The non-cynical viewpoint would be that "Hey Siri" is an easy phrase to detect because of how it's said, which isn't really that similar to many other phrases.

“Hey Siri” actually a pretty challenging trigger phrase. It’s short (few phonemes), and it’s phonetically confuseable with phrases like “Hey Serial is out” (could be “Hey Siri all is out”).

It's too bad, because I sure wish I could make the homepod use a different phrase.

The homepod knows to activate when I say hey siri, but my friends phone activates instead of the homepod whenever they say 'hey siri' about %50 of the time.

At least Siri is easy to say internationally. I never manage to use the proper accent when I say "Okay Google".

On a separate note, I have yet to get Siri to transcribe "Isle of Dogs" correctly. It's always "I love dogs" which is quite ok by me :).

The title of that movie was chosen specifically to be a homophone for "I love dogs", so Siri's confusion is understandable. :)

In 1992 Apple gave me a T-shirt for supplying speech samples used to train their System 7 voice recognition. Below a picture of bunch of apple cores strewn on a beach was the caption "I helped Apple wreck a nice beach!"

That's amazing.

"I helped Apple recognize speech" for those struggling.

It's also a place in the east end of London and has been for almost 500 years: https://en.wikipedia.org/wiki/Isle_of_Dogs

This is an interesting case because they’re phonemically the same (long-I, L, schwa, V) but may be realised differently phonetically depending on your accent†. I bring this up because I’ve always wanted to be able to tell speech recognition systems specific features of my accent, say through a series of questions (“Do you pronounce ‘cot’ and ‘caught’ the same or differently?”, &c.) in addition to ML training, because as it is, they persistently mix up certain words—“I’ll” and “all”, for instance.

† With “isle of”, I have two syllables and a velarised (“dark”) L in “isle”, a mid-central vowel (/ə/) in “of”; with “I love”, I have one syllable in “I”, a non-velarised (“light”) L and an open-mid back unrounded vowel (/ʌ/) in “love”.

Ah! Makes total sense now.

I was testing some Alexa stuff for work the other day and named a light "Lighty McLight Face". All Alexa would ever hear was "lightly lick my face".

You should be able to say "Hey Siri, listen carefully" brief pause "isle ... of ... dogs", or some such to have the device enter a special mode where it attempts to identify individual words more accurately.

What's wrong with just saying "isle … of … dogs"? Why would you need to enter a special mode before enunciating?

Pausing long enough to be sure the AI knows you're done pronouncing the previous words is often enough to trigger the end of the phrase. We want our AI assistant to be performant, after all!

Without a special mode, you'd likely say "Hey, Siri, play Isle..." "Okay, I found--" "Dammit, Siri, I wasn't done!"

You don't have to wait 5 seconds between words. You just have to enunciate clearly enough that the divisions between words become truly unambiguous.

It’s worth noting that, elsewhere in this thread, user smortaz wrote:

Yep, tried that too... but admittedly the sound very similar. I keep getting "I'll of dogs" and "I love dogs".

If they're getting "I'll of dogs" then they're getting the word divisions right, but the problem there is "I'll" and "isle" sound identical so no amount of enunciating will distinguish them. The only thing to suggest at this point is to spell out the word.

I've never heard about this special mode.

I meant to imply it should have this mode, not that it does.

For phonetically ambiguous utterances like this, it usually helps continuous speech recognition systems if you pause between words like “Isle. Of. Dogs.”

Yep, tried that too... but admittedly the sound very similar. I keep getting "I'll of dogs" and "I love dogs".

I've never gotten Siri to successfully play "Cuz It's Hot" by "My Life With The Thrill Kill Kult". Also the only place I have the song is "WaxTrax! Black Box: The First 13 Years (Disc 3)", which is also pretty much impossible to get Siri to play. Sigh.

You can tap anything that Siri hears you say and fix it. Unfortunately, that means you might end up with the opposite problem in the future.

I just tried 'Hey Siri where is 'Isle of Dogs' playing - and it got it first time.

That’s easier because the language model knows that “where is I love dogs playing” doesn’t make sense grammatically, whereas “where is <movie name> playing” does. On the other hand, just saying “I love dogs” might be a reasonable thing you may want to text message or something.

You could also give the recognition system an additional hint by dropping in "the movie", "the play", etc. -- "where is the movie 'Isle of Dogs' playing?", "where is the play 'Death of a Salesman' playing?"

Cool little insight into their progress.

I'm excited for the day Siri can differentiate my voice and my wife's so we can use a single HomePod with two iCloud accounts. For example, Siri reading off calendar appointments from an iCloud account based on who's asking.

This feature is one of my favorite things about the Google Home.

That feature is not blocked by technical but by arbitrary limits. Whether it is a family iPad, Apple TV or HomePod, Apple strongly believes in 1:1 device-person (or more accurate, device-iCloud account) mapping.

Macs do multi account quite well, and iOS/iPad does it for schools. I’m not sure why Apple hasn’t at least made iPads for consumers have the same multi-iCloud-account feature available, it would make things a lot easier at home!

The HomePod was designed not to be a single-user device. Especially after reading the OP article, I think it's more of a technical limit than anything because you could potentially give away information that didn't belong to a user if the user isn't detected correctly.

What's the technical reason to limit the wake words to only "Hey Siri", "OK Google", etc?

Should one be able to "Personalized" to his/her own phase PER Device such as "Hey Commander Data" for Iphone 6, "Hey Counselor Troi " for IPad Pro or "Ok Skynet" for the google phone?

That should also solve the problem where someone said "OK Google" in TV Shows/Podcast and wakeup all the google devices.

With the latest "Deep Voice" tech, the device should be able to response with the actors' voice too.

I remember seeing a clip of Chinese TV show when the actress said (in Chinese) "Honey, Honey where are you?" and the phone on chair wake up and said "Honey, I am here, I am here. " And tons of folks in the youtube comments ask "What kind of phone is that and I want it!". I told my wife that and she immediately said she wanted it too.

I still don't know which brand of Chinese phone actually has that features or if that was just imagination of the TV show writer. Someone should build and and I know my wife wants it.

I don't work on such projects, but on other ML projects, and I suspect the answer is: training data.

There are many, many, many samples of people saying the approved trigger phrases. Many, many, many samples. To use your own custom trigger phrases, you would likely need to say it thousands of times to being to approach the accuracy the current systems ship with out of the box. Maybe tens of thousands.

That's just my conjecture, though.

I would love this feature if they found a way to implement it without draining the battery by listening constantly and trying to recognize 'Hey Siri' from all the background noise. It's especially bad when you play music nearby!

Are you sure you’re actually getting your battery drained from this? The phrase spotting happens in a super low power processor which can run without the rest of the phone doing anything.

I remember one day I played music through some speakers connected to my phone for about 6 hours and the battery usage showed 40% for music and 40% for Siri. I put two and two together and figured the music was keeping Siri awake.

I'm not convinced this Siri thing is really that revolutionary until I can invoke something on my phone by screaming "AYYYYYYY MUST BE THE SIRIIIIII"

God this is depressing. Only 3 years behind Amazon and Google.

Really? I'm pretty sure "Alexa" is known for responding to the TV.

When it comes to Apple being behind in machine learning, perception is all that matters. Technical explanations about how Apple makes choices that preserve privacy are irrelevant. Actual head-to-head tests showing that Apple is ahead in some areas and behind in others are irrelevant. All that matters is that we all know that Apple is behind in machine learning.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact