
How Alexa knows “peanut butter” is one shopping-list item, not two - georgecarlyle76
https://developer.amazon.com/blogs/alexa/post/36ca7d4c-cd98-40a9-a9c5-0cde2ab922ab/how-alexa-knows-that-peanut-butter-is-one-shopping-list-item-not-two
======
prions
I implemented a similar BiLSTM-CRF model at my current job. The architecture
itself is really interesting, but runs into scaling issues. With LSTMs, you
run into the constraint of having to wait on previous inputs and cache those
results as well. Although TensorFlow now offers Cuda RNNS and fused kernels to
speed up computation, I'd have thought for Amazon's scale, an
attention/transformer based architcture would serve them better.

I also notice a lot of dismissive comments about "black box models" or the
simple solutions of just parsing out whitespace. My two cents:

1\. Models with hand crafted rules perform WORSE than learned representations,
especially when you have an end-to-end model with pre trained embeddings. This
is shown by one of the seminal papers on this model, Ma and Hovy (2016)
[https://arxiv.org/pdf/1603.01354.pdf](https://arxiv.org/pdf/1603.01354.pdf).

"However, even systems that have utilized dis-tributed representations as
inputs have used theseto augment, rather than replace, hand-crafted fea-tures
(e.g. word spelling and capitalization pat-terns). Their performance drops
rapidly when themodels solely depend on neural embeddings"

2\. Human speech and human written text are messy. Having a rule for human
speech will inevitably lead to a massive list of rules and exceptions to those
rules.

3\. This model is multi domain, meaning that you don't just need rules for one
domain, but rules for multiple domains and interactions between those domains.
Considering Amazon's hefty amount of data, it's much more efficient to learn
these represntations though a machine learning model rather than constantly
playing cat-and-mouse with keeping your hand crafted rules.

~~~
L2R
Spot on. Just a heads up, there is a decent amount of work on using
convolutions to condense the initial representations and can reduce
computation time equal to your max pooling. A lot of these tasks can be done
via hyperparameter search over CNNs, so you can easily reach parity using a
CNN-LSTM approach w/ the same number of parameters.

------
NicoJuicy
In text parsing, this whole machine learning would be implemented as "ignore
white space as delimiter" ( that's how I implemented it on one of my projects)

Ps. I'm aware this will not be a popular comment

~~~
jchw
Sentences spoken aloud don't have commas.

~~~
coldtea
You don't need commas to do that in text parsing either. Newlines, for
example, will do.

When spoken, a shopping list is not a sentence. There's a small pause and/or
different emphasis on the start of each item that can be learned (humans, for
one, can discern it).

"Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter".

(Besides it can easily learn that peanut, singular, is not a thing people
order: it's either "peanuts" or "peanut-butter" etc).

~~~
halflings
You're basically saying that they should build a speech recognition system
that links words that should belong together with a hyphen... great then:
that's exactly what this article is about.

~~~
Dylan16807
No. They're not saying anything about "words that should belong together".
They're saying two things:

1) There is a pronunciation difference between one item with two words, and
two items with one word each.

2) You can also use per-word information here, because "peanut" is not
something that goes on shopping lists.

------
smoser
Amazon is pulling ahead. Siri still adds 'All man milk' to my shopping list
instead of almond milk.

~~~
tacomonstrous
Siri has been behind Google and Alexa basically since the beginning and shows
little sign of catching up.

~~~
master_yoda_1
This article say it all [https://www.theinformation.com/articles/the-seven-
year-itch-...](https://www.theinformation.com/articles/the-seven-year-itch-
how-apples-marriage-to-siri-turned-sour)

------
chamanbuga
This is surprising, I didn't know Alexa was capable of this. Whenever I say
"Alexa, add milk, eggs, bread, & laundry detergent to my shopping list" it
shows up as one long sentence in a single entry.

~~~
BookmarkSaver
It's a fairly new feature.

------
eesmith
What about "Alexa, add fork handles to the shopping list"?
[https://www.youtube.com/watch?v=gi_6SaqVQSw](https://www.youtube.com/watch?v=gi_6SaqVQSw)
.

~~~
jcranendonk
You never know when you might need some new fork handles.

------
quirkot
This only a surprising thing because a human would parse the waveform as two
words and then back into a single concept. A computer could parse it as a
single concept directly or in syllable lengthed chunks to be reconfigured
however.

~~~
stan_rogers
A human really wouldn't. It's the same thing as telling the difference between
_black bird_ and _blackbird_ in running speech; _peanut butter_ may be spelled
as two words, but it's spoken as a single thing. If the concept was new enough
that you were consciously talking about a type of butter made from (of all
things) _peanuts_ , it would be a _black bird_ vocal entity rather than a
_blackbird_ one.

~~~
akira2501
It's true that we do alter our speech to provide context clues, it is also
true that without them we're _still_ capable of piecing them together.

If someone says in a unnaturally drawn out way "I like peanut butter
sandwhiches" then I will have no problem detecting the situation and then re-
parsing it correctly.

~~~
behringer
The space is irrelevant. Consider the sentences:

The black bird ate seeds.

The blackbird flew at mach 3.

Your brain thinks of these two words completely differently and it's only
through conscious effort that you think of them together. They are different
words even though they sound and are spelled the same, regardless of the
space.

A better example I think is "bear feet" vs "bare feet"

~~~
xioxox
I'm not convinced. It takes effort for me to break apart blackbird into two
separate words in my head, as they are so commonly found together. When
speaking "black bird" I would insert a long pause between the two and
emphasise the "b" on bird to show that I'm not talking about a "blackbird".

~~~
qlm
The difference for me (maybe this is regional?) is that for “blackbird” the
stress is on “black” whereas for “black bird” the stress is on “bird”.

~~~
stan_rogers
Yes, it's mostly about the stress.

------
NeonVice
I recently heard about a book called "Anything You Want" and wanted to add it
to my shopping list. I tried repeatedly saying "Alexa, add 'Anything You Want'
book to my shopping list" and every variation thereof. Alexa was unable to add
that book title no matter what I tried.

------
munificent
This seems like a fundamentally hard problem. If I ask Alexa, "Play songs by
Simon and Garfunkel", I may want to include their solo work ("Play songs by
([Paul] Simon) and ([Art] Garfunkel)") or not ("Play songs by (Simon and
Garfunkel)"). The choice is probably more likely for some artist groups than
others. It may even vary by user. It's hard to imagine a single trained AI
that can handle that variance without a ton of very quickly-changing domain
knowledge.

~~~
exogen
In my experience doing stuff like this for artist/song record linkage, the key
is really to take a "query expansion" approach rather than a "normalization"
approach, because choosing a single normalized form is impossible. So it's
better to embrace that there are dozens, hundreds, or even thousands of
interpretations and choose probabilistically.

A great example is trying to deal with the "sort name" of artists: e.g.
"Presley, Elvis".

It's easy to assume that "Hazlewood, Lee & Nancy Sinatra" means "Lee Hazlewood
& Nancy Sinatra".

How bout "Sinatra, Frank & Nancy"? Now the rules are different: the expansion
could either be "Frank Sinatra & Nancy Sinatra" (correct) or "Frank Sinatra &
Nancy" (but there's no singer who just goes by "Nancy", or is there?)

Now how about "Peter, Paul & Mary"? In that case it's already the literal
expanded form referencing three people, not two people named "Paul Peter &
Mary Peter" or "Paul Peter & Mary".

So, you just assume they are all possible and rank them based on real-world
data. You're right, not always easy!

(Treating them as an unordered bag of tokens can either help or hurt accuracy
– that has its own problems when you consider how short and similar many
titles are, and how some artists deliberately name themselves as jokes/riffs
on a more famous one. Not to mention that after all this it could still be
ambigous: MusicBrainz knows about six artists all named "Nirvana". So context
is key!)

------
baybal2
This, yet again shows just how far the computer science is from real natural
language processing, despite of all "AI" companies' claims.

Unless they have all and everything hardcoded, even such natural thing are
impossible to process for the "natural" language processing programs.

All cloud "AI" and "natural" language processing services should really be
called "lots and lots of hardcoded stuff language processing"

~~~
blattimwind
State of the art virtual assistants offer little intelligence over a command
line interface, except instead of typing the command line in, you say it.
Besides that, not much difference; the syntax is rigid and the computer
doesn't understand your utterance more intelligently than GCC understands "gcc
-o my_prog my_prog.c".

~~~
kevin_thibedeau
My girlfriend routinely asks questions with convoluted grammar and gets a
relevant response much of the time. I'm always amazed when it happens.

~~~
taeric
Examples? I seem to constantly get nothing from simple grammar questions. :(

~~~
PascLeRasc
I use a Google Home, not Alexa, but I ask it oddly-worded things all the time.
Here's two from the past week that worked: "Can you see if Reply All has a new
episode and if so play it?" "Can I still use these green peppers I got two
weeks ago?"

~~~
taeric
That first one sounds promising. I'll have to see if I can adapt it.

My devices are basically just gateways to audible, radio, and general timers.
I have begun using the announcement features, but it is amusing to see the
kids basically having announcement wars.

------
master_yoda_1
How the model is adversarial? Also the best configuration is already found 2
year back in an ACL paper
[http://www.aclweb.org/anthology/N/N16/N16-1030.pdf](http://www.aclweb.org/anthology/N/N16/N16-1030.pdf).
Is not it cheating when you claim somebody else result as your own. The
industry is full of fraudster now a days.

------
cphoover
Because who orders a single peanut.

~~~
entity345
Well this is not the key indicator. The key indicator is the absence of pause
between the words, that's how we differentiate between "peanut butter" and
"peanut, butter".

~~~
IshKebab
Far more important than the lack of a pause between then words is the a priori
fact that "peanut butter" is a common single item and "peanut, butter" is an
uncommon list of items. It is that fact that means that you _require_ a pause
between the words to indicate "peanut, butter".

If you ordered "butter, peanuts" for example, it would probably get that it
was two items even without the pause between words.

It's all about the prior probabilities.

~~~
entity345
I would disagree here. Especially in the context of a shopping list.

If I were to say 'peanut' <pause> 'butter' my interlocutor would probably
interrupt me to ask for confirmation because that would be enough to create
doubt.

'Peanut' and 'butter' are two unrelated words and it is the absence of pause
that creates a pseudo single word.

Pauses are extremely important in spoken language and should be exploited.

~~~
ghaff
>If I were to say 'peanut' <pause> 'butter' my interlocutor would probably
interrupt me

I'm on the parent's side on this. "Peanut butter" is such a common item that
if someone were to pause in between the words I would assume they just got
distracted for some reason. In the context of a shopping list, "peanut"
singular just doesn't make sense.

A better example would be something like "yogurt ice cream" which is
technically incorrect but it's still something people might say. In that case,
I'd expect a shorter pause than in the case of yogurt, ice cream. However, if
you were dictating a list to me I'd probably ask for confirmation in either
case because there's enough ambiguity.

~~~
mamon
Deliberate pause sounds a bit different than a pause caused by distraction.

------
CupOfJava
What happened if I said "peanuts, butter". I expect two items, but will Alexa
give one?

~~~
torstenvl
As others have said, there's the pluralization of "peanut[s]" to distinguish
between the two. This is a useful feature of English: the adjective-like role
of a noun in a complex noun phrase is (almost?) always singular.

    
    
        - Computer engineer
        - NOT computers* engineer
    
        - Toothbrush
        - NOT teethbrush*
    
        - Foot doctor
        - NOT feet* doctor
    
        - Alarm clock
        - NOT alarms* clock, even when it supports multiple alarms!
    

Additionally, there's phrasal intonation. If the intonation and stress
decrease throughout the phrase, it's a single item. If the intonation and
stress reset for "butter," then it's a new item.

    
    
        - 'PEA ,Nut but ter
        - 'PEA nut 'BUT ter

~~~
sakebomb
Alexa, please tell me about the...

Attorneys general Senators elect

Ahhhhhhh!

~~~
ksenzee
The difference is that these are phrases with adjectives, not nouns being used
adjectivally.

~~~
a1369209993
"Attorney generals" is a noun phrase (admittedly of questionable adjectivity).
"Attorneys general" is a blind idiot translation of a phrase in a language
with different grammatical rules (Latin, IIRC).

~~~
gowld
"Attorney" is a noun. "General" as used here is an adjective. It's unusual in
that the adjective follows the noun without a hyphen, but it's common enough,
and it's where prepositional phrases are seen, like "Big man on campus" and
"powers that be".

Did ancient Romans have attorneys general?

------
jeena
The German language is superior in this regard because there it would we
either "peanut, butter" or "peanutbutter".

~~~
gnulinux
How is this relevant? Peanut butter isn't spoken like "peanut, butter" it's
spoken like "peanutbutter".

~~~
ghaff
It depends. If I'm enunciating clearly as I tend to do when I dictate to voice
assistants, I'm going to tend to tend to distinctly pronounce the "t" at the
end of peanut and the "b" at the beginning of butter which tends to produce
something of a pause between the two words.

------
harshulpandav
What about "peanut butter cookies"?

~~~
drewmate
Homer: "Alexa, please order: peanut, butter cookie, d'oh!"

~~~
nikofeyn
in that case you'd get peanut, butter cookie, and dough, so i suppose you
could make a peanut and butter cookie bread. lol.

------
2sk21
Pretty much every entity detector that I have used operates on a similar
principle. Not sure whats new here.

~~~
cptskippy
While these types of blogs might not be revolutionary, they're still useful to
people new to the subject who might just be getting into search or are having
to implement lite search functionality into their applications.

I'm actually working on an application now where the initial spec called for
"search" and it was implemented as exact token matching. A bug was immediately
filed because searches for "wlk", "walk", and "walk event" all returned
different results.

------
cozzyd
I wonder what happens when you order 101 Dalmatians.

------
zachguo
Basic probability theory and information theory can fix most of the problem,
just calculate mutual information between adjacent words.

------
sct202
If you google Elijah Wood, under "People also ask:" one of the options is "is
elijah a wood" and I wonder if that's from text to speech that misunderstood.

~~~
adjkant
no that's a meme lol

------
sgustard
Great. Is "milk chocolate" one item or two?

~~~
Klathmon
and what would you add in this sentence:

"Alexa add paper towels milk and eggs to my shopping list" (punctuation
intentionally left out)

It's a really cool problem to try and solve, and while I don't have an Alexa,
I do have a google home which gets this kind of stuff right often enough that
I don't really think about it any more (and kind of laugh on the rare chance
it gets it wrong).

------
sam0x17
I'm pretty sure you could do this in a standard LALR(1) grammar and give
[peanut butter] precedence over [peanut] [butter].

~~~
alttab
This would require that you have an exhaustive list of priorities typed out in
a grammar, for each language. Word embeddings is a more semi-supervised
learning. There is no way grammars could cover all the cases in a scalable
way.

~~~
sam0x17
True, so maybe the best approach is to use machine learning to generate the
exhaustive list of priorities based on labeled human speech. Then your end
product is something we can understand and tweak, instead of a black-box
neural network.

~~~
alttab
You'd have to human-label that speech. That already won't scale very well, and
requires per-language annotation.

Understanding and tweaking it should be done with hyper parameters, not
semantic libraries.

------
dugluak
because Alexa knows that there a is thing called "peanut butter" and you most
probably meant that. I think Humans too behave in a similar fashion. If you
are dictating a shopping list to a person who has never heard of "peanut
butter" would most likely raise a question and ask you to spell it.

------
karmasimida
So IOBES style NER...how surprising...

------
Moru
How come we are doing so much with machine learning but we still have
situations like "You have 1 things in your shoping wagon"

~~~
ultramundane8
I think the explanation there is as simple as "there are lots of different
people." I can't write software half as well as a significant percentage
(maybe a majority) of HN users, but the average person can't write software
half as well as me.

Of course, I could just be dead wrong. I'm just conjecturing while I
procrastinate.

------
wgpete
All of these AI's are not actually even AI. They are client-server voice
commands. True AI requires zero internet connection which no commercially
included AI on any smartphone or tablet can offer today.

~~~
Analog24
"True AI" (or any kind of AI for that matter) is not defined by the presence
or absence of an internet connection.

------
tomphoolery
Did Alexa figure out how to tell when I'm saying "Computer" to _it_ vs. one of
my friends?

I want to be able to talk to my computer like in Star Trek, dammit.

