I also notice a lot of dismissive comments about "black box models" or the simple solutions of just parsing out whitespace. My two cents:
1. Models with hand crafted rules perform WORSE than learned representations, especially when you have an end-to-end model with pre trained embeddings. This is shown by one of the seminal papers on this model, Ma and Hovy (2016) https://arxiv.org/pdf/1603.01354.pdf.
"However, even systems that have utilized dis-tributed representations as inputs have used theseto augment, rather than replace, hand-crafted fea-tures (e.g. word spelling and capitalization pat-terns). Their performance drops rapidly when themodels solely depend on neural embeddings"
2. Human speech and human written text are messy. Having a rule for human speech will inevitably lead to a massive list of rules and exceptions to those rules.
3. This model is multi domain, meaning that you don't just need rules for one domain, but rules for multiple domains and interactions between those domains. Considering Amazon's hefty amount of data, it's much more efficient to learn these represntations though a machine learning model rather than constantly playing cat-and-mouse with keeping your hand crafted rules.
Ps. I'm aware this will not be a popular comment
When spoken, a shopping list is not a sentence. There's a small pause and/or different emphasis on the start of each item that can be learned (humans, for one, can discern it).
"Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter".
(Besides it can easily learn that peanut, singular, is not a thing people order: it's either "peanuts" or "peanut-butter" etc).
1) There is a pronunciation difference between one item with two words, and two items with one word each.
2) You can also use per-word information here, because "peanut" is not something that goes on shopping lists.
True, but I've never went to the grocery store to buy a singular peanut.
I've bought a bag of peanuts, but not just one peanut.
Eggs milk cheese-buns
Eggs milk chicken-salad
pudding pudding pudding applesauce
pudding pudding pudding applesauce
pudding pudding pudding applesauce
applesauce applesauce applesauce
Back to what I really was getting at: I'm pretty sure the person I replied to was suggesting Alexa could just split(',') and call it a day. With text, yes. With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly. I am certain humans use a mix of vocal cues and interpretation to place the commas in their heads.
- ignore the comma, point and space as delimiter and compare the values / entities against a dictionary for neighbouring words.
- don't ignore the comma and compare the values against a dictionary.
Put a priority ( or in machine learning terms: a classifier) on both outcomes, because comma is not reliable in spoken language. So that it would interprit peanuts butter as [peanuts, butter] and peanut butter as [peanut butter].
PS. Now i hope that text-to-speech translates spoken: peanuts correctly to [peanuts] and not [peanut], because that would fail.
PS2. The article itselve doesn't mention the punctuation problem
It doesn't go into detail but it does seem to mention it.
>Off-the-shelf broad parsers are intended to detect coordination structures, but they are often trained on written text with correct punctuation. Automatic speech recognition (ASR) outputs, by contrast, often lack punctuation, and spoken language has different syntactic patterns than written language.
Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.
With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly.
That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.
Okay... But Alexa isn't just shopping lists. You only know you are dealing with a shopping list after parsing the text.
Even if you did go back, is the narrower use case any more solved than the general one? Guessing with text alone turns out to be fairly accurate and so even if you could do this decently, it would have to be notably better to be worth the trouble.
>That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.
Irrelevant to this though, considering that problem has mostly been solved at this juncture.
If someone says in a unnaturally drawn out way "I like peanut butter sandwhiches" then I will have no problem detecting the situation and then re-parsing it correctly.
The black bird ate seeds.
The blackbird flew at mach 3.
Your brain thinks of these two words completely differently and it's only through conscious effort that you think of them together. They are different words even though they sound and are spelled the same, regardless of the space.
A better example I think is "bear feet" vs "bare feet"
A great example is trying to deal with the "sort name" of artists: e.g. "Presley, Elvis".
It's easy to assume that "Hazlewood, Lee & Nancy Sinatra" means "Lee Hazlewood & Nancy Sinatra".
How bout "Sinatra, Frank & Nancy"? Now the rules are different: the expansion could either be "Frank Sinatra & Nancy Sinatra" (correct) or "Frank Sinatra & Nancy" (but there's no singer who just goes by "Nancy", or is there?)
Now how about "Peter, Paul & Mary"? In that case it's already the literal expanded form referencing three people, not two people named "Paul Peter & Mary Peter" or "Paul Peter & Mary".
So, you just assume they are all possible and rank them based on real-world data. You're right, not always easy!
(Treating them as an unordered bag of tokens can either help or hurt accuracy – that has its own problems when you consider how short and similar many titles are, and how some artists deliberately name themselves as jokes/riffs on a more famous one. Not to mention that after all this it could still be ambigous: MusicBrainz knows about six artists all named "Nirvana". So context is key!)
Humans will also screw this up. They don't have statistics about which you most likely meant but they do have context which an AI may not.
Unless they have all and everything hardcoded, even such natural thing are impossible to process for the "natural" language processing programs.
All cloud "AI" and "natural" language processing services should really be called "lots and lots of hardcoded stuff language processing"
AI is mechanical turks all the way down.
My devices are basically just gateways to audible, radio, and general timers. I have begun using the announcement features, but it is amusing to see the kids basically having announcement wars.
If you ordered "butter, peanuts" for example, it would probably get that it was two items even without the pause between words.
It's all about the prior probabilities.
I'm not a linguist or anything, but it seems like in practice people may pronounce "peanut butter" a little differently when they say the two words together. Something like "peanubutter". Or maybe they convert the "t" in "peanut" into a glottal stop.
Anyway, if the "t" is absent when you're talking about peanut butter but present when you're talking about two separate items, I don't see why you shouldn't feel free to use that signal. But I also don't see why you shouldn't use probabilities as well.
If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me to ask for confirmation because that would be enough to create doubt.
'Peanut' and 'butter' are two unrelated words and it is the absence of pause that creates a pseudo single word.
Pauses are extremely important in spoken language and should be exploited.
I'm on the parent's side on this. "Peanut butter" is such a common item that if someone were to pause in between the words I would assume they just got distracted for some reason. In the context of a shopping list, "peanut" singular just doesn't make sense.
A better example would be something like "yogurt ice cream" which is technically incorrect but it's still something people might say. In that case, I'd expect a shorter pause than in the case of yogurt, ice cream. However, if you were dictating a list to me I'd probably ask for confirmation in either case because there's enough ambiguity.
Domain knowledge around the intent (making a shopping list) and around language conventions helps a lot, not only for NLU, but also for NLS.
But even without a huge NLU dictionary, a simple MC or CRF segmentation model would work. As said earlier, almost no one says., “Add peanut to the shopping list.” They say, “Add peanuts”, of if they’re very atypical, “Add a peanut”. A bare “peanut” is simply unlikely.
I imagine lists with post-specifiers are near impossible to parse too:
Yeah, I should just look it up ...
- Computer engineer
- NOT computers* engineer
- NOT teethbrush*
- Foot doctor
- NOT feet* doctor
- Alarm clock
- NOT alarms* clock, even when it supports multiple alarms!
- 'PEA ,Nut but ter
- 'PEA nut 'BUT ter
Did ancient Romans have attorneys general?
A better question is "coconut, milk" versus "coconut milk".
Sure, but if you were dictating to a human that would still be an easy one for them to get wrong, depending on how long you paused.
I find this interesting with phone numbers. In some countries you hear people say "thirty three sixty two" and they mean 303602
- a coconut
- [number] coconuts
- shredded coconut
"Coconut" is best matched to that last option, but it's not a natural word choice. (Although it is a natural list entry... do people think of themselves as dictating to Alexa, or as writing the list themselves while happening to use their voice?)
So it turns into a question of how people think about dictating to Alexa.
If this is a precursor to being able to quickly voice order stuff off amazon to be delivered though it's a different story.
As for the phone number, that's why anyone in a serious occupation (aviation, military, etc) treat each digit as is stand-alone.
You'd be surprised: https://www.youtube.com/watch?v=HoPFQm9PQ_M
PS. I implemented something similar without machine learning and that's how i did it. With text it's easier though, i suppose in NLU it could have a parameter for "pause time between words" which could also contribute to a different conclusion.
I mean, "car insurance" is much better than "motor vehicle liability insurance" too.. ;-)
I'm actually working on an application now where the initial spec called for "search" and it was implemented as exact token matching. A bug was immediately filed because searches for "wlk", "walk", and "walk event" all returned different results.
"Alexa add paper towels milk and eggs to my shopping list" (punctuation intentionally left out)
It's a really cool problem to try and solve, and while I don't have an Alexa, I do have a google home which gets this kind of stuff right often enough that I don't really think about it any more (and kind of laugh on the rare chance it gets it wrong).
Understanding and tweaking it should be done with hyper parameters, not semantic libraries.
Of course, I could just be dead wrong. I'm just conjecturing while I procrastinate.
I want to be able to talk to my computer like in Star Trek, dammit.