
Siri's creators say they've made something better - jboydyhacker
https://www.washingtonpost.com/news/the-switch/wp/2016/05/04/siris-creators-say-theyve-made-something-better-that-will-take-care-of-everything-for-you/
======
wyc
According to an episode of Acquired[0], none of Siri's original technology is
present in Apple's current products. Siri originally licensed Nuance's (also
an SRI spin-out company) voice recognition technology and built some pretty
(comparatively) simple tech on top of it, never changing out the speech-to-
text engine. All current players (Google, Amazon, etc) use neural nets, so
Apple scooped up many Nuance employees to double down and compete[1].
Rosenthal and Gilbert agreed that Siri was a failure: "the only thing I use
Siri for is setting alarms." I'd have to agree from personal experience. They
chalked the failure up to the disadvantages of a UI where the full extent of
the functionality is hidden and difficult to understand, leading to bad
experiences. I imagine this technology will have many of the same struggles.

[0]
[http://www.acquired.fm/episodes/2015/12/14/episode-5-siri](http://www.acquired.fm/episodes/2015/12/14/episode-5-siri)

[1] [http://9to5mac.com/2014/06/30/why-is-apple-hiring-nuance-
eng...](http://9to5mac.com/2014/06/30/why-is-apple-hiring-nuance-engineers-
apparently-to-replace-siris-nuance-powered-backend/)

~~~
skylark
I think the biggest problem with voice commands is that if the command doesn't
work the first time, it would have been faster to simply input the command
yourself. I'm not confident that "get me directions to <place>" will actually
get me those directions. Because of that, I also use Siri exclusively to set
alarms (which does work every time).

~~~
mwfunk
This is the #1 problem with anything based on voice recognition and natural
language processing IMO. It has to work 99% of the time to overcome this
issue, but it seems like the nature of the technologies ensure that it will
only asymptotically approach this threshold, and is currently stuck at 75%
reliability at best.

Plus, even if it did work, it's just a really inexact and inefficient way to
do anything. You always end up trying to see through the abstraction layer
(natural language in this case) to the much more well-defined hidden system
underneath. It's like reverse engineering an expert system, as a UI paradigm.

Programming languages made to resemble natural language face a similar poison
pill, AppleScript being a prime example.

I totally get the appeal of trying to make software understand people better
instead of forcing people to understand software, but in practice it just ends
up being one more abstraction layer users have to struggle to get past in
order to unlock the functionality of the underlying system.

~~~
modeless
The funny thing is that voice recognition is no longer the problem. For years
the reason you couldn't do voice command interfaces was that transcription
didn't work, but today transcription works amazingly well. My commands are
transcribed perfectly, even in my car when I'm doing 65 on the freeway with my
phone stuck in a cupholder.

The problem now is Android randomly kills the "OK Google" background listening
process, or it fails because it can't handle a handover between wifi and cell
data, or it "can't open microphone" even though it _just heard_ me say OK
Google, or any one of many other Android problems rears its ugly head.

The reliability of voice transcription now is far better than the general
reliability of Android on my Nexus 5X, and the Android team ought to feel
pretty bad about that.

~~~
jordanthoms
Exactly this - There is a bug where if your phone is locked, and in a position
where it thinks it's in landscape, when you say 'OK Google' the phone will
unlock in portrait, start listening, rotate to landscape and in the process
kill the app listening to your voice, and then rotate back to portrait and
lock itself again.

It's also incredibly easy to fall off the blessed path where you can interact
solely with voice - when setting a reminder, for example, if it doesn't hear
'Yes' when it's expecting to it'll just sit there and you need to touch the
screen to continue - defeating the whole point of voice interaction.

Google's voice search has incredible voice recognition and text to speech and
can tap into an amazing amount of information through the knowledge graph, but
they don't seem to be capable of fixing the basic bugs and UX issues
preventing all that technology from actually being usable.

~~~
Stratoscope
> It's also incredibly easy to fall off the blessed path where you can
> interact solely with voice...

I had this happen with Google Maps the other day. I was driving into San
Francisco with the turn-by-turn directions when it said something like, "There
is a faster route available in two miles. It will save 5 minutes. Tap 'Accept'
to take this route."

So I had to fiddle with my phone in a hurry, in traffic (after all, traffic
was why there was a faster route coming up), and hope I don't tap the wrong
button or crash into anyone while looking for it.

Why couldn't I just have said "Yes! Please and thank you. Of course I want the
faster route, why wouldn't I? Oh, sorry, I mean 'Accept'!"

The funny part was that the "faster route" was the way I usually go into that
part of town anyway, but Maps had been sending me a different way because of
congestion on the usually-faster route. Why did it even ask me to "accept" the
faster route instead of just redirecting me the way it does automatically on
most occasions?

Like the time I was heading south on 101 through Morgan Hill and Gilroy and
Maps had me get off the freeway and take a side street for a couple of miles
because traffic was stopped on the freeway due to a car fire. It didn't ask me
to tap Accept then, it just gave me some very practical directions on the fly.
That's how it should work.

------
spyder
It sounds like they still have to partner with the different services to
create custom integrations with their APIs so they can interact only with
these preprogrammed services. But the real impressive AI thing would be if it
could figure how to do the order of a service on its own: mimicking how humans
do it by using forms on the webpages or writing e-mails or making phone calls
for reservations. That would make the interaction look like just a regular
order from the service providers perspective, so there would be no need to
create partnership for the API integrations with each service provider.

~~~
sorbits
Having bots fill out forms and write emails is optimistic, but Google already
publish how to structure your (HTML) data to make it readable by their bots so
that it can extract useful information for their knowledge graph.

Ideally we would have a standardized declarative service description language
that “digital assistants” can use to not just answer questions (already
possible) but also purchase services and goods.

Although we can’t just trust any “service” discovered on the internet, someone
would need to actually vouch for it before it can be used by a digital
assistant.

------
EGreg
The main problem is that Siri wasn't opened up to third party apps to hook
into, and thus remained rather limited and stupid.

Each app could have bundled a voice interface, and Apple's phone would have
been the first futuristic and extensible "star trek computer". Oh well,
opportunity missed. And by a company whose founder had the Mac speak onstage.
I think Jobs would have gone in this direction and demoed the crap out of it!

As usual, there is a systems level challenge to solve for building a
foundation for app developers. Namely, how to make a fair and EFFECTIVE way
for app developers to all share the same namespace / tree of commands?

If I was in charge of the Siri team, I would have made the following changes:

1) Fork OpenEars or another open source package and spearhead it as a first
pass on the phone to eliminate the need for internet connection.

2) Have apps register prefixes for commands

3) Have apps register for "voice intents" and verbs that connect like they
have for inter-app audio and app extensions

4) When an app is open, have a way to speak to the app through using the iOS
library. This can be used to issue commands or dictate an email etc.

5) Feature apps that make ingenious use of voice commands and have them pitch
PR stories about how the iPhone is becoming like Star Trek and is far ahead of
Android.

~~~
visarga
Or do the same, but on Android. Why can't devs add new voice commands to the
system?

------
miguelrochefort
Hopefully they'll stop focusing on converstional UI, as it's wildly
innefficient. I'd much rather interact visually and spatially.

I'll give these guys the benefit of the doubt, but it seems like all of these
businesses miss the big picture and try to emulate the way humans communicate.
In trying to deal with the UI fragmentation problem, they're greatly
decreasing the user experience. We don't need a chat bot, we need an app that
does everything, where the GUI is the language.

I don't want to describe a location verbally, I want to point to it in a map.
Most of the time, I don't know what I precisely want, and prefer to browse
options rather than have a tedious conversation.

There is no doubt in my mind that the UI of the future will be more like
Akinator/Tinder (Yes/Maybe/No) than English.

All the pieces already exist. All we need is some interface over a knowledge
graph. Why don't people see this?

~~~
mwfunk
Agreed 100%. I felt the same way 10 years ago when people would always bring
up the Minority Report UI as some sort of holy grail we should all be working
towards. It looks cool (or demos well, in the case of conversational UIs), but
outside of a few very specific use cases it would be horribly inefficient to
use all the time.

Conversational UIs almost seem like the ultimate form of skeuomorphism to me-
it's only an illusion that it works like something from the non-computer world
that you're already familiar with (talking to another person, in this case),
but the reality is something far more complicated and far less robust than it
leads you to believe.

~~~
meric
Yes, I think for a conversational UI to work, it needs to be aware of the
human world - like JARVIS from Iron Man.

------
nl
I always find these articles (and the response to them) a bit strange.

The hard part of this problem _isn 't_ the voice-part - that is doable. The
hard part is parsing the question or order! That is so, _so_ far from being
solved. Accuracy rates on QA tasks in academia are around 60-70% (depending on
the task), which is much worse than the 95-99% accuracy achievable with
speech-to-text.

 _That 's_ why the pizza ordering thing is impressive - the software has to
understand intent.

~~~
takno
But it didn't sound any easier than calling the pizza shop, which is something
I hated doing before the internet solved that problem for me. The unpleasant
experience was caused by the cognitive load of having an unpredictable real-
time interaction with something that wasn't laid out in front of me, not by
the fact that the thing I was conversing with was human

------
cstavish
The example of the pizza ordering is technologically impressive, yes... But
from an end-user perspective it is at best "as hard" as calling the pizza
place and talking to a human taking your order.

There are many use cases where it could actually provide value, but I found it
funny that they chose to demonstrate one to the press that really afforded
end-users no advantage.

~~~
redplasticcup
What if the pizza cost half as much now because it didn't require a human to
standby and take the order?

~~~
grogenaut
I can already order my pizza with an app or with Yo Dominoes... and they cost
the same...

Not sure if it's because the person on the phone is doing other stuff at the
same time or if it's more like eBooks where they cost the same as Paperbacks
because "more profit".

~~~
ams6110
That is the case. Only when the store is very busy will they have dedicated
phone order takers. The rest of the time the drivers who are waiting for the
next delivery, the pizza makers, and/or managers handle the phones.

Used to work at a Dominos, though it was pre-internet days. Not sure how much
online business the typical store now gets, I know I still always call when I
order pizza because it's faster than dealing with the website.

Annoyingly at Dominos they now also have an automated attendant answer the
call initially, which reads off the daily specials. That's a big reason I
don't often order Dominos anymore but instead call a competitor who still has
live humans answering the phone.

When I was at Dominos we had a standard of no more than two rings before a
person picked up. People would sprint across the store to get the phone before
the third ring.

~~~
seanp2k2
Why not get headsets like they have at fast food places? Edit: I feel like
that would be a lot cheaper than even one worker comp claim slipping on a
flour-covered floor in a pizza place.

~~~
bagacrap
I think it's optimistic to think the pizza making process at Domino's involves
any loose flour. I would have imagined frozen and/or refrigerated lumps of
premade dough on a rack.

~~~
drabiega
I don't know about Dominoes, but when I worked for Papa Johns many years ago
loose flour was involved in the process of shaping the refrigerated lumps of
premade dough into a pizza crust.

------
jonathankoren
I wish them luck, and the ability to hook into 3rd party APIs is good, but
ordering a pizza is essentially just straight forward form filling, and quite
frankly, something I could do 30 years ago. SoundHound's Hound[0] shows
refinement and retargetting, which is much harder because you have to entity
resolution on implied subjects / objects, pronouns. Now maybe VIv can do that
too, but that's not what they demoed. What they demoed isn't actually all that
interesting.

[0] [http://www.soundhound.com/hound](http://www.soundhound.com/hound)

------
taneq
> It was their first real test of Viv, the artificial-intelligence technology
> that the team had been quietly building for more than year.

I think they mean "first real demonstration." It would have been tested
thousands of times before reaching the board room.

That gripe aside, the real test (for me, at least) of whether a service like
this will be usable is whether it can run offline without uploading my entire
life to the mothership. I won't use Google Now or Siri or anything similar
because it would make my life (even more of) an open book to a single huge
provider. It's bad enough that Google (through Gmail) has access to my emails,
I don't want them having a log of my moment-to-moment location, voice
conversations, routines, etc.

------
justsaysmthng
I have my doubts when it comes to replacing colorful, tactile user interfaces
with voice conversations.

I don't like the idea at all. I want to interact with my device using a visual
interface, which I can "feel" and interact with, rather than a metallic voice.

I want to see what that pizza looks like, not imagine it.

Voice interface is totally unusable in a place were there are several other
people - an office, train, bus, doctor's office - places were we tend to dive
into our devices.

~~~
ramblerman
> I want to see what that pizza looks like, not imagine it.

I think you're confusing the matter by lumping voice input and voice output
into one bucket.

Think star trek 'Computer show me a map of the omnicrom system and highlight
possible habitable planets within 10 light years of our current position'

~~~
justsaysmthng
You're right, I'm lumping them together, as per the article:

> Then, a text from Viv piped up: "Would you like toppings with that?"

That's not a very good UX in my opinion.

 __*

> Think star trek 'Computer show me a map of the omnicrom system and highlight
> possible habitable planets within 10 light years of our current position'

Yes, it would be a nice additional input method , although sometimes
_pointing_ things out (like selecting with the mouse or finger) is a lot more
efficient than trying to explain it - "highlight that one.. no, no, not that
one, the other one! ... stupid toaster!"

I can see voice input being useful when combined with all the other input
methods plus good realtime visual feedback.

My prediction is that we'll have really interesting programming environments
and programming languages based on voice input in the next couple of years.

 __

Still, voice input is practical in an isolated space, were there are no other
people using voice input as well, so although I see a niche, I don 't see it
replacing mobile devices just yet.

------
bobwaycott
I _almost_ hate to say it, because my hopes were uncharacteristically high,
but I feel like making something better than Siri is a rather low bar.

------
mempko
Another example of government funded invention being converted to dollars
using capitalist innovation. People still don't understand how much our
government is involved in bringing us the tech we have now. government invests
and about 10-20 years we see it in our homes.

------
insulanian
Didn't Microsoft present the technology for the exact same thing (even used
the same pizza-ordering example) during their Build conference this year?

~~~
miguelrochefort
Perhaps that's because it's the most trivial of ideas and every single person
in history has had that thought at some point in life?

~~~
cududa
Pretty sure the first iPhone demo involved Steve Jobs ordering a pizza to the
Moscone

~~~
kccqzy
He ordered Starbucks coffee not pizzas.

------
partiallypro
How is this different than a pizza app using Microsoft's Cortana API in
Windows?

------
addicted
The original Siri app acquired by Apple was more useful than Siri in its
current incarnation, so this headline doesn't really say much.

------
excalibur
They'd better have their ducks in a row with the proof that this is entirely
separate from Siri. The lawsuit is inevitable.

------
sqldba
It can't be hard considering how stupid Siri is.

