Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are the challenges of building a personal assistant like Siri?
101 points by mandela on April 15, 2016 | hide | past | favorite | 52 comments
I'm thinking to build a language-specific personal assistant, like Siri or Cortana. I'm excited about the technical challenge but wondering about its feasibility.

Some of the challenges are:

1. NLP: I am not aware of any non-English library and don't have much background on the subject.

2. Voice recognition: same as the above.

3. Web crawling: there are tons of libraries for doing crawling and I have a decent understanding of the subject.

So few questions:

1. are there any other challenge that I have not considered?

2. is the project feasible?

3. will it succeed?

Most of the comments here are pretty discouraging, which is reasonable. It sounds like an insanely ambitious project for a single non-expert to take on. But I completely disagree.

You should go for it. Even if you never really get it to work you'll most likely learn a ton along the way. And if you even halfway pull it off it would make an epic "Show HN". Also, a lot of the required technologies are advancing at an incredible rate (both the state of the art and public accessibility). It may well be an order of magnitude easier now than it would have been a few years ago.

I don't have any specific advice other than the usual for any large project: start small and focused, then take it one step at a time. Oh, and read lots of academic research papers.

Good luck!

I second this encouragement strongly. Some of my favorite open source projects started with people telling the author "don't be stupid, why would you try to build [x] when there is [y] and you definitely don't understand the challenges of this topic". Linux, Leaflet, the geojson spec and several other projects started exactly like this - somebody excited to learn/build something challenging.

Exactly this, it is a difficult problem but if the goal is learning then it is one which will give you a number of challenges that will teach you a lot.

Start with any part of the problem and break it up into its component challenges. Start knocking off things one by one. Also start reading all the current papers on topic (a membership in a university library will help with this).

Since we take high school students and turn them into engineers with 4 years of training, figure that two and a half to four years (depending on how much time you will spend on core skills like upgrading your maths understanding) of reading, implementing, and improving.

>> "start small and focused"

I'd even go as far you say, use an existing platform like Amazon's Alexa with their hardware to start.

Reading a books like "Calm Technology" and books on lean startups would be good too.

Books: http://www.amazon.com/Calm-Technology-Principles-Patterns-No...


I'm building Abot, an open-source framework that makes it easy and fun to build digital assistants. We try to handle a lot of the heavy lifting you describe above, leaving you to create some cool things! You can see it here: https://github.com/itsabot/abot

I set out six months ago to do exactly what you describe (build my own digital assistant) but decided it would be more helpful to convert it into a framework and open-source it.

I'm happy to help you get started via our IRC or the mailing list, or discuss the challenges we've faced and specific solutions we chose and why.

That's awesome. I'm on my phone and just quickly read through the readme, but it seems a bit like Hubot on steroids. In other words, as easy to deploy and extend as Hubot, but with much more powerful NLP (and voice recognition too, no?)

It's a fantastic idea. I wonder if there's a way to share the data generated as people use it. I would imagine that everyone would benefit from this, though it would have to be anonimized effectively.

Thank you :) No voice support yet but it's easy to add your own drivers to support external APIs. Official support for voice is coming in v.2 (ETA June 1st).

Yes, we will absolutely make it easy to share training data via opt-in! If I train a plugin, everyone should benefit without needing a GPU rig or 2 TB of data.

Your challenges are almost immaterial compared with what goes into something like Siri.

First you need a way to get input. Is that a text input or do you need voice to text? Google and others can provide a voice to text API to help along here at least at first.

Second you need to make that text meaningful. But what the heck is "meaningful" in this context? Generically running an NLP over this is going to get you sentence structure, etc which may or may not be all you need.

Essentially you need to take this text and turn it into a graph of decisions which can then be executed.

Basically distill the sentence "send a text to Josh saying hi"

To a graph that might have a branch like: Action -> Send -> SMS -> ("joshbsmith@yahoo.com", "hi")

Categorization is hard. If you can get it reasonable (85% solution) then you'll be pretty good. This is how Siri essentially works.

Things to consider:

- "How long is the Titanic" is this referring to a movie or the boat?

- "Where is Aaron's?" Is this referring to the furniture store chain, a friend in your contact list or even the street name?

Source: been looking into this a ton. Wrote this during a hackathon: http://devpost.com/software/sim. So this is all very possible :)

If you'd like to start simple, you could skip the voice interface and go for a text-based interface like Messenger.

Voice can be a layer that you put on top later, with the extra complexity that the recognition may not be perfect. I guess it really depends which is the most interesting aspect for you.

I find with all the voice APIs out there, I would recommend choosing one and diving in. One of the skills is getting comfortable with the interactions between the voice API and the back-end.

If George Hotz came in HN and asked about the feasability of doing a self driving car by himself, people would say the same thing, tell him that was impossible. Lucky for us he neither asked HN if he could hack the iphone or the playstation 3 because if he did he could be desencouraged.

Anyway, geohot built a self-driving car and received investment from a16z to hire more people and expand it further. http://www.bloomberg.com/features/2015-george-hotz-self-driv...

My advice to you is study really hard and make it happen. use other people work to leverage your software and go faster. And after that show me and the others what have you made.

You can offload both 1 and 2 to third party services - one of the major ones with a properly public API that officially supports (and thus charges for high volume use) use by other parties is Microsoft's Project Oxford [0].

You could also use some of the Google APIs, but those are more-or-less unsupported, and subject to change as Google needs to change them.

From there, it's a matter of transforming intent into action.

It's entirely possible to do, but you'll have a lot of learning to do to implement it. One place you might start looking is at the Microsoft Python blog [1].

[0] https://www.microsoft.com/cognitive-services/ [1] https://blogs.msdn.microsoft.com/pythonengineering/2016/02/1...

As much as possible, this.

If you, singular, are building Siri v2, you need to do as little as possible yourself. Find an OSS library / tool that does the thing you want, use Caffe/Theano/etc. instead of rolling your own code, use a commercial package if you absolutely have to. Concentrate on the "hard" parts, understanding and translating language to action. Build that part, and glue on all the other stuff. After you get a working thing you can peel off the layers and replace them with something of your own, maybe, or more likely something better will come out by that time and you can plug it in with minimal effort.

I was one of those who wanted to build their own Jarvis. If you want it to do certain -pre-defined- tasks like opening Facebook or a website etc., it is not that hard. However, if you want to create something like Siri, you need to use almost everything humanity has in terms of AI/ML until today.

It seems to me though that there is a continuum between "multi-channel interactive semi-declarative rule engine" (Frankly, the continuum probably starts back in oldschool IRC bots and expert systems) and "strong AI", wherein the bulk of the value, for me at least, is on the former side of that scale.

The vast majority of the pragmatic functionality I'd use on a day to day basis could probably be built by someone with the willingness to learn about the systems involved, no rocket science or all of humanities ML required :) (I certainly didn't know about anything from scraping to relay control to information retrieval at the start of my pet project)

I guess why I'm saying this is to address yours and some other child comments that seemed more dismissive of the idea due to the complexity (Responded to yours largely because you had gone through the steps of "try it yourself" which I see as the crux of all of this). To the op, Try it! What does it matter if it's out of your grasp. Learn new shit. Realize the extents of your knowledge, and in the process, push them a bit further. If you fail, I would be hesitant to ever call it failure if you even learned a little; I can't tell you the number of pipe dream impossible projects (microkernels, distributed OSes, rewriting AML on my own platform, etc) I've taken on, and despite failing at many of them, the learnings were well worth the time, not to mention the once in a while surprising success at someone you thought was out of your reach.

Think about this; Apple and Google -biggest tech companies in the whole world- has teams of remarkable people creating Siri etc. And it is still not enough, so they buy many startups related to those subjects aaand we have what we call Siri.

1. are there any other challenge that I have not considered?

Always. NLP is a valid way to accept requests. Web crawling is a valid way to get data into the system. How do you tie these two together? How is the data stored? How do create relationships between the data in your system?

2. is the project feasible?

You have not yet defined your project beyond "like Siri or Cortana". What does that mean to you? How will you know when that has been accomplished? Can you first define a more limited scope for testing the feasibility of the different parts of your system?

3. will it succeed?

First you must define success.

At my company we did it reasonably well, we just released our product to the market. We do have the advantage of having an app-based model, which means that every app the user installs on our product improves the speech recognition and the amount of actions installed.

Getting started is not that hard, getting good is. It's a hard problem to parse speech correctly, take numbers for example: Nineteen-Eightyfour can be parsed as 19 and 84, or as 1948, or as 9 10 80 4. There are challenges, certainly, but creating a simple program with things like wit.ai is do-able. Prepare to write a lot of speech parsing logic, and implement every piece of functionality by hand. Magic as Siri, Google Now, and Cortana may seem, most of it is just hardcoded responses and actions. That need not be a problem, but I can promise you, smart assistants will lose a lot of their magic once your realize it's just a bunch of responses and actions hard coded.

Anyway, I don't want to discourage you from trying, because it's really interesting to try and see what challenges you're going to encounter. The getting started pack for speech recognition is mostly wit.ai, or the Google STT engine. Keep in mind: none of the big companies are doing everything from scratch. Sure, Google has their own speech recognition, but recognizing the trigger phrase (jargon for 'OK Google', or 'Hey Siri') is outsourced. Every piece of software that has such a trigger uses the same library. Remember: using libraries is not cheating, it's just focusing on your core task, which is writing the parsing and actions.

"I would like to build a space programme in my back garden, but I'm wondering about its feasibility and I don't have a background in rocketry"

Yes, no and no.

You won't believe how far you can get these days with relatively modest means. "Rocket science isn't rocket science" - that's what lots of people involved say last decade at least. Alt-space movement is the witness to that.


Certainly not with that attitude.

Writing software - even hard, complex software - isn't even remotely comparable to a space program.

Your fake quote sounds a bit like what Elon Musk might've said when he started SpaceX.

He didn't start by asking around on HN. Generally people aren't Elon Musk. Even he probably isn't Elon Musk all of the time.

A young composer met mozart and said 'I want to write a symphony, where do I start?'. Mozart spins him a story about starting with short pieces, then apprenticing to a serious composer, mastering the different instruments and gradually working up to it.

The young composer is surprised. 'You didn't do all that'. 'Ah,' says mozart, 'but I didn't ask anybody how'.

I used to think that all the "geniuses" didn't ask anyone on how to do things at all, and just go for it. Turned out that they do, and luckily for us, in the internet age we have history.


I like how your interpretation of 'do things' is 'set UA string in java spiders'. Einstein visited bohr to rap about the speed of light and larry page asked a mailing list how to use java libraries.

Steve jobs asked a lot of people like Intel CEO, HP CEO etc. He made all these successful people his mentor. And, it is well documented in his biography as well in 'Becoming Steve Jobs' book. This is how he became so good in his craft.

justine musk (the ex) has an opinion about this


The possibility of deep NLP suggest that there are tremendous opportunities for building a personal assistant. For Deep NLP, I would suggest course CS224D at Stanford on YouTube (going on now but also delivered last year: https://www.youtube.com/watch?v=kZteabVD8sU

For Deep NLP, you will need to be solid on linear algebra and machine learning. For introduction to machine learning, check out Andrew Ng at Coursera: https://www.coursera.org/learn/machine-learning and my favorite talks on Linear Algebra are the ones done by Gilbert Strang: http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-...

For web crawling, there are plenty of open source libraries. If you are not familiar with it, check out the Common Crawl: http://commoncrawl.org/ This is a great source of data to crawl.

If you focus solely on NLP and data from the Common Crawl (or even Wikipedia), then you will see where you stand as the smoke clears and you feel comfortable with the state of the art techniques.

Ignore all the naysayers. The good news is that it has never been easier to get started in deep NLP. Once you have the experience of training a model and seeing how well it works, you can decide on the next steps. Perhaps, you will find a niche that is not well-served that you can go after as a first step of a personal assistant.

Good luck.

To get an idea of where to start you could check out Sirius [1] which is "an open end-to-end standalone speech and vision based intelligent personal assistant (IPA) similar to Apple’s Siri, Google’s Google Now, Microsoft’s Cortana, and Amazon’s Echo."

[1] http://sirius.clarity-lab.org/

In this specific case, I would start by trying to come up with an idea for a solution to an existing problem instead of trying to build a solution looking for a problem. Trying to build just another Siri will be wasting your time because you're just one person competing with pretty much all the big tech companies, whereas if you try to build a useful tech for a specific use case and people like it, then you can build your platform on top of it. I've never seen any small startup succeed by trying to build an ambitious "platform" from scratch. Most successful platform companies started out as small application providers and went from there. Some people are encouraging you to do it because that's how you learn things, but I would rather make something people like AND learn something new at the same time, instead of spending the same amount of time working on something that will never see the light of the day.

Take a few minutes to discern your motivation.

If you're motivated by learning, you'll certainly do so. You'll probably learn about topics you hadn't considered learning about. Start small and build incrementally. Be patient, you're in for a lot of reading.

If you're motivated by profit, i.e. starting a company, then you need to be strategic. The Big Guys - Apple, Google, MS, etc. - are mostly looking at "general purpose" assistants. Keep in mind that they're putting massive effort into these assistants because the technology will become a central part of products like automobiles.

Perhaps you can find a niche market they're missing so that you can target your research and coding efforts.

As others said, pick a small, focused goal in building an assistant and add goals as you succeed.

For what it's worth, the BotBuilder from Microsoft is open-source: https://github.com/Microsoft/BotBuilder - could be worth a peek.

As the website states, this SDK is only "one of three main components of the Microsoft Bot Framework", and I'm not sure if the other two are open-source, but it shouldn't be all that significant given that it's just integration with Microsoft products, like Office 365, plus the bot directory, which I imagine consists of some ready-to-use bots one could possibly piggyback on.

This is not voice recognition, but once you're past that stage what you have on your hands is a piece of text anyway - which has to be parsed and processed.

Please read up what SDK means. It's worthless in OP case.

If you dont' want to start from scratch you can use http://api.ai or http://wit.ai

The biggest challenge is probably covering all the corner cases. In this regard, building such a thing is more about "caring/nursing" than it is about "engineering".

But of course, there are some engineering challenges as well, although you might want to use existing solutions (open source or APIs) for that. For speech-to-text you can find many APIs. Here's a nice demo for doing NLP: http://nlp.stanford.edu:8080/corenlp/

That mostly depends on what scale you have in mind.

If you simply want to make an app for your own personal use, and you imagine a restricted form of dialog (by which I mean e.g. "query/reply" or "command"-type of dialogs as opposed to open discussions) to trigger a limited set of actions (say, verbatim web searches, control the built-in functionality of a smart phone, etc.) it is feasible.

But that doesn't mean it's easy. But for a project to hack on, why not?

The good news is that there are a lot of tools that can do some of the heavy lifting for you, especially if you restrict yourself to English. You are right that the situation for other languages is not quite as luxurious, but there are tools (of varying quality) for other languages as well, as specially Western European languages.

However, because it's a complex subject matter, expect that you might need to first dig into some linguistic and/or NLP theory in order to get the most out of these tools.

For instance, the Kaldi speech recognition toolkit is a state-of-the-art research software for automatic speech recognition (ASR), and it's open source. The thing is, to get really good recognition results, you might need to train your own acoustic and language models. Hence, you'd need to learn about these things.

For NLU (natural language understanding) there are also a bunch of free software packages available; however, they often follow completely different philosophies and goals. Thus, in order to make an informed decision which one would be the best for you, you'd again have to be prepared to do some reading.

One quite user-friendly service for NLU you might want to check out is wit.ai which was acquired by Facebook last year. They focus on setting the entrance barrier really low for the task of turning spoken input into a domain representation. For example, you can quite easily define rules that turn the utterance "turn down the radio please" into a symbolic representation that you can use in your downstream processing. The big plus here is that they do the ASR for you, so you don't have to worry about that.

If you prefer to have more control over your tool chain, there are a wide variety of scripting languages that you can use to get your feet wet. AIML is sort of popular for writing bots, but it's quite limited and you have to write rules in XML. VoiceXML is a standard that is great for form-filling applications, ie., situations where your system needs to elicit a specific set of information that's required to run a task. A classic example would be traveling: for your system to find a flight for you, it needs to know (a) point of departure, (b) destination, (c) preferred date and time, (perhaps others). So you need to tell the system, or it has to ask about this information.

There are also domain-specific languages like Platon (https://github.com/uds-lsv/platon) that, again, give you more control but also try to make it quite easy to write a simple application.

A next aspect more complex dialog systems typically care about is what the intent of a specific user utterance is. Say, you ask your personal assistant: "do you know when the next bus comes?", you don't want it to answer "yes". That's because your (what is called) "dialog act" was not a yes-no-question, but a request for information. So, you might want to care about how to detect the correct dialog act. Well, first you might want to care about what kinds of dialog acts there are and which of those your system should be prepared to handle.

There are many different dialog act sets developed for different domains and situations. There's also an ISO standard (ISO 24617-2) that defines such a set, but then you'd go into more advanced areas again.

Next, say your system has done all of the above processing, recognized speech, analyzed the meaning, etc. -- now your system has to make the next move! So how does it decide what's the best reaction? What's by some considered the state-of-the-art for dialog management these days runs under the label POMDP -- Partially Observable Markov Decision Processes. These are systems that learn the best strategy on how to behave from data, typically using reinforcement learning. But you also still have the more traditional approaches in which an "expert" (in this case: you) authors the dialog behavior somehow, and there are tools for that as well.

But again, the more simple languages mentioned above like, e.g. Platon etc., also cover this in a way, so don't get discouraged just because you've never even heard of POMDPs so far, nor do you have a large data set that is required for the machine learning part: like with all of the different tasks here, there's always alternatives.

Once your assistant has made up its mind about what to do and what to say, you need to turn that into an actual utterance, right? If you just want to start, having a large-ish set of canned sentences that you simply need to select from can get you a long way. The next step would be to insert some variables into those canned sentences that your system can fill depending on the situation. That's called template-based natural language generation (NLG). More recently, machine learning has also been applied with some success to NLG, but that's (a) still researchy and (b) not even necessary for a first dab into writing a dialog system.

Unless you just want to display the system utterance on the screen, you'd finally need to use some text-to-speech (TTS) component to vocalize the system utterance. There are some free options, such as Festival or MaryTTS, but unfortunately, they don't quite reach the quality of commercial solutions yet. But hey, who cares, right?

One topic I haven't talked about at all yet is uncertainty. Typically, a lot of the steps on the input side of a dialog system use probabilistic approaches because, starting from the audio signal, there's inevitably noise in the input and so the outputs produced on the input side should always be taken with a grain of salt. For ASR, you can often get not just one recognized utterance, but a whole list of hypotheses on what it was the user actually said. Each of these alternatives might come with a confidence score.

That, of course, has implications on all the processing that comes afterwards.

Now, I've written a whole lot -- and yet, there's so much more I haven't touched yet, such as e.g. prosody processing, multimodality (e.g., using (touch) gestures together with speech), handling of speech disfluencies, barge-in, etc.

But I think that shouldn't keep you from just giving it a try. You don't have to write a Siri clone in one weekend. Just like the first video game you write doesn't have to be the next "Last of Us". You can start with Pac-Man just fine, and likewise you can write your first small voice-based assistant that cannot do half the stuff Siri can, and yet have a great time.

woa very thorough answer. I wonder what's the reward in dialog pomdp, where does it come from?

Apple and Microsoft are investing many millions of dollars [citation needed] in this, there's just no way that you can compete. Not that miracles don't happen and it's only huge corporations that build amazing things, but statistically speaking, the answer is a clear "No, it's not feasible".

If you, however, want to learn a lot, then it's a definite "Go for it".

I think the hardest part might be getting lots of context information. Google has a huge database, you probably don't. And Google has access to your calendar, your contacts, your search requests, your location history - and the same for your contacts. They can infer a lot from that.

The most epic challenge is to start and get back to work when thing will go wrong. Do it, you'll learn a lot. It will be hard. Post it on Github, ask questions on SO and I hope to see it here as 'Show HN: My f*ckn awesome personal assistant' =)

Email me kespindler at gmail. I have relevant work that you'd be interested to hear about.

Re 2) Speech Recognition: it seems like using the new Google Speech API is the way to go but I wonder if you can obtain access or if it is very expensive for small companies. I applied a few weeks ago and haven't heard anything from them.

What is the scoope for your project?? I mean, emmbedded system this is?? (Hardware/software). Do you think running on dsp/arduino/raspberry?? Exist a lot of doubts. Im working in something like that. Good luck the road is long.

Hey, I am also thinking to work on something similar (as a side project). Do let me know if you want to collaborate.

It's tricky but there are some nice ways to check the water quickly. Check out LUIS - https://www.luis.ai

To add a few, in addition to the ones you mentioned: 1. Infrastructure 2. Scaling 3. Huge datasets for machine learning

1. Do you know what you are doing? (not trying to be harsh... it's a serious question.)

2. Have you read any of the latest research in this area? (think Google scholar).

3. Often, apps like siri and cortana do not process the sound on the device, but rather send it to a server for processing. There is, of course, exception in cases of small, well-defined applications with limited pattern matching required.

4. Whether or not it is feasible and/or succeed is based on your ability, performance, and approach. We cannot predict any of that here.

Hi Mandela:

By Siri, I am assuming your mean some application that understands speech and carries out commands like answering queries, or carrying out a command? Answering queries is specific domain of NLP. As is developing the acoustic model (roughly the part that links the audio with linguistic units). You will find that there are many sub-problems to tackle. You have to decide on which problems to tackle.

I have very limited experience in this field. If I didn't try out Nuance NLU/MIX at a hackathon a few months ago (and get a lot of help from the Nuance representatives), I would have never taken a stab at writing a voice application.

I am currently building a simple request/response system. Requests of a "HAL, open the pod bay door" nature. As opposed to something more free form and conversational. One has to start somewhere.

2. Voice recognition: same as the above.

As many posters have commented, there are many speech recognition and synthesis systems out there that handle the acoustic model. Some of the system handle languages other than English.

I have tried the Amazon Alexa SDK, Wit.ai and Nuance NLU/MIX. For the most part, they are essentially a client/server model.

1 - At "compile" time, one builds up an acoustic model/language model with samples (at least this is how Nuance works). Or maybe the language model already exists and training through additional use makes it better.

2- At "runtime," a client such as a mobile application interacts with the speech recognition system, via an API. Audio is sent. The speech system parses the audio stream and returns some data structure (or audio if speech synthesis is done). Or in a slight variation, the speech system sends the AST to an end-point for back-end processing. Most of your application will be the back-end that does the meaningful domain specific stuff. For instance, in Alexa, one is developing "Skills"

One of the things I am learning is that there is not a clear cut distinction between what is in the language model and in the back-end. Although the language model allows it, I don't necessarily want to bake in business logic (i.e. Pod bay Door maps onto part number 42). Also the back-end may have to do additional NLP related processing ("Cod bay" is probably "Pod bay").

NLP: I am not aware of any non-English library and don't have much background on the subject.

I found the Stanford course cs224n to be super useful (https://www.coursera.org/course/nlp). There is a book (Speech and Language Processing) associated with the course.

The Peter Norvig paper "How to write a Spell Corrector" (http://norvig.com/spell-correct.html) is very helpful.

Since I am working with Python, I use the book "Natural Language Processing with Python" by Bird, Klein, and Loper and the Python NLTK.

Finally I find that reading the documentation associated with the various speech systems to be very helpful. Vary to suit your needs. For instance, Alexa provides guidelines for U/X like Voice Design Best Practices (https://developer.amazon.com/public/solutions/alexa/alexa-sk...).

3. Web crawling: there are tons of libraries for doing crawling and I have a decent understanding of the subject.

I think crawling is the least of your problems. Look at Week 8, Information Retrieval of the Stanford NLP Course.

Mandela, at this point all I can say is go for it! Look at the speech SDKs out there and pick one, preferably one with a large community. Build something small. See if you can work in a computer language and system you are comfortable with (With Nuance, I could work with mostly with Python and avoid Android and Java).

Have fun!!!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact