Hacker News new | past | comments | ask | show | jobs | submit login

Hi Mandela:

By Siri, I am assuming your mean some application that understands speech and carries out commands like answering queries, or carrying out a command? Answering queries is specific domain of NLP. As is developing the acoustic model (roughly the part that links the audio with linguistic units). You will find that there are many sub-problems to tackle. You have to decide on which problems to tackle.

I have very limited experience in this field. If I didn't try out Nuance NLU/MIX at a hackathon a few months ago (and get a lot of help from the Nuance representatives), I would have never taken a stab at writing a voice application.

I am currently building a simple request/response system. Requests of a "HAL, open the pod bay door" nature. As opposed to something more free form and conversational. One has to start somewhere.

2. Voice recognition: same as the above.

As many posters have commented, there are many speech recognition and synthesis systems out there that handle the acoustic model. Some of the system handle languages other than English.

I have tried the Amazon Alexa SDK, Wit.ai and Nuance NLU/MIX. For the most part, they are essentially a client/server model.

1 - At "compile" time, one builds up an acoustic model/language model with samples (at least this is how Nuance works). Or maybe the language model already exists and training through additional use makes it better.

2- At "runtime," a client such as a mobile application interacts with the speech recognition system, via an API. Audio is sent. The speech system parses the audio stream and returns some data structure (or audio if speech synthesis is done). Or in a slight variation, the speech system sends the AST to an end-point for back-end processing. Most of your application will be the back-end that does the meaningful domain specific stuff. For instance, in Alexa, one is developing "Skills"

One of the things I am learning is that there is not a clear cut distinction between what is in the language model and in the back-end. Although the language model allows it, I don't necessarily want to bake in business logic (i.e. Pod bay Door maps onto part number 42). Also the back-end may have to do additional NLP related processing ("Cod bay" is probably "Pod bay").

NLP: I am not aware of any non-English library and don't have much background on the subject.

I found the Stanford course cs224n to be super useful (https://www.coursera.org/course/nlp). There is a book (Speech and Language Processing) associated with the course.

The Peter Norvig paper "How to write a Spell Corrector" (http://norvig.com/spell-correct.html) is very helpful.

Since I am working with Python, I use the book "Natural Language Processing with Python" by Bird, Klein, and Loper and the Python NLTK.

Finally I find that reading the documentation associated with the various speech systems to be very helpful. Vary to suit your needs. For instance, Alexa provides guidelines for U/X like Voice Design Best Practices (https://developer.amazon.com/public/solutions/alexa/alexa-sk...).

3. Web crawling: there are tons of libraries for doing crawling and I have a decent understanding of the subject.

I think crawling is the least of your problems. Look at Week 8, Information Retrieval of the Stanford NLP Course.

Mandela, at this point all I can say is go for it! Look at the speech SDKs out there and pick one, preferably one with a large community. Build something small. See if you can work in a computer language and system you are comfortable with (With Nuance, I could work with mostly with Python and avoid Android and Java).

Have fun!!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: