

Ask HN: How do you implement a Watson?  - sown

I want to implement my own version, which will probably be very crappy. But I'd like to try anyways with the goal that I'll learn something.
======
emef
Probably start with a good knowledge representation. Facts need to be
categorized and linked to one another. I assume a solid ontology would be key.

A way to process your training data to fit your knowledge graph needs to be
unambiguous. No sense it teaching it anything if it becomes full of
contradictions.

Finding a dataset for it to learn would be tough as well. Given the breadth of
possible questions, you'll need to parse huge encyclopedias and/or wikipedia.

Finally, a way to efficiently query this massive amount of data. If it has to
come up with an answer faster than its competitors, it better be able to
lookup information pretty damn fast.

------
th0ma5
There are some architectural diagrams and other info in the Quora thread here:
[http://www.quora.com/IBM-Watson/Whats-the-system-
architectur...](http://www.quora.com/IBM-Watson/Whats-the-system-architecture-
of-the-IBM-Watson)

~~~
sown
I saw those but I am wondering if those aren't at the level needed for an
implementation, especially for an imbecile such as myself. :)

I'm going to start with information retreival book I found at stanford but
something tells me I need much more than just that!

~~~
th0ma5
No fair enough. I guess it is all in the field of search or natural language
processing. For NLP I recommend the NLTK tutorial for Python as an intro, and
for search I would recommend checking out Apache Lucene. Perhaps to implement
you need to tear concepts from those fields into your own ideas tailored for
the task at hand. If you choose Hadoop as job management and distribution, it
would scale wide, to a threshold of overhead. If you just want highly
specialized on a smaller dataset, and traceable, and I might get shunned for
this, heh, but I recommend looking at RDFS and OWL as a somewhat approachable
something in the realm of description logic, but that's too formal, maybe.
Creating ngrams and other NLP techniques may be pretty neat to play around
with, but at smaller scales, it is hard for me to envision something simple
that is really any better than just free text search.

------
gojomo
I think you could get pretty far (even reproducing some of Watson's errors)
probing a full-text index of a local copy of Wikipedia. My reasoning with
examples is here:

[http://memesteading.com/2011/02/16/ibm-watson-
overprovisione...](http://memesteading.com/2011/02/16/ibm-watson-
overprovisioned-big-iron/)

Apparently the open-source Apache UIMA and Hadoop projects are key parts of
Watson's preprocessing and live operation:

[https://blogs.apache.org/foundation/entry/apache_innovation_...](https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s)

