
How to build your own "Watson Jr." in your basement - flapjack
https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en
======
gojomo
My hunch is that 90-99% of all Jeopardy questions can be answered with
information in Wikipedia/Wiktionary, properly understood.

So I'd start with Wikipedia: ~30GB uncompressed full article text. Break it
into chunks; canonicalize phrasings to be more declarative, and include
synonyms/hypernym/hyponym phrasings (via something like WordNet), so that
various 'cluesy' ways of saying things still bring up the same candidate
answers.

Because it's free and compact and well-structured, throw in Freebase, too.

Jeopardy goes back to certain topics/answers again and again. So I'd scrape
the full 200K+ clue "J!Archive", and use it as both source and testing
material (though of course not testing the system on rounds in its memory).

And I'd add special interpretation rules for commonly-recurring category
types: X-letter words, before-and-after, quasi-multiple-choice, words-in-
quotes.

I think such a system might get half or more of the questions in a typical
round correct, and in a matter of seconds, even on a single machine.

~~~
aristus
"properly understood" is the whole point. That's the hard problem they are
trying to solve.

~~~
gojomo
Sure, but it does put somewhat of a cap on the amount of reference material
you need to import. And a fairly low cap in the tens of GB:
Wikipedia/Wiktionary/WordNet/Freebase/J!Archive is probably enough.

Beyond that, you want software/heuristics. You might find far more data
helpful to initially create that software, but once it's created, the
reference material to have at hand can come from a small set of sources.

~~~
knowledgesale
there is a project on formalization of the human "common sense" with (partly)
open source database. take a look: <http://en.wikipedia.org/wiki/Cyc>

~~~
jokermatt999
For those who were intrigued by the story of Eurisko (that space fleet battle
playing bot that completely crushed the competition and then was impossible to
find decent info on) the developer of Eurisko, Douglas Lenat, is the one who
started this program.

------
tel
_Search optimization: No, this team focused on making IBM Watson optimized to
answer in 3 seconds or less. We can accept a slower response, so we can skip
this._

That makes me laugh. I'd guess that search optimization effort has a power law
response here. 3 seconds is extraordinary, 1 minute is tricky, 10 minutes is
possible after some solid effort, 3 days-heat death of universe is what you
get without optimization.

Not saying you actually ignore it. It's built into those libraries they
casually throw around. Just thought the wording was funny.

------
srean
I somehow cannot give up daydreaming wistfully about a personal CM-5. From a
previous discussion on HN it seems it still is going to be an expensive thing
to build as a toy project. Particularly because of the hyper-cube inter-
connection. Not sure if the source code for star-Lisp is available. But I
think an emulator lives on at Sourceforge.

Edit 1: Here it is <http://sourceforge.net/projects/starsim/>

Edit 2: Just doubled checked, the Sourceforge repository has no code !! But I
found it here
[http://examples.franz.com/category/Application/ParallelProgr...](http://examples.franz.com/category/Application/ParallelProgramming/index.html)
@dhess Thanks a lot for that link. I just ordered a copy :)

~~~
dhess
I think you might like this book: The Paralation Model, by Gary W. Sabot.

[http://www.amazon.com/Paralation-Model-Architecture-
Independ...](http://www.amazon.com/Paralation-Model-Architecture-Independent-
Programming-
Intelligence/dp/0262192772/ref=sr_1_1?ie=UTF8&s=books&qid=1298514403&sr=8-1)

~~~
denimboy
Interesting. Found this article which explains the paralation model.

[http://www.mactech.com/articles/mactech/Vol.08/08.07/Paralat...](http://www.mactech.com/articles/mactech/Vol.08/08.07/Paralation/index.html)

------
charlesju
It seems to me it would be a lot easier to use EC2 than to setup all the
machines in the basement.

------
kirpekar
Watson has become an unbelievable marketing tool for IBM.

~~~
rudiger
Has become? From the beginning, Watson's purpose has been to advertise IBM's
computers.

~~~
tel
And there's pretty little surprise that it worked! IBM's been here before
(Deep Blue).

------
moomba
This article doesn't really tell you how to build a "Watson Jr." as they call
it. It just tells you to use OpenNLP and UIMA (which is unnecessary, but
understandable why its advocated since IBM created it).

I was kind of hoping that there would be a deeper dive into how the data was
being stored and retrieved. I'm also interested in the Machine Learning side
of it. They don't really give any hints at that as well.

~~~
nl
Actually, UIMA _was_ used to train Watson
(<https://cwiki.apache.org/UIMA/powered-by-apache-uima.html>).

You are right that UIMA isn't _needed_ , but some kind of tool for importing
unstructured or semi-structured data is required.

------
joakin
I hate when I enter a page Im interested in that makes an ajax call whenever I
click something, since I read long texts clicking and selecting text.

Why man? Why? If you want to track clicks on links with javascript, dont
trigger the ajax call when I do click in :not(a) ... -_-'

------
maeon3
This is one of the great moments in the history of humanity, right up there
with the first self-powered flying machine. North Carolina got a licence
plate: "FIRST IN FLIGHT". Someone is going to get the credit for open ended
question answering machine shortly. Who gets it?

The race is on, whoever creates the first reasonably good question-answering
machine for demonstration will get their names etched into the sands of time
for the next ten thousand years. Get to it!

This industry has the chance to be bigger than Google and Microsoft combined.
Every person on the Earth will demand one of these. Those who won't have one
will be at a remarkable disadvantage. This is going to turn into a trillion
dollar industry.

~~~
dailystatusrpt
open-ended question answering systems have been around for many years. there
are stacks of research papers written about them, watson is an improvement.

