Hacker News new | comments | show | ask | jobs | submit login
How to build your own "Watson Jr." in your basement (ibm.com)
147 points by flapjack 2518 days ago | hide | past | web | favorite | 27 comments

My hunch is that 90-99% of all Jeopardy questions can be answered with information in Wikipedia/Wiktionary, properly understood.

So I'd start with Wikipedia: ~30GB uncompressed full article text. Break it into chunks; canonicalize phrasings to be more declarative, and include synonyms/hypernym/hyponym phrasings (via something like WordNet), so that various 'cluesy' ways of saying things still bring up the same candidate answers.

Because it's free and compact and well-structured, throw in Freebase, too.

Jeopardy goes back to certain topics/answers again and again. So I'd scrape the full 200K+ clue "J!Archive", and use it as both source and testing material (though of course not testing the system on rounds in its memory).

And I'd add special interpretation rules for commonly-recurring category types: X-letter words, before-and-after, quasi-multiple-choice, words-in-quotes.

I think such a system might get half or more of the questions in a typical round correct, and in a matter of seconds, even on a single machine.

"properly understood" is the whole point. That's the hard problem they are trying to solve.

Which is especially interesting, considering the other competitors are coming from exactly the opposite direction. They are playing the same game, but facing completely different challenges.

But properly understanding what? You can diagram the question all you want, but if you don't know the answer it's pointless, no? Wikipedia is just a source of data.

Given that Jeopardy focuses a lot on literature, I'd also throw in gutenburg project books. And probably a newspaper archive going way back, like the NYT.

Sure, but it does put somewhat of a cap on the amount of reference material you need to import. And a fairly low cap in the tens of GB: Wikipedia/Wiktionary/WordNet/Freebase/J!Archive is probably enough.

Beyond that, you want software/heuristics. You might find far more data helpful to initially create that software, but once it's created, the reference material to have at hand can come from a small set of sources.

there is a project on formalization of the human "common sense" with (partly) open source database. take a look: http://en.wikipedia.org/wiki/Cyc

For those who were intrigued by the story of Eurisko (that space fleet battle playing bot that completely crushed the competition and then was impossible to find decent info on) the developer of Eurisko, Douglas Lenat, is the one who started this program.

I'm not certain, but I'd bet Watson was explicitly not allowed to use something like J!Archive as training data. For one, the questions used in the Jeopardy games it played were drawn randomly from previous questions. More importantly, though, learning a stilted, domain specific language model to play Jeopardy isn't anywhere near as challenging, impressive, or worth pursuing than generating something that includes Jeopardy as a subset of its capacity.

Now Watson was tuned on Jeopardy questions. I'm sure the learning processes were adjusted in light of mistakes made on the Jeopardy corpus, but interpolation is far less big a deal than a full language model.

the questions used in the Jeopardy games it played were drawn randomly from previous questions

I've not heard that, and if true, it would have given Jennings and Rutter, both excellent crammers, a knowledge advantage.

Further, human contestants absolutely review the J!Archive before competing, so why wouldn't Watson?

We don't yet know for sure Jeopardy is only one subset of all the impressive things Watson can do. Notably, in the 'Ask Reddit' answers, the Watson team says: "At this point, all Watson can do is play Jeopardy and provide responses in the Jeopardy format."

So it seems like they're trying to claim the accolades for solving a bigger problem, when in fact they've only done well on a very constrained problem.

Can't find the quote this moment, but IIRC the questions(or technically, answers) were drawn from previously prepared questions, but not previously used questions. The point being that aside from eliminating audio/video based questions, these had been designed with humans in mind and there was no tailoring of the content to be "Watson friendly/unfriendly".

That may help explain the confusion.

For what it's worth, I scraped J-Archive.com and wrote a couple of articles for Slate Magazine about what I found. More for the purpose of learning about Jeopardy than learning how to win.

- The first piece: http://www.slate.com/id/2284678/

- The follow-up, in which I answer reader questions: http://www.slate.com/id/2287705/

Search optimization: No, this team focused on making IBM Watson optimized to answer in 3 seconds or less. We can accept a slower response, so we can skip this.

That makes me laugh. I'd guess that search optimization effort has a power law response here. 3 seconds is extraordinary, 1 minute is tricky, 10 minutes is possible after some solid effort, 3 days-heat death of universe is what you get without optimization.

Not saying you actually ignore it. It's built into those libraries they casually throw around. Just thought the wording was funny.

I somehow cannot give up daydreaming wistfully about a personal CM-5. From a previous discussion on HN it seems it still is going to be an expensive thing to build as a toy project. Particularly because of the hyper-cube inter-connection. Not sure if the source code for star-Lisp is available. But I think an emulator lives on at Sourceforge.

Edit 1: Here it is http://sourceforge.net/projects/starsim/

Edit 2: Just doubled checked, the Sourceforge repository has no code !! But I found it here http://examples.franz.com/category/Application/ParallelProgr... @dhess Thanks a lot for that link. I just ordered a copy :)

I think you might like this book: The Paralation Model, by Gary W. Sabot.


Interesting. Found this article which explains the paralation model.


It seems to me it would be a lot easier to use EC2 than to setup all the machines in the basement.

Watson has become an unbelievable marketing tool for IBM.

Has become? From the beginning, Watson's purpose has been to advertise IBM's computers.

And there's pretty little surprise that it worked! IBM's been here before (Deep Blue).

I believe they also sell some of the tech behind this in their business-intelligence products, as a kind of extended semantic version of information extraction / information retrieval.

You sure its purpose isn't to reinstantiate the cryogenically frozen brain of IBM's founder? Maybe that's just a bonus.

It's worked pretty well for Jeopardy too.

This article doesn't really tell you how to build a "Watson Jr." as they call it. It just tells you to use OpenNLP and UIMA (which is unnecessary, but understandable why its advocated since IBM created it).

I was kind of hoping that there would be a deeper dive into how the data was being stored and retrieved. I'm also interested in the Machine Learning side of it. They don't really give any hints at that as well.

Actually, UIMA was used to train Watson (https://cwiki.apache.org/UIMA/powered-by-apache-uima.html).

You are right that UIMA isn't needed, but some kind of tool for importing unstructured or semi-structured data is required.

I hate when I enter a page Im interested in that makes an ajax call whenever I click something, since I read long texts clicking and selecting text.

Why man? Why? If you want to track clicks on links with javascript, dont trigger the ajax call when I do click in :not(a) ... -_-'

This is one of the great moments in the history of humanity, right up there with the first self-powered flying machine. North Carolina got a licence plate: "FIRST IN FLIGHT". Someone is going to get the credit for open ended question answering machine shortly. Who gets it?

The race is on, whoever creates the first reasonably good question-answering machine for demonstration will get their names etched into the sands of time for the next ten thousand years. Get to it!

This industry has the chance to be bigger than Google and Microsoft combined. Every person on the Earth will demand one of these. Those who won't have one will be at a remarkable disadvantage. This is going to turn into a trillion dollar industry.

open-ended question answering systems have been around for many years. there are stacks of research papers written about them, watson is an improvement.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact