So I'd start with Wikipedia: ~30GB uncompressed full article text. Break it into chunks; canonicalize phrasings to be more declarative, and include synonyms/hypernym/hyponym phrasings (via something like WordNet), so that various 'cluesy' ways of saying things still bring up the same candidate answers.
Because it's free and compact and well-structured, throw in Freebase, too.
Jeopardy goes back to certain topics/answers again and again. So I'd scrape the full 200K+ clue "J!Archive", and use it as both source and testing material (though of course not testing the system on rounds in its memory).
And I'd add special interpretation rules for commonly-recurring category types: X-letter words, before-and-after, quasi-multiple-choice, words-in-quotes.
I think such a system might get half or more of the questions in a typical round correct, and in a matter of seconds, even on a single machine.
Given that Jeopardy focuses a lot on literature, I'd also throw in gutenburg project books. And probably a newspaper archive going way back, like the NYT.
Beyond that, you want software/heuristics. You might find far more data helpful to initially create that software, but once it's created, the reference material to have at hand can come from a small set of sources.
Now Watson was tuned on Jeopardy questions. I'm sure the learning processes were adjusted in light of mistakes made on the Jeopardy corpus, but interpolation is far less big a deal than a full language model.
I've not heard that, and if true, it would have given Jennings and Rutter, both excellent crammers, a knowledge advantage.
Further, human contestants absolutely review the J!Archive before competing, so why wouldn't Watson?
We don't yet know for sure Jeopardy is only one subset of all the impressive things Watson can do. Notably, in the 'Ask Reddit' answers, the Watson team says: "At this point, all Watson can do is play Jeopardy and provide responses in the Jeopardy format."
So it seems like they're trying to claim the accolades for solving a bigger problem, when in fact they've only done well on a very constrained problem.
That may help explain the confusion.
- The first piece: http://www.slate.com/id/2284678/
- The follow-up, in which I answer reader questions: http://www.slate.com/id/2287705/
That makes me laugh. I'd guess that search optimization effort has a power law response here. 3 seconds is extraordinary, 1 minute is tricky, 10 minutes is possible after some solid effort, 3 days-heat death of universe is what you get without optimization.
Not saying you actually ignore it. It's built into those libraries they casually throw around. Just thought the wording was funny.
Edit 1: Here it is http://sourceforge.net/projects/starsim/
Edit 2: Just doubled checked, the Sourceforge repository has no code !! But I found it here http://examples.franz.com/category/Application/ParallelProgr...
@dhess Thanks a lot for that link. I just ordered a copy :)
I was kind of hoping that there would be a deeper dive into how the data was being stored and retrieved. I'm also interested in the Machine Learning side of it. They don't really give any hints at that as well.
You are right that UIMA isn't needed, but some kind of tool for importing unstructured or semi-structured data is required.
The race is on, whoever creates the first reasonably good question-answering machine for demonstration will get their names etched into the sands of time for the next ten thousand years. Get to it!
This industry has the chance to be bigger than Google and Microsoft combined. Every person on the Earth will demand one of these. Those who won't have one will be at a remarkable disadvantage. This is going to turn into a trillion dollar industry.