> > Google have manually mocked up their early product? How would

> Crawl an intentional community (remember webrings?) or other small directed subset of the web and see if you're able to get better search results using terms you know exist in the corpus, rather than all of the Internet.

But that isn't a mock up, it's the real thing but on a smaller dataset. If you're going to do the real thing anyway, why not run it on all the data you can?

After all, the throttling factor to release is in the engine, not in the dataset. If you're going to write the full engine anyway, there's nothing to be saved by limiting it to a subset of the data.