|I've been thinking for a while of building a domain-specific / "vertical" search engine. I usually end up telling myself it's crazy to think that I can compete with Google, but the recent thread (http://news.ycombinator.com/item?id=902999) on the apparently declining quality of Google search, especially for niche queries, got me thinking that it might not be so crazy after all. What's more, I know of at least one existing search startup in this domain who seem to be doing well, so that's some additional validation of this market.|
Although I'm a pretty competent hacker, I'm new to search. I've done some reading on search and information retrieval in general, and picked up a copy of Manning et al.'s "Introduction to Information Retrieval", (http://nlp.stanford.edu/IR-book/information-retrieval-book.html) which looks fantastic.
However, I don't want to reinvent too many wheels, at least not until I really have to. I'm planning to dive into Hadoop, Lucene, Nutch, Solr, and other open source tools and see how far those can take me.
I would appreciate any advice on this endeavor -- preferably from your own real world experience -- tips, tricks, pitfalls, resources, etc.
One particular issue that I've pondered is how to do a good "targeted" web crawl -- how to restrict the crawl (or at least indexing) to pages I know are relevant to my domain. It occurred to me to seed the crawler with a set of "authoritative" domains that I know are relevant, then use pages from those domains to train a simple classifier to apply to each page visited, to decide whether to index it and crawl its outgoing links. Any other (simpler?) strategies?
Also, how far can I realistically expect to get with "off-the-shelf" OSS tools like Nutch, etc.? Those of you who've used such tools "in anger", what roadblocks and brick walls can I expect to run into?
Thanks in advance!