Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Building a domain-specific search engine. Advice, experiences?
3 points by jdale27 2947 days ago | hide | past | web | 3 comments | favorite
I've been thinking for a while of building a domain-specific / "vertical" search engine. I usually end up telling myself it's crazy to think that I can compete with Google, but the recent thread (http://news.ycombinator.com/item?id=902999) on the apparently declining quality of Google search, especially for niche queries, got me thinking that it might not be so crazy after all. What's more, I know of at least one existing search startup in this domain who seem to be doing well, so that's some additional validation of this market.

Although I'm a pretty competent hacker, I'm new to search. I've done some reading on search and information retrieval in general, and picked up a copy of Manning et al.'s "Introduction to Information Retrieval", (http://nlp.stanford.edu/IR-book/information-retrieval-book.html) which looks fantastic.

However, I don't want to reinvent too many wheels, at least not until I really have to. I'm planning to dive into Hadoop, Lucene, Nutch, Solr, and other open source tools and see how far those can take me.

I would appreciate any advice on this endeavor -- preferably from your own real world experience -- tips, tricks, pitfalls, resources, etc.

One particular issue that I've pondered is how to do a good "targeted" web crawl -- how to restrict the crawl (or at least indexing) to pages I know are relevant to my domain. It occurred to me to seed the crawler with a set of "authoritative" domains that I know are relevant, then use pages from those domains to train a simple classifier to apply to each page visited, to decide whether to index it and crawl its outgoing links. Any other (simpler?) strategies?

Also, how far can I realistically expect to get with "off-the-shelf" OSS tools like Nutch, etc.? Those of you who've used such tools "in anger", what roadblocks and brick walls can I expect to run into?

Thanks in advance!

Solr can serve great for your purposes. As far as domain specific crawl is concerned, maybe you could use our semantic classification API which takes in a URL and tells the DMOZ category it belongs to - http://www.wingify.com/contextsense/

Some other helpful books:

- Data Mining, by Witten and Franke; describes basics , how to use Weka, which they wrote

a couple java-based books from Manning:

- Collective Intelligence in Action (by Satnam Alag) and

- Algorithms of the Intelligent Web (Marmanis, Babenko)

I've had good results combining results of SOLR and sphinx, using each with and without different numbers stopwords (50-200), i.e. 4 queries through 2 indexes; might want to look into Xapian as well

You end up spending a lot of time evaluating combinations of stopwords, stemmers, token separators, UTF-8 to ISO latin conversions, etc.


Here's some things about evaluating quality of your search hits, e.g. precision vs. recall:



mean reciprocal rank (MRR), average precision (MAP) , precision at 10



The social media (reddit, digg, Y-C, delicious, tweets, facebook, stumbles, stackoverflow) can bootstrap you pretty far. Also look at Mixx, Y Buzz, etc. For setting domain bounds on your crawl: you can assemble a list of most-frequently tagged domains from e.g. delicious or subreddits, This works well for generating google custom search engines

(email w/ questions/comments)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact