

Ask HN: Building a domain-specific search engine. Advice, experiences? - jdale27

I've been thinking for a while of building a domain-specific / "vertical" search engine. I usually end up telling myself it's crazy to think that I can compete with Google, but the recent thread (http://news.ycombinator.com/item?id=902999) on the apparently declining quality of Google search, especially for niche queries, got me thinking that it might not be so crazy after all. What's more, I know of at least one existing search startup in this domain who seem to be doing well, so that's some additional validation of this market.<p>Although I'm a pretty competent hacker, I'm new to search. I've done some reading on search and information retrieval in general, and picked up a copy of Manning et al.'s "Introduction to Information Retrieval", (http://nlp.stanford.edu/IR-book/information-retrieval-book.html) which looks fantastic.<p>However, I don't want to reinvent too many wheels, at least not until I really have to. I'm planning to dive into Hadoop, Lucene, Nutch, Solr, and other open source tools and see how far those can take me.<p>I would appreciate any advice on this endeavor -- preferably from your own real world experience -- tips, tricks, pitfalls, resources, etc.<p>One particular issue that I've pondered is how to do a good "targeted" web crawl -- how to restrict the crawl (or at least indexing) to pages I know are relevant to my domain. It occurred to me to seed the crawler with a set of "authoritative" domains that I know are relevant, then use pages from those domains to train a simple classifier to apply to each page visited, to decide whether to index it and crawl its outgoing links. Any other (simpler?) strategies?<p>Also, how far can I realistically expect to get with "off-the-shelf" OSS tools like Nutch, etc.? Those of you who've used such tools "in anger", what roadblocks and brick walls can I expect to run into?<p>Thanks in advance!
======
paraschopra
Solr can serve great for your purposes. As far as domain specific crawl is
concerned, maybe you could use our semantic classification API which takes in
a URL and tells the DMOZ category it belongs to -
<http://www.wingify.com/contextsense/>

------
gtani
Some other helpful books:

\- Data Mining, by Witten and Franke; describes basics , how to use Weka,
which they wrote

a couple java-based books from Manning:

\- Collective Intelligence in Action (by Satnam Alag) and

\- Algorithms of the Intelligent Web (Marmanis, Babenko)

~~~
gtani
I've had good results combining results of SOLR and sphinx, using each with
and without different numbers stopwords (50-200), i.e. 4 queries through 2
indexes; might want to look into Xapian as well

You end up spending a lot of time evaluating combinations of stopwords,
stemmers, token separators, UTF-8 to ISO latin conversions, etc.

__________________________________________

Here's some things about evaluating quality of your search hits, e.g.
precision vs. recall:

[http://www.alistapart.com/articles/testing-search-for-
releva...](http://www.alistapart.com/articles/testing-search-for-relevancy-
and-precision/)

[http://www.alistapart.com/articles/internal-site-search-
anal...](http://www.alistapart.com/articles/internal-site-search-analysis-
simple-effective-life-altering/)

mean reciprocal rank (MRR), average precision (MAP) , precision at 10

[http://www.computationalmedicine.org/challenge/cmcChallengeD...](http://www.computationalmedicine.org/challenge/cmcChallengeDetails.pdf)

______________________________

The social media (reddit, digg, Y-C, delicious, tweets, facebook, stumbles,
stackoverflow) can bootstrap you pretty far. Also look at Mixx, Y Buzz, etc.
For setting domain bounds on your crawl: you can assemble a list of most-
frequently tagged domains from e.g. delicious or subreddits, This works well
for generating google custom search engines

(email w/ questions/comments)

