

Ask HN: How to do targeted web crawling? - newcrawler

I'm thinking of building a domain-specific search engine. What are some general techniques for doing a targeted web crawl?<p>First, there's the issue of crawling pages that are relevant to my domain. How do I keep my crawler "on topic"? I know that I can start crawling from relevant seed sites, but how do I prevent my crawler from moving onto the general web once it hits a very general (and high PageRank) site such as Wikipedia?<p>Second, suppose that I am interested in finding a particular type of file (for example, PDFs). Are there techniques (again, other than seeding) for guiding the crawler toward sites that are likely to have lots of those files?<p>Thanks for any assistance.
======
wehriam
You could try latent semantic analysis of new pages to determine similarity to
other documents in your domain.

I've been doing a lot of web crawling recently, feel free to get in touch if
you'd like to discuss further.

