
Ask HN: What's a good general seed list for a web crawler - jdrock
I'm developing a web crawler on top of a large distributed computer.  As part of the testing process, I want to keep a background job running to keep crawling the web over and over.  I was wondering if anyone had some ideas of a general seed list from which the crawler could reach a wide variety of links.  It would be great if the links it traversed were a good representation of the Internet as a whole, taking into account content variety, frequency of updates, and other variables.
======
soult
Wikipedia provides dumps of it's link table:
<http://download.wikimedia.org/backup-index.html>

------
alex_c
I've never done something like this myself, but what about using something
like <http://www.dmoz.org/>?

------
gojomo
DMOZ, Wikipedia, Yahoo Directory are the classic broad starting points. You
could also begin with the top 100, 500, 1000, etc. sites from some ranking
service (like Alexa), or top N results from major search engines on queries of
special interest.

Depending on how you order discovered URLs and sites for crawling, it may not
make too much of a difference where you start a truly web-wide crawl: you'll
quickly reach major hubs, and everything else, after a short period. Then it's
a matter of where the crawler chooses to spend its attention: which paths, how
deep.

If you keep crawling 'over and over' you may want to pick what you revisit
based on your own followup analysis, not the seeds of your first crawl(s).

------
fizx
dmoz

~~~
jdrock
Haha, why didn't I think of this.. probably the one link I need - thanks!

------
okeumeni
Yahoo directory is a good start.

------
xenophanes
could you google the dictionary and use top 10 results from each? i don't know
if this is decent idea or not. maybe someone will tell me :)

