Hacker News new | past | comments | ask | show | jobs | submit login

Creating the list of URLs and prioritizing them is the hardest thing about building a crawler! That is, a good, web-scale one. A replacement for wget might be sort of fun, but the real way to make a fast crawler is to be choosy about which pages get updated frequently, which are likely to contain good content (by computing a pagerank-like stat on the fly), etc.

It is far from my area of expertise, but the Wikipedia page about this looks very useful. It cites a bunch of wicked smart people. http://en.wikipedia.org/wiki/Web_crawler

If you just want to suck down a bunch of pages, then there's nothing wrong with wget.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: