Hacker News new | past | comments | ask | show | jobs | submit login

>I found that there are no web crawling frameworks out there that allow for large-scale and continuous crawling of changing data.

Are you distinguishing between "I found that there are no" and "I didn't find any so far"?

Which ones that came close have you rejected, and why?

I can't say for sure that there are none, but I believe that I've done quite a bit of research. If there really was an excellent web crawling framework it should have bubbled up to the top.

I don't remember the names of all projects that I've looked at, but the main ones were Nutch, Hetrix, scrapy and crawler4j. I've come across several companies/startups that have built their crawlers in-house for the same reasons (e.g. http://blog.semantics3.com/how-we-built-our-almost-distribut...).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact