Hello,
I've been trying to use 80legs to crawl about 120,000 pages to extract lawyer bios from the websites of 200 top US law firms. I created a simple app in Django with about 400 lawyers in the database (SQLite). The app pretends to be a who-knows-who database by matching schools, memberships to professional associations etc. Now I want to add those 100,000+ lawyers to make the app a bit more interesting.
I love working with Django and Python but I don't know Java, and 80legs requires the app to be written in Java (see this page http://80legs.pbworks.com/80Apps) Shion Derserkar of 80legs mentions that they offer custom coding and he is willing to give me an estimate. Before doing that I wanted to ask HN if anyone here used 80legs and wrote an 80app to analyze the crawled pages and if they could give me instructions and advice about the process.
In fact, I would appreciate general advice about realizing this project the best way.
Thank you.
Crawling "only" 120k pages can be done easily with a pure Python solution over a normal home / office internet connection. The packages urllib, urllib2, robotexclusionrulesparser and lxml are a good start.
Important: Don't forget to implement a crawl rate limit.