

Ask HN: Writing an 80App - zeynel1

Hello,<p>I've been trying to use 80legs to crawl about 120,000 pages to extract lawyer bios from the websites of 200 top US law firms. I created a simple app in Django with about 400 lawyers in the database (SQLite). The app pretends to be a who-knows-who database by matching schools, memberships to professional associations etc. Now I want to add those 100,000+ lawyers to make the app a bit more interesting.<p>I love working with Django and Python but I don't know Java, and 80legs requires the app to be written in Java (see this page http://80legs.pbworks.com/80Apps) Shion Derserkar of 80legs mentions that they offer custom coding and he is willing to give me an estimate. Before doing that I wanted to ask HN if anyone here used 80legs and wrote an 80app to analyze the crawled pages and if they could give me instructions and advice about the process.<p>In fact, I would appreciate general advice about realizing this project the best way.<p>Thank you.
======
codepoet
<http://news.ycombinator.com/item?id=840244>

Crawling "only" 120k pages can be done easily with a pure Python solution over
a normal home / office internet connection. The packages urllib, urllib2,
robotexclusionrulesparser and lxml are a good start.

Important: Don't forget to implement a crawl rate limit.

~~~
jdrock
80legs automatically handles the crawl rate limits for you.

~~~
codepoet
That's probably not the primary reason to use 80legs - but avoiding to
implement a whole crawler.

------
zeynel1
<http://www.80legs.com/>

<http://80legs.pbworks.com/80Apps>

