Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Writing an 80App
4 points by zeynel1 on Nov 17, 2009 | hide | past | favorite | 5 comments
Hello,

I've been trying to use 80legs to crawl about 120,000 pages to extract lawyer bios from the websites of 200 top US law firms. I created a simple app in Django with about 400 lawyers in the database (SQLite). The app pretends to be a who-knows-who database by matching schools, memberships to professional associations etc. Now I want to add those 100,000+ lawyers to make the app a bit more interesting.

I love working with Django and Python but I don't know Java, and 80legs requires the app to be written in Java (see this page http://80legs.pbworks.com/80Apps) Shion Derserkar of 80legs mentions that they offer custom coding and he is willing to give me an estimate. Before doing that I wanted to ask HN if anyone here used 80legs and wrote an 80app to analyze the crawled pages and if they could give me instructions and advice about the process.

In fact, I would appreciate general advice about realizing this project the best way.

Thank you.




http://news.ycombinator.com/item?id=840244

Crawling "only" 120k pages can be done easily with a pure Python solution over a normal home / office internet connection. The packages urllib, urllib2, robotexclusionrulesparser and lxml are a good start.

Important: Don't forget to implement a crawl rate limit.


80legs automatically handles the crawl rate limits for you.


That's probably not the primary reason to use 80legs - but avoiding to implement a whole crawler.


Thank you for posting that link to previous HN discussion. They mention Scrapy http://scrapy.org/ and I looked at it. I liked the fact that it is Python based and the tutorial is very good. They even have a shell to test HPath Selectors. Now I have a better understanding of the process. Of course, it is not like filling a form as the case with 80legs, but I am having fun working through the tutorial. I also ran a couple of small jobs with 80legs but I am unable to see the results. I guess 80legs would be good for huge projects. In any case, I will try to work with both. Thanks again.

Another discussion about scrapy http://news.ycombinator.com/item?id=411733





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: