
Scrapy: New Python web crawling & scraping framework (built on Twisted) - lunchbox
http://dev.scrapy.org/
======
tdavis
Awesome, now somebody go back in time 3 months and release this so I could
have not spent that time writing the same thing (okay, not exactly the same,
mine isn't nearly as pluggable).

------
lunchbox
This looks quite promising and I look forward to trying it out. I wonder how
it will work compared with my current approach of using Mechanize and
BeautifulSoup, along with the threading module.

~~~
tdavis
For long-running, large jobs I can tell you from experience it would work
about _a gazillion times faster_. Especially if you drop BeautifulSoup for
something like lxml.

~~~
breck
Agreed. BeautifulSoup can be quite slow(unless I'm doing it wrong, which I
probably am).

It accounts for 98% of the time of a current job I'm running. If anyone can
provide some tips it'd be much appreciated.

~~~
tdavis
It's rather unlikely you're doing anything wrong. IIRC, BeautifulSoup docs
acknowledge that it is rather slow.

Also, if you don't make use of the methods that extract/unravel object trees,
they may not be properly GC'd, leading to further slowdowns. I can't remember
the method names exactly (might be destroy() and extract()), but they're in
the docs.

------
sachinag
If anyone is interested in doing a small project for us using Scrapy (and lxml
and whatever else), please drop me an e-mail.

------
glazz
Please, tell me why scrapy better than wget? I can easly call wget from my
python scripts...

~~~
tdavis
wget is synchronous while Twisted is an asynchronous networking engine. This
means that you don't need to wait for a request to finish before making
another one (or making pancakes, or doing whatever you want).

I essentially wrote a parallelized version of scrapy which has the ability to
make hundreds of requests per second, depending on available CPUs. You could
never achieve that level of performance using wget.

~~~
breck
This is great. I was running threads on a current crawl job but the real
bottleneck is BeautifulSoup and not the network. So splitting the project into
threads(while it helped about 10%) wasn't really necessary and Twisted
probably would have done the trick.

------
liuliu
anyone knows how the memory leak happened? I use scrapy to fetch some data
out, the total network in/out is about 400M, but the memory usage of scrapy
gained about 1.5G.

------
msie
Can it simulate a browser as well as HtmlUnit?

------
agentbleu
I want to ask people here thoughts on frameworks, this looks well suited to a
project I have, but it is built on Twisted and the preferred option of
frameworks seems to be Django, now I'm a PHP coder, who is just about to step
up to the python challenge so I am thinking it would be better to start with a
more established framework? Thoughts would be most welcome.

~~~
iamelgringo
Django is a framework for creating web applications. Twisted is a framework
for network programming. Scrapy is a framework for scraping web pages.

If you're thinking about learning web development with Python, I'd suggest
Django. Other Python web frameworks are TurboGears, Pylons, Web.py or
Cherry.py. Django tends to have the best documentation and probably the
largest community right now, however.

------
agentbleu
Ah just what I needed and on the day I needed it! Thanks HN for posting it and
the creators for making it.

------
agentbleu
is anyone from scrapy here, i have some tech questions? is there an irc group?

~~~
lowkey
I am looking for some community action for scrapy. It looks useful for a
project I'm working on currently using BeautifulSoup but not digging the
sluggish performance.

I am having trouble resolving the docs to the code. Is there an IRC, mailing
list or forum?

