
Crawl a website with scrapy and store extracted results with MongoDB - BaltoRouberol
http://isbullsh.it/2012/04/Web-crawling-with-scrapy/
======
JackC
For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat
combination:

    
    
      import httplib2, lxml, pyquery
      h = httplib2.Http(".cache")
      def get(url):
          resp, content = h.request(  url, headers={'cache-control':'max-age=3600'})
          return pyquery.PyQuery( lxml.etree.HTML(content) )
    

This gives you a little function that fetches any URL as a jquery-like object:

    
    
      pq = get("http://foo.com/bar")
      checkboxes = pq('form input[type=checkbox]')
      nextpage = pq('a.next').attr('href')
    

And of course all of the requests are cached using whatever cache headers you
want, so repeated requests will load instantly as you iterate.

Just something else to throw in the toolbelt ...

~~~
the_cat_kittles
Have checked out kenneth reitz's requests? Its fantastic, you might like it

~~~
codehenge
Link, for the interested:

<https://github.com/kennethreitz/requests>

~~~
the_cat_kittles
thanks, i should have included a code sample too:

    
    
        import requests
        from lxml import etree
    
        jquery_like_page = etree.HTML(requests.get('url').text)

------
jat1
Also check this out for a pretty good discussion on scraping
[http://pyvideo.org/video/609/web-scraping-reliably-and-
effic...](http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-
pull-data)

~~~
BaltoRouberol
Yeah, I actually learnt scraping from Asheesh :) He's awesome.

~~~
jat1
I have been playing with scraping for quite some time now and have my own
scripts and stuff, but I found that video informing and there were a few
useful snippets I had missed.

Keep meaning to check out more of the Pycon vids

------
danneu
Here's the same functionality written in Ruby using Chris Kite's crawler
called Anemone[1]. Gist: <https://gist.github.com/2475824>. Screenshot:
<http://i.imgur.com/cbv9A.png>

[1]: <http://anemone.rubyforge.org/doc/index.html>

------
ananthrk
Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and
not as "is _b_ ullshit_spiders.py"? :)

~~~
BaltoRouberol
Oh, that's just a typo. My bad. Edit: there, corrected.

------
hack_edu
I really want to read. Topic is right down my alley.

Unfortunately, the page is literally broken and unreadable on Android ICS with
Chrome :(

~~~
mumphster
Same using mobile safari

~~~
jordanmessina
Same using chrome on windows unless I resize the browser and make the width
about 1000px

~~~
martius
Yes, it's true for any viewport with a width of less than 940px. I'll do my
best to fix this too.

