

An Introduction to Compassionate Screen Scraping - helwr
http://dev.lethain.com/an-introduction-to-compassionate-screenscraping/

======
jp
Pretending to be human is problematic if the server thinks you are a robot
because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up
patterns (CNN and similar sites).

So "behaving like a human" on HN might result in an IP ban because /x is
denied in robots.txt. And this gets really funny when you get banned randomly
because of dynamic IP addresses in cloud infrastructure.

------
hung
Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up
a blog post on how to do this with App Engine but it should work generally in
Python using urllib2.

[http://www.hung-truong.com/blog/2010/12/01/conditional-
gets-...](http://www.hung-truong.com/blog/2010/12/01/conditional-gets-in-app-
engine/)

------
runningdogx
Screen scraping is taking visual data and transforming it into structure data.
A screen scraper would graphically capture a window and try to identify or
pick out data. Bots for MMOs tend to do that, alnong with providing input to
the MMO depending on what they "see".

Web or data scraping is what the article talks about. Still a hard problem,
easily broken by minor changes to the scraped webpage, but not subject to the
vagaries of OCR and computer vision or graphical interpretation problems,
which is what I was expecting from the title.

~~~
joshu
The original screen scraping was in the context of emulating a 3270 terminal
as an "API" into some system running on a mainframe.

I worked at a place that allowed web-based trading in the mid 90s by wrapping
a web server (on a sun spark 10) around a single emulated terminal running on
a SCO box.

It was called screen scraping then, too :)

------
eli
No mention of observing robots.txt?

~~~
megamark16
It was my understanding the urllib2 respects robots.txt automatically. I can't
find much to back that up, but I really thought I read that somewhere reliable
once.

~~~
mcav
This link corroborates that urllib2 respects robot.txt:

[http://stackoverflow.com/questions/3197299/urllib2-connectio...](http://stackoverflow.com/questions/3197299/urllib2-connection-
timed-out-error)

~~~
arst
That link is wrong. Python does ship with a robotparser module in the standard
library that parses robots.txt files, but urllib2 does not use it out of the
box. This can be easily confirmed using Wireshark or a quick glance at the
source:
[http://hg.python.org/cpython/file/08b5e2c9112c/Lib/urllib2.p...](http://hg.python.org/cpython/file/08b5e2c9112c/Lib/urllib2.py).

------
storborg
The author makes some great suggestions, namely to cache heavily and throttle
requests. However, they lost a lot of credibility for me with "screen scraper
traffic should be indistinguishable from human traffic". Sorry, but that's BS
--socially responsible scraping leaves control with the publisher. If the
publisher doesn't want you scraping their content, you shouldn't try to fake a
human in order to be able to.

~~~
dotBen
I see your point - however I read from it that the author was more referring
to the load/level of activity on the server that your requests make should be
indistinguishable from human traffic.

IE if the server's log files have 100's of requests from the same IP address
in successive lines then that doesn't look like human behavior.

What would have been nice for a 'best practice' document would be to show how
to set the HTTP AGENT string for the crawler so that it had an identifier,
version number and some contact method.

~~~
helwr
This was already asked:
[http://groups.google.com/group/comp.lang.java/browse_thread/...](http://groups.google.com/group/comp.lang.java/browse_thread/thread/6923c024ed392c85/88fa10845061c8ba?pli=1)

~~~
dotBen
I'm confused - your url is for a thread about Java but this best practice
document is python-orientated

~~~
njs12345
It's a joke, look at the person asking the question :)

------
dhruvbird
I can't help but mention that you should probably be using node.js with the
jsdom module for such a task these days. You can get the complete power of
jQuery with jsdom, making screen scraping child's play

~~~
jerf
It may not use the _MIGHTY POWER OF JAVASCRIPT_ , but BeautifulSoup is a best-
of-breed real-world HTML parser. Not just in the API, but in the verification
that its parsing algorithm is effective against HTML found on real web sites.
I find it unlikely that jsdom is actually significantly better. Does that code
really look like it's going to be significantly improved with jQuery?

    
    
        titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]
    

packs a lot of punch.

~~~
meatmanek
The ability to use CSS/jQuery selectors is really nice, though; In order to
find all <td>s whose parents have a class "blah", you have to use list
comprehension:

    
    
      tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']
    

In jQuery, this is more compactly written:

    
    
      tds = $('.blah > td')
    

And if you just want to look for <td>s somewhere within a .blah element, you
can use

    
    
      tds = $('.blah td')
    

This is a lot less clear in BeautifulSoup:

    
    
      tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]
    
    

(If there are better ways to write this BeautifulSoup code, please let me
know)

Selectors have some other benefits too - you can just go to the CSS file and
grab the selector that matches what you want, and you can be reasonably sure
it'll work in most cases.

~~~
joshu
Lxml in python lets you use CSS selectors.

