

Web scraping: Reliably and efficiently pull data from pages that don't expect it - zmitri
http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-pull-data

======
epoxyhockey
My last web scraping activity consisted of using PhantomJS to drive and
injecting my own javascript that submitted the pertinent data to my own web
service. <http://phantomjs.org/>

~~~
epoxyhockey
Check that.. I _wanted_ to use js to submit to my own web service, but forgot
about the cross-domain issues, so I just used _console.log_ to output the
data. I suppose I could have used the iframe querystring parameter trick.

~~~
easp
Hmmm, seems like something you could have worked around by running phantom.js
against a proxy that intercepted and rerouted requests to the data collection
server.

~~~
zackzackzack
It's hella slow but there is always this:
<https://github.com/sgentle/phantomjs-node>

------
superasn
My favorite language when it comes to scraping the web has always been Perl.
With tools like LWP, Mechanize and Win32::Mechanize (OLE) scraping any site is
a breeze. Unfortunately, I haven't seen many good modules on CPAN for DOM
processing. Of course there is the Tokeparser and XPath but those generally
don't work well with street-HTML (which most sites are) and are no-where near
as fast or friendly as jQuery selectors. By the way, there is one module
called pQuery which is Perl's port of jQuery but that only supports a handful
of selectors and doesn't work with Mechanize.

If only there was a module in Perl which could marry Mechanize and jQuery
(without using IE or OLE) it would make the best scraper in the world!

~~~
mickeyp
Actually, libxml (lxml for Python) is very, very good at handling real-life
HTML content.

I'm currently on a contract where I do a significant amount of ETL and web
scraping as part of the project, and I almost exclusively use lxml and XPath
for parsing real-world HTML.

XPath is, without a doubt, the best tool for DOM manipulation that I can think
of. And that's just XPath 1.0 -- 2.0 is reputedly even better, but no 2.0
support is forthcoming for lxml as near as I can tell.

~~~
gauravk92
Xpath is good when the content is well formed, but usually doesn't work well
with even slightly messed up tags. Mojolicious supports broken content much
better. And it has full css3 selector support, which we all know is hands down
the best way to access Dom elements.

~~~
lusr
If you're in the .NET world, HtmlAgilityPack does a great job of producing
proper XML from broken HTML. CSS selectors don't always suffice, particularly
with sites that don't use CSS :) Sometimes you have sites where the best you
can do is e.g. get the text of the 2nd h2 header following the span with text
'X'. With some utility functions I can just write:

    
    
        result.ContentX = doc.Element("span").WithText("X:").FollowingText("h2", 2);
    

which translates to:

    
    
        //span[text()='X:']/following-sibling::h2[2]
    

From which my code then selects the HtmlAgilityPack InnerText (i.e. less
formatting, etc.). (Practically speaking, my code also does some case-
insensitive translation in there, which is an area where XPath is a bit
annoying, plus string trimming, checking and propagating nulls, etc.)

In my experience the greater challenge with scraping lots of data is dealing
with stuff like:

\- cache disabled in response headers but you've scraped 10K pages and just
discovered a page with e.g. a deformed href in an anchor (e.g. "<a
href+'....'>"); after giving up trying to understand how the hell they managed
that, it's not long before you're writing a crawl repository so you can
selectively ignore the caching rules your proxy cache happily abides by so you
can quickly restart your debug session for the next weird thing you discover
(unfortunately the nature of the site has forced you to do in-memory
preprocessing of 50K pages before you can do the real processing for the rest
of the site because they have done some OTHER weird stuff)

\- sites that treat EVERYTHING as dynamic content even though you could easily
cache it... now you get to do the job of the webmaster because you have many
data sources you're feeding from and don't want to hammer servers

\- sites with bad links but no 404 responses (just redirects); easy to detect,
but still a nuisance

\- _proper_ request throttling (i.e. throttling on the basis of requests
_serviced_ not merely _requested_ )

\- dynamically adjusting the above throttling, because sites can be weird :)

\- efficiently issuing millions of requests/week to a bunch of sites and
scraping data from the responses in custom formats for each site

~~~
beagle3
Indeed. And I would add

\- site layout changing and breaking your scraping logic. I'm not sure how
common this is today, but I was scraping hundreds of commerce sites in 2001,
each having several (often 5 but sometimes 50) different product page layouts
for different sections, each with its own field names, fields, and crazyness,
for a total of a few thousand different "scraping logics" (each just 5-10
lines long, but each had to be individually maintained). Now, every day just
two (out of a few thousands) broke, but to keep everything robust, you had to
(a) be able to tell which one broke, and (b) fix it within a reasonable time
frame. Neither of these is simple.

\- sites that depend nontrivially on JavaScript. That gives you the choice of
either (a) reverse engineering the javascript, and making your scraper figure
out all the details the same way the javascript would, or (b) use something
like phantomjs or e.g. a controlled IE session to let the javascript run and
then take the data from the DOM. (a) is more efficient, more work but was
(unexpectedly for me) much more stable. (b) is less work upfront, more
maintenance, and a LOT more resource intensive.

\- sites whose traffic management system you trip while scraping. Many will
block you, some actively (with an error message, so you know what is
happening), some will just keep you hanging or throttle you down to a few
hundred bytes/second all of a sudden, with no explanation and no one to
contact. Amazon contacted us when they figured we were scraping (we weren't
hiding anything and doing it with a logged in user that had contact details),
and were cool about it.

\- sites that randomly break and stop in the middle of a page. Happens much
more than you'd think; When using the site, you just reload or interact with a
half-loaded page. You could, of course, still scrape a half-loaded page - but
what if only 20/23 of the items you need are there? What if the site is
stateful, and reloading that page would cause a state change you do not want?

------
AdrianRossouw
i found node.js is a phenomenal scraping tool. There's also a pretty simple /
easy framework for these tasks called node.io (<http://node.io/>).

Being able to pop up a jsdom, and extract data from the page using jquery is a
lot of fun.

~~~
lancefisher
I've found jsdom to break for quite a bit of real-world crap HTML. Next time,
I'm going to try driving phantom.js with node using
<https://github.com/sgentle/phantomjs-node>

I've had pretty good luck with phantom.js, but it is somewhat difficult to
debug.

~~~
chrisohara
FYI, JSDOM isn't the default parser in node.io. It uses a faster and more
forgiving parser if you can survive with a subset of jQuery functionality
([https://github.com/chriso/node.io/wiki/API---CSS-
Selectors-a...](https://github.com/chriso/node.io/wiki/API---CSS-Selectors-
and-Traversal-methods))

------
tszming
We at Rewritely (<http://www.rewritely.com>) have quite some experience in
dealing large scale content migration for clients (not just a single HTML
page, but a whole site) mainly using scraping techniques.

We have seen so many invalid, funny, ugly markups and edge cases and
eventually, we found out the only reliable parser is actually the "browser"
you use everyday - which is heavily tested by hundreds of millions of users!
We chose to run the scraper in a real browser so what you scrape is what you
see, not to mention you can fire up events, Ajax requests, inject JavaScript
or even play Flash video... they are possible only when you have a real
browser.

Also, the scraper codes need to be concise and expressive (keys: maintainable
& testable), because your codes are going to break sooner or later - if you
are doing serious scraping business. Less LOC = more easy to change.
JavaScript & jQuery are the obvious winners in this category and what is more,
they are fun to work with.

Disclaimer: we have no relationship with PhantomJS :)

------
nicksergeant
Kind of surprised I haven't seen PyQuery in here yet:
<http://pypi.python.org/pypi/pyquery> \-- highly recommend. I've used it on
several projects.

------
uptown
Recently I've been using this method to inject JQuery into pages that may not
already have it available, then executing some custom javascript to extract
the page contents:

[http://www.guru.net.nz/blog/2009/06/screen-scraping-with-
jqu...](http://www.guru.net.nz/blog/2009/06/screen-scraping-with-jquery.html)

------
salvadors
<https://scraperwiki.com/> is a great source of thousands of pre-written
scrapers to use / copy / extend etc. It's sort of like github except you can
actually schedule the scrapers to run at regular intervals, and then just
access the scraped data over a standard API.

------
elchief
This one uses webdriver to get the full html from ajax pages:
[http://vancouverdata.blogspot.com/2012/02/less-painful-
ajax-...](http://vancouverdata.blogspot.com/2012/02/less-painful-ajax-
javascript-web.html)

------
ricksta
Anyone recommend some equivalent tools for Ruby?

~~~
rb2k_
There is Mechanize (<http://mechanize.rubyforge.org/>) which uses Nokogiri
(nokogiri.org/) internally and thus nicely supports CSS and XPATH expressions.

There is also Capybara, usually used as a testing framework, but you can
easily navigate, chose a backend (Selenium/Webkit for compatibility, mechanize
for speed): <https://github.com/jnicklas/capybara>

------
hk_kh
Also, tor + privoxy as a rule-of-thumb when your scraper IP is getting banned.

And - I know, it is (now) so old - memcached as a good place to store things.

~~~
duskwuff
When your scraper IP is getting banned, that's typically a sign that you
should stop, talk to the site owner, and/or reconsider what you're doing.

~~~
hk_kh
That's not always the case.

While government or private contractors do not allow to gather that data by
any citizen (there are examples of cities allowing it), I find myself in the
right to do so, and also, in the right to redistribute that data freely so
other developers can play / investigate / learn with it.

Why? Well, for starters:

1) Their own apps suck (or are inexistent)

2) They don't want to help their own users.

3) It's fun

4) Their services are paid with public money.

5) It raises awareness on the need of public data legislation.

There's a lot more to talk on the subject.

If you want to check it, <http://citybik.es>

I am helping projects and visualizations like: <http://bikes.oobrien.com/> or
my own <http://citybik.es/realtime/>

