
Scraping web sites which dynamically load data (like Twitter) - wslh
http://blog.databigbang.com/scraping-web-sites-which-dynamically-load-data/
======
malandrew
You can use PhantomJS for automated scraping. You have two options:

(1) Design your scraper is usually via a 1-to-1 correspondence with the
routing logic and the client side templates. Create a new scraping module for
each of the templates and use the scraping modules according to the data
visible at each route.

(2) Another simpler approach is just to design your scraper to hi-jack the
app's own XHR or Sockets module and collect the data directly via the API
exposed to the web-app.

The latter approach is the really smart way to scrape client-side web-apps
since you can get a lot of additional valuable metadata that may not be
written to the screen.

~~~
lambtron
Interesting. Do you have an example of the latter approach? I have been trying
to build a web scraper, but many times after the HTML is returned, the
javascript then loads the data dynamically. I'd love to learn of a few
techniques to fetch the AJAX data.

~~~
HarryRotha
Get firebug, or use developer tools in chrome and look at the requests the
browser is making when you load the page, and when you scroll down and it
loads more data. Then you just implement that in whatever language you are
scraping with. I usually use pythons requests module.. That's about as easy as
it gets for something like this.

------
level09
I used headless selenium successfully for executing javascript. it has a nice
feature of waiting the browser until the some element appears in the HTML.

~~~
91bananas
Here is another little snippet I used to scrape yelp before. It involved
polling for a specific element on the page, if you can inject or find jQuery,
this is pretty light and simple. edit: Should note that jQuery is less than
necessary for this (since I'm sure someone would have said it right after I
posted this), just makes me have to type less crap :)

[https://gist.github.com/91bananas/5644737](https://gist.github.com/91bananas/5644737)

------
hpaavola
Let's install couple things

    
    
        pip install robotframework-selenium2library
        npm install phantomjs
    

Run phantomjs and tell it to listen port for Webdriver traffic

    
    
        phantomjs --webdriver=4444
    

This goes to foo.txt

    
    
        *** Settings ***
        Library           Selenium2Library
    
        *** Test Cases ***
        Scrape Twitter
            [Setup]    Open Browser    https://twitter.com/HackerNews    firefox    main browser    http://localhost:4444
            Capture Page Screenshot
            Wait Until Keyword Succeeds    30    3    Find Waldo    Demand for HTML5 Skills On the Rise, Report Says http://bit.ly/90Vcnr
            Capture Page Screenshot
            [Teardown]    Close All Browsers
    
        *** Keywords ***
        Find Waldo
            [Arguments]    ${waldo}
            Click Element    css=.stream-footer
            Wait Until Page Contains    ${waldo}
    

And finally execute

    
    
        pybot foo.txt
    

Hope I didn't mess up with the formatting.

~~~
wslh
I think using a browser extension is faster than using the webdriver. Do you
agree?

~~~
hpaavola
Depending what we mean by "faster" and what is the usecase.

------
agibsonccc
Disclaimer: I don't have as much experience with phantomjs and the many more
traditional services mentioned due to sticking mainly to jvm languages for my
backends.

That being said:

I've run in to a lot of problems scraping sites with heavy javascript. I
personally use selenium with a headless google chrome for it.
[http://www.alittlemadness.com/2008/03/05/running-selenium-
he...](http://www.alittlemadness.com/2008/03/05/running-selenium-headless/)

Granted, there's a bit of custom logic involved at times, but it's by far the
most reliable compared to the alternatives I've seen like HTMLUnit.

It's allowed me to just focus on worrying about page content rather than
worrying about if something will work or not.

------
jackschultz
I'm actually working on this now for an app. The data is for the scores from a
golf tournament. When I was that it loaded dynamically, I dug through all the
requests using Chrome's developer tools. I was able to find the json that the
page was using to load the data which turned out to be easier to deal with
than scraping from html in the first place. So if you're running into this
problem, try to find the url that they snag the data from in the first place.

------
septicmadman
When all else fails you can resort to using webdriver. I generally try to
reverse the request but occasionally the authentication weirdness and data
obfuscation gets in a way to the point it's going to take more time than I'd
like.

------
tzs
I've been using PhantomJS to get upcoming movie listings from Comcast, so I
can plan my DVRing. The hardest part (aside from only having dabbled in
JavaScript so I have no idea how to write good or idiomatic JS) was keeping
straight in my mind what is executing in the page environment, and what is
executing in the environment that controls the script. Get the environments
mixed up, and you won't be a happy scrapper.

Here's the code if anyone wants to play with it. Note that a couple of things
are hard coded to my area. There's a cookie that gives the zip code, and
there's a 981 that is the channel number of a channel that is higher than all
the channels I'm interested in. This needs to be a channel that exists--you
can't just pick a large number.

I was going to put this on Github at one point, but decided not to because I
don't know what Comcast thinks of this sort of thing.

    
    
       var system = require('system');
       var page = require('webpage').create();
       var fs = require('fs');
       var of = fs.open("out", "w");
       var timer0 = null;
       var min_start = null;
       var want_exit = false;
       var days = 2;   // how many 24 hours worth of movies to get
       
       // if the page generates any console messages, dump them to our console
       page.onConsoleMessage = function (msg) {
           console.log('page: ' + msg);
       };
       
       
       //{ These functions are meant for use with page.evaluate()
       function set5()
       {
           var e = document.createEvent('MouseEvents');
           e.initMouseEvent('click', true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
           var t = document.getElementById('options-advanced-hours-5');
           t.dispatchEvent(e);
           return 0;
       }
       
       function tag_as_stale()
       {
           var marker = document.getElementById("981");
           marker.setAttribute("foo","bar");
       }
       
       function fresh()
       {
           var marker = document.getElementById("981");
           if ( marker == null )
               return false;
           if (marker.getAttribute("foo") == "bar")
               return false;
           return true;
       }
       
       function get_width()
       {
           // The 'timeline' class is the div that contains the headings
           // for each half hour on the listings. It has a child div for
           // each half hour. The count of these children lets us tell
           // if the listings are in 1 hour mode, 3 hour mode, or 5 hour mode.
           var t = document.getElementsByClassName('timeline');
           return t[0].children.length;
       }
       
       function forward()
       {
           var e = document.createEvent('MouseEvents');
           e.initMouseEvent('click', true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
       
           // 'option-forward' is the class of the thingies that advance
           // the listings when clicked.
           var t = document.getElementsByClassName('option-forward')[0];
           t.dispatchEvent(e);
       }
       
       // Find all the movies on the current listings, and put their
       // information into an array attached to th window, from whence
       // they can later be extracted (see get_movie() below).
       function find_movies()
       {
           window.tzs_movies = new Array();
           window.tzs_next = 0;
           var listings = document.getElementsByClassName('listing movies');
           for (var i = 0; i < listings.length; i++) {
               var el = listings[i];
               var name = el.getElementsByClassName('listing-entity')[0].innerText;
               var start = el.getAttribute('data-starttime');
               var chan = el.parentElement.getElementsByClassName('channel-actions')[0];
               var call = chan.getAttribute('data-callsign');
               var cnum = chan.getAttribute('data-vcn');
               window.tzs_movies.push(start + ":" + call + ":" + cnum + ":" + name);
           }
           return window.tzs_movies.length;
       }
       
       // Get movie data stashed away earlier by find_movies().
       function get_movie()
       {
           return window.tzs_movies[window.tzs_next++];
       }
       //}
       
       // This cookie seems to set the zip code for the listings.
       phantom.addCookie({
           'name':     'rh',
           'value':    'h%3D25095X%26z%3D98370',
           'domain':   '.comcast.net'
       });
       
       // This gets the movies from the page and prints them.
       function dump_movies()
       {
           var num_movies = page.evaluate(find_movies);
           var max_start = null;
           var page_start = null;
           for (var i = 0; i < num_movies; ++i) {
               var m = page.evaluate(get_movie);
               var parts = m.split(":");
               var start = parseInt(parts[0]);
               if (min_start == null || start < min_start)
                   min_start = start;
               if (page_start == null || start < page_start)
                   page_start = start;
               if (max_start == null || start > max_start)
                   max_start = start;
               if (start - min_start > days*86400000)
                   want_exit = true;
               of.writeLine(m);
           }
           if (want_exit) {
               of.close();
               phantom.exit();
           }
           var date_max = new Date(parseInt(max_start));
           var date_page = new Date(parseInt(page_start));
           var from = date_page.toDateString() + " " + date_page.toTimeString();
           var to = date_max.toDateString() + " " + date_max.toTimeString();
           console.log(from + " ==> " + to);
       
           page.evaluate(tag_as_stale);
           page.evaluate(forward);
           timer0 = setInterval(wait_for_data,1000);
       }
       
       // Check for data ready. Called from interval timer when we are
       // waiting for data. If we've got the data, kill the timer and
       // dump the movies.
       function wait_for_data()
       {
           result = page.evaluate(fresh);
           if (result) {
               clearInterval(timer0);
               dump_movies();
           } else {
               console.log("wait_for_data...");
           }
       }
       
       if (system.args.length == 2)
           days = parseInt(system.args[1]);
       page.open('http://xfinitytv.comcast.net/tv-listings', function () {
           // The site seems to get upset if you press the "forward" button
           // too many times. It doesn't seem to matter if each press is
           // advancing by an hour or by 5 hours--it is the number of requests
           // that seem to be limited. In 1 hour mode it is hard to even get
           // a day's worth of data. To allow getting a couple days or more,
           // we'll start look at the current mode, and if it is not the 5
           // hour mode, we'll change the mode to 5 hour mode first.
           var w = page.evaluate(get_width);
           if (w < 10) {
               // in 1 hour or 3 hour mode
               page.evaluate(tag_as_stale);
               page.evaluate(set5);
               timer0 = setInterval(wait_for_data,1000);
           } else {
               // in 5 hour mode (or better?) so we can go right to data extraction
               dump_movies();
           }
       });

~~~
wslh
_The hardest part was keeping straight in my mind what is executing in the
page environment, and what is executing in the environment that controls the
script_

YES! that's why we used a browser extension in this and the previous article.
Because you know that all the objects live in the same world. You don't have
this controller/scraper duality. If someone likes the Phantomjs approach, they
can implement the controller as a background web page and the scraper logic in
the content.

------
coherentpony
Doesn't Twitter have an API you can query?

~~~
bluetidepro
From the title:

> _...(like Twitter)._

I think that's just an easy example the author used of a popular site that
uses a lot of front-end AJAX calls to show content.

------
hernan604
use perl with HTML::Robot::Scrapper

~~~
wslh
Does it scrape Javascript/AJAX sites?

~~~
hotpockets
Don't think so, but there are a lot of Perl options that do.

In no particular order: Gtk2::WebKit::Mechanize, Win32::IE::Mechanize,
WWW::Mechanize::Firefox, WWW::Scripter, WWW::Selenium

Personally, I use WWW::Mechanize::Firefox

