Looking at the source right now. Noticed comments in the code along the lines of "selenium is really slow traversing the dom" etc.. Also noticed the script implements the non-headless Firefox WebDriver. Wouldn't it have been much faster to have used GhostDriver or some similar headless solution?
Hi, one of the authors here. First, though I worked on the paper, I am not employed by CGD, so these comments are my own.
The main reason for using the non-headless Firefox WebDriver was that we wanted the script to access the site just like a human user. This made it easy to explain to non-technical people exactly how we had gotten the data. We didn't want to do anything that could be seen as circumventing the interface that the World Bank had created for that purpose.
Up to a point, performance was not a concern. In fact, as slow as Selenium is, we still artificially limited the speed of the script by waiting three seconds between each set of queries. However, when it came to selecting options, it could take Selenium tens of seconds, so that was done with js.
Unless you need to evaluate JavaScript or take screenshots of the rendered page is there any point at all in using a webdriver like that instead of building a plain old scraper?
I agree, using Selenium seems like an unnecessary waste of time and cpu resources when replicating the GET/POST requests and parsing the html response using a simple Perl or Python script would have sufficed.