

We scraped the World Bank's website - danso
http://www.cgdev.org/publication/we-just-ran-twenty-three-million-queries-world-banks-website-working-paper-362

======
mileswu
An interesting and somewhat related (but only tangentially) article I read a
couple of days ago found that nearly 1/3rd of World Bank reports are never
read, not even by a single person:
[http://www.washingtonpost.com/blogs/wonkblog/wp/2014/05/08/t...](http://www.washingtonpost.com/blogs/wonkblog/wp/2014/05/08/the-
solutions-to-all-our-problems-may-be-buried-in-pdfs-that-nobody-reads/)

It was submitted to HN
([https://news.ycombinator.com/item?id=7715881](https://news.ycombinator.com/item?id=7715881))
by another user, but probably never got traction because the title of the
article is very vague.

~~~
ohashi
It sounds like a marketing problem. Although, I suspect the PDFs aren't the
only dissemination method (as stated in the article). After someone spends so
much time writing a report, it probably sits in their consciousness and
spreads via the person's other interactions (eg shaping their perspective,
thinking, and pursuits). With so many people producing so much content on a
daily basis, it's hard to imagine people actually reading it all. So hopefully
the good ideas get kept in the person's thoughts and come out repeatedly until
it's heard. Otherwise, I am not sure there really is a good filtering
mechanism on a system scale.

~~~
frozenport
Or maybe they are spreading bad ideas, we won't know unless somebody reviews
them.

------
stevoski
Or...you could simply use the World Bank's website and free API that already
offers much public data to download in various formats:
[http://data.worldbank.org/](http://data.worldbank.org/)

A similar data repository is available from the European Central Bank:
[http://sdw.ecb.europa.eu/](http://sdw.ecb.europa.eu/)

------
yp_all
"public domain... but can [only] be accessed in small pieces"

Sounds like the wonderful world of "APIs" on the www.

This sort of data should be on an FTP server.

I can build my own "apps". Give me the option of raw data.

Just my opinion, nothing more.

~~~
aaron-lebo
That's not quite what is going on. I had to do a paper for class recently that
was looking at economic indicators. If I needed data on "Agriculture & Rural
Development":

[http://data.worldbank.org/topic/agriculture-and-rural-
develo...](http://data.worldbank.org/topic/agriculture-and-rural-development)

Down at the bottom is a link that will give you a csv file:

[http://api.worldbank.org/v2/en/topic/1?downloadformat=csv](http://api.worldbank.org/v2/en/topic/1?downloadformat=csv)

I thought it was really easy. I ended up having to visit multiple download
links, but stitching the data together was simple using Python.

Without looking at the paper, I'm not sure what exactly they have. Have they
just done the stitching already and are providing the complete data set? Or is
there additional data not covered in the csv downloads?

Edit: Having looked a the paper it looks like the data they scraped was not in
the CSVs. But I really cannot tell. If that is the case, I don't know why only
some data is available as a bulk download and other data is not. So...back to
your original point.

------
obstacle1
Looking at the source right now. Noticed comments in the code along the lines
of "selenium is really slow traversing the dom" etc.. Also noticed the script
implements the non-headless Firefox WebDriver. Wouldn't it have been _much_
faster to have used GhostDriver or some similar headless solution?

~~~
tokenizerrr
Unless you need to evaluate JavaScript or take screenshots of the rendered
page is there any point at all in using a webdriver like that instead of
building a plain old scraper?

~~~
dmn001
I agree, using Selenium seems like an unnecessary waste of time and cpu
resources when replicating the GET/POST requests and parsing the html response
using a simple Perl or Python script would have sufficed.

------
CHY872
I'm not sure what to make of this. On the one hand, the data could be useful.

On the other hand, the data was probably already useful through the world
bank's tool. Furthermore, it's reasonable to assume that the data was only
released because the people who had to make the final decision were convinced
that the effort required to reassemble the data would be prohibitive. The fact
that it's now all been released might have a chilling effect on future
releases.

~~~
bjelkeman-again
Not sure I follow. Do you think the World Bank will now hesitate to release
their data because someone scraped it?

They have an open data policy, using CC-BY [1], so unless this scraping effort
took data that wasn't covered by that it should be ok I think.

[1]
[http://web.worldbank.org/WBSITE/EXTERNAL/NEWS/0,,contentMDK:...](http://web.worldbank.org/WBSITE/EXTERNAL/NEWS/0,,contentMDK:23164491~pagePK:64257043~piPK:437376~theSitePK:4607,00.html)

------
Zaheer
The code is _very_ well commented and I'd highly recommend if anyones
interested in scraping to read through it. As a bonus, the Appendix details
all the steps to run the script making it very easy for beginners.

------
coldcode
Why would such important public data be only available to internal
researchers? Cost? Politics? Fear?

~~~
unreal37
If 70% of their reports are downloaded only a couple of times, ever, I would
guess that the answer is that nobody is interested in them. Have you ever read
one?

------
obstacle1
There is an appendix to the paper describing how to install and run the
author's script. I'd like to take a look at the source, but can't find an
actual link to the code. Am I overlooking something?

~~~
officialjunk
1\. go here
[http://www.cgdev.org/section/publications?f[0]=field_documen...](http://www.cgdev.org/section/publications?f\[0\]=field_document_type%3A2057)

2\. click on the link to "We Just Ran Twenty-Three Million Queries of the
World Bank's Web Site"

3\. click on the "Data & Analysis" tab

4\. scroll to the bottom and there are download links to
harvester_parameters.py, harvester.py and unloader.py

BUT i can't seem to actually download them, as there is some redirect that
fails :( i tried creating an account on that site as well. anyone else have
any luck?

~~~
obstacle1
Perfect, thanks.

I got the redirect error too when I tried to download _all_ files. However
when I deselected all and just checked the readme, csv data, and python
scripts, it worked just fine.

------
dalek2point3
i thought the point of this way, gee, the world bank makes it hard for you to
get data out, so we scraped them, and here is the data for you guys to play
around with. I looked for a bit, but couldnt find the data -- am I missing
something?

~~~
bybjorn
The last sentence on that page reads: "The full data can be downloaded at
www.cgdev.org/povcalnet."

~~~
keithpeter
And the page linked has large number of files that appear to be relevant to
_other_ papers but not this one.

However, this link

[http://www.cgdev.org/section/publications?f[0]=field_documen...](http://www.cgdev.org/section/publications?f\[0\]=field_document_type%3A2057)

leads to page that has two papers listed by title. Click the second of the
papers. Click Data and Analysis tab, and you can download the CSV files they
obtained. You don't need a log in but you do need to agree to sensible looking
conditions.

I can't seem to link directly to the agree terms/download page.

------
iand
I guess they could have just gone to
[http://povertydata.worldbank.org/poverty/home/](http://povertydata.worldbank.org/poverty/home/)
and downloaded via CSV

------
jacquesm
I wonder if they would have released this if Weev had not been freed.

------
e12e
Looking at some random datasets, this looks like it's pretty small datasets
(less than a mb?) -- so it'd be nice to just have them zip-ed and available as
a torrent (say plain-text description in one file, and csv in a folder per
set)?

Rather than re-creating the problem of data being hard to find by splitting
the download links over two-pages, and apparently requiring a click-through
for every dataset to get at the data?

------
nickthemagicman
I'm not even sure what kind of data this would be but it sounds fascinating.

