The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.
The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).
The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.
You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.
But, his project really was very reasonable:
- it fetched ~2,400 pages
- he cached them after first fetch
- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)
English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.
I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.
He was probably one of the biggest users that day, so that makes sense.
The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.
Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.
If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.
That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.
Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.
Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).
This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.
En wikipedia has some standards. Generally though they are user-created tables and its up to the users to make them consistent (if they so desire). En Wikipedia generally does, but its not exactly a hard garuntee.
If you want machine readable use wikidata (if you hate rdf you can still scrape the html preview of the data)
Even just being able to download a tarball of the HTML of the infoboxes would be really powerful, setting aside the difficulty of, say, translating them into a consistent JSON format.
That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.
That actually might be kind of a neat thing to publish. Hmmmm.
Better yet-- what is the set of wikipedia articles which have an info box that cannot be sensibly interpreted as key/value pairs where the key is a simple string?
Hehe-- I am going to rankly speculate nearly all of them follow an obvious standard of key/value pairs where the key is a string. And then there are like two or three subcultures on Wikipedia that put rando stuff in there and would troll to the death before being forced to change the their infobox class to "rando_box" or whatever negligible effort it would take them if a standard were to be enforced.
The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).
The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.
[1]: https://github.com/spencermountain/wtf_wikipedia