Hacker News new | past | comments | ask | show | jobs | submit login

The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).

The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.

[1]: https://github.com/spencermountain/wtf_wikipedia




Still, I guess you could get the dumps and do a local Wikimedia setup based on them, and then crawl that instead?


You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.

But, his project really was very reasonable:

- it fetched ~2,400 pages

- he cached them after first fetch

- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)

English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.

I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.


> This guy's use was 0.001% of traffic on that day

For 1 person consuming from one of the most popular sites on the web, this really reads big.


He was probably one of the biggest users that day, so that makes sense.

The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.

Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.

If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.


I don't think I agree. Cache has a cost too.

In theory, you'd want to cache more popular pages and let the rarely visited ones go through the uncached flow.

Crawling isn't user-behavior, so the odds are that a large percentage of the crawled pages were not cached.


That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.

Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.


Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).

This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.


> The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

How is it possible that "give me all the infoboxes, please" is more than a single query, download, or even URL at this point?


The problem lies in parsing them.

Look at the template for a subway line infobox, for example. https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT

It's a whole little clever language (https://en.wikipedia.org/wiki/Wikipedia:Route_diagram_templa...) for making complex diagrams out of rather simple pictograms (https://commons.wikimedia.org/wiki/Template:Bsicon).


Oh wow.

But every other infobox I've seen has key/value pairs where the key was always a string.

So what's the spec for an info box? Is it simply to have a starting `<table class="hello_i_am_infobox">` and an ending `</table>`?


En wikipedia has some standards. Generally though they are user-created tables and its up to the users to make them consistent (if they so desire). En Wikipedia generally does, but its not exactly a hard garuntee.

If you want machine readable use wikidata (if you hate rdf you can still scrape the html preview of the data)


Even just being able to download a tarball of the HTML of the infoboxes would be really powerful, setting aside the difficulty of, say, translating them into a consistent JSON format.

That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.

That actually might be kind of a neat thing to publish. Hmmmm.


Better yet-- what is the set of wikipedia articles which have an info box that cannot be sensibly interpreted as key/value pairs where the key is a simple string?


The infoboxes aren't standardized at all. The HTML they generate is.


Hehe-- I am going to rankly speculate nearly all of them follow an obvious standard of key/value pairs where the key is a string. And then there are like two or three subcultures on Wikipedia that put rando stuff in there and would troll to the death before being forced to change the their infobox class to "rando_box" or whatever negligible effort it would take them if a standard were to be enforced.

Am I anywhere close to being correct?


I think you'll have to more clearly define what you mean by "key-value" pairs.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: