
Wikipedia data dumps and stats - kola
https://meta.wikimedia.org/wiki/Research:Data
======
fauigerzigerk
Sadly, they don't publish up-to-date HTML dumps and there is no reliable way
of reproducing them short of installing the entire wikipedia system locally,
including the database. I know there are quite a few projects that claim to do
it but they're all abandoned, incomplete or unsuitable in various other ways
(as far as I know).

~~~
atdt
"The entire Wikipedia system, including the database" is slightly overblown:
it's just MediaWiki and MySQL.

~~~
yuvipanda
MediaWiki (+ a finely tuned PHP system, because mediawiki is unusably slow
without it), MySQL, and plenty of time / resources for doing the large import
of enwiki (since that is what most people are interested in).

~~~
fauigerzigerk
Exactly, and at that point you still don't have the static HTML files. You
have to crawl the entire local site, which takes ages. Then you have to repeat
all of this according to your desired update frequency.

~~~
alexkus
EC2 + S3

Minimal charge to cover the bandwidth and hosting costs. Any profit donate to
Wikipedia Foundation.

Solves that same problem for lots of people.

------
wikiburner
Hey everybody, fauigerzigerk sort of gets into this, but I just downloaded the
dump yesterday expecting there to be a relatively straightforward way to parse
and search it with Python and extract and process articles of interest w/
NLTK.

I'm not sure what I was expecting exactly, but it sure wasn't a single 40gb
XML file that I can't even open in Notepad++.

Is my only real option (for parsing and data mining this thing) to basically
set up a clone of wikipedia's system, and then screen scrape localhost?

~~~
fauigerzigerk
It's not your only option. You can open the XML dump with a streaming XML
parser (not a DOM parser) and use one of the existing wiki syntax parsers to
extract what you need. If you just need a few specific items (for instance
just the links to reconstruct the page graph or just the info boxes) that's a
perfectly workable solution. There is a large number of small tools and
scripts that extract various bits and pieces from the XML dump. You may well
find a tool that suits your needs.

But there are two issues: The available parsers are not very robust and not
very complete, because the wiki syntax is extremely convoluted and there is no
formal spec. Second, the wiki syntax includes a kind of macro system. Without
actually executing those macros you don't get the complete page as you see it
online. The only way to get the complete and correct page content, to my
knowledge, is to install the mediawiki site and import the data.

If you just want to look at the XML dump quickly you can use less or tail.

~~~
yareally
> _because the wiki syntax is extremely convoluted and there is no formal
> spec_

I ran into that when parsing out pages with Python for an app I am working on.
Parsing it by conditions leads to a lot of conditions for edge cases, which as
one might think happen more often as the more obscure the topic gets due to
not being updated or improved to be more inline with the formatting of
trafficked articles. If you are looking for something in particular, ranking
elements on a page helps to a point if the elements you want are the ones that
occur the most or near to it.

Aside from more obscure, less trafficked articles, I noticed many of the Non-
English wiki articles are also formatted in awkward ways and appear far less
updated to their English counterparts. I thought I had most edges cases
covered until I started parsing out wiki markup for other languages.

~~~
fauigerzigerk
Ah, thanks for the warning. I haven't even touched the non-english articles
yet :/

~~~
yareally
If you plan on doing both, which is pretty easy to do with their API (as you
can grab all the potential languages from an article and the URL), I think I
would think of testing against foreign languages first and then English once
you have a basic parser going and search. Non-English had more weirdness, but
it happened more often, so it became easier to eliminate similar cases in
English articles that may happen more infrequent.

I ended up doing a lot of massive unit testing against various edge cases to
make sure things were working. Even with that still, I would try to log any
anomalies and put them aside for manual inspection later (by running checks on
what "good" data should look like), just to be safe.

