
Json-wikipedia - ikuyamada
https://github.com/diegoceccarelli/json-wikipedia
======
sktrdie
What about DBPedia
[http://dbpedia.org/page/The_Shining_(film)](http://dbpedia.org/page/The_Shining_\(film\))
?

That's available as JSON, and much larger community/effort behind it.

~~~
ikuyamada
In my understanding, DBpedia is a project for extracting data mainly from
Wikipedia infoboxes (not whole data dumps) by collaboratively creating rules
for converting data into cleaner schema that enables to perform a query such
as SPARQL. I think this is a project that directly converts wikipedia dump xml
to JSON for easier manipulation, which differs from DBpedia.

------
andrewvc
Some might be interested in
[https://github.com/andrewvc/wikiparse](https://github.com/andrewvc/wikiparse)
, code I wrote to dump wikipedia into elasticsearch. Here's a chapter in my
book on searching it: [http://exploringelasticsearch.com/book/searching-
natural-lan...](http://exploringelasticsearch.com/book/searching-natural-
language/searching-the-wikipedia-dataset.html)

------
DenisM
So, does it parse the markdown of the text as well? I could not quite get it
from the documentation. It would be quite exciting if it could - the markdown
is notoriously complex, and creates a lot of problem if all you want is te
body of text (e.g. for training a machine learning algorithm).

~~~
TorKlingberg
Wikipedia articles are not written in markdown but in wikicode. Wikicode is
much more complex to parse, but allows more complex formatting. You may be
looking for the word 'markup' which is a general term that included markdown,
HTML, wikicode, bbcode etc.

I doubt this library parses the wikicode. What format would it parse it into?
As matthewarkin said, there is a Wikipedia API to convert wikicode into HTML.

~~~
diegolo
It does that indeed! It will process a field paragraph, containing a list of
parsed paragraphs in the article, as simple UTF8 text. Lists and Tables are
collected in separated fields, have a look to the Article object. For the
wikicode parsing I relied on the JWPL Library [http://www.ukp.tu-
darmstadt.de/software/jwpl/](http://www.ukp.tu-darmstadt.de/software/jwpl/)

------
fox91
Why would I need the dump in JSON? XML is for sure uglier but at least there
are efficient SAX parsers.

~~~
dk8996
I would like to know this as well. What is gained by using this tool...
converting to JSON?

~~~
tcdent
JSON is cool and hip.

More seriously, it has been gaining popularity because of it's direct
relationship to built-in structures in most programming languages.

For example: XML doesn't have a concept of arrays, but accomplishes storing
collections in it's own special way. It's up to the parser to determine how
this should be referenced and interacted-with, while with JSON there's no
question: it's an array.

~~~
sillysaurus2
For example, now I know for a fact I can easily manipulate Wikipedia's data
dump using Python thanks to this JSON change.

    
    
        import json
    
        js = json.loads(wikidump)
    

That's it. The only reason it's so simple is thanks to this JSON conversion.

XML? No clue how to manipulate it properly. I could figure it out of course,
but now I don't have to. JSON + Python just works™.

~~~
diegolo
yes, and also it makes easy to manipulate the dump using Hadoop, and
retrieving some values for all the article using jq
[http://stedolan.github.io/jq/](http://stedolan.github.io/jq/), could you do
the same with XML? Btw You can find a little sample of the dump here:
[https://dl.dropboxusercontent.com/u/4663256/tmp/json-
wikiped...](https://dl.dropboxusercontent.com/u/4663256/tmp/json-wikipedia-
sample.json)

------
diegolo
A little sample of the dump:
[https://dl.dropboxusercontent.com/u/4663256/tmp/json-
wikiped...](https://dl.dropboxusercontent.com/u/4663256/tmp/json-wikipedia-
sample.json)

