
How to Extract Data from Wikipedia and Wikidata - okket
https://linktosheets.com/extract-data-from-wikipedia-wikidata/
======
3131s
Extracting just raw text from the Wikipedia dumps is pretty hellish.
Apparently the official Wikipedia parser itself is something like one 5000
line PHP function. There are about 30 or so alternate parsers that attempt to
do this with limited success. I tried a lot of those options, but eventually
had to hack together some terrible scripts of my own to do the job.

[https://www.mediawiki.org/wiki/Alternative_parsers](https://www.mediawiki.org/wiki/Alternative_parsers)

It was for a web app that I made as a way to learn web dev. The data
extraction from Wikipedia was what really killed my motivation and eventually
left the project in a sort of amateurish, half-finished state. But, if any
language learners wants to practice reading difficult English texts it might
be somewhat useful:

[https://lexical.io](https://lexical.io)

~~~
logicallee
>Extracting just raw text from the Wikipedia dumps is pretty hellish.

Did you try just asking them? I'm sure someone there can flick a switch and
get it all to you. After all, it's _just_ text, and they must have it all set
up to serve however they want.

The total size of the English Wikipedia is under 100 GB as uncompressed xml[1]
(bearing in mind that xml is pretty verbose).

This should not be prohibitive for a Wikipedia admin to generate and to serve
to you at your request. Why not?

It's a community project sponsored by donations.

[1] Eyeballing the trend on this page -
[https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia)
which mentions "As of May 2015, the current version of the English Wikipedia
article / template / redirect text was about 51 GB uncompressed in XML
format", so I estimate based on the shape of the graph that it is less than
100 GB today.

~~~
taejo
You can download a database dump which contains the wikitext source. You can
interpret this into HTML. What GP means is extracting the natural-language
text from this (separating it from the infoboxes, presentation information,
etc).

~~~
logicallee
oh, okay. thanks.

------
hmottestad
Worth mentioning is dbpedia, which is an open database made from extracting
data from Wikipedia.

[http://wiki.dbpedia.org](http://wiki.dbpedia.org)

~~~
frik
Is there an easy way to host a local DBPedia copy? Last time I checked it uses
a little known database software. MySQL, Postgres or Lucene/ElasticSearch
support would be prefered.

~~~
smarx007
That "little known database software" is called an RDF triplestore. You can
think of it as a graph store on steroids, which allows you to formulate very
complex queries on linked data (which wikipedia has plenty of). Take a look at
[https://www.futurelearn.com/courses/linked-
data](https://www.futurelearn.com/courses/linked-data) for learning more (not
affiliated with the authors, just enjoyed the course).

------
Animats
Note what this is for. "When to use it: Seek spikes in pageviews, find out the
reason it triggered interest, and try to replicate it. If you’re unable to
replicate it, you can produce content related to that trigger." This is an
article on how to generate clickbait content.

~~~
theoh
A little cynically, one could say that a lot of "life" is about seeing someone
exploiting a niche, and imitating them.

Unprincipled, yes.

------
WikipediasBad
Wikidata is very interesting experience. Using their API is rather cool. For
example, here is the wikidata API call for Hacker News:
[https://www.wikidata.org/w/api.php?action=wbgetentities&ids=...](https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q686797&format=json)

The mainsnaks snytax is a bit confusing but once you get used to it it's quite
powerful. I use it quite often.

------
garysieling
I've been investigating Wikidata, because I want to add informational cards to
the sidebar of [https://www.findlectures.com](https://www.findlectures.com)
(like Google does), so this spreadsheet is a great head start.

------
wlib
I made a cli tool [0] a while ago to simplify this. I remember the API was
pretty difficult.

[0] [https://rubygems.org/gems/wik](https://rubygems.org/gems/wik)

------
redsummer
Is there a way to use a local version of Wikipedia (english) on a raspberry
pi? That can be accessed and searched via the shell? On a USB stick. Images
and media fils excluded.

I know there is Gozim -
[http://scaleway.nobugware.com](http://scaleway.nobugware.com) \- but that
serves a local website.

~~~
JCharante
You could use kiwix and have the rpi host a nas w/ the db.

