Hacker News new | past | comments | ask | show | jobs | submit login
How to Extract Data from Wikipedia and Wikidata (linktosheets.com)
201 points by okket on March 27, 2017 | hide | past | favorite | 22 comments



Worth mentioning is dbpedia, which is an open database made from extracting data from Wikipedia.

http://wiki.dbpedia.org


Is there an easy way to host a local DBPedia copy? Last time I checked it uses a little known database software. MySQL, Postgres or Lucene/ElasticSearch support would be prefered.


That "little known database software" is called an RDF triplestore. You can think of it as a graph store on steroids, which allows you to formulate very complex queries on linked data (which wikipedia has plenty of). Take a look at https://www.futurelearn.com/courses/linked-data for learning more (not affiliated with the authors, just enjoyed the course).


I meant the Virtuoso OpenLink database software. https://en.wikipedia.org/wiki/Virtuoso_Universal_Server

RDF triple stores were the latest hype during Semantic Web era. Wouldn't a more general NoSQL/Graph or SQL database be suitable as well, if soneone wrote an import script. I read someone achieved in with MySQL and a star table schema.

What database software is WikiData using? WikiMedia properties usually use MySQL. From the dumps page they have SQL files as well, so probably MySQL. But they promote just the JSON, RDF and XML dumps.


Take a look at Stardog. Should let you load in dbpedia on your laptop no problem. They have a free trial license. http://stardog.com/

Modern and simple to get started.


There is also YAGO, which serves a similar purpose.

https://www.mpi-inf.mpg.de/departments/databases-and-informa...

https://gate.d5.mpi-inf.mpg.de/webyago3spotlx/SvgBrowser

(Disclaimer: colleague of a YAGO developer.)


Note what this is for. "When to use it: Seek spikes in pageviews, find out the reason it triggered interest, and try to replicate it. If you’re unable to replicate it, you can produce content related to that trigger." This is an article on how to generate clickbait content.


A little cynically, one could say that a lot of "life" is about seeing someone exploiting a niche, and imitating them.

Unprincipled, yes.


Wikidata is very interesting experience. Using their API is rather cool. For example, here is the wikidata API call for Hacker News: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=...

The mainsnaks snytax is a bit confusing but once you get used to it it's quite powerful. I use it quite often.


I've been investigating Wikidata, because I want to add informational cards to the sidebar of https://www.findlectures.com (like Google does), so this spreadsheet is a great head start.


I made a cli tool [0] a while ago to simplify this. I remember the API was pretty difficult.

[0] https://rubygems.org/gems/wik


Extracting just raw text from the Wikipedia dumps is pretty hellish. Apparently the official Wikipedia parser itself is something like one 5000 line PHP function. There are about 30 or so alternate parsers that attempt to do this with limited success. I tried a lot of those options, but eventually had to hack together some terrible scripts of my own to do the job.

https://www.mediawiki.org/wiki/Alternative_parsers

It was for a web app that I made as a way to learn web dev. The data extraction from Wikipedia was what really killed my motivation and eventually left the project in a sort of amateurish, half-finished state. But, if any language learners wants to practice reading difficult English texts it might be somewhat useful:

https://lexical.io


MediaWiki's original parser is a nasty pile of PHP and regex, but there's a new officially supported parser you can use called Parsoid. It's used extensively in production on Wikimedia sites and basically does MediaWiki wikitext to HTML roundtrip.

https://www.mediawiki.org/wiki/Parsoid

If I were parsing lots of MediaWiki wikitext, I'd put Parsoid at the top of the list as the rest seem incomplete or unsupported.


There are dumps available that already have lots of metadata extracted and the content available as plaintext without wikimarkup or html. Specifically these are the dumps of the search indices used at wikimedia:

https://dumps.wikimedia.org/other/cirrussearch/current/

These dumps contain a json line per article and are formatted roughly like so: https://en.wikipedia.org/wiki/Hacker_News?action=cirrusdump


>Extracting just raw text from the Wikipedia dumps is pretty hellish.

Did you try just asking them? I'm sure someone there can flick a switch and get it all to you. After all, it's just text, and they must have it all set up to serve however they want.

The total size of the English Wikipedia is under 100 GB as uncompressed xml[1] (bearing in mind that xml is pretty verbose).

This should not be prohibitive for a Wikipedia admin to generate and to serve to you at your request. Why not?

It's a community project sponsored by donations.

[1] Eyeballing the trend on this page - https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia which mentions "As of May 2015, the current version of the English Wikipedia article / template / redirect text was about 51 GB uncompressed in XML format", so I estimate based on the shape of the graph that it is less than 100 GB today.


You can download a database dump which contains the wikitext source. You can interpret this into HTML. What GP means is extracting the natural-language text from this (separating it from the infoboxes, presentation information, etc).


oh, okay. thanks.


Data extraction from Wikipedia seems to be a task for which there exist a plethora of partial solutions, all different.

Partial solutions are perhaps all we can hope for, given how hopelessly intermingled semantics and presentation are in the Wikitext format. If you had a completely accurate reimplementation of MediaWiki and all its templates and extensions, what you'd get in the end is HTML pages, and that's not actually data extraction.

Here's my project [1], which uses Haskell's Attoparsec library to get at the parts of Wikimedia content that I care about. Currently it extracts plain text from Wikipedia articles, and word relationships from three languages of Wiktionary.

[1] https://github.com/LuminosoInsight/wikiparsec


I was in a similar situation some years ago, where I needed wikipedia documents for building a corpus of documents for text-mining. Eventually had to build a solution myself from wikipedia's XML dump files.

Here's the link to it, may someone need to do something similar: https://github.com/joaoventura/WikiCorpusExtractor


By coincidence for another project, I was trying to mine Wikipedia. It is better to try and parse the html than mess about with the text. The wikidata stuff can be quite inconsistent also in what it returns.


Is there a way to use a local version of Wikipedia (english) on a raspberry pi? That can be accessed and searched via the shell? On a USB stick. Images and media fils excluded.

I know there is Gozim - http://scaleway.nobugware.com - but that serves a local website.


You could use kiwix and have the rpi host a nas w/ the db.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: