
Extracting data from Wikipedia using curl, grep, cut and other bash commands - loige
http://loige.co/extracting-data-from-wikipedia-using-curl-grep-cut-and-other-bash-commands/
======
lacksconfidence
Because i was randomly curious, can extract this data from the structured html
with some dom selectors in a similarly haphazard way:

Start with:
[https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olymp...](https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olympic_medalists_in_judo)

Run this js one-liner:
[].slice.call(document.querySelectorAll('table[typeof="mw:Transclusion
mw:ExpandedAttrs"] tr td:nth-child(n+2) > a:nth-child(1),
table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr:nth-child(3) td > a:nth-
child(1)')).map(function(e) { return e.innerText; }).reduce(function(res,el) {
res[el] = res[el] ? res[el] + 1 : 1; return res; }, {});

The result is an object with the medalists as keys, and the count as values.
JS objects are unordered so sorting is left as an excercise for the reader.

~~~
austinjp
Nice. Nasty, but nice :)

I can't help but notice a small bug.... Driulis Gonzalez for example has
medalled 4 times, but your script gives his count as only 3. Similarly Gévrise
Émane isn't listed by your script. Something to do with split tables cells I
suspect.

Still, it's inspiring. I've often used bash and perl to scrape data from web
pages. I'll definitely consider JS in future.

------
jpatokal
This is you-can't-parse-HTML-with-regex [1] level hideous, only worse, because
Mediawiki markup is essentially a Turing-complete programming language thanks
to template inclusion, parser functions [2], etc.

The only remotely sane way to do this is to use the Mediawiki API [3] to get
the pages you want, then use an actual parser like mwlib [4] to extract the
content you need. Wikidata and DBpedia are also promising efforts, but both
have a long way to go in terms of coverage.

[1] [http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags)

[2]
[https://www.mediawiki.org/wiki/Help:Extension:ParserFunction...](https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions)

[3]
[https://www.mediawiki.org/wiki/API:Main_page](https://www.mediawiki.org/wiki/API:Main_page)

[4]
[https://www.mediawiki.org/wiki/Alternative_parsers](https://www.mediawiki.org/wiki/Alternative_parsers)

~~~
taneq
This isn't for production-level data migration. It's for smooshing some source
text into a shape which is useful to you.

Parsing HTML with regexps is fine if you're just curious roughly how many
images are in a page. It's great for quick command line experiments. It's just
not good when you need to be "doing it properly".

~~~
elmigranto
I mostly agree, been using Regexes to parse known limited subset of HTML tags
in known limited format (1 tag per line, forced padding, no attrs, etc.).

But on the other hand, this is often how long term "proper" solutions are born
— evolved from something cobbled together in couple of hours.

~~~
taneq
Oh yeah, the golden rule for quick hacks is NEVER show them to someone non-
technical, especially someone non-technical who's in your chain of command.

I learned _that_ one after mocking up some UI screens using VB6 and then
having to explain over and over again that no, just because we showed you some
buttons on a page doesn't mean that the program (which had to be a Java
applet, mind you) was "almost done."

------
betolink
...or we can just use SPARQL and
dbpedia!([http://wiki.dbpedia.org/](http://wiki.dbpedia.org/)) There are
questions where you'll have to scrap more than one page to get an answer and
things could get really complicated with shell commands.

dbpedia is a triple-store that allows us to perform simple queries against
wikipedia data like listing music bands based on a particular city:

    
    
      SELECT ?name ?place
      WHERE {
        ?place rdfs:label "Denver"@en .
        ?band dbo:hometown ?place .
        ?band rdf:type dbo:Band .
        ?band rdfs:label ?name .
        FILTER langMatches(lang(?name),'en')
      }
    

or queries that involve multiple subjects, categories etc.

~~~
lacksconfidence
I looked at dbpedia, but it was non-obvious to me what statements to use. We
can also use SPARQL with wikidata, although the coverage isn't particularly
great.I threw together an example query for medalists and michael phelps
doesn't make the list, because he doesn't have the appropriate participant
of/award received statements:

[https://query.wikidata.org/#SELECT%20%3Fhuman%20%3FhumanLabe...](https://query.wikidata.org/#SELECT%20%3Fhuman%20%3FhumanLabel%20%3Fcount%20WHERE%20%7B%0A%20%20%7B%0A%20%20%20%20SELECT%20%3Fhuman%20%28COUNT%28%2a%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%20%20%20%20%3Fevent%20wdt%3AP31%20wd%3AQ18536594%20.%20%23%20All%20items%20that%20are%20instance%20of%20Olympic%20sporting%20event%0A%20%20%0A%20%20%20%20%20%20%3Fmedal%20wdt%3AP279%20wd%3AQ636830%20.%20%20%20%20%23%20All%20items%20that%20are%20subclass%20of%20Olympic%20medal%20%0A%0A%20%20%20%20%20%20%3Fhuman%20p%3AP1344%20%3FparticipantStat%20.%20%23%20Humans%20with%20a%20participant%20of%20statement%0A%20%20%20%20%20%20%3FparticipantStat%20ps%3AP1344%20%3Fevent%20.%20%23%20..%20that%20has%20any%20of%20the%20values%20of%20%3Fevent%0A%20%20%20%20%20%20%3FparticipantStat%20pq%3AP166%20%3Fmedal%20.%20%23%20..%20with%20the%20award%20received%20qualifier%20of%20any%20of%20the%20values%20of%20%3Fmedal%0A%20%20%20%20%7D%0A%20%20%20%20GROUP%20BY%20%3Fhuman%0A%20%20%7D%0A%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D%0AORDER%20BY%20DESC%28%3Fcount%29%0ALIMIT%20100%0A)

    
    
      SELECT ?human ?humanLabel ?count WHERE {
        {
          SELECT ?human (COUNT(*) as ?count) WHERE {
            ?event wdt:P31 wd:Q18536594 . # All items that are instance of Olympic sporting event
      
            ?medal wdt:P279 wd:Q636830 .    # All items that are subclass of Olympic medal 
    
            ?human p:P1344 ?participantStat . # Humans with a participant of statement
            ?participantStat ps:P1344 ?event . # .. that has any of the values of ?event
            ?participantStat pq:P166 ?medal . # .. with the award received qualifier of any of the values of ?medal
          }
          GROUP BY ?human
        }
      
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
      }
      ORDER BY DESC(?count)
      LIMIT 100
    

EDIT: The dbpedia search should be something like:
[http://dbpedia.org/sparql?default-graph-
uri=http%3A%2F%2Fdbp...](http://dbpedia.org/sparql?default-graph-
uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fhuman+%3Fcount+WHERE%0D%0A%7B%0D%0A++%7B%0D%0A++++SELECT+%3Fhuman+%28count%28*%29+as+%3Fcount%29+WHERE+%7B%0D%0A++++++%3Fevent+rdf%3Atype+dbo%3AOlympicEvent%0D%0A++++++%7B%0D%0A++++++++%3Fevent+dbo%3AbronzeMedalist+%3Fhuman+.++%0D%0A++++++%7D+UNION+%7B%0D%0A++++++++%3Fevent+dbo%3AsilverMedalist+%3Fhuman%0D%0A++++++%7D+UNION+%7B%0D%0A++++++++%3Fevent+dbo%3AgoldMedalist+%3Fhuman%0D%0A++++++%7D%0D%0A++++%7D%0D%0A++++GROUP+BY+%3Fhuman%0D%0A++%7D%0D%0A%7D%0D%0AORDER+BY+DESC%28%3Fcount%29%0D%0ALIMIT+100&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on)

    
    
      SELECT ?human ?count WHERE
      {
        {
          SELECT ?human (count(*) as ?count) WHERE {
            ?event rdf:type dbo:OlympicEvent
            {
              ?event dbo:bronzeMedalist ?human .  
            } UNION {
              ?event dbo:silverMedalist ?human
            } UNION {
              ?event dbo:goldMedalist ?human
            }
          }
          GROUP BY ?human
        }
      }
      ORDER BY DESC(?count)
      LIMIT 100

~~~
betolink
It's not surprising that Phelps is not on the list, you're grouping by ?human
and he is not... jk, maybe there is an ontology issue i.e. new triples are
using a different class or that triple-store is not up to date.

------
minimaxir
> You will not need to open an editor and write a long script and then to have
> an interpreter like Node.js to run it, sometime the bash command line is
> just enough you need!

This is a bad attitude to have for working with _data processing_ , where QA
is necessary and the accuracy of the output is important. A 50 LOC scraper
with comments and explicitly-defined inputs and output from functions is far
preferable to a 8 LOC scraper that those without bash knowledge will be unable
to parse.

And the 8 LOC bash script is not much of a time savings as this post
demonstrates; you still have to check each function output manually to find
which data to parse / handle edge cases.

~~~
MichaelBurge
I've used Makefiles to coordinate a zoo of perl/bash scripts before. It's
pretty effective.

~~~
black_knight
Makefiles are underrated. The concept is so simple – it is basically a
dependency graph with attached shell scripts – yet so powerful. Not just for
building software, but also for everyday tasks where you need some files to be
updated under some condition.

I recommend mk (original make replacement for Plan 9, available for different
OSes through Plan9Port [0,1]). It has a bit more uniform syntax, and can check
that the dependency graph is wellfounded.

[0] [https://github.com/9fans/plan9port](https://github.com/9fans/plan9port)

[1] [https://swtch.com/plan9port/](https://swtch.com/plan9port/)

~~~
gcb0
another advantage of make is that every host have it. like vim.

mk is not there yet.

~~~
viraptor
Did you mean "vi"? Lots of systems don't install vim by default.

~~~
taneq
And contrariwise, lots of hosts install "vi" which actually seems to be vim.
Aah, 'standards'. :P

------
ianseyer
This is the kind of query that excites me for WikiData's development.

[http://wikidata.org](http://wikidata.org)

~~~
lacksconfidence
see also the SPARQL search:
[https://query.wikidata.org/](https://query.wikidata.org/)

------
Washuu
There is the other option to use Parsoid.

[https://github.com/wikimedia/parsoid](https://github.com/wikimedia/parsoid)

That is MediaWiki's official off wiki parser that can turn wikitext into HTML
or HTML back into wikitext. It would be reasonably simple to hook into its API
and use it for data extraction instead.

~~~
rspeer
Is converting Wikitext to HTML/RDFa really going to help with this task? I'd
say it's actually clearer how to get the data out of the original Wikitext.

------
orfix
My 2 cents: the cut/grep lines could be replaced by a sed/awk one-liner such
as:

sed -n 's/. _flagIOCmedalist|\\[\\[\\([^]|]_ \\).*/\1/p'

~~~
oxymoron
Agreed. I was a long time abuser of cut, but has moved to relying on sed
instead. I find that it's generally a lot more robust if you think through
your expressions. For certain cases awk will also do the job. Perl oneliners
do seem convenient but that has never been my cup of tea.

------
CydeWeys
There is an active project sponsored by the Wikimedia Foundation called
PyWikiBot that I've been a contributor to and user of for over a decade now.
If you want to do anything and everything with Wikipedia, look no further
than: [https://github.com/wikimedia/pywikibot-
core](https://github.com/wikimedia/pywikibot-core)

------
mickael-kerjean
You should try wikidata for any type of query that can't be answer using
google and where all the information itself is already on wikipedia. it's way
faster (if you know about sparql) and way more powerfull and flexible. It only
seems surprising there isn't more people talking about it, triplestore are
awesome

------
davidgerard
... there's an API making half of this superfluous. You can do pretty much any
MediaWiki reading or writing through it. (All Wikipedia bots are required to
use it, for instance.)

[https://en.wikipedia.org/w/api.php](https://en.wikipedia.org/w/api.php)

The article text is a raw blob of wikitext you have to process, but you don't
have to go to stupid lengths trying to parse HTML without a browser.

~~~
gkbrk
But he didn't parse any HTML in the article.

------
tpetricek
Extracting data from Wikipedia with type providers:
[http://evelinag.com/blog/2015/11-18-f-tackles-james-
bond/](http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/)

------
turtlebits
An xpath like `//table/tr/td[2]/a[1]/text()` seems like it would be a lot
simpler.

~~~
san_dimitri
This is my goto approach every time I have to parse html or XML. I still don't
understand why people don't use something as simple as google spreadsheets and
write a simple xpath to load tabular data using =IMPORTHTML().

------
kasperset
Large part of Bioinformatics data processing involves these commands. They
seem little cryptic but gets job done. I would also like to mention Datamash:
[https://www.gnu.org/software/datamash/](https://www.gnu.org/software/datamash/)

------
hbogert
Isn't this a poster child example for the semantic web?

------
ShakataGaNai
errrrrrrrrk. Extracting raw wiki-markup and trying to use it? Not the greatest
of idea. The only true parser of that language is mediawiki. Doing it yourself
is a recipe for a massive headache.

~~~
ClayFerguson
I was thinking the same thing. Wikimedia makes all of wikipedia available for
download. You don't need to screen scrape. LOL. I guess they had some other
wikis they want to get data from but the "main" worldwide wikipedia site, that
everyone thinks of as wikipedia makes the data freely downloadable, and I've
downloaded it before.

------
yarrel
A couple of years ago I found Perl was fastest at processing Wikipedia dumps.

It also didn't require having a JVM preloaded to make startup times acceptable
during development (naming no other tools).

I do use shell tools to process data, a lot. They're particularly good for
exploratory programming and initial analysis of new datasets.

~~~
Steeeve
cut, awk, grep, and perl can churn through an initial data dump like nobody's
business.

------
dangravell
Or, for a lot of the structured elements, you could use DBPedia.

------
vram22
There is also a wikipedia library for Python. An example of its use:

Using the wikipedia Python library (to search for oranges :)

[https://jugad2.blogspot.in/2015/11/using-wikipedia-python-
li...](https://jugad2.blogspot.in/2015/11/using-wikipedia-python-library.html)

And there maybe libraries for other languages too, since the above library
wraps a Wikipedia API:

[https://en.wikipedia.org/wiki/Wikipedia:API](https://en.wikipedia.org/wiki/Wikipedia:API)

------
loige
I actually added some of your alternative solutions to the bottom of the
article, thanks for commenting :)

------
lovelearning
Python has excellent packages like mwparserfromhell and wikitables for this
kind of processing.

------
opensourcedude
I appreciate this for the novelty factor, but, somebody show this dude how to
use a spreadsheet!

~~~
junke
Honestly, what novelty factor?

