
Show HN: Easy to use Wikipedia API for Python - jgoldsmith
https://github.com/goldsmith/Wikipedia/blob/master/
======
cjbarber
Fantastic job Johnathon. To HN readers, I just want to point out that this
awesome library was created by a high school student (who is extremely
motivated and builds a bunch of stuff)!

This is Hacker News at it's best. Highlighting creation.

Interesting to note that this was submitted a few days ago and received only 3
points - I mention this because this is the sort of thing that should have
made the front page the first time it was posted!

I'm thinking of writing a weekly post that highlights the things people
created and posted to hacker news to deaf ears (thankfully this is not the
case with this post!).

In fact I'm going to go and write it now (and create!).

Edit 1: If anyone is on Medium, here's my draft.

[https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...](https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km_postIds=e394f6d917d3)

Edit 2: And while I have the chance for something to not to fall to dead ears,
here's something I just wrote that would be interesting to anyone who's
annoyed with recruiters and would rather work at SpaceX than Snapchat.

[https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...](https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km_postIds=de5c73174a4e)

~~~
DjangoReinhardt
Wait, HN doesn't mind reposts? Shut the front door! o_O

Also, I wistfully disagree with you about the "highlighting creation" bit. I
mean, FSM knows I want to agree with you but my experience so far says
otherwise. In all the time I've spent on HN (most of it as a lurker) I've
found HN to be quite snobbish about the show-and-tell attempts.

Then again, maybe I am experiencing sour grapes since my own Show HN posts
seem to disappear rapidly even before I can say, "Hey HN, loo-"...

:(

I'm thinking of creating an HN spin-off for young, upcoming devs to do a Show-
And-Tell about their recent attempts at learning/developing. Heck I've been
dying to give discourse ~~(the django-based discussions platform)~~ a try,
maybe I'll finally get around to it now. In fact, I'm going to go and write it
now... (Sorry, couldn't resist. ;) Not meant as a dig.)

Question is, should I do a Show HN, when it is done? :P

EDIT: Turns out discourse is rails-based, not django-based. Still gonna give
it a try, I guess... :(

~~~
voltagex_
Highly recommend Discourse, even if you don't know Rails.

~~~
DjangoReinhardt
Sold!

Caveat: I don't know Rails. I've only recently started teaching myself to
code/program/develop. So I guess now is a good time to start with Rails as
any, eh?

8 months ago I jumped straight into working with Django/Python - one fine day,
I made a list of project ideas that I've had in my head for a while and
started chalking out the corresponding algos & coding them without a care
about how 'bad' my code was going to look. (Yeah, I belong to the 'learn
first, refine later' school of thought.)

8 months since I first started, the score is two ideas done, six more to go.
Wait, scratch that, seven more to go. Wish me luck!

~~~
sdoering
I can only agree. I started learning to code, because I wanted to acomplish
one special thing (downloading and parsing xml from a weather-source to look
at historical wether data for five places).

I started with python, learned a little bit sqlite on the way, learned about
parsing files (good for understanding our devs at work better) and so on. Now
I've gone on, got a little sandbox-server at work (I'm an editor) and atomated
some really bad jobs at work with python. And did some funny things to make
life better for our editorial team.

Yes, I am probably still writing spaghetti code, and there are a lot of things
left I really like to learn, but with time comes understanding and I get
better week by week.

When I look at code from 3 month ago, I get the urge to refactor it. But it
works, it runs as it should and today I am more inclined to learn new things
and get new things done first.

So yes. Keep coding and have fun!

------
nevermore
Please note that this API does not make any specific attempts to obey the
mediawiki etiquette
([http://www.mediawiki.org/wiki/API:Etiquette](http://www.mediawiki.org/wiki/API:Etiquette)).
This sort of API is easy and clean for something like a command line script,
but if you're going to do further automation or crawling I strongly recommend
using the pywikipediabot library
([http://www.mediawiki.org/wiki/Manual:Pywikipediabot](http://www.mediawiki.org/wiki/Manual:Pywikipediabot))
which includes a very full API, has tunable throttling, and makes a more
direct attempt to require a user agent string that is in line with the api
etiquette.

If you just want a bash script to look things up on wikipedia, you can always
use something like

function wp { curl
"[http://en.wikipedia.org/wiki/$(echo](http://en.wikipedia.org/wiki/$\(echo)
"$@" | tr ' ' '_')" | gunzip | html2text }

which will work for basic queries (needs url encoding and words to be properly
capitalized).

A full api reference is here
([http://en.wikipedia.org/w/api.php](http://en.wikipedia.org/w/api.php)).

~~~
zalew
> If you just want a bash script to look things up on wikipedia

or for basic description:

    
    
        wp() { dig +short txt "$*".wp.dg.cx; }

~~~
jonbaer
+1 thank you

------
languagehacker
Nice work on this. It's always good to see people giving more visibility to
MediaWiki's capabilities.

To engineers who would like to use this library, I would give a caveat that
there are way more API actions than the ones enumerated here. So if you're
looking to make some contributions to a project, this one is rife with
possible pull requests.

In terms of article access and analysis, I'd recommend looking at Pattern
([https://github.com/clips/pattern](https://github.com/clips/pattern)) before
starting with this library. Not only do you get access to the rest of
Pattern's IR/text analysis capabilities, but the approach in Pattern is
written to support any site built off of MediaWiki, and not just English
Wikipedia. This means not only foreign-language Wikipedia instances, but all
350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs
MW 1.13).

------
echohack
Nice. I can see lots of room for improvement. Great start, especially using
Requests will streamline things.

I'll work in some changes tonight. Let's start with PEP8, shall we? :)

~~~
sbuccini
As a budding coder, sometimes following PEP8 is harder than actually coding
the stuff!

~~~
frakkingcylons
For PEP-8 compliance, I would recommend looking into using an IDE in some part
of your workflow that has active inspections for things like PEP-8 compliance.
I personally use PyCharm.

------
DenisM
Any advice on stripping wiki markup to obtain plain text from the wikipedia
dump? A friend is doing linguistic research and could benefit from large
bodies of text in different languages. Ideally this would be a C# library, but
a simple command line tool in any other language would do as well - accept
.xml.bz2, strip the wiki markup, return something that's easily processable by
further tools in a single file. Thanks in advance.

~~~
jgoldsmith
Since I am using the MediaWiki extracts API, I never had to find/write my own
Wikitext parser. However, I did run into a couple in my research that seemed
relatively popular:

\- [https://github.com/dcramer/py-wikimarkup](https://github.com/dcramer/py-
wikimarkup) (coverts wikitext to HTML using Python, would need to extract text
with BeautifulSoup or something

\-
[http://wiki.eclipse.org/Mylyn/Incubator/WikiText](http://wiki.eclipse.org/Mylyn/Incubator/WikiText)
(also to HTML, but in Java)

\-
[https://github.com/earwig/mwparserfromhell](https://github.com/earwig/mwparserfromhell)

I'm sure if you did a bit more digging you could find a C# library that does
this, or you could roll your own pretty easily using the others as a model.

~~~
krichman
Easy? Haven't they got a Turing-complete language hiding in there?

------
toyg
I've used (and patched) this alternative:
[https://github.com/richardasaurus/wiki-
api](https://github.com/richardasaurus/wiki-api)

------
harlowja
You might really want to not cache everything coming back into a never ending
python dictionary. Lookup memory leak on wikipedia ;)

~~~
jgoldsmith
Each function has its own cache, and the size of the cache is limited by the
number of unique requests you make. How would this be a problem? (totally
genuine question)

Also, if you have a better way of doing it, please totally fork and request a
pull!

------
level09
Excellent !

Are there any apis for other languages ? tried to query using unicode strings
and it worked but I only got English content.

~~~
draegtun
For Perl library see MediaWiki::API -
[https://metacpan.org/module/MediaWiki::API](https://metacpan.org/module/MediaWiki::API)

~~~
LukeShu
GP meant APIs for Wikipedia languages than English, not other programming
languages.

~~~
draegtun
Yes I thought that maybe the case but covered both bases because that CPAN
module is multi-lingual.

------
ksrm
Is there an API for extracting data from infoboxes?

------
leoplct
Great! I looking forward for a Ruby version!

~~~
benastan
Like [https://github.com/kenpratt/wikipedia-
client](https://github.com/kenpratt/wikipedia-client) ?

