Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Easy to use Wikipedia API for Python (github.com/goldsmith)
218 points by jgoldsmith on Aug 26, 2013 | hide | past | favorite | 35 comments



Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!

This is Hacker News at it's best. Highlighting creation.

Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!

I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).

In fact I'm going to go and write it now (and create!).

Edit 1: If anyone is on Medium, here's my draft.

https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...

Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.

https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...


Wait, HN doesn't mind reposts? Shut the front door! o_O

Also, I wistfully disagree with you about the "highlighting creation" bit. I mean, FSM knows I want to agree with you but my experience so far says otherwise. In all the time I've spent on HN (most of it as a lurker) I've found HN to be quite snobbish about the show-and-tell attempts.

Then again, maybe I am experiencing sour grapes since my own Show HN posts seem to disappear rapidly even before I can say, "Hey HN, loo-"...

:(

I'm thinking of creating an HN spin-off for young, upcoming devs to do a Show-And-Tell about their recent attempts at learning/developing. Heck I've been dying to give discourse ~~(the django-based discussions platform)~~ a try, maybe I'll finally get around to it now. In fact, I'm going to go and write it now... (Sorry, couldn't resist. ;) Not meant as a dig.)

Question is, should I do a Show HN, when it is done? :P

EDIT: Turns out discourse is rails-based, not django-based. Still gonna give it a try, I guess... :(


Highly recommend Discourse, even if you don't know Rails.


Sold!

Caveat: I don't know Rails. I've only recently started teaching myself to code/program/develop. So I guess now is a good time to start with Rails as any, eh?

8 months ago I jumped straight into working with Django/Python - one fine day, I made a list of project ideas that I've had in my head for a while and started chalking out the corresponding algos & coding them without a care about how 'bad' my code was going to look. (Yeah, I belong to the 'learn first, refine later' school of thought.)

8 months since I first started, the score is two ideas done, six more to go. Wait, scratch that, seven more to go. Wish me luck!


I can only agree. I started learning to code, because I wanted to acomplish one special thing (downloading and parsing xml from a weather-source to look at historical wether data for five places).

I started with python, learned a little bit sqlite on the way, learned about parsing files (good for understanding our devs at work better) and so on. Now I've gone on, got a little sandbox-server at work (I'm an editor) and atomated some really bad jobs at work with python. And did some funny things to make life better for our editorial team.

Yes, I am probably still writing spaghetti code, and there are a lot of things left I really like to learn, but with time comes understanding and I get better week by week.

When I look at code from 3 month ago, I get the urge to refactor it. But it works, it runs as it should and today I am more inclined to learn new things and get new things done first.

So yes. Keep coding and have fun!


Actually, if your goal is to set up a discussion forum, then you can get away with "ignoring" the code behind Discourse (except for some YAML config files)


Also, email me (details in profile) if you need some more help with this idea - it's a good one!


Code without giving a shit. The best way to start!


Please note that this API does not make any specific attempts to obey the mediawiki etiquette (http://www.mediawiki.org/wiki/API:Etiquette). This sort of API is easy and clean for something like a command line script, but if you're going to do further automation or crawling I strongly recommend using the pywikipediabot library (http://www.mediawiki.org/wiki/Manual:Pywikipediabot) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.

If you just want a bash script to look things up on wikipedia, you can always use something like

function wp { curl "http://en.wikipedia.org/wiki/$(echo "$@" | tr ' ' '_')" | gunzip | html2text }

which will work for basic queries (needs url encoding and words to be properly capitalized).

A full api reference is here (http://en.wikipedia.org/w/api.php).


Hi (creator here),

Thanks for bringing this to my attention. I've added a disclaimer to the GitHub page regarding Pywikipediabot and plan to make changes to fully comply with MediaWiki API etiquette. The last thing I'd want to do is inadvertently cause problems for the site or foundation.


I would go one step further and suggest people that need structured queries use the Google BigTable API to query their structured Wikipedia data. Granted, their public dataset is from 2010, so is slightly outdated, but you can write structured SQL against all of the wikipedia article metadata and then use the mediawiki api itself to grab only the article text that you're interested in.

The wikipedia data is hosted here: https://bigquery.cloud.google.com/table/publicdata:samples.w...

Here is a sample query, searching for all articles that start with Positive:

SELECT id,title FROM [publicdata:samples.wikipedia] WHERE (REGEXP_MATCH(title,r'^Positive*')) LIMIT 10

Query complete (2.0s elapsed, 9.13 GB processed

  1|	464347|	Positive airway pressure	 
  2|	10008223|	Positive behavior support	 
  3|	464347|	Positive airway pressure	 
  4|	1354851|	Positivism in Poland	 
  5|	1023857|	Positive set theory	 
  6|	5154273|	Positivism dispute	 
  7|	2871407|	Positivism	 
  8|	17179765|	Positive psychological capital	 
  9|	9033239|	Positive Action Group	 
  10|	4163012|	Positive K
Here is the python API documentation: https://developers.google.com/api-client-library/python/


> If you just want a bash script to look things up on wikipedia

or for basic description:

    wp() { dig +short txt "$*".wp.dg.cx; }


+1 thank you


Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.

To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you're looking to make some contributions to a project, this one is rife with possible pull requests.

In terms of article access and analysis, I'd recommend looking at Pattern (https://github.com/clips/pattern) before starting with this library. Not only do you get access to the rest of Pattern's IR/text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).


Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.

I'll work in some changes tonight. Let's start with PEP8, shall we? :)


As a budding coder, sometimes following PEP8 is harder than actually coding the stuff!


For PEP-8 compliance, I would recommend looking into using an IDE in some part of your workflow that has active inspections for things like PEP-8 compliance. I personally use PyCharm.


Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.


Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:

- https://github.com/dcramer/py-wikimarkup (coverts wikitext to HTML using Python, would need to extract text with BeautifulSoup or something

- http://wiki.eclipse.org/Mylyn/Incubator/WikiText (also to HTML, but in Java)

- https://github.com/earwig/mwparserfromhell

I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.


Easy? Haven't they got a Turing-complete language hiding in there?


Thanks!


I've used (and patched) this alternative: https://github.com/richardasaurus/wiki-api


You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)


Each function has its own cache, and the size of the cache is limited by the number of unique requests you make. How would this be a problem? (totally genuine question)

Also, if you have a better way of doing it, please totally fork and request a pull!


Excellent !

Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.


There's MediaWiki::Gateway for Ruby: https://github.com/jpatokal/mediawiki-gateway

Disclaimer: I'm the main author, and there's other implementations, but this seems to have become the most popular one.


For Perl library see MediaWiki::API - https://metacpan.org/module/MediaWiki::API


GP meant APIs for Wikipedia languages than English, not other programming languages.


Yes I thought that maybe the case but covered both bases because that CPAN module is multi-lingual.


Thanks! I'm working on some additional features including international support, and I hope to push them within a week.


Is there an API for extracting data from infoboxes?


Great! I looking forward for a Ruby version!



There are a couple of implementations for Ruby--but I haven't been able to get them to work.

So, I wrote one that was dead simple, but have only implemented the API features I needed, which is mostly related to deleting things, as I use it for mass spam deletion. If you're interested, I'll upload it somewhere, but like I said, it's very incomplete.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: