Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!
This is Hacker News at it's best. Highlighting creation.
Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!
I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).
In fact I'm going to go and write it now (and create!).
Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.
Wait, HN doesn't mind reposts? Shut the front door! o_O
Also, I wistfully disagree with you about the "highlighting creation" bit. I mean, FSM knows I want to agree with you but my experience so far says otherwise. In all the time I've spent on HN (most of it as a lurker) I've found HN to be quite snobbish about the show-and-tell attempts.
Then again, maybe I am experiencing sour grapes since my own Show HN posts seem to disappear rapidly even before I can say, "Hey HN, loo-"...
:(
I'm thinking of creating an HN spin-off for young, upcoming devs to do a Show-And-Tell about their recent attempts at learning/developing. Heck I've been dying to give discourse ~~(the django-based discussions platform)~~ a try, maybe I'll finally get around to it now. In fact, I'm going to go and write it now... (Sorry, couldn't resist. ;) Not meant as a dig.)
Question is, should I do a Show HN, when it is done? :P
EDIT: Turns out discourse is rails-based, not django-based. Still gonna give it a try, I guess... :(
Caveat: I don't know Rails. I've only recently started teaching myself to code/program/develop. So I guess now is a good time to start with Rails as any, eh?
8 months ago I jumped straight into working with Django/Python - one fine day, I made a list of project ideas that I've had in my head for a while and started chalking out the corresponding algos & coding them without a care about how 'bad' my code was going to look. (Yeah, I belong to the 'learn first, refine later' school of thought.)
8 months since I first started, the score is two ideas done, six more to go. Wait, scratch that, seven more to go. Wish me luck!
I can only agree. I started learning to code, because I wanted to acomplish one special thing (downloading and parsing xml from a weather-source to look at historical wether data for five places).
I started with python, learned a little bit sqlite on the way, learned about parsing files (good for understanding our devs at work better) and so on. Now I've gone on, got a little sandbox-server at work (I'm an editor) and atomated some really bad jobs at work with python. And did some funny things to make life better for our editorial team.
Yes, I am probably still writing spaghetti code, and there are a lot of things left I really like to learn, but with time comes understanding and I get better week by week.
When I look at code from 3 month ago, I get the urge to refactor it. But it works, it runs as it should and today I am more inclined to learn new things and get new things done first.
Actually, if your goal is to set up a discussion forum, then you can get away with "ignoring" the code behind Discourse (except for some YAML config files)
Please note that this API does not make any specific attempts to obey the mediawiki etiquette (http://www.mediawiki.org/wiki/API:Etiquette). This sort of API is easy and clean for something like a command line script, but if you're going to do further automation or crawling I strongly recommend using the pywikipediabot library (http://www.mediawiki.org/wiki/Manual:Pywikipediabot) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.
If you just want a bash script to look things up on wikipedia, you can always use something like
Thanks for bringing this to my attention. I've added a disclaimer to the GitHub page regarding Pywikipediabot and plan to make changes to fully comply with MediaWiki API etiquette. The last thing I'd want to do is inadvertently cause problems for the site or foundation.
I would go one step further and suggest people that need structured queries use the Google BigTable API to query their structured Wikipedia data. Granted, their public dataset is from 2010, so is slightly outdated, but you can write structured SQL against all of the wikipedia article metadata and then use the mediawiki api itself to grab only the article text that you're interested in.
Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.
To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you're looking to make some contributions to a project, this one is rife with possible pull requests.
In terms of article access and analysis, I'd recommend looking at Pattern (https://github.com/clips/pattern) before starting with this library. Not only do you get access to the rest of Pattern's IR/text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).
For PEP-8 compliance, I would recommend looking into using an IDE in some part of your workflow that has active inspections for things like PEP-8 compliance. I personally use PyCharm.
Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.
Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:
I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.
Each function has its own cache, and the size of the cache is limited by the number of unique requests you make. How would this be a problem? (totally genuine question)
Also, if you have a better way of doing it, please totally fork and request a pull!
There are a couple of implementations for Ruby--but I haven't been able to get them to work.
So, I wrote one that was dead simple, but have only implemented the API features I needed, which is mostly related to deleting things, as I use it for mass spam deletion. If you're interested, I'll upload it somewhere, but like I said, it's very incomplete.
This is Hacker News at it's best. Highlighting creation.
Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!
I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).
In fact I'm going to go and write it now (and create!).
Edit 1: If anyone is on Medium, here's my draft.
https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...
Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.
https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...