

Parsing Wikipedia Articles with Node.js and jQuery - BenjaminCoe
https://github.com/bcoe/wikifetch

======
mkl
There are lots of attempts to write new Wikipedia parsers that just do "the
useful stuff", like getting the text. They all fail, for the simple reason
that some of the text comes from MediaWiki templates.

E.g.

    
    
      about {{convert|55|km|0|abbr=on}} east of
    

will turn into

    
    
      about 55 km (34 mi) east of
    

and

    
    
      {{As of|2010|7|5}}
    

will turn into

    
    
      As of 5 July 2010
    

and so on (there are thousands of relevant templates). It's simply not
possible to get the full plain text without processing the templates, and the
only system that can correctly and completely parse the templates is MediaWiki
itself.

Yes it's a huge system entirely written in PHP, but you can make a simple
command line parser with it pretty easily (though it took me quite a while to
figure out how). The key points are to put something like

    
    
      $IP = strval(getenv('MW_INSTALL_PATH')) !== ''
            ? getenv('MW_INSTALL_PATH')
            : '/usr/share/mediawiki';
      require_once("$IP/maintenance/commandLine.inc");
    

at the start of it, and then use the Parser class. You get HTML out, but it's
simple and well-formed (to get text, start with the top level p tags).

To get it to process templates, get a Wikipedia dump, extract the templates,
and use the mwdumper tool to import them into your local MediaWiki database.

I don't know if this is the best or "right" way to do it, but it's the only
way I've found that actually works.

~~~
alok-g
>> To get it to process templates, get a Wikipedia dump, extract the
templates, and use the mwdumper tool to import them into your local MediaWiki
database.

Could you please explain this more? Specifically, what is meant by "extract"
the templates? From what I gather from your message, you are proposing using
MediaWiki itself to process the templates and output more of a plain text
(within the HTML output).

~~~
mkl
The dumps of Wikipedia come as big bzip2ed XML files containing all articles,
templates, etc., each in a "page" tag. The templates have titles starting with
"Template:", so they are easy to detect:

    
    
      <page>
        <title>Template:Convert</title>
        ...
      </page>
    

It's these page tags that need to be copied to a new XML file, along with the
header and footer from the original.

 _From what I gather from your message, you are proposing using MediaWiki
itself to process the templates and output more of a plain text (within the
HTML output)._

Correct. The MediaWiki parser outputs HTML containing all the text to be
displayed, including that generated by templates.

------
tillk
This is interesting, but why not use their API?

<http://en.wikipedia.org/w/api.php>

It's part of mediawiki and available for each and every wikipedia subsite – as
far as I can tell. We are using this as well to autocomplete data. And it
works really well.

I prefer this method over 'scraping' the content.

~~~
logn
Wikipedia prefers and advises this method too.

------
decad
This is very interesting, although anyone aiming to crawl Wikipedia should
make sure they read this section on the Database download page.
[http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why...](http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime.3F)

Everything should be fine as long as you respect their 1 request per second
rule and their robots.txt

~~~
kefs
A quick skim of the source shows that rate-limiting is not implemented, and
the code is non-compliant with Wikipedia's crawling rules.

~~~
BenjaminCoe
Thanks for the heads up. I'll add rate-limiting directly into the API.

------
taliesinb
For anyone who might find it useful, I wrote this really simple spidering tool
in Go, which is useful when you just want a small subgraph of Wikipedia.

<https://github.com/taliesinb/wikispider>

------
kenshiro_o
That looks really good and neat! I am currently working on a project that uses
information from Wikipedia articles and having a parser such as yours would
make things a lot easier. I am currently on vacation for the next 2 weeks but
I'd like to fork your project when I get back. Let me know if there is
anything you need help with (bug fix or new features).

~~~
BenjaminCoe
I would love the help. I'll be hammering on the library next weekend, an I'm
sure I'll have a laundry list of features I want to add.

------
bsb
There's also the dbpedia interfaces, including SPARQL access. NLP meet SemWeb,
SemWeb meet NLP or have you met already?

