

The Parser that Cracked the MediaWiki Code - rams
http://dirkriehle.com/2011/05/01/the-parser-that-cracked-the-mediawiki-code/

======
neilk
This isn't the first alternative parser for MediaWiki content -- there are 28
rows in this table. (I just added Sweble's and my own project...)

[http://www.mediawiki.org/wiki/Alternative_parsers#Known_impl...](http://www.mediawiki.org/wiki/Alternative_parsers#Known_implementations)

Most of these are special purpose hacks. Kiwi and Sweble are the most serious
projects I'm aware of, that have tried to generate a full parse.

However, few of these projects are useful for upgrading Wikipedia itself. Even
the general parsers like Sweble are effectively special-purpose, since we have
a lot of PHP that hooks into the parser and warps its behaviour in
"interesting" ways. The average parser geek usually wants to write to a
cleaner spec in, well, any language other than PHP. ;)

Currently the Wikimedia Foundation is just starting a MediaWiki.next project.
Parsing is just one of the things we are going to change in major ways --
fixing this will make it much easier to do WYSIWYG editing or to publish
content in ways that aren't just HTML pages.

(Obviously we will be looking at Sweble carefully.)

If this sounds like a fun project to you, please get in touch! Or check out
the "Future" portal on MediaWiki.org.

<http://www.mediawiki.org/wiki/Future>

~~~
knowtheory
Hey Neilk!

Did you ever turn up anything regarding this?
<http://news.ycombinator.com/item?id=2216249>

btw, neat js parser, i'll have to check it out. :)

~~~
neilk
FYI the JS parser is broken for some cases, but it works great for most things
you want from message strings.

As for your original question, I don't think there is a forum that tries to
unite the left-brained and right-brained wikipedians. There is a bit of a
divide. I'll send an email right now to someone who might know better.

We don't have contests per se to try to steer the community, other than I
guess GSoC, or reaching out to developers that we think are already doing good
things.

------
sigil
It's great to see people tackling this problem, but I wouldn't declare victory
for sweble just yet ("The Parser That Cracked..."). There are other promising
MediaWiki parser efforts out there.

For one, sweble is a Java parser, and I'm not sure this makes it a good drop-
in replacement for the current MediaWiki PHP code. The DBPedia Project also
has what looks like a decent AST-based Java parser [1]. I would be interested
in a comparison between sweble and DBPedia's WikiParser.

I stumbled across a very nice MediaWiki scanner and parser in C a while ago
[2]. It uses ragel [3] for the scanner; the parser is not a completely generic
AST builder, but is rather specific to the problem of converting MediaWiki
markup to some other wiki markup. It does do quite a bit of the parser work
already though.

Presumably a PHP extension around a C or C++ scanner/parser could someday
replace the current MediaWiki parsing code.

[1]
[http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=...](http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=hdy)

[2] <http://git.wincent.com/wikitext.git>

[3] <http://www.complang.org/ragel/>

~~~
ZoFreX
Given the complexity of Wikipedia's deployment compared to a typical MediaWiki
installation, it really wouldn't be much effort to hook into a parser in say,
Java rather than PHP, and would be well worth doing if it had significant
benefits.

Of course, a PHP parser would still have to be maintained in parallel as not
everyone would be able to do the Java option.

~~~
sigil
> Given the complexity of Wikipedia's deployment compared to a typical
> MediaWiki installation, it really wouldn't be much effort to hook into a
> parser in say, Java rather than PHP...

No doubt the incremental complexity for Wikipedia would be small in relative
terms. I assume that argument would support a variety of proposals.

A solid scanner and parser in C/C++ would benefit a broader audience though.
All the major scripting languages can be extended in C/C++. In fact, the
ragel-based parser I mentioned earlier [1] was built to be used from within
Ruby code.

[1] <http://git.wincent.com/wikitext.git>

------
bjonathan
Site down, here is a mirror: <https://www.readability.com/articles/r9i55x6e>

cache version:
[http://webcache.googleusercontent.com/search?q=cache:8xjwEj-...](http://webcache.googleusercontent.com/search?q=cache:8xjwEj-4nrcJ:dirkriehle.com/2011/05/01/the-
parser-that-cracked-the-mediawiki-code/+http://dirkriehle.com/2011/05/01/the-
parser-that-cracked-the-mediawiki-
code/&cd=1&hl=fr&ct=clnk&gl=fr&source=www.google.fr)

~~~
rwolf
Your readability link redirects me to readabilities home page.

~~~
VMG
dito (because upvote doesn't suffice anymore)

~~~
bjonathan
Mirror of the mirror:
[http://dl.dropbox.com/u/2577298/The%20Parser%20that%20Cracke...](http://dl.dropbox.com/u/2577298/The%20Parser%20that%20Cracked%20the%20MediaWiki%20Code%20%20%20webcache.googleusercontent.com%20%20%20Readability.htm)

------
sunir
This is a breakthrough and a welcome one. From a end user point of view, it
has a couple major implications.

First, I believe this reveals the complexity of the parser, which implies a
complex syntax, which implies a complex user interface as felt by end users. A
more complex the user interface may make it harder it is to attract new
editors, although it's unclear (to me) if that is a fact.

Second, having an AST representation is awesome. It makes it possible to even
think about building a path towards WYSIWYG or some other form of rich text
editing. It was not really possible to build a WYSIWYG editor around the wiki
syntax.

If you have an AST, you can also store the page as the AST since you can
regenerate the wiki syntax from the AST for people who need text-based
editors.

~~~
tokenadult
_A more complex the user interface may make it harder it is to attract new
editors_

There may be friction against gaining new editors from the user interface of
the MediaWiki software, but I think the greatest barrier to participation by
new editors is the hostile, drama-filled environment on many controversial
topics on Wikipedia. My evidence for that is the decline in "unsustainable
fashion"

[http://strategy.wikimedia.org/wiki/Story_of_Wikimedia_Editor...](http://strategy.wikimedia.org/wiki/Story_of_Wikimedia_Editors#Chapter_Three:_The_future_.282007-present.29)

in the number of Wikipedian administrators, who presumably for the most part
are people who know how to use Wikimedia software. Too many of best
contributors (people who look up facts in reliable sources and edit articles
for better readability) on Wikipedia feel attacked and that their time is
wasted. I know a lot of dedicated hobbyists who quietly work on their hobby-
related subjects putting together great articles, but on any subject that is
controversial, and for which looking up reliable sources takes some effort,
Wikipedia is becoming a war zone and is not improving in quality.

[http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_P...](http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_Priorities#Increase_participation)

[http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_P...](http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_Priorities#Improve_Content_Quality)

------
mdaniel
From reading the article, and especially the interesting comments thereon, it
seems this problem is half a bogus "language" specification and half that the
unwashed masses are inputting any damn thing they like and Wikipedia accepts
it.

I suppose this is one of the knobs that must be tuned to balance between
reproducible I/O and turning away meaningful contributions from the community.

------
pornel
AST of an example page is the interesting bit:

[http://sweble.org/crystalball/result?query=ASDF&format=t...](http://sweble.org/crystalball/result?query=ASDF&format=text&stage=postpro&expMode=with_expansion)

------
Semiapies
I hadn't realized that there were any parsing issues around MediaWiki's
markup. 5000 lines of PHP? Eek.

~~~
sigil
It's worse. The MediaWiki PHP code doesn't implement a proper scanner and
parser, it's a bunch of regexes around which the code has grown more or less
organically. Silent compensation for mismatched starting and ending tokens
abounds, and causes problems for all consumers of the markup, in the same way
that lenient HTML parsers have. The difference is that Wikipedia, as the sole
channel for editing markup, could have easily rejected syntax errors with
helpful messages instead of silently compensating.

If it was anything else, I'd say "who cares," but this is "the world's
knowledge" -- we absolutely should care about the format it's stored in. I'm
glad to see people tackling this problem.

~~~
VMG
Interestingly markdown has the same problem.

Another example of the imperfect but working implementation winning.

~~~
seanp2k
Markdown...ugh. Let's just stick to DokuWiki or Mediawiki syntax for
everything, please. If you need something more advanced than that, you should
be using LaTeX. Actually, it'd be cool to build a working MediaWiki + Markdown
=> LaTeX converter....in something like Python.

~~~
VMG
there is markdown2pdf written Haskell, which seems to have XeTeX as an
intermediary step.

Personally I'd be happy to see _any_ markup language becoming the default,
regardless which one it is. Having a proper grammar would be a bonus.

~~~
gbog
"Personally I'd be happy to see any markup language becoming the default."

I don't agree with this. All lightweight humane markup languages are not born
equal, some are better others, and Mediawiki's is not in the list of the best
ones. Now there seem to be a trend towards Markdown, but it should be improved
and then, migrating Wikipedia to this Markdown2 could be a real good thing.

------
car
Site is down due to harddisk problems, but the actually referenced Sweble
Wikipedia Parser project site is at <http://www.sweble.org>.

------
brianjolney
link died. any mirrors?

~~~
pornel
Googlecache:
[http://webcache.googleusercontent.com/search?client=opera...](http://webcache.googleusercontent.com/search?client=opera&rls=en&q=cache:http://dirkriehle.com/2011/05/01/the-
parser-that-cracked-the-mediawiki-
code/&sourceid=opera&ie=utf-8&oe=utf-8&channel=suggest)

~~~
driehle
It's back up at <http://dirkriehle.com> \- the project site is actually
<http://sweble.org> where under Crystalball Demo you can play with the parser
without having to install anything.

------
seanp2k
I think we killed this poor site.

