
Parsoid in PHP, or There and Back Again - sthottingal
https://phabricator.wikimedia.org/phame/post/view/189/parsoid_in_php_or_there_and_back_again/
======
trynewideas
Reintegrating the parser into Mediawiki's PHP core goes beyond performance.
Many prominent MW features -- particularly the visual editor, translation
functions, and mobile endpoints -- depend heavily on Parsoid/JS, which
required running it as a Node.js microservice, something not all smaller (or
especially shared-hosting wikis) could manage for quite some time.

Bringing Parsoid closer to core makes it easier for non-Wikimedia admins to
use more of MW's modern core features. The performance gains are a nice bonus,
and I suspect are related to the general improvements into PHP 7.2.

~~~
dingdingdang
Performance gains of "Parsoid/PHP ... roughly twice as fast on most requests
as the original version" are bit more than a casual bonus and into territory
of pretty dang awesome in my book.

~~~
trynewideas
I totally agree, but on wikis of a smaller scale it's not a difference of
performance, it's potentially being able to use some of these MW components
for the first time. It could be 10x faster than Parsoid/JS and that's
irrelevant compared to being _possible_ on a service that won't let you run
Parsoid/JS.

------
mhd
> We will (re)examine some of the reasons behind this architectural choice in
> a future blog post, but the loose coupling of Parsoid and the MediaWiki core
> allowed rapid iteration in what was originally a highly experimental effort
> to support visual editing.

I'd really like to read that. The decision to have this parser as a completely
separate component is the main reason why a lot of local MediaWiki
installations completely avoided having a visual editor -- which in turn
probably created lots of missing hours and/or missing documentation because
WikiCreole ain't exactly a thing of beauty or something that's used in other
places (as opposed to Markdown, which is an ugly beast, too, but at least the
ugly beast you know).

You need a heavy JS frontend for a visual editor anyway, so why not do it
client-side?

Having to deploy a separate component, probably in an environment that's not
used at all is pretty much the worst choice possible. Yes, I'm aware, you
readers here probably do all kinds of hip docker/keights setups where yet
another node microservice ain't nothing special (and should've been rewritten
in Rust, of course), but a wiki is something on a different level of ubiquity.

------
cscottnet
I'm on the team. Part 2 of this post series should have lots of interesting
technical details for y'all; be patient, I'm still writing it.

But to whet your appetite: we used
[https://github.com/cscott/js2php](https://github.com/cscott/js2php) to
generate a "crappy first draft" of the PHP code for our JS source. Not going
for correctness, instead trying to match code style and syntax changes so that
we could more easily review git diffs from the crappy first draft to the
"working" version, and concentrate attention on the important bits, not the
boring syntax-change-y parts.

The original legacy Mediawiki parser used a big pile of regexps and had all
sorts of corner cases caused by the particular order in which the regexps were
applied, etc.

Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to
pegjs for this project). There are still a bunch of regexps scattered
throughout the code, because they are still very useful for text processing
and a valuable feature of both JavaScript and PHP as programming languages,
but they are not the primary parsing mechanism. Translating the regexps was
actually one of the more difficult parts, because there are some subtle
differences between JS and PHP regexps.

We made a deliberate choice to switch from JS-style loose typing to strict
typing in the PHP port. Whatever you may consider the long term merits are for
maintainability, programming-in-the-large, etc, they were _extremely useful_
for the porting project itself, since they caught a bunch of non-obvious
problems where the types of things were slightly different in PHP and JS. JS
used anonymous objects all over the place; we used PHP associative arrays for
many of these places, but found it very worthwhile to take the time to create
proper typed classes during the translation where possible; it really helped
clarify the interfaces and, again, catch a lot of subtle impedance mismatches
during the port.

We tried to narrow scope by not converting _every_ loose interface or
anonymous object to a type -- we actually converted as many things as possible
to proper JS classes in the "pregame" before the port, but the important thing
was to get the port done and complete as quickly as possible. We'll be
continuing to tighten the type system -- as much for code documentation as
anything else -- as we address code debt moving forward.

AMA, although I don't check hacker news frequently so I can't promise to
reply.

~~~
lovasoa
One question: did you investigate why the PHP version was so much faster than
the JS one ? Do you think the performance gains of the PHP versions could be
achieved in JS, or do you use any special feature of the PHP interpreter ?

~~~
subbu_ss
No, we haven't investigated it yet since we haven't had the time to do it.
But, we've filed a task for maybe someone to look at it (
[https://phabricator.wikimedia.org/T241968](https://phabricator.wikimedia.org/T241968)
), but our hunch is that it would be be incorrect to conclude that PHP is
faster from JS.

~~~
cscottnet
A slightly longer answer is that we looked into a number of possible reasons
and it's not clear there's an easy answer. Lots of differences between the two
setups, and every time we come up with an possible answer like "oh, it's the
reduced network API latency" we come up with a counter like "but html2wt is
also faster and it does barely any network requests". Casual investigation
raises more questions than answers. So we've looked into it but don't yet have
an answer that we fully believe.

------
lolphp111
Why? The editor needs a frontend in javascript anyways, so why mot handle this
all in real time on the client?

Now they rewrote in PHP, thats probably one of the worst languages out there,
and why not rewrite in something compiled if speed was the main reason for a
rewrite?

For me PHP sits in the middle as a poor language, and still slow compared to
any compiled languages. Also i would want to see some wasm vs php benchmarks
they did before starting with php.

Lots of poor decisions from the wiki team.

~~~
Piskvorrr
Lots of assumptions from you. "PHP bad, PHP slow, everyone has fast client."
Did you assume it was 2010?

~~~
lolphp111
Well, actually i made zero assumptions. Theres tonnes of markdown editors that
work in the browser in real time. I assume the wiki syntax is not that much
"heavier" to work with, if it is its another lol for the wiki team.

And by PHP i mean if this rewrite is done partly because of speed, why not
build the parser in a compiled language? Its just silly that you have to work
with PHP as its one of the worst languages out there in terms of dx and
features.

I assume its 2020 where clients are "fast enough" to handle a syntax
transformation like markdown -> HTML

~~~
Piskvorrr
"For me PHP sits in the middle as a poor language, and still slow compared to
any compiled languages." No assumptions whatsoever.

2020: mobile clients. Yes, they're where the rest of the web was in 2005. Yes,
they're ubiquitous.

~~~
lolphp111
I cant believe users on legacy phones (phones from say 2010-ish era) would be
writing new content on wikipedia actively. I dont care if you have the newest
iphone, you still dont write anything, you just read. For those 0.00001% of
users who actually create new content they most def are not using a phone to
do this.

> "For me PHP sits in the middle as a poor language, and still slow compared
> to any compiled languages." No assumptions whatsoever.

Its not an assumption. PHP is slower than a compiled langauge. simple and
easy. Need speed? Dont do it in a compiled language. Period.

~~~
kmaher
Not exactly accurate regarding mobile editing. Edits on mobile-heavy wikis are
up 18% YoY, which we attribute to improvements in the mobile editing
interface. We have people who edit on T9 interfaces. Not many, but they exist.

------
tjpnz
>Parsoid/JS had very few unit tests focused on specific subsections of code.
With only integration tests we would find it difficult to test anything but a
complete and finished port.

I found this a little frightening given Parsoid/JS is handling user input.

~~~
cscottnet
There are thousands of integration tests. The "correct" output of the parser
is well-known for a given input, and those test cases have been accumulating
for over a decade. But the internal structure of the parser is much more
fluid, and so it wasn't (historically) thought worthwhile to try to write
tests against that shifting target.

------
znpy
Last time i tried getting mediawiki up and running as a personal wiki i found
out that getting parsoid working was quite a mess. hopefully now it will be
easier to get a fully fledged wikimedia installation, together with a visual
editor.

~~~
cscottnet
Yes. We're not there yet, but that's the goal!

~~~
znpy
awesome! keep rocking!

------
TicklishTiger
I would assume that the code is open source?

For some reason, I did not manage to find it. Neither linked from this
article, nor via the MediaWiki page:

[https://www.mediawiki.org/wiki/Parsoid](https://www.mediawiki.org/wiki/Parsoid)

Nor via the Phabricator page:

[https://phabricator.wikimedia.org/project/profile/487/](https://phabricator.wikimedia.org/project/profile/487/)

What am I missing?

~~~
perlgeek
> What am I missing?

Google fu.

parsoid source code =>
[https://github.com/wikimedia/parsoid](https://github.com/wikimedia/parsoid)

~~~
TicklishTiger
Woops .. so they use Phabricator for issue management and GitHub to host the
code?

~~~
perlgeek
The GitHub repo is a mirror. It has links to more documentation on how the
development process works.

~~~
TicklishTiger
I saw that. But why not have the code on Phabricator?

~~~
jedieaston
Here it is on Phabricator:
[https://phabricator.wikimedia.org/diffusion/GPAR/](https://phabricator.wikimedia.org/diffusion/GPAR/)

------
harryf
Years ago I wrote the re-wrote wiki parser for Dokuwiki (which is used at
[https://wiki.php.net/](https://wiki.php.net/) among other places). Originally
the parser was scanning a wiki page multiple times using various regular
expressions. I used a stack machine as a way to manage the regular
expressions, which resulted in being able to parse a page in a single pass -
it's documented here -
[https://www.dokuwiki.org/devel:parser](https://www.dokuwiki.org/devel:parser)

A nice (unexpected) side effect is it became much easier for people extend the
parser which their own syntax, leading to an explosion of plugins (
[https://www.dokuwiki.org/plugins?plugintype=1#extension__tab...](https://www.dokuwiki.org/plugins?plugintype=1#extension__table)
)

I'm no expert on parsing theory but I have the impression that applying
standard approaches to parsing source code; building syntax trees, attempting
to express it with context free grammar etc. is the wrong approach for parsing
wiki markup because it's context-sensitive. There's some discussion of the
problem here
[https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study](https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study)

Another challenge for wiki markup, from a usability perspective, if a user
get's part of the syntax of a page "wrong", you need to show them the end
result so they can fix the problem, rather than have the entire page "fail"
with a syntax error.

From looking at many wiki parsers before re-writing the Dokuwiki parser, what
_tends_ to be the case, when people try to apply context-free grammars or
build syntax trees is they reach 80% then stumble at the remain 20% of edge
cases of how wiki markup is actually used in the wild.

Instead of building an object graph, the Dokuwiki parser produces a simple
flat array representing the source page (
[https://www.dokuwiki.org/devel:parser#token_conversion](https://www.dokuwiki.org/devel:parser#token_conversion)
) which I'd argue makes is simpler write code for rendering output (hence lots
of plugins) as well as being more robust at handling "bad" wiki markup it
might encounter in the wild - less chance of some kind of infinite recursion
or similar.

Ultimately it's similar discussion to the SAX vs. DOM discussions people used
to have around XML parsing (
[https://stackoverflow.com/questions/6828703/what-is-the-
diff...](https://stackoverflow.com/questions/6828703/what-is-the-difference-
between-sax-and-dom) ). From a glance at the Parsiod source they seem to be
taking a DOM-like approach - I wish them luck with that - my experience was
this will probably lead to a great deal more complexity, especially when it
comes to edge cases.

~~~
cscottnet
It appears that you were probably describing the lexer pass in your
description of docuwiki. Indeed tokenization is a very hard problem for
wikitext. We use a pegjs grammar for it, but it contains less of
lookahead/special conditions/novel extensions, etc. It's hard. Wikitext is
messy precisely because it was intentionally designed to be easy and forgiving
to write.

Seems like we've learned many of the same lessons building our parsers. Markup
parsers do seem to be a unique thing, not really like parsing either
programming languages or natural languages. If we every meet I'm sure we could
happily share a beverage of your choice trading stories.

------
TicklishTiger
Anybody else feeling that strict typing and long var names are not worth all
the visual overload?

Example:
[https://github.com/wikimedia/parsoid/blob/master/src/Parsoid...](https://github.com/wikimedia/parsoid/blob/master/src/Parsoid.php#L235)

This is how I would write the function definition:

    
    
        function html2wikitext($config, $html, $options = [], $data = null)
    

This how Wikimedia did it:

    
    
        public function html2wikitext(
            PageConfig $pageConfig, string $html, array $options = [],
            ?SelserData $selserData = null
        ): string
    

I see this "strictness over readability" on the rise in many places and I
think it is a net negative.

Not totally sure, but this seems to be the old JS function definition:

[https://github.com/abbradar/parsoid/blob/master/lib/parse.js...](https://github.com/abbradar/parsoid/blob/master/lib/parse.js#L70)

A bit cryptic and it suffers from the typical promise / async / callback /
nesting horror so common in the Javascript world:

    
    
        _html2wt = Promise.async(function *(obj, env, html, pb)

~~~
Cthulhu_
Your first function definition leans HEAVILY on implicit knowledge - what does
$config mean? What does $html mean? What are valid properties of $config and
$data?

This is fine in smaller codebases, 'your' code, and code that you can read to
a point where you can extrapolate these variables from the implementation, but
this simply does not scale beyond a certain code size - or more importantly, a
certain amount of contributors.

It can be compensated with documentation (phpDoc), but that is just as verbose
if not moreso than adding type information - although you should probably do
both.

Type systems come into place where you are not expected anymore to fully
comprehend the code. They are useful when you are just a consumer / user of
this function and all you want to do is convert some html to wiki text without
having to understand the internals of that particular function (and whatever
else goes on beyond it). Types are documentation, prevent shooting yourself in
the foot, reduce trial-and-error, and avoid the user having to read and
comprehend hundreds - thousands of lines of code.

~~~
egeozcan
Why would you add a docblock on top of your already typed functions? Maybe
just description would be cool but phpdoc with all the duplicated parameter
definitions? I think it's unnecessary but would love to hear other
perspectives.

~~~
Eremotherium
You can add e.g. descriptions to your params if needed. Also phpdoc
understands types like string[] for arrays and union types like (int|string)
for untyped params.

------
echelon
I'm actually curious why PHP was chosen instead of Rust or Go given that the
parsing team wasn't familiar with the language. I understand that MediaWiki is
written in PHP, but it sounds like they were already comfortable with language
heterogeny.

They claim,

> The two wikitext engines were different in terms of implementation language,
> fundamental architecture, and modeling of wikitext semantics (how they
> represented the "meaning" of wikitext). These differences impacted
> development of new features as well as the conversation around the evolution
> of wikitext and templating in our projects. While the differences in
> implementation language and architecture were the most obvious and talked-
> about issues, this last concern -- platform evolution -- is no less
> important, and has motivated the careful and deliberate way we have
> approached integration of the two engines.

Which is I suppose a compelling reason for a rewrite if you're understaffed.

I'd still be interested in writing it in Rust and then writing PHP bindings.
There's even a possibility of running a WASM engine in the browser and
skipping the roundtrip for evaluation.

~~~
shimst3r
> Parsoid/PHP also brings us one step closer to integrating Parsoid and other
> MediaWiki wikitext-handling code into a single system, which will be easier
> to maintain and extend.

I assume that Wikimedia works on a rather tight budget. Choosing (and unifying
on) tech stacks with a larger supply in devs seems to be an economically
reasonable choice.

~~~
maxfromua
[https://meta.wikimedia.org/wiki/Wikimedia_Foundation_salarie...](https://meta.wikimedia.org/wiki/Wikimedia_Foundation_salaries)

~~~
krick
Huh. What ED of Wikimedia Foundation even does?

