Hacker News new | past | comments | ask | show | jobs | submit login
Parsoid in PHP, or There and Back Again (wikimedia.org)
87 points by sthottingal 15 days ago | hide | past | web | favorite | 72 comments

Reintegrating the parser into Mediawiki's PHP core goes beyond performance. Many prominent MW features -- particularly the visual editor, translation functions, and mobile endpoints -- depend heavily on Parsoid/JS, which required running it as a Node.js microservice, something not all smaller (or especially shared-hosting wikis) could manage for quite some time.

Bringing Parsoid closer to core makes it easier for non-Wikimedia admins to use more of MW's modern core features. The performance gains are a nice bonus, and I suspect are related to the general improvements into PHP 7.2.

Performance gains of "Parsoid/PHP ... roughly twice as fast on most requests as the original version" are bit more than a casual bonus and into territory of pretty dang awesome in my book.

I totally agree, but on wikis of a smaller scale it's not a difference of performance, it's potentially being able to use some of these MW components for the first time. It could be 10x faster than Parsoid/JS and that's irrelevant compared to being _possible_ on a service that won't let you run Parsoid/JS.

> We will (re)examine some of the reasons behind this architectural choice in a future blog post, but the loose coupling of Parsoid and the MediaWiki core allowed rapid iteration in what was originally a highly experimental effort to support visual editing.

I'd really like to read that. The decision to have this parser as a completely separate component is the main reason why a lot of local MediaWiki installations completely avoided having a visual editor -- which in turn probably created lots of missing hours and/or missing documentation because WikiCreole ain't exactly a thing of beauty or something that's used in other places (as opposed to Markdown, which is an ugly beast, too, but at least the ugly beast you know).

You need a heavy JS frontend for a visual editor anyway, so why not do it client-side?

Having to deploy a separate component, probably in an environment that's not used at all is pretty much the worst choice possible. Yes, I'm aware, you readers here probably do all kinds of hip docker/keights setups where yet another node microservice ain't nothing special (and should've been rewritten in Rust, of course), but a wiki is something on a different level of ubiquity.

I'm on the team. Part 2 of this post series should have lots of interesting technical details for y'all; be patient, I'm still writing it.

But to whet your appetite: we used https://github.com/cscott/js2php to generate a "crappy first draft" of the PHP code for our JS source. Not going for correctness, instead trying to match code style and syntax changes so that we could more easily review git diffs from the crappy first draft to the "working" version, and concentrate attention on the important bits, not the boring syntax-change-y parts.

The original legacy Mediawiki parser used a big pile of regexps and had all sorts of corner cases caused by the particular order in which the regexps were applied, etc.

Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to pegjs for this project). There are still a bunch of regexps scattered throughout the code, because they are still very useful for text processing and a valuable feature of both JavaScript and PHP as programming languages, but they are not the primary parsing mechanism. Translating the regexps was actually one of the more difficult parts, because there are some subtle differences between JS and PHP regexps.

We made a deliberate choice to switch from JS-style loose typing to strict typing in the PHP port. Whatever you may consider the long term merits are for maintainability, programming-in-the-large, etc, they were extremely useful for the porting project itself, since they caught a bunch of non-obvious problems where the types of things were slightly different in PHP and JS. JS used anonymous objects all over the place; we used PHP associative arrays for many of these places, but found it very worthwhile to take the time to create proper typed classes during the translation where possible; it really helped clarify the interfaces and, again, catch a lot of subtle impedance mismatches during the port.

We tried to narrow scope by not converting every loose interface or anonymous object to a type -- we actually converted as many things as possible to proper JS classes in the "pregame" before the port, but the important thing was to get the port done and complete as quickly as possible. We'll be continuing to tighten the type system -- as much for code documentation as anything else -- as we address code debt moving forward.

AMA, although I don't check hacker news frequently so I can't promise to reply.

One question: did you investigate why the PHP version was so much faster than the JS one ? Do you think the performance gains of the PHP versions could be achieved in JS, or do you use any special feature of the PHP interpreter ?

No, we haven't investigated it yet since we haven't had the time to do it. But, we've filed a task for maybe someone to look at it ( https://phabricator.wikimedia.org/T241968 ), but our hunch is that it would be be incorrect to conclude that PHP is faster from JS.

A slightly longer answer is that we looked into a number of possible reasons and it's not clear there's an easy answer. Lots of differences between the two setups, and every time we come up with an possible answer like "oh, it's the reduced network API latency" we come up with a counter like "but html2wt is also faster and it does barely any network requests". Casual investigation raises more questions than answers. So we've looked into it but don't yet have an answer that we fully believe.

If you want to dig through the history some: https://github.com/wikimedia/parsoid/blame/6eb00df3e090b20cc... Is a pretty good example of the porting technique. You'll see quite a decent number of lines are still unchanged from the "automatic conversion from JS". https://github.com/wikimedia/parsoid/commit/6eb00df3e090b20c... shows what the initial port process was like. Still quite a bit of work, but you'll see it's almost all "real" work that needs a human to think about things, not just mechanical syntax translation. The syntax translation part was done automatically.

Then https://github.com/wikimedia/parsoid/commits/master/src/Ext/... is a not-too-atypical view of the process after the "intial working port" was done (post Aug 2019). Some nasty bugs fixed (https://github.com/wikimedia/parsoid/commit/34fcb4241aa0f3a0... a GC bug in PHP!), some more subtle bugs (PHP's crazy behavior of '$' at the end of a regexp, unless you use the 'D' flag), etc.

If you look through the history earlier in 2019, you'll even see JS commits like https://github.com/wikimedia/parsoid/commit/2853a90ceda7cdfa... which are to the JS code (in production at the time) preparing the way for the PHP port. In that particular case, our tooling was doing offset conversion between JS UTF-16 and PHP UTF-8 as part of the output-testing-and-comparison QA framework we'd built for the port, and it was getting hugely confused by Gallery since Gallery was using "bogus" offsets into the source text. Since fixing the offsets was rather involved (the patchset for this commit in gerrit went through 56 revisions : https://gerrit.wikimedia.org/r/505319 ) the change was first done on the JS side, thoroughly tested, and deployed to production to ensure it had no inadvertent effects, before that now-better JS code was ported to PHP. It would have been a disaster to try to make this change in the PHP version directly during the port.

Why? The editor needs a frontend in javascript anyways, so why mot handle this all in real time on the client?

Now they rewrote in PHP, thats probably one of the worst languages out there, and why not rewrite in something compiled if speed was the main reason for a rewrite?

For me PHP sits in the middle as a poor language, and still slow compared to any compiled languages. Also i would want to see some wasm vs php benchmarks they did before starting with php.

Lots of poor decisions from the wiki team.

> For me PHP sits in the middle as a poor language, and still slow compared to any compiled languages.

After switching jobs and ending up in a PHP-based company, I can say that such thing is not entirely true.

Poor language?

Not really, it does a lot of stuff and solves a lot of problems.

Still slow?

Kinda true but not really: php 7.x saw a huge improvements and rumors have it that php 8.x will be getting a JIT-compiler.

Also, from my own observation, most of php slowness derives from the fact that the usual approach to deploying a php web application means using php fpm, that starts a whole new php interpreter for each request.

This in turn derives from the fact that php was born to create "dynamic websites" as in websites that were mostly static but with some occasional dynamic page.

IMHO some framework (Laravel? Symfony? some new player?) should try and start a single php process to handle request and persist between a request and the next one.

Starting a new php process is SUPER expensive: there's the whole fork+exe overhead, I/O to load data from disk, parsing and byte-compiling. every single time. even with opcache, you might skip some of the last steps, but you'll have to re-load cache in the next execution.

Lots of assumptions from you. "PHP bad, PHP slow, everyone has fast client." Did you assume it was 2010?

Well, actually i made zero assumptions. Theres tonnes of markdown editors that work in the browser in real time. I assume the wiki syntax is not that much "heavier" to work with, if it is its another lol for the wiki team.

And by PHP i mean if this rewrite is done partly because of speed, why not build the parser in a compiled language? Its just silly that you have to work with PHP as its one of the worst languages out there in terms of dx and features.

I assume its 2020 where clients are "fast enough" to handle a syntax transformation like markdown -> HTML

"For me PHP sits in the middle as a poor language, and still slow compared to any compiled languages." No assumptions whatsoever.

2020: mobile clients. Yes, they're where the rest of the web was in 2005. Yes, they're ubiquitous.

I cant believe users on legacy phones (phones from say 2010-ish era) would be writing new content on wikipedia actively. I dont care if you have the newest iphone, you still dont write anything, you just read. For those 0.00001% of users who actually create new content they most def are not using a phone to do this.

> "For me PHP sits in the middle as a poor language, and still slow compared to any compiled languages." No assumptions whatsoever.

Its not an assumption. PHP is slower than a compiled langauge. simple and easy. Need speed? Dont do it in a compiled language. Period.

Not exactly accurate regarding mobile editing. Edits on mobile-heavy wikis are up 18% YoY, which we attribute to improvements in the mobile editing interface. We have people who edit on T9 interfaces. Not many, but they exist.

you probably want to do stuff on server side too.

One of the things that come to my mind is rendering in formats other than html.

wikibooks for example lets you render wikimedia pages into pdf, and that's cool. but to do that you have to parse the page server side.

>Parsoid/JS had very few unit tests focused on specific subsections of code. With only integration tests we would find it difficult to test anything but a complete and finished port.

I found this a little frightening given Parsoid/JS is handling user input.

There are thousands of integration tests. The "correct" output of the parser is well-known for a given input, and those test cases have been accumulating for over a decade. But the internal structure of the parser is much more fluid, and so it wasn't (historically) thought worthwhile to try to write tests against that shifting target.

Last time i tried getting mediawiki up and running as a personal wiki i found out that getting parsoid working was quite a mess. hopefully now it will be easier to get a fully fledged wikimedia installation, together with a visual editor.

Yes. We're not there yet, but that's the goal!

awesome! keep rocking!

I would assume that the code is open source?

For some reason, I did not manage to find it. Neither linked from this article, nor via the MediaWiki page:


Nor via the Phabricator page:


What am I missing?

> What am I missing?

Google fu.

parsoid source code => https://github.com/wikimedia/parsoid

Woops .. so they use Phabricator for issue management and GitHub to host the code?

The GitHub repo is a mirror. It has links to more documentation on how the development process works.

I saw that. But why not have the code on Phabricator?

Years ago I wrote the re-wrote wiki parser for Dokuwiki (which is used at https://wiki.php.net/ among other places). Originally the parser was scanning a wiki page multiple times using various regular expressions. I used a stack machine as a way to manage the regular expressions, which resulted in being able to parse a page in a single pass - it's documented here - https://www.dokuwiki.org/devel:parser

A nice (unexpected) side effect is it became much easier for people extend the parser which their own syntax, leading to an explosion of plugins ( https://www.dokuwiki.org/plugins?plugintype=1#extension__tab... )

I'm no expert on parsing theory but I have the impression that applying standard approaches to parsing source code; building syntax trees, attempting to express it with context free grammar etc. is the wrong approach for parsing wiki markup because it's context-sensitive. There's some discussion of the problem here https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study

Another challenge for wiki markup, from a usability perspective, if a user get's part of the syntax of a page "wrong", you need to show them the end result so they can fix the problem, rather than have the entire page "fail" with a syntax error.

From looking at many wiki parsers before re-writing the Dokuwiki parser, what _tends_ to be the case, when people try to apply context-free grammars or build syntax trees is they reach 80% then stumble at the remain 20% of edge cases of how wiki markup is actually used in the wild.

Instead of building an object graph, the Dokuwiki parser produces a simple flat array representing the source page ( https://www.dokuwiki.org/devel:parser#token_conversion ) which I'd argue makes is simpler write code for rendering output (hence lots of plugins) as well as being more robust at handling "bad" wiki markup it might encounter in the wild - less chance of some kind of infinite recursion or similar.

Ultimately it's similar discussion to the SAX vs. DOM discussions people used to have around XML parsing ( https://stackoverflow.com/questions/6828703/what-is-the-diff... ). From a glance at the Parsiod source they seem to be taking a DOM-like approach - I wish them luck with that - my experience was this will probably lead to a great deal more complexity, especially when it comes to edge cases.

It appears that you were probably describing the lexer pass in your description of docuwiki. Indeed tokenization is a very hard problem for wikitext. We use a pegjs grammar for it, but it contains less of lookahead/special conditions/novel extensions, etc. It's hard. Wikitext is messy precisely because it was intentionally designed to be easy and forgiving to write.

Seems like we've learned many of the same lessons building our parsers. Markup parsers do seem to be a unique thing, not really like parsing either programming languages or natural languages. If we every meet I'm sure we could happily share a beverage of your choice trading stories.


So, remove the majority of dynamic websites from the internet, basically? I don’t see how that’s helpful. I get you’re (probably) being hyperbolic, but the fact remains that it’s a highly useful language for generating dynamic web content.

I was refactoring yesterday some JS code, bad JS code it is worse then bad PHP code, so by your logic JS has to go too and probably many other languages.

Good code uses small functions that do a simple thing, then you combine those functions , it will look similar for most programming languages

Let's smite the vast majority of web content due to technical purity?

Why are web engineers snobs? Get the job done and move on.

Anybody else feeling that strict typing and long var names are not worth all the visual overload?

Example: https://github.com/wikimedia/parsoid/blob/master/src/Parsoid...

This is how I would write the function definition:

    function html2wikitext($config, $html, $options = [], $data = null)
This how Wikimedia did it:

    public function html2wikitext(
        PageConfig $pageConfig, string $html, array $options = [],
        ?SelserData $selserData = null
    ): string
I see this "strictness over readability" on the rise in many places and I think it is a net negative.

Not totally sure, but this seems to be the old JS function definition:


A bit cryptic and it suffers from the typical promise / async / callback / nesting horror so common in the Javascript world:

    _html2wt = Promise.async(function *(obj, env, html, pb)

Your first function definition leans HEAVILY on implicit knowledge - what does $config mean? What does $html mean? What are valid properties of $config and $data?

This is fine in smaller codebases, 'your' code, and code that you can read to a point where you can extrapolate these variables from the implementation, but this simply does not scale beyond a certain code size - or more importantly, a certain amount of contributors.

It can be compensated with documentation (phpDoc), but that is just as verbose if not moreso than adding type information - although you should probably do both.

Type systems come into place where you are not expected anymore to fully comprehend the code. They are useful when you are just a consumer / user of this function and all you want to do is convert some html to wiki text without having to understand the internals of that particular function (and whatever else goes on beyond it). Types are documentation, prevent shooting yourself in the foot, reduce trial-and-error, and avoid the user having to read and comprehend hundreds - thousands of lines of code.

This. Passing a variable named $config, or options, settings... etc, to a function with no type definition is the recipe for unmaintainability.

Why would you add a docblock on top of your already typed functions? Maybe just description would be cool but phpdoc with all the duplicated parameter definitions? I think it's unnecessary but would love to hear other perspectives.

You can add e.g. descriptions to your params if needed. Also phpdoc understands types like string[] for arrays and union types like (int|string) for untyped params.

Reformat your example slightly, and suddenly it's not so bad:

  public function html2wikitext(
    PageConfig  $pageConfig, 
    string      $html, 
    array       $options = [],
    ?SelserData $selserData = null
  ): string
I am sure an IDE could do that for you.

That being said, I think the Python way of formatting type annotations ("variable : type") is more readable than C-style "type variable = ...", especially when the annotation is optional.

I did not like/see the real benefit of strict typing until actually using Typescript. Now for me the more verbose code and extra typing is vastly outweighed by knowing what arguments the function actually wants, especially when using 3rd party libraries. The time saved by real-time type checking in my editor compared to edit->run->crash->edit is almost unbelievable and the amount of errors in code I ported over that we did not come across in production is also huge. So no, I think the visual overload is definitely worth it.

You can then e.g. run a static analyzer over it, which can say things like 'in SomeOtherFile.php:130, you're passing $config="nope"; this is a string, not what the function expects to handle' or 'html2wikitext is supposed to return string, you're returning a DateTimeImmutable at line 160'. Same with the access modifiers: 'private function whatever() is unused'.

Further, documentation: "of course everybody knows what you're supposed and forbidden to pass into $data" - NOT. Even if it's just you writing the code: the you+1year will have trouble reading it (been there). Not even when it's supposed to be documented. If you have an explicit data structure, this becomes far more evident, even before any documentation (note: not replacing it).

I'm not interested in playing computer in my head any more, juggling internal state that's completely superfluous to me: am I a higher primate? Yes. Are higher primates tool users? Also yes. Should I let machines do the menial tasks for me, leaving me to do the creative ones? A hundred times yes.

(NB: this is not a silver bullet - e.g. won't help against logic errors - but it's a useful guard against going completely off the rails)

Honestly, I think the tooling should allow you to write just

    function html2wikitext($config, $html, $options = [], $data)
and then infer the types, and subsequently enforce them. The typechecker should only complain when it can't infer the types (a-la strict). If you use tooling to reduce verbosity, you'll have best of both worlds simultaneously.

I like long names because autocomplete usually means typing speed doesn't matter and they are more verbose, so less thinking when I come back to the code next week. Same goes for strong typing, it makes you think when you write it so you (or somebody else) have to think less when you read it, and obviously it keeps you from making mistakes, helps the IDE make useful suggestions etc.

Me: Many people like X. Anybody here who likes Y?

You: I like X!

Sorry, I'm bad at reading, I understood your question as mostly "why do people like X?"

But you also say

> I see this "strictness over readability" on the rise in many places and I think it is a net negative.

which makes your tastes a bit more... absolute to say something. So it makes sense to me that people reply to your negative view with counter arguments.

Strict typing has an overhead, but the payoff is apparent when working with a large dev team of variable skill levels. People can make a mess in any language but it's easier to decode with strict typing.

For lone wolf coding or rapid prototyping the equation is different.

I've never worked in a large team that has used a dynamically typed language. To me it sounds like a nightmare, especially given that I know how little time is typically left for documentation. I see that it could work if you enforce type annotations. But then you might as well use a strongly typed language.

I've worked on a 15 person team using Perl that was actually pleasant because everyone was A grade, actively did code reviews and the two seniors were A+ and would kick your ass for bad code. But most teams aren't like that so would benefit from strong typing.

And Perl has many aspects of strength compared with Python or Javascript. See for example:




For what comes out of the box with Perl regarding undefined variables. Perl has "my" "local" and "ours" for the start but even more: the array and hash variables have by design different syntax, which helps immensely: the sigils are of real help as a kind of written "type." It's like in old Basics string variables looked different from the numeric ones, and Larry Wall acknowledged he was inspired by that. I can go on and on. Perl looks hard to the uninitiated but it can produce much more stable code, in my experience, than the "typical" scripting languages like Python and Javascript. Stable in the sense that you know it will work after it compiles, not only once you test with every possible input.

But yes, the people using it should know learn enough before they start contributing, and in a lot of places it's preferred to have people who barely know the basics of what they use (because they are "cheap" and "easily replaceable").

I'll bet the typed version is a lot more readable with syntax highlighting.

I agree with you, I prefer your structure. I actually wish PHP hadn't added typing, if I wanted that I wouldn't have chosen PHP in the first place.

I'm actually curious why PHP was chosen instead of Rust or Go given that the parsing team wasn't familiar with the language. I understand that MediaWiki is written in PHP, but it sounds like they were already comfortable with language heterogeny.

They claim,

> The two wikitext engines were different in terms of implementation language, fundamental architecture, and modeling of wikitext semantics (how they represented the "meaning" of wikitext). These differences impacted development of new features as well as the conversation around the evolution of wikitext and templating in our projects. While the differences in implementation language and architecture were the most obvious and talked-about issues, this last concern -- platform evolution -- is no less important, and has motivated the careful and deliberate way we have approached integration of the two engines.

Which is I suppose a compelling reason for a rewrite if you're understaffed.

I'd still be interested in writing it in Rust and then writing PHP bindings. There's even a possibility of running a WASM engine in the browser and skipping the roundtrip for evaluation.

> I'm actually curious why PHP was chosen

From the article: "However, by 2015, as VisualEditor and Parsoid matured and became established, maintaining two parallel wikitext engines in perpetuity was untenable"

They didn't write it in PHP for speed, that was merely a side effect. They wrote it in PHP so they could have a single language for the system.

> Parsoid/PHP also brings us one step closer to integrating Parsoid and other MediaWiki wikitext-handling code into a single system, which will be easier to maintain and extend.

I assume that Wikimedia works on a rather tight budget. Choosing (and unifying on) tech stacks with a larger supply in devs seems to be an economically reasonable choice.

It's more complicated than that. MediaWiki is PHP based because back when it was developed PHP was everywhere. Since then the world has moved on, but PHP still powers a huge percentage of the web via things like WordPress.

The other side to using PHP was having support in other host providers. Wikipedia is not the only installation of MediaWiki and there has been consideration in the past for those installing MediaWiki on shared hosts where you don't necessarily have root access to install things like node. Moving forward that's less of a concern because you can containerise MediaWiki (and the other services), but not even Wikimedia run that in production yet AFAIK.

However, even if they weren't budget constrained (which they aren't) unifying on a single language used by the majority of their devs isn't a bad idea, especially when the effort to port the entire stack to a new language would be unjustifiable.

and... migrating an entire codebase to something new because there's a subset of devs that jump between tech stacks and want 'newer' stuff isn't an economically reasonable choice.

server-side JS was a thing 10 years ago, but it didn't offer enough benefits to switch. same with python, java, ruby - all existed, but didn't offer enough benefits to switch then, and probably still don't now.

also, what would be a "larger supply"? C? Java? C#? JS? PHP has a huge supply of developers at all skill levels, which may make it just as easy (or easier) in finding the talent they need. And... hey - they wrote that initial parsoid in JS and... they've doubled the speed by converging on PHP.

Huh. What ED of Wikimedia Foundation even does?

Wikimedia is swimming in donations. More than $100,000,000 yearly since 2017/2018.

Probably parser is bunch of regex-es that noone understands. So they just converted to code to php without touching the expressions.

My suspicion is correct - code is full of things like: /\[\[([^\[\]])\]\]|\{\{([^\{\}])\}\}|-\{([^\{\}]*)\}-/

I never understood why people find regex so intimidating. Obviously you probably didn't look to find the worst of all, but one you posted is very straightforward.

You jest, but that regex looks machine-generated. My Emacs is full of these in places used for syntax coloring, but I know these are optimized. There's an elisp function, regex-opt, into which you can throw a bunch of strings, and you get out a regex like above.

To be honest I was serious. Personally I believe that regular expressions is one of few tools that super useful even for people outside of IT because everyone have to extract of format some text or table data from time to time. You can even learn them just by playing game:


The example quoted required some mental work to unparse, so I assumed you're joking.

But in general, I agree with you. Regular expressions aren't hard, and there's no excuse for not learning to read and use them.

Regex are dreaded as difficult to comprehend, but the real danger in using them is more subtle - especially nowadays when you'd have most text as UTF-8, possibly escaped, etc. and regex are prone to misbehave in odd ways, and introduce security issues - they should only be handled by expert programmers. Even parsing apparently simple stuff like email addresses, IP addresses, phone numbers and date/time is tricky, far beyond what a newbie would expect. There's a reason we have dedicated validation functions in PHP for all of the above. That said, regex have their use case too, and if your parsing case is not covered by a dedicated function, are usually the best option.

I never understood why people who understand regex don’t understand people who don’t understand regex. Obviously you are not the worst of all, but it’s not that hard to imagine how a regex looks to someone who doesn’t know regex, is it?

I couldn't agree more. I know regex fairly well and parsing regex is still annoying and takes a lot more concentration than just reading normal code.

Plus there are so many cases where people build insane regex where they are just the wrong tool for the job, e.g. parsing/extracting or manipulating HTML. It always starts out with "I just need the src from that <img>, what could go wrong" and ends in despair, because you never just need that src and you never only deal with perfect html and you'd be done already if you had just used some dom parser.

Yeah, I get that regular expressions might look complex and tangled like brainfuck looks for me since I never tried to learn it. Yet I just see comments on how regular expressions are hard to understand from all kind of IT people who solving hundred times more complex puzzles every day. I guess it's just reputation that stick to certain technology and really have nothing to do with actual complexity.

Experience i guess. I've spent hundreds of hours on debugging and fixing regexes that other people wrote - usually just to find there's a quirk in certain regex parser implementation.

Regexes are easy to understand if you write them, but reading them can take lots of time.

Note that HN formatting messed it up (there are stars missing before the first two closing parens). The regex itself is indeed quite straightforward, just a bit hard to read due to all the required backslash-escaping.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact