
A New RegExp Engine in SpiderMonkey - simonpure
https://hacks.mozilla.org/2020/06/a-new-regexp-engine-in-spidermonkey/
======
ridiculous_fish
I wrote the JavaScript regex engine in Hermes [1], and also regress in Rust. I
want to explain what's sucky about this decision, but first some ES regexp
background.

The regexp grammar portion of the ES spec is _different_ , in that it's not
sufficient. You can implement JavaScript from the ES spec, but NOT regexp. The
grammar doesn't really make sense [2]. The Chakra team found the same thing
[3].

I have not confirmed this but I have been told the regexp portion is tightly
controlled by Google, more than most parts of the spec. When v8's regexp
implementation is found to deviate from the spec, the spec gets changed. This
makes sense (why risk breaking websites), but it illustrates how little weight
the regexp spec actually has.

With Firefox using irregexp, it further cements this unfortunate reality.

1: [https://hermesengine.dev](https://hermesengine.dev)

2: Example: the NonemptyClassRangesNoDash production may include a dash

3: [https://docs.microsoft.com/en-us/archive/blogs/ie/chakra-
int...](https://docs.microsoft.com/en-us/archive/blogs/ie/chakra-
interoperability-means-more-than-just-standards#whats-the-real-regular-
expression-grammar)

~~~
IainIreland
> I have not confirmed this but I have been told the regexp portion is tightly
> controlled by Google, more than most parts of the spec. When v8's regexp
> implementation is found to deviate from the spec, the spec gets changed.
> This makes sense (why risk breaking websites), but it illustrates how little
> weight the regexp spec actually has.

A few responses:

1\. This was not at all my experience working with the V8 team. For example,
when I noticed that Irregexp did not properly implement the Canonicalize
algorithm for non-unicode case-insensitive comparisons [1], they were happy to
accept my patches [2][3]. Nobody ever suggested changing the spec.

2\. Nothing changes in the JS spec without consensus from all the major
stakeholders. See, for example, the second sentence in the TC39 process
document here [4]: "The committee operates by consensus and has discretion to
alter the specification as it sees fit." Mozilla can block (and has blocked)
proposals that it feels are bad for the web platform. Google has no special
power to bend the committee to its will.

3\. Another way of saying "when V8's regexp implementation is found to deviate
from the spec, the spec gets changed" is "when the spec is found to deviate
from the consensus of implementations, we correct the spec". The Chakra team's
post notes: "In practice, all browsers accept regular expressions such as
`/]/` and web developers write them." The purpose of the spec is to make the
web work better by maximizing compatibility between implementations. In cases
where all implementations agree, and real websites depend on that behaviour,
the spec _should_ change to match reality.

Don't get me wrong: Chromium monoculture is a real problem. If we thought that
Google was going to add a bunch of new regexp features without going through
the standards process, or if we had plans to prototype regexp features that we
feared Google wouldn't accept upstream, we might have made a different
decision. But regexps are not a plausible battleground, and the compatibility
benefits of sharing code outweigh the harms.

PS: regress looks really sweet, and I am excited to see how it turns out.

[1] [https://tc39.es/ecma262/#sec-runtime-semantics-
canonicalize-...](https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-
ch) [2]
[https://github.com/v8/v8/commit/3fab9d05cf34a7f0bc0e9405729a...](https://github.com/v8/v8/commit/3fab9d05cf34a7f0bc0e9405729ab8b78c0671ac)
[3]
[https://github.com/v8/v8/commit/b65fcfe92566415c8aae373d5fb5...](https://github.com/v8/v8/commit/b65fcfe92566415c8aae373d5fb5a9b16bed21ce)
[4] [https://tc39.es/process-document/](https://tc39.es/process-document/)

~~~
ridiculous_fish
Great reply, thank you for sharing your perspective.

First, I confess to some skepticism of the standardization process. For
example, consider ES2018 lookbehinds. At the point of standardization, only
one browser implemented them: v8. And in the standardized form (arbitrary
width) they are very difficult to retrofit onto existing engines. Moddable had
to throw their engine out and start over [1]. JSC's YARR still doesn't support
them, neither does SpiderMonkey. Did all stakeholders really agree on this
feature and then not implement it? Or was this just v8 in the driver's seat?

Second, regexps actually do have a role in driving monoculture. You can't
polyfill regexp syntax. At one point, Steam's website only worked on Chrome
because only v8 implemented lookbehinds [2] - again because it's hard to
retrofit.

Lastly, it remains a problem that one cannot implement a conforming JS regexp
engine from the spec alone. The rest of JavaScript does not have that problem,
only regexp. And I disagree with the "real websites depend on that behaviour"
qualifier. No website depended on /{1}/ being invalid syntax, yet it was made
so in ES8.

Incidentally, it's amusing that v8's Canonicalize was buggy. Non-Unicode case-
insensitive char ranges is THE ugliest part of regexp (QuickJS doesn't even
try [3]) and I just assumed it was codifying v8's implementation. Guess it was
someone else's, hah.

1:
[https://blog.moddable.com/blog/regexp/](https://blog.moddable.com/blog/regexp/)

2: [https://github.com/webcompat/web-
bugs/issues/51385](https://github.com/webcompat/web-bugs/issues/51385)

3:
[https://github.com/ldarren/QuickJS/blob/mod/libregexp.c#L259](https://github.com/ldarren/QuickJS/blob/mod/libregexp.c#L259)

~~~
IainIreland
The TC39 minutes are all online [1]. You can see the initial introduction of
lookbehind assertions [2], the meeting where they reached stage 3 [3], a
status update while waiting for more implementations [4], and then the
eventual agreement to move them to stage 4 (full standardization)[5].

They reached that point because: a) everybody agreed that they were a good
feature, with precedent in other languages; b) the spec text had been
scrutinized by the people who like to spend time scrutinizing spec text; and
c) two implementations (Moddable and V8) had successfully implemented them.
That's the process, and it was followed to the letter.

Does that put a burden on other engines to keep up? Sure. But that's what we
signed up for. The webcompat bug you linked is from April 2020, more than two
years after lookbehind assertions were standardized. That's nobody's fault but
ours, and we did this project so that we don't end up in a similar situation
in the future.

I'm not sure what your source is for JS regexps being impossible to implement
from the spec, but if you have specific points of concern, you should open an
issue [6]. Writing a spec is hard work, and things get missed, but the people
working on this are genuinely trying their best.

(As far as I can tell, Canonicalize is what everybody _thought_ they were
implementing. V8 tried to get clever with ICU and ran facefirst into some of
the dark corners of Unicode.)

1:
[https://github.com/tc39/notes/tree/master/meetings](https://github.com/tc39/notes/tree/master/meetings)

2:
[https://github.com/tc39/notes/blob/master/meetings/2016-11/n...](https://github.com/tc39/notes/blob/master/meetings/2016-11/nov-30.md#12ivc-
regexp-lookbehind)

3:
[https://github.com/tc39/notes/blob/master/meetings/2017-03/m...](https://github.com/tc39/notes/blob/master/meetings/2017-03/mar-21.md#10ib-
regexp-lookbehind-assertions)

4:
[https://github.com/tc39/notes/blob/master/meetings/2017-05/m...](https://github.com/tc39/notes/blob/master/meetings/2017-05/may-23.md#16ic-
status-update-on-regexp-proposals-lookbehind-unicode-properties-dotall-flag-
and-named-groups-status-update)

5:
[https://github.com/tc39/notes/blob/master/meetings/2018-01/j...](https://github.com/tc39/notes/blob/master/meetings/2018-01/jan-24.md#13iij-
regexp-lookbehind-assertions-for-stage-4)

6:
[https://github.com/tc39/ecma262/issues](https://github.com/tc39/ecma262/issues)

~~~
ridiculous_fish
I want to say four things. First, I think Mozilla probably made the right
decision here. Mozilla is an org with limited resources and must direct them
to what matters - and as you say, regexps are not a plausible battleground. I
wish I had said so in my top-level post.

Second, I cast no aspersions against any of the people involved in the ES
spec. I recognize they are engaged in hard and unrewarding work driven by a
sincere effort to push JavaScript forward, while everyone's a critic,
especially myself. It's awesome that the meeting minutes are available. Thank
you for the links.

Third, I honestly believe the meeting minutes support my point. I risk
overstepping, because I had no involvement, but from your third link:

> We have two implementations. One in V8 and one in the Dart VM. It's not a JS
> implementation, but it does implement this feature.

This supports my speculation that it's all about v8. Later:

> I would assume Chakra or V8 would have said something by now if they had
> issues

SpiderMonkey and JSC are chopped liver? This feature really seems to have been
driven by the implementors.

And this one feature (lookbehinds) was so disruptive, so damn hard to
implement, that Mozilla _abandoned their implementation_ and now just uses
v8's. The linked blog post cites this feature as an impetus to switch.

Was that the goal? Did SpiderMonkey engineers give the nod to lookbehinds with
the intention of abandoning their engine in favor of irregexp? I have to
believe not; I speculate it was the very human response of conflict avoidance.
Easier to say "yes" especially if you do not appreciate how much work you are
signing up for. But it means ES2018 regexp advanced the monoculture. One less
regexp engine in the wild.

Fourth, I really do know that the ES regexp spec is not useful for
implementors, because I am an implementor. Ok, it's not really a question,
huh; I should just make a PR.

------
sam_goody
Nice to see the V8 and FF team working together. And preventing FF from
falling behind on features is good.

On the other hand, the more dependent Mozilla is on the Chromium base, the
more power it gives Google (even if at this point Google already acts like
they own the internet and do things that break in FF at will).

~~~
ldeangelis
I agree, but in the article, it was mentioned that the V8 team already wanted
to make their RegExp engine more independent. Also, since Mozilla runs of
limited resources, it's great that they will have less maintenance to do in
the future.

~~~
nerdponx
Hopefully we start seeing more "modularized" browser components. It's a huge
detriment IMO that we are stuck with 3-4 (depending on who you ask) monolithic
browser engines.

~~~
LockAndLol
I honestly don't understand why Mozilla went the route of a monolithic and
firefox-only browser engine. They had a lead before Chrome showed up and
refused to cater to devs. Now they're paying for it and so is everybody else.

------
29athrowaway
Regular expressions come from this paper, "Representation of events in nerve
nets and finite automata" by Stephen Cole Kleene.

[https://www.rand.org/content/dam/rand/pubs/research_memorand...](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM704.pdf)

It's incredible that biology served as inspiration for this.

------
jedisct1
Slightly related question: does a RegExp engine in pure JavaScript exist?

~~~
ridiculous_fish
Yes! RegExp Tree, by Dmitry Soshnikov (who is extremely awesome).

[https://github.com/DmitrySoshnikov/regexp-
tree](https://github.com/DmitrySoshnikov/regexp-tree)

A fun bug I found in it: [https://github.com/DmitrySoshnikov/regexp-
tree/issues/69](https://github.com/DmitrySoshnikov/regexp-tree/issues/69)

------
wereHamster

      [regexp] Remove trivial assertion
    
      The assertion in BytecodeSequenceNode::ArgumentMapping cannot fail,
      because size_t is an unsigned type. This triggered static analysis
      warnings in SpiderMonkey.
    

Does that mean Chromium doesn't use any static analysis tool? Or one that does
not work for this trivial assertion?

~~~
wahern
> Or one that does not work for this trivial assertion?

Or one that _does_ work for this trivial assertion, which is to say not emit a
diagnostic.

What ends up happening in a situation like this is that somebody removes the
assertion to quiet the compiler diagnostic. Then the types change and all of a
sudden it _can_ fail, but the assertion is gone.

You might say, "well, now that one of the operands is size_t how could the
types ever change to reintroduce an issue?" That's the wrong question to ask.
The whole point of the assertion is to avoid having to answer such a tricky
question, or at least to encode your answer in a way that if the unforeseen
happens things break loudly instead of silently. Anyhow, you'd be surprised by
how things can break. I personally _religiously_ use size_t for anything
related to object size, but many other developers don't (including at Google
and Mozilla), and so you often see a mix of, e.g., size_t and uint64_t, size_t
and uint32_t, size_t and int, etc, and regular tweaks back-and-forth, which
can easily introduce regressions.

I understand why compilers emit a warning--the idea is that if the assertion
couldn't possibly be false, maybe it has a bug. But, IME, the opposite is
usually true--it's well-written and deliberate, because the developer is
trying to catch spooky action at a distance where the type, which is defined
far away, is changed, accidentally or intentionally. I don't know where to
draw the line in terms of second-guessing the code to help catch bugs, but GCC
and clang need to provide far more succinct constructs to tell the compiler to
shut up. Currently you're stuck with inline pragmas, __extension__, statement
expressions, and other weird convolutions that require far more code than the
assertion or operation itself. The issue makes writing arithmetic overflow-
safe code _more_ tedious and error prone.

(Other languages simply prohibit mixing integer operands of different types,
so if a type is changed far away then code will break loudly even without any
assertions. But in codebases like SpiderMonkey and V8 that use a wide range of
integer types for space and performance optimizations, that tends to encourage
casting, which has the exact same problems.)

~~~
wereHamster
But the bug is not in that function, if a function accepts size_t, then that
value can never be <0\. If someone were to pass (size_t)-1 to it, it would
still be positive. The issue is in the caller of the function, which must
convert a negative value of a signed type into size_t, and that is the actual
bug in this situation. Perhaps those conversions should trigger at least a
compiler warning.

Asserting things which are given from the types is a waste. Or do you think we
should assert that a value of type bool is really either true or false and
fail the program otherwise?

------
tannhaeuser
I thought Mozilla wanted to rewrite their browser in Rust didn't they?

~~~
sstangl
JIT engines like this one don't receive much benefit from being rewritten in a
memory-safe language. Errors typically occur in the generated machine code,
not in the compiler itself. The benefit would be small.

~~~
monocasa
There's more benefits to Rust than just it's memory safety. It being a non-
GCed language with ADTs is a big one, and that's been nice for the couple JITs
I've written in Rust.

And the partial memory safety over the metadata around the actual JITed code
is a big win as well.

Yes, Rust doesn't get you there 100%, but IMO it gets you closer than C or
C++.

~~~
saagarjha
> It being a non-GCed language with ADTs is a big one

C++ has both, too, so in this case Rust's only advantage would be memory
safety.

~~~
monocasa
Having used both (and in the space for writing JITs particularly), C++'s
support is very weak in the space of ADTs. Like most things in C++ you _can_
reach it with straight jacketing yourself in a particular way, with 80/20
static analysis rules backing that and 20% manual code review, but it's
difficult to maintain.

Rust gives you that more or less by default and for free wrt tooling. It's
sort of the classic "Rust makes you write the C++ you should have been writing
all along", which makes it a net win IMO.

~~~
saagarjha
It might be weak, but I think every browser engine makes do with what they
have available fairly well…

~~~
monocasa
Given the fact that Mozilla is the primary sponsor of Rust, and Rust has been
sneaking its way into Chrome as well, I'd say the authors of those browsers
disagree with you.

~~~
saagarjha
I'm not them, but I suspect that they're slowly switching not because of its
slightly better abstract data types but because it offers better memory
safety.

~~~
monocasa
Those are one and the same. The ADTs are how the shape and validity of the
data is described to the compiler in a lot of cases. Rust wouldn't be able to
be memory safe without it.

C++'s ADTs are easy to subvert even accidently; Rust's can't be without
explicitly calling it out as unsafe.

It allows you to describe transformations of state in a formal set theoretical
way. You should check out formally verified software like CompCERT and sel4
and their heavy use ADTs internally to achieve that. Rust obvs isn't full
formally verified but it's a neat 80/20 in that direction.

