
ECMAScript regular expressions are getting better (2017) - tosh
https://mathiasbynens.be/notes/es-regexp-proposals
======
kibwen
Aside, but I'd love for someone to come up with an alternative syntax for
regexes that sacrifices speed-of-typing for ease-of-reading. Regexes are
obviously still fantastically useful for things like command-line scripting
and jumping around in text editors, but the fact that people still reach for
regexes within programs that are meant to receive long-term maintenance tells
me that we're missing a tool in our toolbox.

~~~
couchand
This was one of the places CoffeeScript really innovated, IMO (though of
course Perl did it first). The "Block Regular Expressions"[0] allow whitespace
and comments:

    
    
        NUMBER     = ///
          ^ 0b[01]+    |              # binary
          ^ 0o[0-7]+   |              # octal
          ^ 0x[\da-f]+ |              # hex
          ^ \d*\.?\d+ (?:e[+-]?\d+)?  # decimal
        ///i
    

[0]: [https://coffeescript.org/#regexes](https://coffeescript.org/#regexes)

~~~
pitaj
Rust allows this too:

    
    
        Regex::new(r"(?x)
          (?P<y>\d{4}) # the year
          -
          (?P<m>\d{2}) # the month
          -
          (?P<d>\d{2}) # the day
        ")

~~~
burntsushi
Yup! rust/regex has had this for a while. More recently, its error messages
have also gotten a lot better, which I'm proud of!

    
    
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        regex parse error:
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        1: (?x)
        2:       (?P<y>\d{4}) # the year
        3:       -
        4:       (?P<m>*\d{2} # the month
                       ^
        5:       -
        6:       (?P<d>\d{2}) # the day
        7:     
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        error: repetition operator missing expression
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    

Playground link: [https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2015&gist=c9fb1d58a8132d57d66417752a0f7282)

------
JoelJacobson
Javascript regexes are "getting better" feature-wise, but performance-wise not
much have improved. The the a?a?a? example regex in Ross Cox article "Regular
Expression Matching Can Be Simple And Fast" from 2007 is still getting
exponentially slower and slower with the number of "a?"-repetitions
([https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html))

for (let i=20; i<30; i++) { let str = "a".repeat(i); let regex = new
RegExp("a?".repeat(i) + "a".repeat(i)); let t0 = performance.now();
str.match(regex); let t1 = performance.now(); console.log(i + " took " \+
Math.round(t1 - t0) + " milliseconds."); }

20 took 7 milliseconds. 21 took 14 milliseconds. 22 took 27 milliseconds. 23
took 52 milliseconds. 24 took 102 milliseconds. 25 took 224 milliseconds. 26
took 421 milliseconds. 27 took 814 milliseconds. 28 took 1604 milliseconds. 29
took 3470 milliseconds.

Tested today on latest Chrome on a MacBook Pro 2,6GHz Intel Core i7.

Here comes an idea on what could be done to improve performance for regexes
that are actually regular, and to still keep support for non-regular advanced
regexes that e.g. contain back-references:

Add a step immediately after parsing the regex, check if it is regular and can
be handled by an engine like RE2, otherwise handle it using the normal regex
engine.

An even more exotic idea would be to let multiple regex engines execute in
parallell, and return the result for the one that finishes first.

~~~
jhpriestley
You can also implement linear-time regex matching in user code with
webassembly. In my tests it was faster than native regex matching
([https://jasonhpriestley.com/regex-dfa](https://jasonhpriestley.com/regex-
dfa))

~~~
ridiculous_fish
As someone familiar with JS engine regex internals, those results are totally
shocking to me. How can I try your example? In particular I wasn't able to
figure out which regexes you are testing with.

~~~
jhpriestley
I tested with this regex:

    
    
      /^abcd(a(b|c)*d)*a(bc)*d$/
    

I used long strings of repeated "abcdabcd" as the test strings.

It's possible I made a mistake somewhere, and I can put together an open-
source repo with the test setup when I get the time. But I'm curious why you
find the results shocking?

~~~
ridiculous_fish
The results are shocking because the native regular expression engines have
thousands of man hours poured into them, with specialized JITs, exotic
optimizations, etc. And your page is reporting the native one at 10x slower!

One thing I spotted is that you're asking the native engines to record capture
groups, but your wasm implementation doesn't support those. For fairness you
might switch to non-capturing groups: `(?:bc)` instead of `(bc)`. However this
cannot explain the magnitude of the difference.

I dug into it some more, reducing it to this case:

    
    
        /^(?:abc*d)*$/
    

what happens is that the backtracking engine has to be prepared to backtrack
_into_ every iteration of the outermost loop. This means that the backtracking
stack grows as the length of the input; the engine spends most of its time
allocating memory! These nested loops definitely make the NFA approach look
good.

Regardless it's a cool project, thanks for sharing!

~~~
stplsd
There is nothing shocking here, parent implements strictly regular expressions
and compile them to DFA, so of course it will be fast, especially using only
ASCII characters and hand chosen regular expressions. Russ Cox articles covers
this very well.

------
dbrgn
It's funny to see that in almost all langues the regular expression engines
are not actually regular. This results in developers regularly (pun maybe
intended) shooting themselves in the foot with overly clever expressions.

(Notable exception: Rust [https://docs.rs/regex/](https://docs.rs/regex/))

~~~
burntsushi
Thanks for the mention! I just want to expand on some things you said:

\- Rust's regex library was greatly inspired by RE2, which is a C++ library
that also executes regexes in linear time.

\- Go's regex library also runs in linear time, however, it is still missing
some optimizations that can make it run more slowly on non-pathological
regexes when compared to RE2 and Rust's regex crate.

\- You don't need to use fancy features with a backtracking regex engine in
order to shoot yourself in the foot. e.g.,

    
    
        >>> import re
        >>> re.search('(a*)*c', 'a' * 30)
    

\- Even with linear time regex engines, you can get big slow downs. You'll
never get exponential (in the size of the text) slow downs of course, but
regexes like `[01]*1[01]{20}$`[1] can generate large finite state machines,
which can be problematic in either memory or match speed, depending on the
implementation.

[1] -
[https://cyberzhg.github.io/toolbox/min_dfa?regex=KDB8MSkqMSg...](https://cyberzhg.github.io/toolbox/min_dfa?regex=KDB8MSkqMSgwfDEpKDB8MSkoMHwxKSgwfDEpKDB8MSkk)

~~~
ridiculous_fish
Does the linear time hold up in the face of capture groups? For example, say
you have a bunch of capture groups in a loop:

    
    
        /((a)|(b)|(c)|(d))*/
    

If the string has length N, the loop iterates N times, and each iteration must
clear capture groups proportional to the length of the regex. So in this case
the time varies as the product of the input and regex length, not their sum
independently.

~~~
burntsushi
Yes, it does. Capture groups aren't special here. The issue is a semantic
quibble. Namely, when folks say, "a regex engine that guarantees matching in
linear time," what they actually mean is, "a regex engine that guarantees
matching in linear time with respect to the input searched where the regex
itself is treated as a constant." If you don't treat the regex as a constant,
then the time complexity can vary quite a bit depending on the implementation
strategy.

For example, if you do a Thompson NFA simulation (or, more practically, a Pike
VM), then the time complexity is going to be O(mn), where m ~ len(regex) and n
~ len(input), regardless of capturing groups.

As another example, if you compile the regex to a DFA before matching, then
the time complexity is going to be O(n) since every byte of input results in
executing a small constant number of instructions, regardless of the size of
the regex. However, DFAs typically don't handle capturing groups (although
they certainly can handle grouping), with the notable exception of Laurikari's
Tagged DFAs, but I don't know off-hand if the time complexity of O(n) usually
associated with a DFA carries over to Tagged DFAs. Of course, the principal
downside of building a DFA is that it can use exponential (in the size of the
regex) memory. This is why GNU grep, rust/regex and RE2 use a hybrid approach
("lazy DFA"), which avoids O(2^n) space, but falls back to O(mn) matching when
the DFA would otherwise exceed some memory budget during matching.

~~~
ridiculous_fish
> when folks say, "a regex engine that guarantees matching in linear time,"
> what they actually mean is, "a regex engine that guarantees matching in
> linear time with respect to the input searched where the regex itself is
> treated as a constant."

Well the Rust docs say "all searches execute in linear time with respect to
the size of the regular expression and search text." Their engine compiles to
a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.

> As another example, if you compile the regex to a DFA before matching, then
> the time complexity is going to be O(n) since every byte of input results in
> executing a small constant number of instructions, regardless of the size of
> the regex. However, DFAs typically don't handle capturing groups

Now you have arrived at my question! Rust compiles to a DFA that supports
capture groups. My question is whether capture groups ruin the linearity of
the DFA matching.

Thanks for the Laurikari's Tagged DFAs reference, I hadn't heard of that. I'll
check it out!

~~~
burntsushi
> Well the Rust docs say "all searches execute in linear time with respect to
> the size of the regular expression and search text."

Yeah, that's ambiguous phrasing on my part. I meant that it was linear time
with respect to _both_ the size of the regex and the search text.

> Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the
> basis for their claim.

No, it doesn't. rust/regex uses some combination of the Pike VM, (bounded)
backtracking and a lazy DFA. It will compile a DFA ahead of time in some cases
where Aho-Corasick can be used.

> Rust compiles to a DFA that supports capture groups.

No, it uses a lazy DFA to answer "where does it match," but it still must use
either the Pike VM or the bounded backtracker to resolve capture groups.

> My question is whether capture groups ruin the linearity of the DFA
> matching.

Yeah I think I would probably look at Tagged DFAs to answer this. You'll want
to check out recent papers that cite Laurikari's work, since I think there
have been some developments!

------
idbehold
This article is nearly two years old. Not that this isn't a great article, but
people should probably know that these features aren't entirely new.

~~~
Lord_Zero
I am also curious about which the the proposed features actually made it, and
which did not.

\- dotAll mode (the s flag)

\- Lookbehind assertions

\- Named capture groups

\- Unicode property escapes

\- String.prototype.matchAll

\- Legacy RegExp features

~~~
mathias
I make sure to keep the article up-to-date. It correctly states the status for
each of the features. They’re all part of ES2018 with the exception of
String#matchAll which is currently at Stage 3.

------
fock
I hope you don't need such things anymore then:
[https://github.com/sindresorhus/shebang-
regex/blob/master/in...](https://github.com/sindresorhus/shebang-
regex/blob/master/index.js)

------
tobyhinloopen
I like the `matchAll` feature. I now use the awesome/awful (depending who you
ask) hack using `.replace`:

    
    
        var matches = [];
        "12345678".replace(/\d/g, (m) => matches.push(m));
        console.log(matches);

------
SiVal
This was a status report on multiple, changing features as of two years ago.
The people who know best are probably busy building, but I can't help
wondering if anyone is able to provide the current status of any of these
things.

------
nik1aa5
Since this is from 2017, isn't the year missing in the title? When I initially
saw the submission, for a moment I thought all the struggles I had some weeks
ago had been solved... :-)

~~~
mathias
I mean, they _have_ been solved. With the exception of String#matchAll (which
is currently a Stage 3 proposal), all these features are part of ES2018 and
shipping in Chrome. Other browsers implement some of them already and are
working on shipping more.

You can view the implementation status of the various features here:
[https://kangax.github.io/compat-table/es2016plus/#test-
RegEx...](https://kangax.github.io/compat-table/es2016plus/#test-
RegExp_named_capture_groups)

------
hidiegomariani
finally named capture groups

~~~
dclowd9901
Finally lookbehind!!!

~~~
JadeNB
There's a difference, right, in that named capture groups are purely cosmetic,
whereas look-behind can dramatically slow down matching? (I dunno in any
theoretical sense, but that's the way it goes in Perl.)

~~~
DFHippie
Named captures means you don't have to count parentheses to figure out which
group you want. I don't know whether it qualifies as cosmetic, but it sure is
nice.

~~~
JadeNB
> Named captures means you don't have to count parentheses to figure out which
> group you want. I don't know whether it qualifies as cosmetic, but it sure
> is nice.

Agreed! (As a regex-heavy Perl hacker, I loved the day that they entered the
language.) I didn't mean to minimise them; rather quite the opposite, to point
out that they gave a great return essentially for free (compared to regexes
that still have capturing groups, but without names), as opposed to look-
behind, which (I think) can slow down a match dramatically.

------
jancsika
Is there a linter that can scan ECMAScript regex constants and disallow any
that go beyond what is allowed in a regular language?

~~~
ridiculous_fish
Is the idea to avoid catastrophic backtracking? Unfortunately you aren't
immunized from catastrophic backtracking by limiting yourself to regular
languages. The catastrophe is a property of the implementation, not the
matched language.

------
pier25
This is cool, but I really wish the TC39 was more focused on solving _major_
problems with the language instead of small incremental updates.

~~~
interesthrow2
issues such as what?

~~~
pier25
For example, the lack of types.

------
shawn
_Another proposal specifies certain legacy RegExp features, such as the
RegExp.prototype.compile method and the static properties from RegExp.$1 to
RegExp.$9. Although these features are deprecated, unfortunately they cannot
be removed from the web platform without introducing compatibility issues.
Thus, standardizing their behavior and getting engines to align their
implementations is the best way forward. This proposal is important for web
compatibility._

Interesting view. Is this better than a "let it break" approach?

Link rot already claims N% of websites per year. I wonder if cleaning up APIs
like this one would increase N noticeably.

~~~
alangpierce
I think breaking JS tends to be much more destructive than dead links. At
least with dead links there's a fairly clear non-technical fix. With a web
standards break, maybe 8 years ago you hired a contractor to build a website
that uses a library that uses a library that uses a library that uses some JS
feature that is now broken, so your website is now completely broken. Some of
those libraries are on older unmaintained versions, where the only upgrade
would be through non-trivial breaking changes, or you would need to just find
alternate libraries. Getting things working again is a huge undertaking, not
just a matter of "don't use that weird JS feature anymore", and I think in
that situation it seems reasonable to blame the browser for the breaking
change.

It's also maybe more widespread than you'd think. Adding `global` seemed safe
for a long time, but ended up breaking both Flickr and Jira because they both
use a library that broke: [https://github.com/tc39/proposal-
global/issues/20](https://github.com/tc39/proposal-global/issues/20)

------
dclowd9901
‘matchAll’ looks great but why make it an iterator vs a ‘map’ style callback?
Just seems so arcane.

~~~
masklinn
> ‘matchAll’ looks great but why make it an iterator vs a ‘map’ style
> callback?

Because that makes it significantly easier to e.g. optionally collect into an
array, or to only partially iterate the sequence (e.g. only get the first 3
matches) which is painful to impossible using JS's callbacks. An iterator is
simply more flexible.

