Aside, but I'd love for someone to come up with an alternative syntax for regexes that sacrifices speed-of-typing for ease-of-reading. Regexes are obviously still fantastically useful for things like command-line scripting and jumping around in text editors, but the fact that people still reach for regexes within programs that are meant to receive long-term maintenance tells me that we're missing a tool in our toolbox.
People have a number of good suggestions, but let me suggest unit tests as one excellent way to make regexes maintainable.
The maintenance problem for me with regular expressions is that when I write them I have carefully studied a variety of inputs and then make something that matches. Typically I will also have tested it as I go, trying out various known strings. But then a year later I have forgotten all the cases, and I have to somehow decompress them from the regex.
If I save my experiments in well-named unit tests, though, it's much easier for me to figure out my original intent, and to see whether the new case I'm thinking about is covered.
In OCaml, most people use the re[1] library for regular expressions (example[2]). I wrote a library called tyre[3] for typed extraction that follows a similar API.
Of course these APIs are much more verbose, but they are also very regular(hah!): Regular operators are normal functions of the language, typechecking and completion works, etc.
This was one of the places CoffeeScript really innovated, IMO (though of course Perl did it first). The "Block Regular Expressions"[0] allow whitespace and comments:
CL-PPCRE turns a PCRE string into a parse tree, as a native data structure, and the programmer has full access to this interface. For example, the sample regex:
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u
could be specified in your program as something like:
(Whether this is easier to read depends on your relative familiarity with CL and PCRE. It's a lot easier to generate or manipulate with native CL functions, though, if you ever need to do that.)
Most HLLs today wrap an existing C regular expression library, and I don't know any C regular expression library that provides a public interface to its parse tree, so it's unlikely that other languages will be able to do something similar without a lot of work.
Of course, regular-expressions-as-strings are still strings, so if you only need to write them, you can get most of the benefit by using your language's native string facilities: https://news.ycombinator.com/item?id=241373
> Most HLLs today wrap an existing C regular expression library, and I don't know any C regular expression library that provides a public interface to its parse tree, so it's unlikely that other languages will be able to do something similar without a lot of work.
I don't think I know of one either. But Go's regexp library provides access to the syntax[1], and so does rust/regex[2]. In the case of [2], it provides both an AST and a high level IR for regexes. It's not as convenience to build expressions as in your Lisp example, though, there's nothing stopping someone from building such a convenience. :-)
Yup! There's performance reasons not to write a regular expression engine in a language like Python or Ruby. I'm not surprised that other natively compiled languages like Go and Rust are the ones that come closest.
Perl has always let you split regexs across multiple lines, with freeform formatting and even comments. Absolute godsend for those (rare?!) times your regex gets a bit too complex for its own good...
As someone who never found regex difficult to understand, and observed how some others are the same while others seem to view any moderately long regex with a look of shock and horror, it appears to largely be a matter of attitude and perception --- the ones in the former group have, for lack of a better term, an "APL mindset" to learning. That is, they're not afraid of the fact that something looks completely foreign and incomprehensible at first, and realise that it's just another language to be learned.
The main issue I think is that they get convoluted...fast. Combined with varying dialects, the issue that they’re usually write-once, update rarely, and all regexes are imperfect and will fail in unexpected fashions, and that they usually don’t appear that many times in any given codebase, means that they’re never really worth the effort of properly memorizing. Understanding is fine, its not that complex and each operator is relatively simple conceptually, but I have to re-lookup the regex syntax anytime I’m writing/reading any regex that goes beyond pure regex (*+.|)
Character classes beyond \w, d and b always require a doc-lookup and I always misremember the look{ahead,behind} operators.
The issue is that its too small to remember fully, makes no effort to be self-documenting, the docs are always annoying (I can never get a ^F search to not return 20 items), and theres always an uncovered edge case somehow
Rosie (https://gitlab.com/rosie-pattern-language/rosie) is an interesting approach to this problem, and I think PEGs in general would be better for reading. My memory is a bit foggy on the details, but I do recall that PEGs have some problems parsing certain inputs performance wise, though I could be mistaken.
Wolfram Language / Mathematica has “string patterns” which extend the built in pattern language to character-based matching. Within the context of Mathematica is a quite readable and familiar syntax.
The terseness of regex isn't about speed-of-typing but ease-of-reading. Once you get used to the syntax, you can pattern match in your head pretty easily (for any non-insane use of regex).
Pattern matching with a verbose syntax that is "easy to ready" has been done hundreds of times but it rarely translates into "easy to read" for anyone familiar with regular expressions.
Most regexp libraries let you use named groups and comments which let you write more maintainable expressions. The fact javascript does not may explain why so many people don't know how to use them.
Javascript regexes are "getting better" feature-wise, but performance-wise not much have improved. The the a?a?a? example regex in Ross Cox article "Regular Expression Matching Can Be Simple And Fast" from 2007 is still getting exponentially slower and slower with the number of "a?"-repetitions (https://swtch.com/~rsc/regexp/regexp1.html)
for (let i=20; i<30; i++) {
let str = "a".repeat(i);
let regex = new RegExp("a?".repeat(i) + "a".repeat(i));
let t0 = performance.now();
str.match(regex);
let t1 = performance.now();
console.log(i + " took " + Math.round(t1 - t0) + " milliseconds.");
}
20 took 7 milliseconds.
21 took 14 milliseconds.
22 took 27 milliseconds.
23 took 52 milliseconds.
24 took 102 milliseconds.
25 took 224 milliseconds.
26 took 421 milliseconds.
27 took 814 milliseconds.
28 took 1604 milliseconds.
29 took 3470 milliseconds.
Tested today on latest Chrome on a MacBook Pro 2,6GHz Intel Core i7.
Here comes an idea on what could be done to improve performance for regexes that are actually regular, and to still keep support for non-regular advanced regexes that e.g. contain back-references:
Add a step immediately after parsing the regex, check if it is regular and can be handled by an engine like RE2, otherwise handle it using the normal regex engine.
An even more exotic idea would be to let multiple regex engines execute in parallell, and return the result for the one that finishes first.
I can't source this, but IIRC, Chrome devs have tried something like this, but couldn't get the match semantics in the RE2 case to exactly match existing Javascript regex match semantics. I heard about this long ago from a friend, and I don't remember the details. But it's quite plausible.
In general, "obvious" optimizations like this are almost always harder than they seem.
One possible path would be to precisely characterize regexes on which both FSMs and backtrackers agree precisely, and of only those, use the FSM based approach. But still, even then, you're maintaining two regex engines instead of one, and both need to be production grade and very fast. It is no easy task.
You can also implement linear-time regex matching in user code with webassembly. In my tests it was faster than native regex matching (https://jasonhpriestley.com/regex-dfa)
As someone familiar with JS engine regex internals, those results are totally shocking to me. How can I try your example? In particular I wasn't able to figure out which regexes you are testing with.
I used long strings of repeated "abcdabcd" as the test strings.
It's possible I made a mistake somewhere, and I can put together an open-source repo with the test setup when I get the time. But I'm curious why you find the results shocking?
The results are shocking because the native regular expression engines have thousands of man hours poured into them, with specialized JITs, exotic optimizations, etc. And your page is reporting the native one at 10x slower!
One thing I spotted is that you're asking the native engines to record capture groups, but your wasm implementation doesn't support those. For fairness you might switch to non-capturing groups: `(?:bc)` instead of `(bc)`. However this cannot explain the magnitude of the difference.
I dug into it some more, reducing it to this case:
/^(?:abc*d)*$/
what happens is that the backtracking engine has to be prepared to backtrack into every iteration of the outermost loop. This means that the backtracking stack grows as the length of the input; the engine spends most of its time allocating memory! These nested loops definitely make the NFA approach look good.
Regardless it's a cool project, thanks for sharing!
There is nothing shocking here, parent implements strictly regular expressions and compile them to DFA, so of course it will be fast, especially using only ASCII characters and hand chosen regular expressions. Russ Cox articles covers this very well.
> Add a step immediately after parsing the regex, check if it is regular and can be handled by an engine like RE2, otherwise handle it using the normal regex engine.
This would be a bad idea. For one thing, it would be slower on non-pathological cases, since backtracking is faster in practice. But more importantly, devs who test on Chrome might unwittingly create regexes that would run exponentially slower in other browsers.
The real answer is that it's complicated. Backtrackers can certainly be faster in some cases, but not all. Moreover, as someone who has worked on regex engines for a while now, it's not clear to me how exactly to characterize performance differences (outside of pathological cases or cases that otherwise invoke catastrophic backtracking), so my guess is that it's probably mostly up to implementation quality and the optimizations that are implemented.
It's a strange property to want. It's not a property of most other parts of programming languages. You can write loops or function calls in any language that take exponential time. So why demand that regexps be immune to it?
It would make sense to worry about it if it were a common mistake, but I've never seen a gratuitously exponential regexp in the wild.
One of the key points here is that it's not always obvious from looking at a regex whether it will exhibit catastrophic backtracking (at least, to an untrained eye). Even more insidious, catastrophic backtracking may only happen on certain inputs, which means that tests won't necessarily track it either. Therefore, desiring a performance guarantee here is certainly not "strange."
The bottom line is that the various implementation strategies for regex engines have trade offs. On Internet forums, it's popular to "proclaim" for one side or the other. The backtracking advocates like to believe that catastrophic cases never happen or are easily avoided and the linear time advocates like to believe that so long as matching doesn't take exponential time, then it will be fast enough. :-)
Because most of the time you don't need the features that demand backtracking, and it's simple to use a NFA approach that avoids those kinds of issues.
Besides, people make mistakes all the time. Why not prevent the possibility?
It's funny to see that in almost all langues the regular expression engines are not actually regular. This results in developers regularly (pun maybe intended) shooting themselves in the foot with overly clever expressions.
Thanks for the mention! I just want to expand on some things you said:
- Rust's regex library was greatly inspired by RE2, which is a C++ library that also executes regexes in linear time.
- Go's regex library also runs in linear time, however, it is still missing some optimizations that can make it run more slowly on non-pathological regexes when compared to RE2 and Rust's regex crate.
- You don't need to use fancy features with a backtracking regex engine in order to shoot yourself in the foot. e.g.,
>>> import re
>>> re.search('(a*)*c', 'a' * 30)
- Even with linear time regex engines, you can get big slow downs. You'll never get exponential (in the size of the text) slow downs of course, but regexes like `[01]*1[01]{20}$`[1] can generate large finite state machines, which can be problematic in either memory or match speed, depending on the implementation.
Thank you for all your hard work. I've been using Rust for two years now and have invoked your code probably a hundred times over.
I always feel so spoiled with Rust because it seems every core "functionality" library (serde, uuid, rand, chrono, soon to be futures I hope, and innumerable more) is the best implementation of that concept in any language ever so far.
Its go weird to go from working on decade old PHP or C++ code where I constantly curse the developers for undefined or illogical behavior to Rust where I rarely ever have a hiccup - if I do, the crazy powerful documentation engine often solves it in seconds - if that doesn't work, the lints, compiler, and more and more the RLS point me in the right direction, and finally in the worst case I go actually look at the crate source and discover how well thought out the developer made things.
Its legitimate programming magic and I'm worried I'm developing a psychological dependence on Rust where in my next job if I can't get built in backtraces from my errors like Failure I might jump out a window. So thank you again (and the rest of the Rust community in general) so much for making me miserable whenever I think about having to write regular expressions (or code, in general) with any other language (except Python, even with the warts its still a sweetheart who means well, I can't stay mad at it).
Does the linear time hold up in the face of capture groups? For example, say you have a bunch of capture groups in a loop:
/((a)|(b)|(c)|(d))*/
If the string has length N, the loop iterates N times, and each iteration must clear capture groups proportional to the length of the regex. So in this case the time varies as the product of the input and regex length, not their sum independently.
Yes, it does. Capture groups aren't special here. The issue is a semantic quibble. Namely, when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant." If you don't treat the regex as a constant, then the time complexity can vary quite a bit depending on the implementation strategy.
For example, if you do a Thompson NFA simulation (or, more practically, a Pike VM), then the time complexity is going to be O(mn), where m ~ len(regex) and n ~ len(input), regardless of capturing groups.
As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups (although they certainly can handle grouping), with the notable exception of Laurikari's Tagged DFAs, but I don't know off-hand if the time complexity of O(n) usually associated with a DFA carries over to Tagged DFAs. Of course, the principal downside of building a DFA is that it can use exponential (in the size of the regex) memory. This is why GNU grep, rust/regex and RE2 use a hybrid approach ("lazy DFA"), which avoids O(2^n) space, but falls back to O(mn) matching when the DFA would otherwise exceed some memory budget during matching.
> when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant."
Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text." Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.
> As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups
Now you have arrived at my question! Rust compiles to a DFA that supports capture groups. My question is whether capture groups ruin the linearity of the DFA matching.
Thanks for the Laurikari's Tagged DFAs reference, I hadn't heard of that. I'll check it out!
> Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text."
Yeah, that's ambiguous phrasing on my part. I meant that it was linear time with respect to both the size of the regex and the search text.
> Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.
No, it doesn't. rust/regex uses some combination of the Pike VM, (bounded) backtracking and a lazy DFA. It will compile a DFA ahead of time in some cases where Aho-Corasick can be used.
> Rust compiles to a DFA that supports capture groups.
No, it uses a lazy DFA to answer "where does it match," but it still must use either the Pike VM or the bounded backtracker to resolve capture groups.
> My question is whether capture groups ruin the linearity of the DFA matching.
Yeah I think I would probably look at Tagged DFAs to answer this. You'll want to check out recent papers that cite Laurikari's work, since I think there have been some developments!
There's no way to reconstruct the captures after the match completes. You have to record them as you go in case you successfully match, or run the whole regex again.
Finding the match can use the lazy DFA, which is roughly one or two orders of magnitude faster than the Pike VM, and similarly faster than bounded backtracking. Typically, the span of a match is much much smaller than the entire input you're searching, so it makes sense to find that span as quickly as possible. When it comes time to run the Pike VM (or the bounded backtracker), you can run the regex on just the span of the match, which will be much better than running it on the entire input to find the match if the input is large.
This strategy doesn't always make sense, and it can be better to just run the NFA right away if the input is small.
RE2 is written in C++. Go's regex library is written in pure Go. Russ Cox wrote both of them. Both RE2 and Go's regex library have a Pike VM, a bitstate backtracker, and a one-pass NFA, but RE2 has a lazy DFA which Go lacks. The lazy DFA is a key component for performance in a lot of cases.
I don't know specifically, but I'd guess the former over the latter. Adding a lazy DFA is a lot of work. There is an issue tracking the addition of a lazy DFA on the golang tracker. My guess is that it's up to a contributor to do it.
But backrefs are super useful though. I think it was a mistake to conflate what developers actually want, a DSL to detect the shape of data and extract portions of it, and the implementation detail of basing such a thing on regular languages.
They are useful, but also a bit dangerous. Case in point, a few years ago I foolishly thought that I'd "just take a few minutes to fix the Django URLValidator". I picked up where someone else had failed a year previously (that should have warned me).
After a lot of time I finally got the test suite to pass and was happy, naïvely thinking that "if it passes the tests it must be correct". Unfortunately I also integrated a nice case of catastrophic backtracking into the regex that timgraham fortunately caught. This could have resulted in DoS-attacks against web forms that contain validated URL fields. (This is especially nice when doing it against non-asynchronous Python servers.)
Well, most commonly used features like back-references allow to define not-regular language, so there is nothing funny that most regular expression engines are not "regular".
I like to use term regex for "regular" expressions implemented in most languages, by PCRE engine or in Perl and term regular expressions for actual regular expressions as defined in theoretical computer science, that is expressions which can be recognised by finite (either deterministic or non-deterministic) automata.
>Well, most commonly used features like back-references allow to define not-regular language, so there is nothing funny that most regular expression engines are not "regular".
The funny thing is that they're still called regular.
Hey, I’m the author of the article. Despite its age, it’s still accurate and up-to-date.
> these features aren't entirely new
Depends on what you mean by that. These features are still not universally supported by all modern browsers, for example, so I can imagine they’re still new to a lot of developers.
Chrome supports all these features (except for String#matchAll, which is currently at Stage 3). Other browsers don’t yet support the full set, but they’re all working on getting there.
I make sure to keep the article up-to-date. It correctly states the status for each of the features. They’re all part of ES2018 with the exception of String#matchAll which is currently at Stage 3.
This was a status report on multiple, changing features as of two years ago. The people who know best are probably busy building, but I can't help wondering if anyone is able to provide the current status of any of these things.
Since this is from 2017, isn't the year missing in the title? When I initially saw the submission, for a moment I thought all the struggles I had some weeks ago had been solved... :-)
I mean, they have been solved. With the exception of String#matchAll (which is currently a Stage 3 proposal), all these features are part of ES2018 and shipping in Chrome. Other browsers implement some of them already and are working on shipping more.
There's a difference, right, in that named capture groups are purely cosmetic, whereas look-behind can dramatically slow down matching? (I dunno in any theoretical sense, but that's the way it goes in Perl.)
Named captures means you don't have to count parentheses to figure out which group you want. I don't know whether it qualifies as cosmetic, but it sure is nice.
> Named captures means you don't have to count parentheses to figure out which group you want. I don't know whether it qualifies as cosmetic, but it sure is nice.
Agreed! (As a regex-heavy Perl hacker, I loved the day that they entered the language.) I didn't mean to minimise them; rather quite the opposite, to point out that they gave a great return essentially for free (compared to regexes that still have capturing groups, but without names), as opposed to look-behind, which (I think) can slow down a match dramatically.
Is the idea to avoid catastrophic backtracking? Unfortunately you aren't immunized from catastrophic backtracking by limiting yourself to regular languages. The catastrophe is a property of the implementation, not the matched language.
Another proposal specifies certain legacy RegExp features, such as the RegExp.prototype.compile method and the static properties from RegExp.$1 to RegExp.$9. Although these features are deprecated, unfortunately they cannot be removed from the web platform without introducing compatibility issues. Thus, standardizing their behavior and getting engines to align their implementations is the best way forward. This proposal is important for web compatibility.
Interesting view. Is this better than a "let it break" approach?
Link rot already claims N% of websites per year. I wonder if cleaning up APIs like this one would increase N noticeably.
I think breaking JS tends to be much more destructive than dead links. At least with dead links there's a fairly clear non-technical fix. With a web standards break, maybe 8 years ago you hired a contractor to build a website that uses a library that uses a library that uses a library that uses some JS feature that is now broken, so your website is now completely broken. Some of those libraries are on older unmaintained versions, where the only upgrade would be through non-trivial breaking changes, or you would need to just find alternate libraries. Getting things working again is a huge undertaking, not just a matter of "don't use that weird JS feature anymore", and I think in that situation it seems reasonable to blame the browser for the breaking change.
It's also maybe more widespread than you'd think. Adding `global` seemed safe for a long time, but ended up breaking both Flickr and Jira because they both use a library that broke: https://github.com/tc39/proposal-global/issues/20
> Interesting view. Is this better than a "let it break" approach?
Or perhaps a branch and let the old stagnate approach?
Strip the deprecated features to create RegEx2/RegExNG/whatever[1] and build the new features on that. Old code can keep using the old version as they always have, new code can use all the nice new shinys, and the new version doesn't have to worry about backwards compatibility with the broken parts of dark age UAs. Also make sure everything in the new spec can be polyfilled for cases when new code needs to work on those older UAs.
[1] or while we are branching off, why not answer the complaint of regexs usually not actually being regular (as per the mathematical concept of regular languages) and rename them completely? Maybe SearchExpressions? Or SearchExp if you want something smaller. Or SExp if you want even shorter and don't mind attracting bad puns.
ECMA is pretty bad about this. For example, in ES5 the RegExp prototype is a regexp. In ES6, it became an ordinary object, which broke some stuff. In ES8 they kept it as an object, but instead changed a bunch of regexp methods to require them to specially check for the regexp prototype object. This churn is pointless and baffling.
> ‘matchAll’ looks great but why make it an iterator vs a ‘map’ style callback?
Because that makes it significantly easier to e.g. optionally collect into an array, or to only partially iterate the sequence (e.g. only get the first 3 matches) which is painful to impossible using JS's callbacks. An iterator is simply more flexible.
For what it’s worth Array.from() [0] lets you pass a map style callback as its second argument, so Array.from(yourStr.matchAll(yourRegex), yourMapFn) works.