
Describing and Inventing a New Regular Expression Quantifier - robertelder
https://blog.robertelder.org/regular-expression-quantifiers/
======
jlokier
From the article:

> I will suggest, rather opinionatedly, that possessive quantifiers aren't
> particularly useful.

I will suggest the opposite, that possessive quantifiers are so useful they
should have been the default, as backtracking causes problems of performance
and accidental mismatching that you usually don't want.

Backtracking is a really useful invention. But it should not be used on every
term.

I default to using possessives more often than not these days. I've writte a
lot of parsers over the years where regexes played a role, and this is what I
found in practice.

Now, a non-possessive quantifier makes me stop and think if it should be
there, and what kind of problem is it hiding. Usually it doesn't matter,
especially when the regex is being used on "known" data, but occasionally it's
hiding a subtle parsing bug.

For example, any regex that is matching whitespace-separated terms. For the
whitespace, which makes more sense, "\s+" or "\s++"?

The latter. Backtracking over whitespace is just a waste of time; it won't
change what is matched, and if it does, that's usually a bug in the rest of
the regex.

What about matching words in a parser of a typical programming language, does
it make more sense to write "\w+" or "\w++"? The latter again. You do not want
to accidentally treat "iffy" as though it's an "if" followed by "fy".

You can explicitly require a word boundary or a non-word character after the
word. But which is more straightforward (and faster) out of "\w++", "\w+\b"
and "\w+(?:\W|$)"?

------
eru
Note: the author is not describing regular expressions (ie those that
correspond to regular languages and nothing else), but seems to be describing
some mixed beast. See eg their mention of backtracking, which is an
implementation technique used in some implementations of regular expressions
(where it adds nothing); and also in many implementations of not-quite-
regular-expressions, where it can make a real difference.

~~~
bawolff
Unix style regular expressions is what most programmers mean by regular
expressions ("regular expressions" have been non-regular long before perl,
fyi. Even old style unix "basic" regular expressions are irregular and i think
they go back to the 80s), and is much more common than the concept in language
theory that they are named after. I think its pretty obvious from context the
author is talking about the common concept and not the obscure concept.

~~~
compsciphd
I think (perhaps incorrectly) that posix regular expressions can be converted
to finite state machines (and in fact, that is how you make them run
efficiently), it is the PCRE that cannot be fully converted to finite state
machines. (but willing to be corrected on this!)

~~~
bawolff
/^\\(.*\\)\1$/

(Match all lines that are palidromes in posix BRE syntax) cannot be done with
a finite state machine.

~~~
throwaway_pdp09
That will match 'abab' for example but a palindrome is not that, it reads the
same forward as backward, so 'abba' is a palindrome but your rex won't match
it.

I don't think a state machine can do it as it needs to remember the captured
part, but not sure. Fairly sure capture groups can't be FSMs.

~~~
lgeorget
The language of palindroms over a finite alphabet is described by a
deterministic context-free grammar, you need a pushdown automaton (i.e. a
finite state machine with a stack) to recognize it.

[https://en.wikipedia.org/wiki/Pushdown_automaton](https://en.wikipedia.org/wiki/Pushdown_automaton)

~~~
eru
To be more precise, you'd need to say that the palindrome is not only
described by a deterministic context-free grammar, but that it is _not_
described by anything simpler.

Multiple different grammars can define the same language. There are context
free grammars for regular languages.

~~~
throwaway_pdp09
Ah, now that makes more sense! Original comment threw me rather. Thanks.

A -> bAb where b is a terminal, ok.

------
lifthrasiir
Yeah, the possessive quantifier looks like that it simply disables
backtracking but it also silently makes the matching greedy. It should really
have been orthogonal: `a+` is greedy and backtrackable, `a+?` is non-greedy
and backtrackable, `a++` is greedy and atomic, and `a+?+` is non-greedy and
atomic. It's not pretty, but multiple quantifiers were never pretty after all.

Or better, ditch the opportunistic use of otherwise invalid quantifier
sequences and stick to the atomic group. Vim got this almost correct [1]:

    
    
                  multi ~
             'magic' 'nomagic'  matches of the preceding atom ~
        
        |/\{|   \{n,m}  \{n,m}  n to m          as many as possible
                \{n}    \{n}    n               exactly
                \{n,}   \{n,}   at least n      as many as possible
                \{,m}   \{,m}   0 to m          as many as possible
                \{}     \{}     0 or more       as many as possible (same as *)
        
        |/\{-|  \{-n,m} \{-n,m} n to m          as few as possible
                \{-n}   \{-n}   n               exactly
                \{-n,}  \{-n,}  at least n      as few as possible
                \{-,m}  \{-,m}  0 to m          as few as possible
                \{-}    \{-}    0 or more       as few as possible
        
        |/\@>|  \@>     \@>     1, like matching a whole pattern
    

So a non-greedy and atomic pattern would be `\%(a\\{-}\\)\@>`. Unfortunately
you can't write `a\\{-}\@>` and have to wrap `a\\{-}` into a non-capturing
group though.

[1] [https://vimhelp.org/pattern.txt.html#pattern-
overview](https://vimhelp.org/pattern.txt.html#pattern-overview)

------
wodenokoto
This one was pretty cool and new to me. I'm not sure if I would have read it
as how it works or as "Any one, but only one, of these, one or more times".

    
    
        (cat|dog|bird){1,2}
             will match any of the following strings:
    
        cat
        dog
        bird
        catcat
        catdog
        catbird
        dogcat
        dogdog
        dogbird
        birdcat
        birddog
        birdbird
    

Regardless, its pretty neat!

------
compsciphd
Story time about an interview experience I one had.

So I once interviewed at Google (well, I've interviewed at google many times,
and have never gotten passed the hiring committee, but my resume looks decent
so recruiters keep coming after me)

Anyways, on said interview, I was asked to program a relatively simple regex
matcher (characters and star, perhaps ., don't remember).

Anyways, I totally bombed that question (I'm a big believer that 45m isn't
enough time to organize ones thoughts on said Q as phrased, as it gives you
nothing to build on, but I'll get back to that in a bit). This of course
through me off for the rest of the interview day, because I tend to chew on
problems that I have trouble with and tend to not let go of them quickly until
I have some better understanding of the issue (including perhaps my
limitations in being able to do anything about it).

So after the interview, as I tend to do, I did a bit of research, and came
across this article of Kernighan

[https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...](https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html)

which put everything into context beautifully

now, why do I think its a terrible interview Q? Because as Kernighan writes

 _" Rob disappeared into his office, and at least as I remember it now,
appeared again in no more than an hour or two with the 30 lines of C code that
subsequently appeared in Chapter 9 of TPOP."_

if it would take Rob Pike and hour or 2 to come up with a nice solution, what
chance do us mere mortals have in the context of an interview with people
staring at us.

With that said, I do think this serves as a basis for a good interview
question. Take a simple character by character string matcher as Rob Pike
wrote, and ask the interviewee to expand on it, how would they add different
functionality to it. 1) could they add '.', 2) could they enable it to match
anywhere in the text, 3) could they add anchors, 4) and if really good, could
they add star support. You get to see how the interviewee thinks about code,
thinks to modify code, and being regex, still provides a good basis for
thinking about tests. I hope people I interview find this to be a tractable Q.

Getting back to my story, in order to prove to myself I wasn't an idiot (and
to learn some java, I was mostly a C programmer in the linux kernel at the
time), I decided to take Rob Pike's approach and try to implement as much of a
regex matching engine as I could

which turned into
[https://github.com/sjpotter/regex](https://github.com/sjpotter/regex).

then when I needed to improve my Go programming skills (and explore possibly
bad ways of doing Go programming), I ported it to
[https://github.com/sjpotter/regex-go](https://github.com/sjpotter/regex-go)

now, what's the point of this story in regards to the article? Implementing
the quantifiers gave me a better appreciation of how they worked, both greedy
and not (and the performance impacts of them, especially with my naive
implementation of a matcher).

By following [https://www.regular-
expressions.info/reference.html](https://www.regular-
expressions.info/reference.html) and adding as much as I could (include
backreferences, lookahead/behinds and conditionals) it gave me a much better
understanding of the power in regular expressions (at least of the PCRE type,
I'm not sure my old discrete structures teacher would be happy calling all of
these "regular expressions").

~~~
eru
If you are interested in 'proper' regular expressions, the approach via
derivatives is really neat. See
[https://en.wikipedia.org/wiki/Brzozowski_derivative](https://en.wikipedia.org/wiki/Brzozowski_derivative)
or [https://www.ccs.neu.edu/home/turon/re-
deriv.pdf](https://www.ccs.neu.edu/home/turon/re-deriv.pdf)

As for hiring processes, have a look at linked article and discussion at
[https://news.ycombinator.com/item?id=9159557](https://news.ycombinator.com/item?id=9159557)

