
The Regular Expression Visualizer, Simulator and Cross-Compiler Tool - robertelder
https://blog.robertelder.org/regular-expression-visualizer/
======
robertelder
Hi, I'm the author of this page. This tool has a lot of complexity in it and a
huge number of corner-cases. If you discover any bugs or XSS vulnerabilities,
feel free to let me know.

~~~
mlyle
I'm having trouble with backtracking and groups. It's possible I'm doing
something wrong, but

    
    
        [bc]*(cd)+
    

doesn't seem to match

    
    
        cbcdcd
    

Nor does the simpler

    
    
        a*(ab)
    

match

    
    
        aab
    

But

    
    
        a*ab
    

matches aab.

~~~
robertelder
You found a bug. The bug was: When I was computing the animation frames for
the visual, after a backtracking event, I forgot to restore the 'progress'
values for the PROGRESS nodes to what they were before the stack push took
place. I think I fixed this issue, and I just pushed an update which should
appear any second when the cache invalidates.

~~~
mlyle
Cool! Thanks for looking at it/fixing it.

P.S. The whole split() thing you do is an interesting hack to not have to
preserve/rewind so much state. :D I've never quite seen fork() used so...

------
hprotagonist
along these lines, [https://regex101.com](https://regex101.com) is
consistently a godsend for writing regex.

~~~
rtkwe
Yeah regex101 is probably the best one I've found so far. It allows you to
provide sample text and shows the analysis and output you'll get including
captured groups.

------
jjice
I would have loved this when I worked on a regex engine a few years back. It
was before I took my CS theory course and had no idea what an automaton was.
Let me tell you, having a formal theory and visualization in front of you
while working on something is extremely helpful.

This looks like it would be a great tool for both people looking to learn, and
those who are looking to optimize their regular expressions as well.

------
maddyboo
When I read the title, I half hoped ‘cross-compiler’ referred to translating
between different regex varieties. Anyone know of a tool that does this?

------
gfxgirl
Any one write an AST aware regex engine?

Examples: replace s/(?ast:comment)foo/bar to replace foo with bar but only in
comments. Or s/(\w+): (?ast:openparent)(.
_?)(?ast:closeparen)\s+(?ast:openbracked)(._?)(?ast:closebracket)/function
$1($2)\n{\n$3\n}\n/ to convert from

    
    
        foo: (args) {
          codeblock
        }
    

to

    
    
        function foo(args) {
          codeblock
        }
    

etc..?

~~~
chubot
semgrep will at least do the search portion, not sure about replace. Posted a
few days ago:

[https://news.ycombinator.com/item?id=23919313](https://news.ycombinator.com/item?id=23919313)

Slightly related:

 _How to Build Static Checking Systems Using Orders of Magnitude Less Code_

[https://web.stanford.edu/~mlfbrown/paper.pdf](https://web.stanford.edu/~mlfbrown/paper.pdf)

They use "microgrammars" and found that the technique is surprisingly
effective despite not being "correct".

------
twistedpair
Nice graphical explanations! It would be nice to see positive/negative look
ahead/behind support as they're pretty powerful.

The `\K` flag to clear the capture buffer (e.g. Perl flavor) is also a very
useful regexp feature absent here (e.g. when testing `grep` commands).

------
x87678r
Any suggestions to keep regexes simple? They often grow to huge lengths which
are hard to read and/or debug. It would be nice to keep modular. Maybe using
websites like this is the only way.

~~~
chubot
In Python or Perl, you can use re.VERBOSE or /x to spread your regex across
multiple lines. That helps a little, but the syntax is still pretty obscure.

I created a new regular language syntax for Oil which looks more like a
grammar:

 _Eggex Example: Recognizing Python Integer Literals_
[http://www.oilshell.org/blog/2019/12/22.html](http://www.oilshell.org/blog/2019/12/22.html)

And the nice thing is that you can modularize and compose patterns (without
string concatenation). It is more like AST splicing of regexes. There are
details in the article but here's how it looks:

    
    
        DecDigit = / [0-9] /
        BinDigit = / [0-1] /
        OctDigit = / [0-7] /
        HexDigit = / [0-9 a-f A-F] /
    
        DecInt   = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
        BinInt   = / '0' [b B] ('_'? BinDigit)+ /
        OctInt   = / '0' [o O] ('_'? OctDigit)+ /
        HexInt   = / '0' [x X] ('_'? HexDigit)+ /
    
        Integer  = / DecInt | BinInt | OctInt | HexInt /
    

To accept 123, 0b10101, 0o755, 0xff but reject non-integers.

------
chubot
Note that when you use grep, awk, sed, libc regexec(), RE2 [1], or rust/regex,
this isn't how your regexes are executed.

They use automata-based engines whereas this is a backtracking engine, which
can blow up on small inputs. (Although I will say the C code here is cool,
because it gives you a fork() bomb, so it might exhaust your operating system
and not just your CPU!)

Canonical reference: _Regular Expression Matching Can Be Simple And Fast (but
is slow in Java, Perl, PHP, Python, Ruby, ...)_

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

Archive since it's down at the moment due to some App Engine stuff (!):
[https://web.archive.org/web/20200624195819/https://swtch.com...](https://web.archive.org/web/20200624195819/https://swtch.com/~rsc/regexp/regexp1.html)

My blog post and sample code that verifies that GNU grep, awk, sed, and libc
don't backtrack:

 _Comments on Eggex and Regular Languages_
[http://www.oilshell.org/blog/2020/07/eggex-
theory.html](http://www.oilshell.org/blog/2020/07/eggex-theory.html)

[https://github.com/oilshell/blog-code/tree/master/regular-
la...](https://github.com/oilshell/blog-code/tree/master/regular-languages)

[1] [https://github.com/google/re2](https://github.com/google/re2)

\----

Also, the production quality way of compiling regular expressions to C code
looks like this [2], which is completely different than what's shown in the
article:

[https://www.oilshell.org/release/0.8.pre8/source-
code.wwz/_d...](https://www.oilshell.org/release/0.8.pre8/source-
code.wwz/_devbuild/gen/osh-lex.h)

The regex is converted to an NFA, then a DFA, then a bunch of switch/goto
statements in C that implement the DFA.

This example is huge, but the switch/goto and lack of any notion of
"threading" is the important point.

Essentially, the instruction pointer is used as the DFA state. It's a very
natural use of "goto" (where you don't write or debug them yourself)

[2] via [http://re2c.org/](http://re2c.org/) (not related to RE2)

~~~
robertelder
It's worth pointing out that the resource

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

linked above is also by the same author, Russ Cox,

as the reference I included at the end of the blog post:

[https://swtch.com/~rsc/regexp/regexp2.html](https://swtch.com/~rsc/regexp/regexp2.html)

All of the insights that I presented through this visualization tool are based
upon the knowledge found in Russ Cox's articles. Also, the link above on RE2:

[https://github.com/google/re2](https://github.com/google/re2)

is a project that was was also (started by|heavily contributed to by) Russ
Cox. His writings on the topic of regular expressions are absolutely world-
class and I have never found any better resource for understanding the deep
low-level theory on how they work in practice.

~~~
chubot
But Cox would dislike this formulation of regular expressions. His whole point
is that Perl-stye regexes are not regular languages, and RE2 goes to great
lengths to work around this fact.

I'm taking up his arguments with some more traditional terminology that I
think may help:

[http://www.oilshell.org/blog/2020/07/eggex-
theory.html](http://www.oilshell.org/blog/2020/07/eggex-theory.html)

The entire O'Reilly book "Mastering Regular Expressions" by Friedl is based on
the "debugging regexes by backtracking" approach you present here. It talks
about "optimizing" regexes by reordering clauses and optimizing the
backtracking.

Russ Cox HATES THAT APPROACH and he wrote the entire RE2 project to advocate
that it NOT be used! You do not need to optimize, reorder clauses, debug, or
trace backtracking when you're using automata-based engines, because they run
in guaranteed linear time and constant space (with respect to the length of
the input). There is no backtracking.

(I was at Google when he wrote RE2 and used it shortly afterward on big data,
and regular languages are indeed better than regexes for such engineering.
Also made clear by the HyperScan work.)

Since his site is down I can't find the original comments, but here's a
twitter exchange from 2018 that backs this up:

[https://twitter.com/geofflangdale/status/1060313603268501504](https://twitter.com/geofflangdale/status/1060313603268501504)

[https://twitter.com/_rsc/status/1060702201159565313](https://twitter.com/_rsc/status/1060702201159565313)

Also, I believe your article would be improved if you simply take out this
sentence:

 _In computer science theory, this node type should remind you of the
difference between DFAs and NFAs._

It is treading close to the confusion that Cox is talking about. So I would
just remove, add a proper explanation, or add a link to the authoritative
source.

\----

I would love to see a followup that goes through the rest of the article with
the Thompson VM and Pike VM approaches! :) I think the whole point is that you
don't have to backtrack to implement this VM, and this page confuses that
issue.

~~~
robertelder
The goal with this article was never to show the best or most efficient way to
match regular expressions, but rather to convey the correct interpretation of
what a given regular expression will do when you try to use it. Even
discussions of slow and simple backtracking regular expression matchers are a
pretty niche topic and likely too much for casual users of regular
expressions.

Yes, I was thinking about also allowing it to optionally cross-compile to the
more asymptotically efficient matching algorithm, also documented by Russ Cox.
Everything takes time though, and I figured it would make to put this out
there before investing too much time in something that possibly no one would
use.

~~~
chubot
Sure, but automata based engines are also "correct" and used in common tools
like grep, awk, and sed.

It would be nice to make that clear in the article.

\-----

It's a matter of preference, but to me debugging by backtracking is a
pointless exercise. It's easier to reason about regexes as sets of strings,
composed by sequence and alternation.

Write unit tests that enumerate what you accept and reject, and build them up
by composition. There's no need for debugging or tracing.

I understand that the imperative style is more natural to many programmers,
but I don't think it requires too much of a leap of thinking. And once you get
used to the other style, it saves effort overall (as well as bounding your
worst case running time).

~~~
ridiculous_fish
How does the "sets of strings" model express the difference between:

    
    
       /x*/
    

and

    
    
       /x*?/

~~~
chubot
Good question, and here's something I wrote a couple weeks ago to the author
of the Regex Lingua Franca paper [1]:

 _I admit I occasionally use non-greedy repetition in Python, but I think that
is the only construct I use that 's not in the shell-ish engines. I believe
it's usually pretty easy to rewrite without non-greedy, but I haven't tested
that theory._

\----

OK so I started testing that theory. I collected all the times I used

    
    
        .*?
    

[https://github.com/oilshell/blog-code/blob/master/regular-
la...](https://github.com/oilshell/blog-code/blob/master/regular-
languages/survey.txt)

With a cursory inspection, most of them are unnecessary in an automata-based
"regular language" engine.

They are "nice to have" in a Perl-style regex engine for PERFORMANCE. But
automata-based engines don't have that performance problem. They don't
backtrack.

The distinction between .* and .*? doesn't affect whether it matches or not,
by definition. It does affects how much it backtracks. (And it affects where
the submatches are, which I need to test out)

So I think most of my own usages ARE "nice to have" in Python, but unnecessary
in awk because it uses an automata-based engine.

But I'd like to look at it more carefully, and I'm interested in answering
this question more thoroughly (feel free to contact me about it, e-mail in
profile).

I started a demo here. Still needs work.

[https://github.com/oilshell/blog-code/blob/master/regular-
la...](https://github.com/oilshell/blog-code/blob/master/regular-
languages/greedy.sh)

\----

So here's another conjecture: regexes are bad because they force you to make
UNNECESSARY DISTINCTIONS in the name of performance.

You have to care about greedy vs. nongreedy, because there is backtracking.

In contrast, with regular languages, you just write what you mean, and they do
it quickly for you. You can't write a slow expression for a regular language.

There are some caveats around that, like submatch extraction with nongreedy
repetition, which I will test out. If anyone has specific examples or use
cases, send them my way.

I can't see where I really rely on nongreedy submatches, but I'll have to look
more closely. I think the nongreedy use case is usually just "skip quickly
until this other string appears", e.g. the one in pgen2/tokenize.py that I
didn't write.

\----

As far as I remember submatch extraction works fine in automata-based engines,
and Cox's articles are in a large part about the (then) little known
algorithms for doing it in automata-based engines. re2c also has a 2017 paper
for their algorithm to do it in DFAs:

[http://re2c.org/2017_trofimovich_tagged_deterministic_finite...](http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf)

\----

[1] thread:
[https://news.ycombinator.com/item?id=23687478](https://news.ycombinator.com/item?id=23687478)

~~~
ridiculous_fish
I don't agree it's primarily about performance. /".+"/ will find a quoted
string, but you will certainly prefer the non-greedy match instead.

Now you may torture the greedy regex to make it work (/"[^"]+"/), but the
point stands. "Did it match" is rarely enough; the regex author usually wishes
to know _where_ it matched, and to exercise control over which match is
preferred if there are multiple possibilities. It's indeed possible to do this
with DFAs but the theory is comparatively hard, as a cursory reading of your
linked paper will show!

~~~
chubot
Yeah that is a good example, and I recorded it here:

[https://github.com/oilshell/blog-code/blob/master/regular-
la...](https://github.com/oilshell/blog-code/blob/master/regular-
languages/greedy.sh#L44)

I don't mind the rewrite, but I'm not surprised that some would be
inconvienced by it. (Though that is part of the reason for the Eggex syntax)

\----

On the other side I'll offer that I wrote a lexer for bash and all its
sublanguages without any nongreedy repetitions. It's dozens if not hundreds of
regexes.

[https://github.com/oilshell/oil/blob/master/frontend/lexer_d...](https://github.com/oilshell/oil/blob/master/frontend/lexer_def.py)

Though, regular languages are more like a building block for the lexer than
the lexer itself. As mentioned elsewhere, they're too limited to express the
entire lexer. They're too limited to express syntax highlighters (but so are
regexes! They also need some logic outside the model, e.g. for lexing shell
here docs)

\-----

So I pose the question of whether there's a more powerful abstraction that's
still linear time and constant space:

[http://www.oilshell.org/blog/2020/07/ideas-
questions.html#mo...](http://www.oilshell.org/blog/2020/07/ideas-
questions.html#more-powerful-linear-time-constant-space-languages)

Of course there is, but it's a matter of picking one that will address a lot
of use cases effectively.

I think re2c has grown something like

    
    
        .*?
    

for example. grep needs something like that too. Often you want to be able to
skip input in a linear fashion, without invoking the pattern matching logic at
all.

Regexes are OK if you know how to use them, but evidently most people have
lots of problems with them (the necessity of OP's demo being an example of
that.) I think there's something better out there built on top of regular
languages.

~~~
ridiculous_fish
You hit upon an important cleavage: features vs worst-case perf. A generous
lexer supports partial input, syntax highlighting, helpful error messages,
etc., while a strict one is not so nice, but guarantees protection against
malicious input. But the way you write these are very different! I also would
like to write one lexer that can operate in both modes, but with one syntax.

BTW, high five from one shell author to another :)

~~~
chubot
Yeah I've been looking at fish. Definitely has the cleanest code out of any
shell I've looked at :) (e.g. bash, dash, zsh, mksh)

And I came to realize what a huge effort it is on top of the language:

[http://www.oilshell.org/blog/2020/02/recap.html#fish-oil-
is-...](http://www.oilshell.org/blog/2020/02/recap.html#fish-oil-is-a-good-
idea-link)

Also I wonder if you knew of the existence of ble.sh, a fish-like UI written
in pure bash! I think it's the biggest shell program in the world:

[http://www.oilshell.org/cross-
ref.html?tag=ble.sh#ble.sh](http://www.oilshell.org/cross-
ref.html?tag=ble.sh#ble.sh)

[https://github.com/oilshell/oil/wiki/The-Biggest-Shell-
Progr...](https://github.com/oilshell/oil/wiki/The-Biggest-Shell-Programs-in-
the-World)

\----

I did add little hack in the lexer for autocompletion of partial input. But my
hope though is to keep most of that logic in the parser and not the lexer?
Though it is true I was thinking of breaking up the exact regex we talked
about for similar reasons, e.g. a single token:

    
    
        ' [^']* '
    

vs. 3 tokens

    
    
        '
        [^']*
        '
    
    

I think the latter could be better for completion of paths in quotes, which
bash does, but I haven't revisited it yet.

------
vmchale
neato bandito

