Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Regular expression compilation visualized (compiler.org)
219 points by JoelJacobson 27 days ago | hide | past | favorite | 38 comments



This is an amazing interactive learning tool. Kudos! I vaguely remember from a "computer science modelling" course at uni, learning how to do this process by hand, from the textbook "Introduction To Automata Theory Languages , and Computation". Wish we had had this tool back then.


Thanks for the kind words. I made it for fun during a compiler course held by professor Christian Schulte at the Royal Institute of Technology here in Sweden. My original plan was to make similar visualizations for all the different algorithms in Christian's compiler course, but tragically he passed away recently.

Christian is the main designer and developer of many great open source projects, including Unison, Gecode and Mozart. https://chschulte.github.io/software.html

R.I.P. Christian, you were an amazing professor and person. I miss you.


I would like to echo @allanrbo kind words I to think it's a great way to understand regular expressions and their various implementations. I was a little confused about how to change the regex but I'm often confused so it's no reflection on you! I think Christian would have been proud to see what you've done with his ideas.


I feel your loss about Christian, I just realized I have used his software a ton of times. At least you got to meet him!


I wonder what the state of the art is in regex compilation? It feels to me like it shouldn’t be compilation to a program that processes one character at a time as you’ll be doing a lot of branches and so can’t process the data that fast. It feels like there ought to be a way to take advantage of ILP or SIMD operations. Maybe regex just doesn’t matter enough for that to be important.


Author here. The generated LLVM IR at the bottom actually make uses of vector instructions, which will make LLVM generate SIMD instructions if supported by the CPU.

For instance, try entering just a long word as the regex, e.g. "helloworld":

  %state0.goto.1.0.cmp_mask = icmp eq <10 x i8> %state0.rhs, < i8 104, i8 101, i8 108, i8 108, i8 111, i8 119, i8 111, i8 114, i8 108, i8 100 > ; helloworld


> It feels to me like it shouldn’t be compilation to a program that processes one character at a time

High performance regex engines do use SIMD in various places. Others have mentioned Hyperscan, which is undoubtedly a showcase of the most sophisticated SIMD techniques when it comes to regex and substring searching. But even something like RE2 uses vectorized code in places, although indirectly.

But, in all instances I can think of, using SIMD is a matter of identifying some optimization opportunity in regex matching that doesn't generalize to handling all cases. At some point, to handle some cases, you'll need that "one character at a time" loop. (Whether it's backtracking or FSM based.) So visualizations like these are still rather helpful, and at the very least, give a basic conceptual understanding of what the most general kind of FSM-based regex might look like internally. (Although, most general purpose regex engines don't actually utilize all of the transformations presented in the OP.)


https://www.microsoft.com/en-us/research/wp-content/uploads/... is an interesting read on ILP with FSM's.

Regex is used all over the place, so it's certainly worthwhile to speed things up. And huge FST's are also used in language tech, e.g. to analyze all frillions of the forms of Finnish verbs into all their ambiguous readings with tags and dictionary forms: https://beta.apertium.org/index.eng.html?choice=fin&qA=Taite... The language supported by that FST is basically infinite due to derivations and compounding, but the zipped binary fits on a handful of floppies.


Many high-performance regex libraries use SIMD these days. Hyperscan might be the most prominent example of that, being made by Intel and all.


This is really cool, im just really surprised that you can afford that domain and just put one thing on it :D


Heh, the idea was to put more things on it, would be cool to visualize more algorithms. Too bad ReasonML became such a mess, I really liked the coding experience in it. Now I would probably need to look for alternatives before implementing more algorithms.

I actually bought compiler.org to get a nice personal email address after selling a company I co-founded, but I didn't realize how difficult it is to spell "compiler" for non-geeks. Is it with a "k" or "c" and "m" or "n"? Bummer. I've bought a new one now which is hopefully easier to spell. :)


I do not program with ReasonML, but have looked it few times as it was touted as replacement for JS/TS/React. What made it a mess?



This led me to look into sexpr based regular expressions, and indeed there are some. SRFI 115 seems pretty neat example:

https://srfi.schemers.org/srfi-115/srfi-115.html



I guess this might be hard, but it would be pretty cool to be able to "stop" the optimizations at some point and see the resulting code.

Say "give me the generated code for the Rabin–Scott powerset construction (NFA to DFA)". That way, one could see the impact of each optimization step.

Nevertheless, this is a pretty cool proeject!


Great work! I did something similar, albeit with less features, for one of my uni classes. If anyone is curious about Rust code: https://github.com/miedzinski/regex.


Regex101.com is the super-lite version of this, also worth a look if you haven’t tried it


This is so very cool.

btw, Does anyone know if there is a regex standard (syntax and semantics) that is truly cross-platform, and by cross-platform I mean having an implementation in all the major languages (Javascript, Rust, Python, etc?)


What are the limitations of regular expressions compiled in this way, compared to things like Python’s `re` module? Is it just that they can’t support backreferences?


Too many to enumerate, already the utter lack of Unicode is a killer. The tool supports only the academic interpretation of regular expressions that are strictly equivalent to some NFA/state machine, which is useless because in the last forty years we use programming languages and libraries with extended abilities to deal with real world problems.

Some programmers have adopted to vernacular "regular expression" for the former and "regex" for the latter for easier distinction, see quote in http://enwp.org/Regexen#Patterns_for_non-regular_languages


Real regular expressions are hardly useless, I speculate that majority of real world uses of regexp are actually regular. Many of state of art regex engines (re2, hyperscan, rust regex) support only regular expressions


> I speculate that majority of real world uses of regexp are actually regular

I just surveyed a corpus of regexes with a crude static analysis tool and only 4% fit that restriction, I believe the result to be accurate within the order of magnitude. It makes sense: non-regular features are widely available, and thus people use them.

> state of art regex engines (re2, hyperscan, rust regex)

These are a clear regression from the actual state of art that's in use everywhere. (I know that re2's reason for being is precisely to have less features.) The advent of Perl (and related, libpcre) has utterly obliterated the competition at that time, and newcomers were not able to wrest their crown.


PCRE is hardly used "everywhere". Vim switched from backtracking engine to dfa one. Emacs has it's own non-backtracking engine. As does GNU grep. Ripgrep notoriously is built around rust-regex. VSCode uses ripgrep for searches. Google obviously uses its own re2 in its products. Hyperscan has some case studies of its use:

https://www.hyperscan.io/2016/01/21/rspamd-1-1-released-hype...

https://www.hyperscan.io/2020/09/28/optimize-azure-cloud-sec...

https://www.hyperscan.io/2018/10/19/hyperscan-adopted-by-git...

Even PCRE itself has a dfa mode. Because people want and use that. It's not just some academic navelgazing.

I'm not saying that non-regular engines, pcre in the forefront, are not popular. They are. But at the same time they are not be-all end-all for regex, regular engines still see lot of use especially in performance sensitive applications.


> PCRE is hardly used "everywhere".

The topic under discussion was extensions that make regex non-regular (as popularised by Perl and libpcre), not PCRE per se. Per this site's rules, I assume good faith and that you simply misunderstood me and did not deliberately put up and topple this straw-man.

Adoption of non-regular extensions is overwhelmingly larger than adoption of the opposite.

1. These non-regular extensions can be found in Java/Kotlin/Scala/etc., Javascript, Perl, PHP, Python, Ruby, C#, R, Swift, Matlab, Julia, Haxe, Ocaml and literally dozens of other languages on various popularity charts, and as a first pick option in C, C++ and Lua. Go and Rust are the exemptions to the rule! There are millions of pieces of software written using these which one can't even see because they are not public.

2. Programmers and end users want features and power much more than they want determinism. (Performance is a red herring because the vast majority of the time, performance is good enough, or even identical to non-extended.) That's why ripgrep and GNU grep and rspamd have them.

https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#how...

https://www.gnu.org/software/grep/manual/html_node/Regular-E...

https://rspamd.com/announce/2016/03/21/rspamd-1.2.0.html

3. A factual survey where libraries are used. This will be invisible for the aforementioned programming languages because they have built-in regex, but simply libpcre alone versus re2 and libhs shows clearly which paradigm is dominant and which is a niche.

libpcre: ag, apache2, blender, clamav, cppcheck, exim, fish, git, gnome-builder, godot, grep, haproxy, kodi, libvte, lighttpd, lldb, mariadb, mongodb, mutt/neomutt, mysql-workbench, nginx, nmap, pam, postfix, Qt5/Qt6, rspamd, selinux, sway, swig, syslog-ng, systemd, uwsgi, uwsgi, varnish, vlc, wget, zsh … … … and 110 more.

re2: bloaty, chromium/chromedriver/qtwebengine, clickhouse, libgrpc

libhs: libndpi, rspamd


The tool could be extended to support Unicode, whereas AFAIK it would not be possible to extend it to support backreferences. Are there any other “regex” features that would be impossible to support?


>> Too many to enumerate

I take back my previous claim, this is a wrong exaggeration.

> The tool could be extended to support Unicode

Not an easy task. There are some things in the standard that do not map neatly to states, notably foldcasing of characters that change the count of characters and the treatment of the generic line boundary. Edit: after browsing UTS#18, I am almost certain that a conforming implementation cannot be mapped as exemplified in the tool. Maybe there's a neat work-around possible.

> features that would be impossible to support?

(?=, (?!, (?<=, (?<!, (?{, (??{, (?&, (?(…), (?>, (*asr:, (*SKIP)


Some of these are in fact possible, though with some restrictions. I wrote a blog post on how to support some negative lookbehinds: http://allanrbo.blogspot.com/2020/01/alternative-to-negative...


Intersection and complement are both expressible in regular expressions (in the CS sense). Is that not what you mean by impossible to support?

(I'm too lazy to look up what the other listed notations mean.)


> Is that not what you mean

no, I mean generic line boundary


This is really neat! I've been putting off learning regex's for some time now, guess I need to add it to my tool belt.


Try the following input, if you dare:

(aa)*|(aaa)*|(aaaaa)*|(aaaaaaa)*


most of the crazy email regexes found here https://stackoverflow.com/questions/201323/how-to-validate-a... fail to validate in the input field


Only regular expressions[0] can compile to FST's; Perl-style regexes aren't regular.

[0] https://en.wikipedia.org/wiki/Regular_language


That's probably, because the regexes you linked to are Perl-flavoured regexes, whereas this website seems to use the "Unix Extended" variant of regex.


If you want crazy, you should look up RegEx match open tags except XHTML self-contained tags.


In all DM books. Kudos!


I don't know what I'm looking at but this looks like the solution to a front-end interview question.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: