Irregular Expressions

zokier · on April 9, 2023

There is also unrelated Scheme(/Racket) library called IrRegular Expressions/irregex: https://synthcode.com/scheme/irregex

It is notable for supporting lispy "SRE" syntax in addition to traditional PCRE syntax, standardized as SRFI 115: https://srfi.schemers.org/srfi-115/srfi-115.html

metadat · on April 9, 2023

Great writeup, clear and easy to follow.

For readers unfamiliar with nom, it would be even more helpful if the author included reference links, because I was immediately confused about the sudden appearance of the "alt(..)" function scene, seemingly materialized out of thin air (presumably imports omitted for brevity).

https://docs.rs/nom/latest/nom/branch/fn.alt.html

jmholla · on April 9, 2023

What's with the 21 parsers limit?

umanwizard · on April 9, 2023

It's currently not possible to write a Rust function that's generic over a tuple with arbitrarily many entries, so they had to implement a trait by hand for tuple sizes up to 21 (presumably that limit was chosen arbitrarily).

metadat · on April 9, 2023

Does Rust have a template or macro system that would facilitate generating up to e.g. 221 instead?

If I'm understanding correctly, some Regex features may require implementing more than 21 rules for a thorough and complete implementation.

iudqnolq · on April 9, 2023

Yes, rust declarative macros shine at exactly this kind of problem. And nom does use this to generate the implementation.

But for some reason it's customary in rust to stick to a lower number of autogenerated implementations. My first guess would be it might be a compile time issue?

https://docs.rs/nom/latest/src/nom/branch/mod.rs.html#245

jcparkyn · on April 9, 2023

The workaround is just to nest multiple `alt`s inside each other.

8uhaal · on April 9, 2023

Anybody knows what the problem with negation is? Can't you just swap accepting with non-accepting states? The result should accept exactly the complement, no? Since we're not talking about Büchi automata here.

mcyc · on April 9, 2023

It's because this produces NFAs not DFAs (due to the epsilon arcs), and so negation can't be done by just flipping the final states.

Take, for example, the regex "a|(aa)+" (the set of even length strings of "a"'s or just one "a"). If you use the construction from the article, you get an NFA with basically two arms: one that recognizes "a" and one that recognizes "(aa)+". The initial state contains an epsilon arc to the start state of each of these arms. If you just flip the final/non-final states in each arm, the resulting language contains "a", since we are no longer accepting each even length "a" string, but now each odd length "a" string, of which "a" is a member. Thus the new language is not the proper complement of "a|(aa)+", which would not contain "a".

8uhaal · on April 9, 2023

Ah ok, thanks. What I meant was that you could flip the states if it's a DFA. If you have an NFA then that won't work, as shown by your counterexample.

I suppose you generally accept in an NFA if there is an accepting state epsilon-reachable, so if you flip the accepting states, you could still have an accepting state epsilon-reachable, which is why this doesn't work. In an DFA, there is only one state that's (trivially) epsilon-reachable, so the construction works.

mcyc · on April 9, 2023

While epsilon transitions often, but not always, cause this issue of not being able to complement the language by just flipping final/non-final states, it isn't always the case. For example, you can make an ambiguous NFA with no epsilon transitions by just doing epsilon removal, and then you still have the same problem but no non-trivial epsilon-reachable states!

On the other hand, you can have an unambiguous [1] NFA (for example, a DFA, but you insert some dummy epsilon transitions between a split up state) where you can just flip the final/non-final states and complement the language.

So, in the end, complementing languages described by NFAs without determinizing them first is a bit of a tricky problem.

[1]: https://en.wikipedia.org/wiki/Unambiguous_finite_automaton --- a superset of DFAs, but they can have epsilon transitions

8uhaal · on April 10, 2023

Oh, interesting! Thanks for the pointer!

But you can't always complement the language easily for an UFA either, right? The path to an accepting state may be unambiguous, but there could at the same time be a path to a non-accepting state, so flipping the states may keep some words in the language. And make the automaton even ambiguous.

zeroimpl · on April 9, 2023

That example doesn’t really fit since one easily could compute the negation of “a|(aa)+” in linear time by simply returning the opposite of the top-level node in the tree.

Perhaps something like “X|!Y” or similar might be impossible?

mcyc · on April 9, 2023

The parent comment asked specifically about negating the finality of each state, which I addressed. As for how to actually implement complement in a regex engine, there are certainly some strategies like the one you mentioned that could be used (but the point of my comment was that just swapping final/non-final states is not a valid one).

zeroimpl · on April 9, 2023

Ah I thought they meant flipping the finality of the final state, but I guess they said "states" so your interpretation makes more sense.

layer8 · on April 9, 2023

The problem is when combining the complement of an expression with another expression (such as when constructing an alternation of the two), because then the complement expression has to be constructed for one of the two, and that can grow double-exponentially in length: https://en.wikipedia.org/wiki/Regular_expression#Expressive_...

ridiculous_fish · on April 9, 2023

Often you want to capture something, not just check if a match exists. But the theory of capture groups in DFAs is relatively obscure - for example tagged DFAs in the Laurikari paper, which is a difficult read. Does anyone know of an accessible writeup of these techniques?

Interestingly neither re2 nor Rust regex perform capturing in DFAs. Rather, a DFA is used to locate the end of a match, the DFA is used again to determine the beginning of the match, and then an NDFA is used to match a third time, to extract the capture groups.

burntsushi · on April 9, 2023

The "Papers" section on re2c's web site continues Laurikari's work: http://re2c.org/

... but I haven't found them particularly accessible. And it's not clear it's a viable strategy in a general purpose regex engine. Namely, I'm not sure how much bigger it makes the DFA.

Also, AFAIK, these aren't DFAs. They are different theoretical structures with explicitly more power.

> and then an NDFA is used to match a third time, to extract the capture groups.

That's the PikeVM. It's an NFA simulation. Although it uses additional storage and is otherwise more computationally powerful than just a plain NFA.

ridiculous_fish · on April 9, 2023

Thanks! I hadn't encountered that top paper "A closer look at TDFA."

That paper claims to be "the first practical submatch extraction and parsing algorithm based on DFA" and it came out only last year! It shows how new this theory is.

yarg · on April 10, 2023

I don't know about sound theory, but pulling something out of bullshit:

Firstly, the nice part about DFAs is that they're really simple to implement - in terms of capture groups bare in mind that the generation mechanism means that the nodes are probably super-positions, and the extracts across aspects of your language can intersect if you're not careful.

You can get into messy places really quickly - so I'd suggest not fucking with the machine directly.

However, there's one thing that you can do without compromising machine generation: add a processing mode that returns the trace of machine states for the matching string.

And then process that state trace.

Now bare in mind that some of the states that you went through might have been super-positions, so you don't quite know what they are.

But (now that you have a match) you can run the reverse of the match through the reverse of the network and intersect the reverse of its trace.

(I haven't actually done that, but it will clean up some trace ambiguity - where earlier branches were only later found not to have been taken.)

chubot · on April 9, 2023

The animated diagrams are cool! Does anyone know how they're made?

leetrout · on April 9, 2023

Glad you said this because they do not work in mobile safari. All I saw was empty ovals

KerrAvon · on April 9, 2023

Also broken in Desktop Safari.

tavianator · on April 9, 2023

I wrote the SVGs by hand

tester457 · on April 9, 2023

They're beautifully animated.

How did you know that your article was posted? Do you have a program to alert you of mentions?

tavianator · on April 9, 2023

Nope, a friend of mine messaged me that I was on the front page of HN! I'm busy with family for Easter so I can't reply to everything I want to, but I'll come back later

martyalain · on April 9, 2023

Cool. Could your irregular expressions take something interesting from the Johnathan's snippet recalled in this wiki page http://lambdaway.free.fr/lambdawalks/?view=levithan ?

andrewfromx · on April 9, 2023

GPT has completely changed my use and feeling about regex. I used to hate them, would rather write the logic myself in a language I understand. But now I just ask GPT for the regex I need and voila!

tomashubelbauer · on April 9, 2023

That's interesting. I personally like regexes very much but I'm slow to create them so I figured I'd ask ChatGPT a few times and just double check its work and it got them all wrong.

leobg · on April 9, 2023

I recently asked it to produce a regex for me, and it responded that as a large language model, it wasn’t allowed to produce hate speech… Not making this up.

sroussey · on April 9, 2023

Got them wrong for me as well. :/

I could point out the flaw and it would fix it, but introduce another.

However, I got enough out of it to rewrite it myself.

chrismorgan · on April 9, 2023

Every regular expression I’ve seen produced by a large language model has either being broken, or done an unacceptably bad job due to being the wrong tool for the job (e.g. validating an email address with a fairly short regular expression, which I’ve seen people share the answers of at least three times).

8uhaal · on April 9, 2023

Writing regexes is one of the most fun things that sometimes pop up at work. A little puzzle. Well-defined, clear that there must be a solution, somewhat easily checkable .. the easiest kind of task. I enjoy this just as much as some people enjoy crosswords or sudoku. Nobody would ask chatgpt to solve a sudoku for them (It would get them all wrong anyway.) No way I'm gonna let GPT take this away from me.

andrewfromx · on April 9, 2023

"No way I'm gonna let GPT take this away from me" surely you can see a photoshop expert from years ago saying this about pixel blurring?

meindnoch · on April 9, 2023

What is "pixel blurring"?

Turing_Machine · on April 9, 2023

The regex GPT gave me the other day was pure shite. However, that was only a small part of the job, so GPT was still a big win overall.

zabzonk · on April 9, 2023

and next you will be asking gpt "how do i do multiplication"?

pests · on April 9, 2023

Didn't people say this about calculators too probably?

zabzonk · on April 10, 2023

they did, and they were right

MathMonkeyMan · on April 9, 2023

There will be no need to ask.