For readers unfamiliar with nom, it would be even more helpful if the author included reference links, because I was immediately confused about the sudden appearance of the "alt(..)" function scene, seemingly materialized out of thin air (presumably imports omitted for brevity).
It's currently not possible to write a Rust function that's generic over a tuple with arbitrarily many entries, so they had to implement a trait by hand for tuple sizes up to 21 (presumably that limit was chosen arbitrarily).
Yes, rust declarative macros shine at exactly this kind of problem. And nom does use this to generate the implementation.
But for some reason it's customary in rust to stick to a lower number of autogenerated implementations. My first guess would be it might be a compile time issue?
Anybody knows what the problem with negation is? Can't you just swap accepting with non-accepting states? The result should accept exactly the complement, no? Since we're not talking about Büchi automata here.
It's because this produces NFAs not DFAs (due to the epsilon arcs), and so negation can't be done by just flipping the final states.
Take, for example, the regex "a|(aa)+" (the set of even length strings of "a"'s or just one "a"). If you use the construction from the article, you get an NFA with basically two arms: one that recognizes "a" and one that recognizes "(aa)+". The initial state contains an epsilon arc to the start state of each of these arms. If you just flip the final/non-final states in each arm, the resulting language contains "a", since we are no longer accepting each even length "a" string, but now each odd length "a" string, of which "a" is a member. Thus the new language is not the proper complement of "a|(aa)+", which would not contain "a".
Ah ok, thanks. What I meant was that you could flip the states if it's a DFA. If you have an NFA then that won't work, as shown by your counterexample.
I suppose you generally accept in an NFA if there is an accepting state epsilon-reachable, so if you flip the accepting states, you could still have an accepting state epsilon-reachable, which is why this doesn't work. In an DFA, there is only one state that's (trivially) epsilon-reachable, so the construction works.
While epsilon transitions often, but not always, cause this issue of not being able to complement the language by just flipping final/non-final states, it isn't always the case. For example, you can make an ambiguous NFA with no epsilon transitions by just doing epsilon removal, and then you still have the same problem but no non-trivial epsilon-reachable states!
On the other hand, you can have an unambiguous [1] NFA (for example, a DFA, but you insert some dummy epsilon transitions between a split up state) where you can just flip the final/non-final states and complement the language.
So, in the end, complementing languages described by NFAs without determinizing them first is a bit of a tricky problem.
But you can't always complement the language easily for an UFA either, right? The path to an accepting state may be unambiguous, but there could at the same time be a path to a non-accepting state, so flipping the states may keep some words in the language. And make the automaton even ambiguous.
That example doesn’t really fit since one easily could compute the negation of “a|(aa)+” in linear time by simply returning the opposite of the top-level node in the tree.
Perhaps something like “X|!Y” or similar might be impossible?
The parent comment asked specifically about negating the finality of each state, which I addressed. As for how to actually implement complement in a regex engine, there are certainly some strategies like the one you mentioned that could be used (but the point of my comment was that just swapping final/non-final states is not a valid one).
The problem is when combining the complement of an expression with another expression (such as when constructing an alternation of the two), because then the complement expression has to be constructed for one of the two, and that can grow double-exponentially in length: https://en.wikipedia.org/wiki/Regular_expression#Expressive_...
Often you want to capture something, not just check if a match exists. But the theory of capture groups in DFAs is relatively obscure - for example tagged DFAs in the Laurikari paper, which is a difficult read. Does anyone know of an accessible writeup of these techniques?
Interestingly neither re2 nor Rust regex perform capturing in DFAs. Rather, a DFA is used to locate the end of a match, the DFA is used again to determine the beginning of the match, and then an NDFA is used to match a third time, to extract the capture groups.
The "Papers" section on re2c's web site continues Laurikari's work: http://re2c.org/
... but I haven't found them particularly accessible. And it's not clear it's a viable strategy in a general purpose regex engine. Namely, I'm not sure how much bigger it makes the DFA.
Also, AFAIK, these aren't DFAs. They are different theoretical structures with explicitly more power.
> and then an NDFA is used to match a third time, to extract the capture groups.
That's the PikeVM. It's an NFA simulation. Although it uses additional storage and is otherwise more computationally powerful than just a plain NFA.
Thanks! I hadn't encountered that top paper "A closer look at TDFA."
That paper claims to be "the first practical submatch extraction and parsing algorithm based on DFA" and it came out only last year! It shows how new this theory is.
I don't know about sound theory, but pulling something out of bullshit:
Firstly, the nice part about DFAs is that they're really simple to implement - in terms of capture groups bare in mind that the generation mechanism means that the nodes are probably super-positions, and the extracts across aspects of your language can intersect if you're not careful.
You can get into messy places really quickly - so I'd suggest not fucking with the machine directly.
However, there's one thing that you can do without compromising machine generation: add a processing mode that returns the trace of machine states for the matching string.
And then process that state trace.
Now bare in mind that some of the states that you went through might have been super-positions, so you don't quite know what they are.
But (now that you have a match) you can run the reverse of the match through the reverse of the network and intersect the reverse of its trace.
(I haven't actually done that, but it will clean up some trace ambiguity - where earlier branches were only later found not to have been taken.)
Nope, a friend of mine messaged me that I was on the front page of HN! I'm busy with family for Easter so I can't reply to everything I want to, but I'll come back later
GPT has completely changed my use and feeling about regex. I used to hate them, would rather write the logic myself in a language I understand. But now I just ask GPT for the regex I need and voila!
That's interesting. I personally like regexes very much but I'm slow to create them so I figured I'd ask ChatGPT a few times and just double check its work and it got them all wrong.
I recently asked it to produce a regex for me, and it responded that as a large language model, it wasn’t allowed to produce hate speech… Not making this up.
Every regular expression I’ve seen produced by a large language model has either being broken, or done an unacceptably bad job due to being the wrong tool for the job (e.g. validating an email address with a fairly short regular expression, which I’ve seen people share the answers of at least three times).
Writing regexes is one of the most fun things that sometimes pop up at work. A little puzzle. Well-defined, clear that there must be a solution, somewhat easily checkable .. the easiest kind of task. I enjoy this just as much as some people enjoy crosswords or sudoku. Nobody would ask chatgpt to solve a sudoku for them (It would get them all wrong anyway.) No way I'm gonna let GPT take this away from me.
It is notable for supporting lispy "SRE" syntax in addition to traditional PCRE syntax, standardized as SRFI 115: https://srfi.schemers.org/srfi-115/srfi-115.html