
Show HN: Wregex – How Regular Expression Engines Work - wernsey
http://wstoop.co.za/wregex.php
======
burntsushi
I'm with glangdale on this one. Some parts of this are strange, but it was
overall a fun read. Thanks. :-)

> DFAs have some big advantages over NFAs: Because DFA regex engines don't
> need to backtrack they are in general much faster than NFAs. Also, because
> NFAs need to backtrack, it is possible to structure your pattern in such a
> way that the backtracking will cause nearly infinite loops on certain input
> sequences. DFAs also don't need the non-greedy operators STAR? and +?

Two points to make here. NFAs certainly do not need to backtrack. You can
write an NFA implementation that inspects each byte at most M times, where M ~
len(regex). You can implement this with a virtual machine:
[https://swtch.com/~rsc/regexp/regexp2.html](https://swtch.com/~rsc/regexp/regexp2.html)

The other point is that "DFAs don't need non-greedy operators" doesn't really
make any sense. Were you thinking of possessive quantifiers instead? Non-
greedy operators don't impact whether a match occurs or not, but they can
certainly impact the length of a match. Both NFAs and DFAs can implement those
semantics.

~~~
glangdale
You can write an NFA implementation that can inspect each byte _once_. Of
course, you won't get capturing or greedy/non-greedy semantics as a result.
Even Start of Match is a challenge.

We did have a project for a while which did capturing (in a Glushkov NFA) in
two passes - you had to inspect each byte twice (if you were content with a
O(N) size requirement for side storage, where N is the size of the input),
three times (with a O(sqrt(N) side storage requirement), ... k times (with a
O(power(N, 1/(k-1)) side storage requirement).

It was quite cool, but we never found a customer, and it polluted the
Hyperscan source base in many weird ways.

------
glangdale
I am not convinced that what is shown as an NFA corresponds to what is
normally understood as an NFA. I don't think that makes it bad (in fact, it
can do more things than a strict NFA, like capturing and back-references), but
it is remarkably different to a Thompson NFA or a Glushkov NFA.

Something that can backtrack has a stack. I suspect this removes you from
'finite' territory; the 'F' in '*FA'.

I think Friedl has a lot to answer for in his muddying of the waters here.
It's created a lot of headaches for us over the years, as we implement both
DFA and NFA (strict) matching in Hyperscan, and people do ask us why our NFAs
can't do backreferences. :-)

~~~
warkdarrior
Something that has a _finite_ stack is still in 'finite' territory. Is there
an upper bound on the number of backtracking steps?

~~~
glangdale
Anything that can grow proportionally with the input (regardless of which O()
it is) takes us out of finite territory.

------
chenster
Even better, a visualized state machine, with email regex demo:
[https://tinyurl.com/ycel7z3k](https://tinyurl.com/ycel7z3k)

------
wernsey
I created a regular expression many years ago. I did it mostly for my own
enrichment, but I've made some time recently to write an essay on how it
works.

~~~
bobosha
nice and concise explanation.

