
Incremental Regular Expressions - necrodome
http://jkff.info/articles/ire/
======
ridiculous_fish
Very clever approach.

It strikes me that the transition functions for a character or a substring is
only dependent on that string, not its position. The article annotates ranges
of the string with the transition function, so if the string contains repeated
substrings, it needlessly repeats the same data.

To fix this, you could store the string and its substrings in a trie: the data
is automatically de-duplicated. You may also be able to get the same effect
with a hash table, using a rolling hash to efficiently compute when a
substring is in the table.

This sort of optimization seems valuable, because the suggested approach is
quite memory intensive. Per the Java implementation, if your NFA has N nodes,
then the transition function needs to store a bitset of length N for each of N
possible states, meaning that the memory usage goes as the square of the
number of nodes in the NFA.

 _edit_ : Also, thinking about this de-duplication made me realize a simpler
explanation of the technique. You have a regex, for which you generate an NFA,
that has some states. We also have a target string, that has a lot of
substrings. For every state, precompute and cache the state transition
produced by every substring. So you end up with a map whose keys are pairs
(state, substring) and whose values are sets of states.

Then if the target string is mutated, you can use the cached state transitions
for the unchanged parts to quickly determine whether the regex matches the
now-mutated string.

~~~
jamii
PEGs are effectively regular expressions with recursion. What you are
describing is a packrat parser. So long as the parsing-language is context-
free (ie you don't allow the values extracted to be in scope during parsing)
then you can make PEGs incremental in the same way. I've been thinking about
this lately in the context of <https://github.com/jamii/strucjure>

------
pjscott
tl;dr: NFA transition functions form a monoid under composition. By using
balanced ropes as a monoid-cached tree, we can get incremental regular
expression matching in O(lg n) time for a string of length n undergoing a one-
character change (for a given automaton), and O(n) for a string undergoing an
n-character change.

And if that didn't make perfect sense, read the article. It's one of the best
things to appear on Hacker News in quite a while.

~~~
talloaktrees
can we get the "too stupid, didn't understand" version?

~~~
RyanZAG
This is a Java library that handles regular expressions in a new way. The
regular expression can be run on a large character string and it supports
changes to the string without having to recompute the entire expression.

The supported changes to the string are concatenating two strings, or cutting
a string into parts. This allows for adding/removing characters from a string.

Best use for this (imo) is in doing a regular expression over a large user
entered document. As the user types, we can give regular expression updates in
real time as we do not need to recompute everything.

Limits: "we only have character classes, braces and the operators +, *, ?, |",
so only basic expressions.

~~~
pjscott
Nice explanation. One more thing: this also leads to an algorithm for doing
ordinary, non-incremental regular expression matching in O(lg n) time on a
parallel machine. It uses a clever method which I think was first discovered
on the old Connection Machines by Danny Hillis and Guy Steele:

[http://cva.stanford.edu/classes/cs99s/papers/hillis-
steele-d...](http://cva.stanford.edu/classes/cs99s/papers/hillis-steele-data-
parallel-algorithms.pdf)

