Hacker News new | past | comments | ask | show | jobs | submit login
Incremental Regular Expressions (jkff.info)
108 points by necrodome on Dec 24, 2012 | hide | past | favorite | 6 comments

Very clever approach.

It strikes me that the transition functions for a character or a substring is only dependent on that string, not its position. The article annotates ranges of the string with the transition function, so if the string contains repeated substrings, it needlessly repeats the same data.

To fix this, you could store the string and its substrings in a trie: the data is automatically de-duplicated. You may also be able to get the same effect with a hash table, using a rolling hash to efficiently compute when a substring is in the table.

This sort of optimization seems valuable, because the suggested approach is quite memory intensive. Per the Java implementation, if your NFA has N nodes, then the transition function needs to store a bitset of length N for each of N possible states, meaning that the memory usage goes as the square of the number of nodes in the NFA.

edit: Also, thinking about this de-duplication made me realize a simpler explanation of the technique. You have a regex, for which you generate an NFA, that has some states. We also have a target string, that has a lot of substrings. For every state, precompute and cache the state transition produced by every substring. So you end up with a map whose keys are pairs (state, substring) and whose values are sets of states.

Then if the target string is mutated, you can use the cached state transitions for the unchanged parts to quickly determine whether the regex matches the now-mutated string.

PEGs are effectively regular expressions with recursion. What you are describing is a packrat parser. So long as the parsing-language is context-free (ie you don't allow the values extracted to be in scope during parsing) then you can make PEGs incremental in the same way. I've been thinking about this lately in the context of https://github.com/jamii/strucjure

tl;dr: NFA transition functions form a monoid under composition. By using balanced ropes as a monoid-cached tree, we can get incremental regular expression matching in O(lg n) time for a string of length n undergoing a one-character change (for a given automaton), and O(n) for a string undergoing an n-character change.

And if that didn't make perfect sense, read the article. It's one of the best things to appear on Hacker News in quite a while.

can we get the "too stupid, didn't understand" version?

This is a Java library that handles regular expressions in a new way. The regular expression can be run on a large character string and it supports changes to the string without having to recompute the entire expression.

The supported changes to the string are concatenating two strings, or cutting a string into parts. This allows for adding/removing characters from a string.

Best use for this (imo) is in doing a regular expression over a large user entered document. As the user types, we can give regular expression updates in real time as we do not need to recompute everything.

Limits: "we only have character classes, braces and the operators +, *, ?, |", so only basic expressions.

Nice explanation. One more thing: this also leads to an algorithm for doing ordinary, non-incremental regular expression matching in O(lg n) time on a parallel machine. It uses a clever method which I think was first discovered on the old Connection Machines by Danny Hillis and Guy Steele:


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact