Hacker News new | past | comments | ask | show | jobs | submit login

Very clever approach.

It strikes me that the transition functions for a character or a substring is only dependent on that string, not its position. The article annotates ranges of the string with the transition function, so if the string contains repeated substrings, it needlessly repeats the same data.

To fix this, you could store the string and its substrings in a trie: the data is automatically de-duplicated. You may also be able to get the same effect with a hash table, using a rolling hash to efficiently compute when a substring is in the table.

This sort of optimization seems valuable, because the suggested approach is quite memory intensive. Per the Java implementation, if your NFA has N nodes, then the transition function needs to store a bitset of length N for each of N possible states, meaning that the memory usage goes as the square of the number of nodes in the NFA.

edit: Also, thinking about this de-duplication made me realize a simpler explanation of the technique. You have a regex, for which you generate an NFA, that has some states. We also have a target string, that has a lot of substrings. For every state, precompute and cache the state transition produced by every substring. So you end up with a map whose keys are pairs (state, substring) and whose values are sets of states.

Then if the target string is mutated, you can use the cached state transitions for the unchanged parts to quickly determine whether the regex matches the now-mutated string.

PEGs are effectively regular expressions with recursion. What you are describing is a packrat parser. So long as the parsing-language is context-free (ie you don't allow the values extracted to be in scope during parsing) then you can make PEGs incremental in the same way. I've been thinking about this lately in the context of https://github.com/jamii/strucjure

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact