Hacker News new | past | comments | ask | show | jobs | submit login

Yet another approach: Translate the regular expression for the disjunction of the search terms into a DFA. Then traverse each document with the DFA to find all matches. Traversing a string of length n with a DFA takes O(n) time regardless of the DFA's size.

The DFA turns out to be basically the same thing as the trie. Indeed, the standard DFA construction factors out common prefixes of the search terms, just like a trie. But this formulation has the advantage that anyone can implement it in a few minutes with their language's regex library. In Python you might do something like this:

    def searcher(terms):
        pattern = r"(?:^|\s)(%s)(?:$|\s|,|\.|!|\?)"
        dfa = re.compile(pattern % ('|'.join(map(re.escape, terms)),))
        return lambda string: dfa.findall(string)
This snippet also tries to deal properly (if simplistically) with search term occurrences bracketed by whitespace and punctuation.

As a side note, the problem didn't specify whether matches could overlap. Actualy, it isn't even solvable in O(n) time (plus precomputation) if overlapping matches are allowed. Let's say the search terms are "a", "aa", "aaa", etc, and the search string consists of n copies of "a". Then every substring of the string is a match and so there are O(n^2) matches, which obviously cannot be listed in O(n) time.

Most regex libraries don't do determinization (perhaps awk or grep do, but perl doesn't).

You'll want to do better than just "use a trie" - like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_mat...

Perl will actually use Aho-Corasick string matching for regular expressions like these.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact