
Fast String Matching for Analytics Pipelines - utopkara
http://www.jwplayer.com/blog/fast-string-matching/
======
sytelus
Article doesn't seem to clarify all the details. Isn't compressed-trie or DAWG
a better approach here? if you want to match substrings, wouldn't suffix tree
be an obvious choice (it can be made to take O(n) space).

~~~
utopkara
The difference between a DAWG and the minimized deterministic dictionary
automaton is the way they are constructed. Otherwise, the resulting automata
should be very similar in size, and of course their running times will be the
same. Note that, we build the automata offline.

The main advantage of using general purpose automata becomes clear when we are
able to reuse the codebase and the pipelines we built for a dictionary
automaton to both build and execute a search automaton. Otherwise, a
generalized suffix tree would have done the job at least as well.

In addition, when we have general purpose automata combined with a powerful
tool like OpenFST, we can compose the automata with transducers, and the
resulting automata will transform the input strings as we are matching them.

