
Monoid machines: a O(log n) parser for regular languages (2006) [pdf] - Cieplak
http://www.dcc.fc.up.pt/~acm/semigr.pdf
======
pjscott
If I'm reading this right, it's a very cute trick. If not, maybe someone will
correct my summary:

You can parse regular languages with a DFA -- think of a set of states and a
transition function saying "if I'm in state x and we see letter c, go to state
y".

Equivalently, for any letter you can ask "for each DFA state, where will this
letter take me?" Like, "if we see an 'a', state 1 will go to state 5, state 2
will go to state 1, ...". You could represent this as a lookup table with an
entry for each state, if you like.

These functions can be composed in the obvious way: just look up where the
first one will take you, then look up where the second will go from there. So
e.g. from the functions for 'c', 'a', and 't', you can get the function for
'cat', that tells what the DFA will do -- starting at any of its states -- if
it next sees 'cat'.

(For a DFA with a constant number of states, composing two of these functions
takes O(1) time.)

Composition is associative, of course. If you see 'ca' and then 't', this is
the same as seeing 'c' then 'at', which is the same as seeing 'cat'.

To run a DFA on some string, you can think of it as equivalent to using the
first letter's function to see where you go from the start state, then using
the second letter's function on whatever state you end up in after that, and
so on... which is the same as using the entire string's function to get to the
final state in one jump.

How do you compute the whole string's function? Just compose the ones for its
letters. This composition is associative, so you can do it in parallel as a
tree-structured thing. For a string of length n, you can do this in O(log n)
time as long as you can throw enough processors at it.

Boom, logarithmic-time parallel regular expression matching.

EDIT: I thought this sounded familiar, and sure enough: Danny Hillis and Guy
Steele wrote essentially the same thing back in 1989, as one of the examples
in "Data Parallel Algorithms". They also give a very clever explanation of how
this can be extended to make a parallel lexer, also in logarithmic time:

[http://cva.stanford.edu/classes/cs99s/papers/hillis-
steele-d...](http://cva.stanford.edu/classes/cs99s/papers/hillis-steele-data-
parallel-algorithms.pdf)

~~~
foota
I'm missing something here, how do you know your start state when you do it in
parallel?

~~~
pjscott
The part you do in parallel is computing the function "If the DFA is in state
x, and it sees the whole input string, where will it end up?" You can
represent that function as a table, one entry for each state. Once you've got
that table, just look in the entry for the start state.

~~~
Dylan16807
So each thread might have to do up to m times the work, where m is the number
of states? That sounds like it might be a pretty bad hit on the number of
processors you need just to break even.

------
mac01021
Interesting. One thing they leave implicit is the amount of _work_ that
parsing requires.

The result is that you can parse in time O(log n) if you use O(n) processors,
so parsing requires O(nlogn) work. (This is probably to be expected.) I think
this means that, if you have o(n) strings to parse, you're better off with the
good old DFA solution which uses O(n) work and parsing the different strings
in parallel.

~~~
gue5t
Yes, given no context one assumes serial computation, where work and depth are
identical. An O(log n)-work parser would be astounding, because it's
impossible--it would have to ignore effectively all of the input. This isn't
clickbait because it's a real article and a good technique, but it's
definitely exaggerating the surprise factor with misleading phrasing.

------
smadge
Pretty cool! I was having trouble understanding the relationship between
monoids and deterministic finite automata, but once I worked out an example by
hand I was able to implement that transformation in Haskell:

    
    
        data MonoidMachine monoid input = Monoid monoid =>  MonoidMachine
          (input  -> monoid)
          (monoid -> Bool)
    
        -- From the proof of Theorem 1 in the paper.
        translate :: DFA state input -> MonoidMachine (Endo state) input
        translate (DFA _ _ delta initial accepting) = 
        MonoidMachine f g where
          f a = Endo (\q -> delta q a)
          g (Endo h) = h initial `member` accepting
    
        runMonoidMachine :: MonoidMachine monoid input -> [input] -> Bool
        runMonoidMachine (MonoidMachine f g) = g . mconcat . map f
    

The rest is here:
[https://gist.github.com/newjam/428613f083d289b83de0ddfeedf4e...](https://gist.github.com/newjam/428613f083d289b83de0ddfeedf4e9a7)

------
dwenzek
Here is a very smart blog about "Incremental Regular expressions" which is
based on this relationship with monoids.

[http://jkff.info/articles/ire/#f7](http://jkff.info/articles/ire/#f7)

