
Regular Expression Engine in 14 lines of Python - okal
http://paste.lisp.org/display/24849
======
abecedarius
Sometimes loops forever, sometimes takes exponential time. Avoiding these
problems:

    
    
        """
        Regular-expression matching by the Thompson construction.
        Explained in C at http://swtch.com/~rsc/regexp/regexp1.html
        """
    
        def match(re, s): return run(prepare(re), s)
    
        def run(states, s):
            for c in s:
                states = set.union(*[state(c) for state in states])
            return accepting_state in states
    
        def accepting_state(c): return set()
        def expecting_state(char, k): return lambda c: k(set()) if c==char else set()
    
        def state_node(state): return lambda seen: set([state])
        def alt_node(k1, k2):  return lambda seen: k1(seen) | k2(seen)
        def loop_node(k, make_k):
            def loop(seen):
                if loop in seen: return set()
                seen.add(loop)
                return k(seen) | looping(seen)
            looping = make_k(loop)
            return loop
    
        def prepare(re): return re(state_node(accepting_state))(set())
    
        def lit(char):     return lambda k: state_node(expecting_state(char, k))
        def alt(re1, re2): return lambda k: alt_node(re1(k), re2(k))
        def many(re):      return lambda k: loop_node(k, re)
        def empty(k):      return k
        def seq(re1, re2): return lambda k: re1(re2(k))
    

From <https://github.com/darius/sketchbook>

I'm not sure how the original comes to 14 lines. btw.

~~~
walrus
It comes to 14 if you ignore the first line of each function (e.g. `def
iconcat(xs, ys):`). I'm not sure why one would count lines like that, though.

------
Peaker
Transliteration into Haskell brings type-safety into it, and cuts it down to 7
lines:

    
    
      import Prelude hiding (seq)
      
      seq l r s = concatMap r (l s)
      alt l r s = l s ++ r s
      star e s = s : seq e (star e) s
      plus e = seq e (star e)
      char c (x:xs) 
        | x == c    = [xs]
        | otherwise = []
      char _ _ = []
      
      example = seq (char 'c') $ seq (plus (alt (char 'a') (char 'd'))) (char 'r')
      
      main = do
        line <- getLine
        mapM_ (putStrLn . ("Match with remainder: "++)) (example line)
    

(Note the original implementation failed to reuse itertools.chain, in Haskell
I just use (++)).

------
cousin_it
Why call it a regexp engine? Those are actually parser combinators with
backtracking. They can parse much more than just regular languages, but are
exponentially slower compared to a proper regexp engine.

~~~
Peaker
regexps aren't regular, so most "proper" engines also do backtracking.

Indeed, if you stick with the regular subset, you can be really fast:

<http://swtch.com/~rsc/regexp/regexp1.html>

------
127
Here's a complete PEG parser in 3 lines:

    
    
        consume = lambda c: lambda inp: inp[1:] if inp and inp[0] in c else None
        ordered_choice = lambda va,vb: lambda inp: va(inp) or vb(inp)
        concatenate = lambda va,vb: lambda inp: (lambda x: x and vb(x))(va(inp))
    

Used like this (something I hacked together recently):
<http://pastebin.com/dUyTZttZ>

If you want it to actually do something useful, just wrap the functions with
another functions that do what you need.

~~~
abecedarius
In va(inp) or vb(inp) couldn't va(inp) succeed with an empty string as the
result (at the end) which Python interprets as false, making it mistakenly try
vb(x)? I haven't tried to run it.

Also I think c seems misleading as a variable name -- it suggests a single
character, but the 'in c' means it's a set of characters.

~~~
127
Yes, that is a bug and yes, those variable names do suck. Thanks for the
correction.

    
    
        ordered_choice = lambda va,vb: lambda inp: (lambda a,b: a if a or a=="" else b)(va(inp),vb(inp))
    

Would work, but starts to be so ugly that I'd have to write proper functions
and the code would not be 3 lines anymore. ;)

~~~
abecedarius
How about

    
    
        eat = lambda chars: lambda s: s[1:] if s and s[0] in chars else None
        alt = lambda p, q: lambda s: (lambda r: q(s) if r is None else r)(p(s))
        seq = lambda p, q: lambda s: (lambda r: None if r is None else q(r))(p(s))
    

going a bit far in the short-names direction, oh well. :) It's too bad Python
has this artificial statement/expression distinction.

------
psykotic
This was a bit of a joke at the time. I'm surprised no-one yet pointed out the
bug with unproductive left recursion which haunts this and all other similar
matchers. Harper's classic Proof-Directed Debugging paper shows how to pre-
normalize regular expressions to eliminate the issue:
[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.56...](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.5602)

~~~
qntm
How can this left-recurse? Can you give an example? I thought regular
expressions weren't recursive at all.

~~~
psykotic
The Kleene star is a fixed-point/recursion operator. Here's an example that
causes an infinite loop:

    
    
        (|a)*

~~~
qntm
That's not left-recursion.

~~~
psykotic
I probably shouldn't have used the 'left' adjective. But that expression
results in an unproductive recursion (unproductive meaning that no characters
are consumed through each cycle) analogous to what happens when you recursive-
descent parse a context-free grammar containing left-recursive productions.

------
skrebbel
The one-character variable names remove any value this snippet might possibly
have for documentation or demonstration.

------
Wilfred
Interesting that it's on paste.lisp.org. It strongly reminds me of Emacs' rx
macro. I suspect this a port of another toy regexp implementation -- otherwise
I would expect the author to use `itertools.chain` rather than implementing
something that does exactly the same (`iconcat` in this sample).

~~~
darklajid
Pasted by: psykotic When: 5 years, 2 months ago

I don't know Python, but was your preferred choice available at that time?

~~~
nknight
It was, though it was fairly new at that time, so the author may not have had
it on his system.

Also, the Python standard library is quite large. Most people will never
really know about all its nooks and crannies, and it's easy to forget they
exist. I've accidentally reimplemented parts of it more than once.

------
berntb
Work on a new implementation might be a good idea. :-)

The last months I've used the PCRE regexp library, which afaik is the one used
in most implementations wanting Perl compatibility. I was a bit shocked... I
thought it was good?

I ran into a couple of cases with bugs (one regexp couldn't be ported from
Perl and another over complex one from a newbie hung(!) the process). It seems
inefficient with backtracking, or? (Maybe this is fixed now?)

~~~
albertzeyer
Even though this implementation is short, don't think it is efficient.

Matching regular expressions can be done in linear time.

This presented algorithm takes exponential time.

~~~
nandemo
> Matching regular expressions can be done in linear time.

Unfortunately, the expression "regular expression" has been so abused in the
context of programming that, unless otherwise specified, we cannot expect it
to correspond to the well-defined regular expression that is used in formal
language theory.

For example, statements such as "you can use regular expressions to parse
HTML" and "regexps aren't regular" (mentioned in this thread) are now
acceptable. I wish people would at least say "extended regular expressions",
but it's a lost cause.

------
Encryptor
in 14 lines of ... python?? really? Ok, I can do better. A Regular Expression
Engine in 1 line of Javascript: string.match();

~~~
dchest
You're confusing the implementation of algorithm with a library call.

