

A regular expression matcher in 19 lines of Clojure - decklin
https://plus.google.com/117377434815709898403/posts/HQek48JMJb3

======
gjm11
Repeating, mutatis mutandis, what I said in an earlier thread
(<http://news.ycombinator.com/item?id=2723366>) about the C code on which this
is based:

It's very nice pattern-matching code, but it isn't a regular expression
matcher. The class of patterns it's able to match is much smaller than the
class of all regular expressions -- not because of missing "sugar" (character
classes, +, non-greedy wildcards, ...) but because it doesn't support groups
or the | operator.

As a result, the language supported by this matcher has, e.g., no way to match
the pattern (ab)* or the pattern a|b. It's much, much less powerful than an
actual regular expression matcher would be, and much of what makes it possible
to do it in 19 lines of Clojure is that loss of power.

(I've written extremely similar code before: this level of functionality --
basically, glob or DOS wildcards -- is pretty useful. I'm not Rob Pike, and my
code for similar functionality would probably be longer than his. But my code,
or even Rob Pike's code, for even the simplest thing that could honestly be
called a regular expression matcher, would be longer than this by a bigger
factor.)

------
alexgartrell
I'm no haskell hacker (by a long, long, long shot[0]), and I've played a lot
more with Clojure than with Haskell, but I still found this easier to write
and read in haskell than in clojure.

    
    
        matches :: String -> String -> Bool
        matches rs = dm ((fixHead . fixTail) rs) where
          fixHead ('^':rs) = rs
          fixHead rs = '.':'*':rs
          fixTail rs = case (reverse rs) of
            ('$':rrs) -> reverse rrs
            rrs -> reverse ('*':'.':rrs)
          dm "" "" = True
          dm ('.':'*':rs) (c:cs) = (dm ('.':'*':rs) cs) || (dm rs (c:cs))
          dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || (c == r && (dm (r:'*':rs) cs))
          dm ('.':rs) (c:cs) = dm rs cs
          dm (r:rs) (c:cs) = r == c && dm rs cs
          dm (_:'*':[]) "" = True
          dm _ _ = False
    

[0] I'm particularly ashamed of fixHead and fixTail and the way I special
cased matching a star and nothing else against the empty string

edit: couldn't resist doing the golf thing

    
    
        matches :: String -> String -> Bool
        matches rs = dm (fixHead rs) where
          fixHead ('^':rs) = rs
          fixHead rs = '.':'*':rs
          cm r c = r == '.' || r == c
          dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || ((cm r c) && (dm (r:'*':rs) cs))
          dm (_:'*':"") "" = True
          dm (r:rs) (c:cs) = cm r c && dm rs cs
          dm "" _ = True
          dm "$" "" = True
          dm _ _ = False

~~~
alexgartrell
Ok, now I'm done

    
    
        matches :: String -> String -> Bool
        matches ('^':rs) = dm rs where
          cm r c = r == '.' || r == c
          dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || ((cm r c) && (dm (r:'*':rs) cs))
          dm (_:'*':"") "" = True
          dm (r:rs) (c:cs) = cm r c && dm rs cs
          dm "" _ = True
          dm "$" "" = True
          dm _ _ = False
        matches rs = matches ("^.*" ++ rs)

------
drallison
LOC (Lines of Code) is not a good characterization of the quality of code;
clarity and understandability is. The Clojure implementation is pretty nice on
those counts.

The LOC measure may be important when, for example, the goal is to show the
number of operations to be as small as possible.

~~~
gruseom
_LOC (Lines of Code) is not a good characterization of the quality of code_

Actually it is. Or at least it's the best we have. When it comes to anything
you can measure, the evidence points to program length as the best predictor
of quality of code. And LOC is, for better or worse, the standard metric for
code size.

We had a whole thread on HN about this a month ago. Specific studies are cited
there. <http://news.ycombinator.com/item?id=3037293>

Speaking of which, someone later told me that Chapter 27 of McConnell's _Code
Complete_ cites studies according to which bug count grows as a mild
exponential function of code size. I don't have that book handy. If anyone
does, can they look at Chapter 27 and tell us what it says?

~~~
arohner
My edition is copyright 1993. In Chapter 21.3, they cite Capers Jones 1977
"Program Quality and Programmer Productivity". (Couldn't find a handy link
online, appears to be [http://www.amazon.com/Program-quality-programmer-
productivit...](http://www.amazon.com/Program-quality-programmer-productivity-
Technical/dp/B0006XJJ80))

They report the following table

    
    
        Project Size (LoC)       Error Density (Bugs / kLoC)
        0-2k                     0-25
        2k-16k                   0-40
        16k-64k                  0.5-50
        64k-512k                 2-70
        512k or more             4-100
    

So yes, the error _density_ increases with LoC.

------
ramiyer
I so understand this
[http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea...](http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html)
way better than clojure. What matters? number of lines or readability? and I
am sure the Clojure implementation would be better understood by people who
understand lisp. Ofcourse I haven't played with clojure or any lisp dialects
and the main reason for that is when I see a lisp code I see only "(" and ")".
I still don't know how to get out of that.

~~~
dewitt
The best way I found to see past the parens is to simply start writing some
lisp. I recently started teaching myself by working through the problems on
4clojure.com, reading pg's _On Lisp_, and books like _Land of Lisp_,
_Programming Clojure_, _Joy of Clojure_, etc., and even within a day or two of
learning I found I no longer saw the syntax at all. In fact, one of the great
beauties of Lisp and s-expressions is that the syntax is so so minimal and so
consistent that it all but disappears. (Clojure is a bit more complicated, in
that it introduces additional special forms for java interop, metadata,
vectors and maps, but you get used to that quickly, too.)

A much trickier thing for me personally has been memorizing what all of the
standard functions do, particularly bouncing back and forth between common
lisp and clojure like I've been. Clojuredocs.org and the clojure source code
itself have been essential companions in this journey.

So the only thing holding you back is the parens, then have no fear, you'll
actually grow to love them (especially if you're an emacs user, thanks to
show-paren-mode and electric-pair-mode and the like).

------
0x12
Both versions are really nice but I find the C version to be lots easier to
understand (I come from a C background, so that's probably not a surprise).

What does this look like to an experienced lisp/clojure programmer, do they
have the same feeling but reversed?

What about programmers familiar with other languages?

