
The absolute bare minimum every programmer should know about regular expressions - rayvega
http://immike.net/blog/2007/04/06/the-absolute-bare-minimum-every-programmer-should-know-about-regular-expressions/
======
philwelch
Slight terminological quibble:

 _Regular expressions are strings formatted using a special pattern notation
that allow you to describe and parse text._

"Parse" is a questionable word choice. A proper Regular Expression can only
describe patterns found in "regular languages", though modern backtracking
regex engines have a bit more power. Parsing implies a _parser_ , which means
matching a grammar somewhat more sophisticated than a regex engine can handle.

~~~
michaelfairley
Regular expressions in modern programming languages and the regular
expressions you learned about in your automata theory class are not the same
thing. The back references present in the regular expressions discussed in the
article make the entire system capable of handling even NP-complete problems.
So yes, regular expressions (in the more common use of the term) can "parse".

~~~
philwelch
I did allude to that, but regular expressions are still not quite up to the
tasks you would use a parser for.

~~~
Goosey
I can only hope this challenge inspires a spiteful hacker to create a full
grammar parser in a regular expression. It won't be me, but I love how these
types of comments can sometimes spawn the most interesting toy projects. :)

~~~
riffraff
You may enjoy this <http://www.a-k-r.org/abnf/>

------
lbrandy
Considering he listed just about everything I can remember about regexes
without google, and nothing more, I believe I'm duty bound to upvote.

My only quibble: I'd promote the character classes (especially \s and maybe
\S) into this list from the "expert" category. Here's part2 (or what I call:
stuff I can google later, if I need it), if anyone wants to keep reading:
[http://immike.net/blog/2007/06/21/extreme-regex-foo-what-
you...](http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you-need-to-
know-to-become-a-regular-expression-pro/)

~~~
davidw
(Off topic, but that same character you randomly dump into irc messages
appeared at the beginning of this comment)

~~~
jacquesm
That appears to be a backspace.

------
cousin_it
The title makes me wonder... I was an okay programmer before I learned about
regexes. When the need arose, I learned how to use them easily. Should I have
learned them before that?

Many programmers seem to think that they possess just the right amount of
knowledge. Someone who knows Unix utilities but not algorithm design will
argue that Unix utilities are vital, but algorithm design is a random obscure
topic; someone with the opposite skillset will defend it just as eloquently. I
think programmers have no obligation to know anything. If you can do the job,
I don't care how much you use Google.

I'm spoiled enough to think that boring things shouldn't be deliberately
memorized. It's enough for me to know the mathematical concept of a regular
language, which lets me recognize situations where regexes would help. For
concrete syntax we always have cheatsheets. This attitude has its advantages:
for example, I never needed to be explained why parsing XML with regexes is a
bad idea.

~~~
kaens
I agree that boring things shouldn't be memorized, and that being familiar
with the concept of a regular language helps identify where a regular
expression could help.

Two points:

\+ A lot of programmers aren't familiar with the concept of a regular
language, especially those that are hobbyists, or who went to shitty schools.
Sad but true.

\+ Regular expressions are useful commonly enough that I would be very
surprised if an okay programmer with any sort of experience under their belt
couldn't remember at least [] . * ? + .

I definitely agree with your main point. There is, however, a weird subjective
layer of knowledge that _tends_ to be present in developers of any experience.
That doesn't mean that I would look down on an individual programmer who
claimed experience that didn't know how to make a regex character class off
the top of their head, but I _would_ be a bit surprised.

~~~
cousin_it
I agree. When you're planning to invade Iraq, you need people who shoot well,
and you need people who have been to Iraq before. They don't have to be the
same people, though.

We had a great experience at one of our contract jobs: we found a smart math
student with no webdev experience, and he did most of the programming while I
constantly consulted him on "platform issues" and jumped in to code any tricky
parts he had problems with. He became a competent webdev in about two months -
way faster than it took me to learn this stuff by myself - then I handed the
project over to him and left. I wonder why other employers don't do this more:
mix experienced programmers with smart newbies who come from hard-science
backgrounds. Seems to be a winning combination.

------
snitko
It's very disappointing to find the author didn't mention greedy matchers.
Consider the difference:

    
    
      >> s = "hello world, I love you world"
      >> s.sub(/(.*?)world/, '\1universe')
      => "hello universe, I love you world"
      >> s.sub(/(.*)world/, '\1universe')
      => "hello world, I love you universe"

~~~
samdk
Those are in the next article in the series, which is linked from the bottom
of this one: [http://immike.net/blog/2007/06/21/extreme-regex-foo-what-
you...](http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you-need-to-
know-to-become-a-regular-expression-pro/).

------
brehaut
I prefer fishbowl's law

    
    
       Every regexp that you apply to a particular block of text reduces 
       the applicability of regular expressions by an order of magnitude.
    

\-
[http://fishbowl.pastiche.org/2003/08/18/beware_regular_expre...](http://fishbowl.pastiche.org/2003/08/18/beware_regular_expressions/)

The minimum any programmer needs to know about regexps is when they are
applicable, and more importantly, when they are not.

~~~
brehaut
Seeing how im advocating that everyone should know when to be using parsing
tools rather than cobbling together a regexp cludge, here are some links to
some relevant tools in a variety of languages. Note that these are not parser
generator tools, but parser combinator libraries:

    
    
        * Python: 
            PyParsing - http://pyparsing.wikispaces.com/ - this   
            is probably the most well respected parser library 
            for python 
            
            picoparse - http://github.com/brehaut/picoparse - My 
            parsec inspired parser combinator library. It is 
            significantly more light weight than pyparsing, and
            I would recommend pyparsing in its place for more 
            heavy lifting parsers.
    
        * Ruby:
            RParsec - http://rubyforge.org/projects/rparsec/ - A 
            ruby implementation of Parsec. I've heard good 
            things about it.
    
        * Haskell: 
            Parsec - http://legacy.cs.uu.nl/daan/parsec.html - 
            This is an amazing library, worth anyones time. 
            http://book.realworldhaskell.org/read/using-parsec.html 
            http://www.haskell.org/haskellwiki/Parsec
    
        * F#:
            FParsec - http://www.quanttec.com/fparsec/ - This is 
            reason enough to learn F#; it is a fairly true port 
            of the haskell library and very nice to use. 
    
        * Scala: 
            Packrat Parsing - I believe this is part of the 
            standard lib. http://www.scala-lang.org/docu/files/api/scala/util/parsing/combinator/PackratParsers.html 
    
        * Clojure: 
            fnparse - http://github.com/joshua-choi/fnparse
    
        * Java: 
            jparsec - http://jparsec.codehaus.org/
    

[edits: formating]

~~~
jasonkester
I think this actually illustrates the other guy's point. If you want a parser,
you need to go download code off the internet. If you want a Regex, you type
something along the lines of new Regex().

~~~
util
[http://research.microsoft.com/en-
us/um/people/daan/parsec.ht...](http://research.microsoft.com/en-
us/um/people/daan/parsec.html) "Nowadays, Parsec is distributed with the
standard Haskell libraries with most Haskell compilers (GHC, NHC, Hugs)."

------
meric
It's quite a good guide but I had thought it was going to contain warnings
like "Don't use regular expressions to parse XML" and then explain why.

------
joubert
In Python, you can get the regex parse tree to help you debug your regex
expressions: [http://stackoverflow.com/questions/101268/hidden-features-
of...](http://stackoverflow.com/questions/101268/hidden-features-of-
python/143636#143636)

------
klochner
I learned something new - disjunction is sometimes called "alternation":

<http://www.britannica.com/EBchecked/topic/165607/disjunction>

------
pmiller2
One thing everyone should know about regular expressions is that the things
called "regular expressions" provided by libpcre (which, of course, stands for
"Perl compatible regular expressions") aren't regular expressions at all.
Since they support backtracking, they provide a more powerful parsing
framework than regular expressions alone (which don't support backtracking).
That's how the (in)famous regular expression that recognizes prime numbers can
work, even though the language of prime numbers is not regular.

------
mbateman
For some reason, I've never been able to get the knack of regular expressions.
This is the best introduction to them that I've seen, though I wish I'd read
it years ago. However, if you've tinkered with regex at all ever, which I
suspect virtually everyone here has, you'll know this. (I know it, and I hate
and avoid regular expressions and am not even a programmer.)

~~~
Natsu
It's weird, because they've always just clicked for me. I mean, you have text
and you compare it to a pattern. The regex is just a crazy shorthand for the
pattern you're looking for. What pattern should you match? Well, first write
down examples of what you want to match and see what they have in common.

I think it's easier to get if you match a pattern to some text by hand, taking
a simple pattern like ^a+b _c$ and trying to see if it matches:

Here are a few test cases:

"aaaaaaaaabbbc" ?

1) ^a+ means it starts with 1 or more 'a's, so those eat all my 'a's and I'm
matching bbbcc against 'b_c' now 2) b* means 0 or more 'b's; those eat up
everything but the c at the end. 3) c$ matches exactly one c at the end, so
yeah, we have a match. If there had been one more c (or anything else), the
match would fail because it wouldn't hit the $ (which is end of string).

aaac ?

1) ^a+ eats all the 'a's again. 2) b* ... what b _? It's "zero or more" so
yes, we can skip it. 3) c$ Yes, there's one c at the end. We have a match.

bc ?

1) We need _one* or more 'a's at the start. That's the difference between +
and _. Match fails immediately.

acbbc ?

1) ^a+ matches the a. 2) b_ means it's optional, so we don't fail... yet. But
we don't match anything, either. 3) c$ That c matches, but we're not at the
end of the string yet. There's still "bbc" left to match and we're out of
pattern. Fail.

You can do something similar with the tools other people have linked in
comments. This was just to give an example pattern and a few test cases to get
you started.

It's weird, though. Even though this clicks for me, I don't feel like I get
monads at all. Yeah, I've read the 3 monad laws (great... so something being a
monad means that I can monad it, un-monad it and do something vaguely similar
to composition on monads) and I can do some very simple Lisp which uses built-
in stuff they say is built from monads (like printing). But I don't know how
they work internally at all, or how they're fundamentally different from all
the other Lisp statements ("causes side affects" isn't terribly specific...
why is printing to the screen a side effect? Is setq a monad?).

In other words, there's probably some topic for every programmer that they
really have to struggle with. As easily as regexes come to me, I struggle to
understand other stuff.

I've been at it for quite a while, incidentally. But I'm not going to give up
any time soon. So keep working on it.

~~~
lincolnq
I wrote some stuff about monads on HN a while ago that some people found
helpful: <http://news.ycombinator.com/item?id=247446>

~~~
Natsu
Thanks. I just read it.

It does help, but I think there's still a lot I don't know. I mean, 'bind' and
'return' still seem like the names are backwards to me. Somehow, I feel like
'return' should give me a value, rather than shoving a value into a monad. But
that's probably because I'm used to C and its relatives.

Perhaps I simply need to read up more on FP itself. I have a huge bookshelf
full of half-read books (and yes, plenty of fully-read books) but nowhere near
enough free time...

~~~
lincolnq
Oh. Yeah. The names are poorly chosen, but we're stuck with them. :(

------
SteveC
When writing regular expressions I always use this website to test them out.

<http://www.rubular.com>

------
raffi
I learned regular expressions when writing the documentation for my Sleep
programming language. They never made sense to me until I had to organize my
thoughts on them in a logical way. I'm still quite happy with how that chapter
turned out:

<http://sleep.dashnine.org/manual/regex.html>

------
JoelSutherland
Here's a cool online tool my friend (HN user KrisJordan) made:

<http://www.gethifi.com/tools/regex>

It's super handy when learning Regular Expressions because it shows results as
you type.

~~~
mark_h
Obligatory emacs post: M-x re-builder is also pretty handy. Of course, then
you're using emacs re syntax, which is a bit clunkier.

------
ez77
While we’re at it, could someone please explain why \\[[^\\[\\]]*\\] does not
match things like [X], where X is a string of non-square-bracket characters?
How would you fix it? Thanks!

~~~
henrikschroder
In what language? That expression works fine for me when I test it?

~~~
ez77
Say in ed or sed.

~~~
ralph

        $ printf 'a[]b[c]d[ef][g[h]i\n' | sed 's/\[[^][]*]/-/g'
        a-b-d-[g-i
        $ ed
        a
        a[]b[c]d[ef][g[h]i
        .
        s/\[[^][]*]/-/g
        p
        a-b-d-[g-i
        q
        ?
        q
        $

------
pmccool
I was surprised to find no mention of the issues with nesting. Every
programmer should know about this limiation, even if they don't get into the
pumping lemma and whatnot.

------
lanstein
(2007)

kind of disappointed that there's 13 comments and no mention of jwz's quote.

in terms of actual thoughts, this is certainly an aptly named article.

~~~
kingkilr
Of course there's always the counterpart to jwz's quote:

> Some people, when confronted with a problem, think "I know, I'll quote Jamie
> Zawinski." Now they have two problems.

[<http://twitter.com/diveintomark/statuses/1249729494>]

~~~
lanstein
awesome.

------
crpatino
1\. Regular expressions cannot match recursive patterns.

2\. No, they cannot! extended regexps are a nice hack, but not really met
formal regular expression definition.

3\. man grep

------
d0m
Did someone learn something on this site? Why it's on hacker news instead of
newbies news.

------
dbz
Meh, no one here wants to here my spoiled brat opinion, so I apologize in
advance:

It should also be included in the bare minimum:

Regex will drive you crazy. Not just writing the expression, but installing a
library (especially if there is no handy installer) ect. Regex likes to punch
you in the balls whenever possible. (I can say this because I've written many
line of Regex and I _do_ love it.)

.

.

Some people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems. -Jamie Zawinski

