
Stop avoiding regular expressions damn it - bradt
http://bradt.ca/blog/stop-avoiding-regular-expressions-damn-it/
======
dasil003
The core criticism of regular expressions is legitimately directed at
intermediate programmers who know enough to be dangerous, but is sometimes
inappropriately cargo-culted by beginner programmers who use it as an excuse
not to learn regular expressions.

The fact is that despite pithy slogans, there is a sweet spot where a regular
expression does the job of matching a string in a clearer fashion than
anything else. But that sweet spot is well shy of the theoretical power of
regular expressions (especially in Perl!), before which you should further
your understanding of a range of parsing techniques before hacking together a
baroque regex.

~~~
dllthomas
Though there's nothing wrong with busting out a baroque regex in one time use
contexts (editor's search function, throw-away application of grep or sed,
&c).

~~~
dasil003
Of course not.

------
4ad
I found the house I currently live in with regular expressions.

A couple of years ago I moved to a different country, and for some reasons I
needed _two_ apartments, preferably close to each other. As you can imagine,
the real estate websites are not designed for the kind of query I needed, so I
wrote some code to aid me in my quest[1].

It's just shell script and text processing with awk. I download various
results with all the available apartments for many real estate websites, then
I scrape the data I care about (with regular expressions!) like address,
rooms, price, anything really, and query the Google Maps API with all the
addresses to retrieve the geographical coordinates, then I compute the
distances between any two houses and sort them.

It's fantastically modular. Adding support for a new website meant just
creating some regular expressions that work for that website. This was great
because I was doing this on the road, as I was visiting the foreign city and
found new sources of information.

Regular expressions were also great because these websites didn't have any API
where I could query for the address, etc. I had to rely on what _people_ wrote
in their ads. This meant that when I wrote a regexp to match a set of results
I had to inspect the failures to see new ways people described their houses
and improved my matching based on that. Initially I had hoped I'd be able to
parse 80% of the ads, but measurements and careful coding had allowed me to
match approximately 99% of the ads!

The textual operation of this software allowed me to easily input some data
manually. For example I realized that I'm also interested in having these
apartments close to a subway station. No problem, just manually create the
file with the subway stations in the correct, simple, textual format and the
program will pick it up and use automatically.

The textual interface also helped with fancy queries, like "price between X
and Y, 6 rooms total, prefer 4-2 to 3-3 if distance less than D, but 3-3 if
distance greater than D, prefer Z subway line to Q, only one apartment might
be from an agency rather than an individual, try to put one in K part of the
city". Try to do that with an existing website.

[1] <https://code.google.com/p/operation-housefinder/>

------
bradt
A little back story on this article for those who are interested...

I noticed my coworker was going out of his way to use string manipulation,
writing many lines of code instead of a simple regular expression. When I
asked why, he explained that he didn't know regular expressions, but more
importantly that he felt that he had read a lot of posts on Stack Overflow
discouraging use of regular expressions. From what he had read, he felt that
it was better practice to avoid regular expressions. Although this could be
anecdotal, there may be a real danger here that inexperienced programmers are
getting the wrong message, that regular expressions are somehow bad in most
situations and not worth learning.

------
Titanous
More concise? Sometimes. Slower? Always.

    
    
        BenchmarkRegexp	  500000	      5136 ns/op
        BenchmarkStrings	10000000	       173 ns/op
    

<http://play.golang.org/p/YT29Ao-tOt>

~~~
buro9
You could more than double the performance of the regexp if you did
MustCompile just the once rather than within every loop.

MustCompile is generally used to make the regexp a global so that it isn't
done over and over.

Just move it out of the loop, as it's really not necessary to compile regular
expressions every time you want to match/replace against it.

~~~
Titanous
It does have the MustCompile outside of the loop. I pasted the wrong link
originally.

~~~
buro9
Ah, my apology I saw the earlier link.

------
nraynaud
As a general rule I ask people to avoid using non-trivial regular expressions.
The grammar is too tricky and often the expression doesn't mean what the
developer intends it to mean. Or the next developer will make a mistake.

My current pet peeve is with parser combinators, wich seems a good compromise
(it's not a magic wand) between maintenance (whereas external parser
generators don't blend well in your code), parsing what you think you are
parsing (more so when your grammar was defined with rules in a reference
document), and integrating the parser with your code.

------
bane
Does anybody know of a good perl of python library that will use a regex (with
constraints on the repetition operators) and generate an exhaustive list of
matching strings (instead of generating a random list)?

I think this would be helpful in many cases in getting people to understand
how regexes work. I've seen lots of cases where toolsets designed to help
people build regexes end up with them confused when their regex also matches
other stuff beyond their test strings.

~~~
pjkundert
<https://github.com/ferno/greenery>

~~~
bane
cool, looks like the strings() method in lego.py might work

------
gbog
OT: Where from come this seemingly odd and new habit of spacing inside
parentheses? I always write "(a, b)", mostly because it is closer to English
(or other languages) typography, and it seem to have good readability, plus it
is, I believe, the standard in most languages. So why write "( a, b )"?

By the way, if some like spacing that much, and if the reason is to have a
better mouse-selectability, then I humbly propose "( a , b )".

~~~
bradt
It's WordPress' PHP Coding Standards:
[http://make.wordpress.org/core/handbook/coding-
standards/php...](http://make.wordpress.org/core/handbook/coding-
standards/php/#space-usage)

~~~
gbog
Ok. Is there any rationale behind it?

------
buro9
I feel that this needs posting again: <http://www.debuggex.com/>

Basically a great online tool for testing your regular expressions and
stepping through what is actually happening. As soon as you get non-trivial,
it's a Godsend.

------
Su-Shee
THE single best ressource to really learn how to deal competently with regex
is still Jeffrey Friedl's book "Mastering Regular Expressions".

You will profit from it for the rest of your career.

(There's also a Regex short reference and a Regex cookbook by O'Reilly...)

~~~
krat0sprakhar
Sincere question - is it worth investing time into reading a 500 odd page book
for something that I might not use that frequently in my career? From my
experience, I've seen that I can get away by just Googling or just
experimenting whenever I'm stuck on a regex.

~~~
Su-Shee
Absolutely.

The book doesn't just teach you regex, but the why, how AND the dialects. It
gives you an overview over different tools and programming languages and their
regex-related functions and methods.

On top, it contains a ton of examples, is very well written (considering the
insanely dry and difficult to typeset subject :) and is very polished (I think
it's in the 3rd edition by now..)

If you just google or experiment on regex, you usally get bad regex, badly
crafted regex, brittle regex and make every single mistake the book prevents
you from doing.

It's really one of the most worthwhile books of reading through - it's also an
excellent handbook to look things up.

Remember that a lot of commandline tools take in regex too - grep, sed, awk,
you name it - it's not just for use in programming languages.

Your favorite editor has regex too.

I simple don't know how people can live without; I'm using regex practically
every day.

P.S.: And _after_ reading the book, you will understand why people yell at you
when you parse HTML with regex but you will know how to do it anyways and at
least not completely badly. ;)

P.P.S: And here's the canonical post to BUT OF COURSE you can parse HTML with
regex from stackoverflow.. :)
[http://stackoverflow.com/questions/4231382/regular-
expressio...](http://stackoverflow.com/questions/4231382/regular-expression-
pattern-not-matching-anywhere-in-string/4234491#4234491)

------
notyourpal
I'm very guilty of this myself. I'm officially a loser if I haven't delved
into regex within two weeks.

------
ExpiredLink
Stop propagating bad interfaces like 'regular expressions' damn it!

An interface that e.g. makes me 'escape' half of my input because its
designers think their special use of characters _must_ take precedence over
all user input is a bad interface.

~~~
Su-Shee
Many programming languages have a function for that to do that for you...

In Perl, it's called quotemeta (qw, qq and family, too), in Python and Ruby
it's .escape... and there's always \Q ... \E to use...

I'm sure others have similar methods/functions.

------
3minus1
What's a good resource for learning reg exp?

~~~
bradt
Great question. I learned them through osmosis over many years of looking at
them in other people's code and tinkering myself. I don't remember ever going
through a tutorial or reading a book. Probably not the best way to learn them
as it definitely took a long time to have a good grip on them and I was
missing important pieces for a long time. For example, it was only relatively
recently that I learned that you can turn off "greedy" when using .* by adding
a ? after it.

