
Why Using .* in Regular Expressions Is Almost Never What You Actually Want - mariusschulz
http://blog.mariusschulz.com/2014/06/03/why-using-in-regular-expressions-is-almost-never-what-you-actually-want
======
perlgeek
Basically the same advice, from the year 2000:
[http://www.perlmonks.org/?node_id=24640](http://www.perlmonks.org/?node_id=24640)

There is one legitimate use case of .* though: advancing to the last match of
something. If you want to find the last digit in a string, /.*(\d)/ will
readily find it for you.

~~~
eCa
A somewhat clearer way (imho) is /(\d)\D*$/ since it anchors to the end of the
string.

~~~
Rynant
This is not the same as finding the last match though. The parent's example
will match '2' in '1 of 2 steps.'

~~~
ronaldx
On the contrary, it _does_ gives the same result.

$ anchors to the end of the string, \D clears the non-digits from the end to
allow \d to match the digit '2'.

~~~
Rynant
Thanks, I see where I was wrong now.

In this case when finding the last match from the end, would the lazy
quantifier reduce backtracking? e.g. /(\d)\D*?$/

~~~
icambron
No, that would work very similarly to the greedy version. The backtracking
happens because the \d gets matched to the '1' and the whole thing has to be
rolled back when the $ attempts match and instead finds '2' (this would happen
again if there were more digits for \d to speculatively match on). So the
backtracking is not caused by the laziness or greediness of the \D* ; we
really do want to gobble up all of the non-digits.

On the two options generally:

    
    
        /(\d)\D*$/
    

is problematic if you have a lot of digits, while

    
    
        /.*(\d)/ 
    

is problematic if you have a lot of text after the last digit. Both could
potentially be optimized by the engine to run right-to-left (the former
because it's anchored to the end and the latter because it greedily matches to
the beginning), and then both would do well. I'm not sure if that happens in
practice.

Overall, I prefer the latter, both because I think it's clearer and because
its perf characteristics hold up under a wider variety of inputs.

Edit: how do you make literal asterisks on HN without having a space after
them?

------
icambron
About eight years ago I finally read Friedl's _Mastering Regular Expressions_
[1]. I know, right? A 500-page book about regular expressions, a tool I
already knew (or thought I did). But it's actually a great book-- easy to read
and full of genuinely good information on the how and why of regex, and it
totally changed my understanding of them. If absolutely anything in this
article surprised you, I highly recommend you read the book.

[1] [http://regex.info/book.html](http://regex.info/book.html)

------
Pxtl
I've gotten into the habit of using the "not" operation instead of .* a lot.
If I'm looking for bracketed text, I use not-bracket to match the contents.

I tend to avoid the non-greedy operator just because it often fails in
terrible half-assed regex implementations (eg. visual studio 2010)

~~~
bane
I wish the not operator allowed for sub-expressions instead of just character
classes. It'll probably make it slower, but it would remove lots of unreadable
convolutions people have to go through.

~~~
ori_b
There are some regex implementations that allow it, but it's a very confusing
feature. Remember that '' is not 'a'.

Arbitrary expressions can have arbitrary length, so excluding an expression
simply will match it, fail the match, and backtrack to the next option.

------
tjgq
Also read Russ Cox's writeup on implementing regular expressions [0].
Backtracking can be done efficiently; it's just that most regular expression
engines have suboptimal implementations for it.

[0]
[http://swtch.com/~rsc/regexp/regexp1.html](http://swtch.com/~rsc/regexp/regexp1.html)

~~~
BugBrother
I've cursed over Python's backtracking, at least a few years back. (Why can't
they just use PCRE? :-( Any advantage at all?)

~~~
ori_b
PCRE has the same problems with backtracking.

~~~
BugBrother
As bad? I might have had smarter coworkers at different times... :-)

------
moron4hire
I think the advent of automatic regex match highlighting in text editors is
changing the regex use-case for a lot of people. It certainly did for me. I no
longer see regexs as just "something you use in code to test input". I now use
them as general purpose text editing tools. In a way, it's like templated text
output, with input specified in the same buffer.

I know this has been done forever, but usually only by extreme greybeards in
Vi or Emacs world. The auto-highlighting now makes it possible for everyone to
do it.

So that said, with the ability to restrict regexes to just a selection of
text, it's more about regex golf--the fewest characters, the most productive--
than it is about semantic correctness. If it works for my input, that's all
that matters, because the regex is getting discarded thereafter.

~~~
Pxtl
Yeah, I do all my data-imports from flat files using regex - easy to export
from spreadsheet programs as flat files, then regex them into a bunch of
insert/update statements.

~~~
collyw
Yes, I do plenty of similar things.

Format a load of data using regexes first, then use it hard coded as string to
do quick one off script to update the database. It beats trying to parse Excel
directly, as you never know what data type a cell will return.

------
chernevik
To be picky, it's always what I want, but with a lot of other stuff I don't.

------
blt
Very interesting. I've had the greedy .* "overmatch", like probably almost
everyone who's used regular expressions. Had no idea they are a performance
drain even when giving the right answer though.

I like posts about details of software craftsmanship like this.

------
prohor
A bit of a problem with lazy quantifiers is that they are not so widely
supported out of the perl world. Therefore I often need to find some tricks to
get similar behavior (eg. "[^,]*," \- if coma is separator)

~~~
hnriot
Except they are! Java, python, JavaScript etc all support lazy quantifies. In
fact I can't think of a single language that doesn't.

~~~
mkehrt
Except bash, awk, sed and vim. So, everywhere I use regular expressions.

~~~
fmoralesc
vim has \\- and variants.

------
guynamedloren
I've run into this problem so many times. Everywhere I think I want .* , I
actually want . __*? (non-greedy matching). Make a mental note of this. It 'll
save you lots of headaches.

------
kstenerud
Actually, it only needs to be \\[([^,]+),([^\\]]+)\\] because you're only
going up to a comma in the first capture group and a square bracket in the
second.

~~~
poolunion
That would also match all of "[a] more [b,c]" though.

~~~
kstenerud
And the other regex would match all of "[a more [b,c]". Your regex must be
designed around your expected input.

------
NanoWar
The regex fiddle is really useful:
[http://regex101.com/r/qQ2dE4](http://regex101.com/r/qQ2dE4)

------
larubbio
Is the early example in the document correct?

Using an input string of abc123 he claims [a-z]+\d+ will match the entire
string (which I agree with). He then says that [a-z]+?\d+? will only match
abc1. Wouldn't it fail since the non-greedy match on [a-z] would just match
'a' causing the non-greedy match on \d to fail trying to match 'b'?

~~~
simcop2387
no it'll still match but because both are non-greedy it could match on just c1
instead of abc123.

~~~
prawks
It could match on c1, but I believe since most (all?) regex parsers parse
left-to-right, it will match the a, look for another a-z character or a digit,
find b, repeat, find c, then find 1 which completes the pattern.

------
bthornbury
I once used .* in a crawler. Came back the next day to find much rogue html
amongst the content of my site. I find something like [^{{delimiting
character}}]* to be better

------
gesman
.*? Is the solution!

------
bthornbury
I once used .* in a crawler. Came back the next day to find much rogue html
amongst the content of my site. I find something like [^

------
mschuster91
For the "greedy" behaviour, PHP has the "U" flag... dunno about other
implementations though.

------
spb
This is why I like Lua's '-' for non-greedy matching in its pattern
facilities.

