
Regexes: The Bad, the Better, and the Best - ScottWRobinson
https://www.loggly.com/blog/regexes-the-bad-better-best/
======
taco_emoji
> In General, the Longer Regex Is the Better Regex

I'd rather word this as "more specific is better". Like say for a U.S. phone
number (minus area code for simplicity),

    
    
        \d{3}-?\d{4}
    

is better than

    
    
        .*-?.*
    

because it's more specific.

"Longer is better" is only useful for helping _identify_ which regex is
better, not for helping _write_ better regexes.

~~~
oftenwrong
In many environments,

    
    
       [0-9]{3}-?[0-9]{4}
    

is even more specific (and faster) because \d would match other digit
characters outside of [0-9].

~~~
dalore
taking it to the next level:

[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]

------
pimlottc
Where "Best" = "Most performant". Of course, there's other ways to judge a
regex - readability, specificity, robustness...

In most places I used regexes, efficiency is the least of my concerns, though
for a company whose product is focused around parsing massive amounts of logs,
focusing on performance does make sense.

------
vezzy-fnord
The RegexBuddy author has a good article on this same phenomenon, dubbed
"catastrophic backtracking": [http://www.regular-
expressions.info/catastrophic.html](http://www.regular-
expressions.info/catastrophic.html)

------
Twisol
How would the regex below compare in this case? That is, modifying the "bad"
regex to use the lazy-consumption trick on (almost) all of the `. _` patterns.

    
    
      /.*? (.*?)\[(.*?)\]:.*/

------
carapace
Okay but think:

If you are searching a very large file for a very few occurrences of the
expected match then this optimization is not so bad.

If you are running line-by-line through a very large log file to extract
_just_ those two pieces of information per line, then throw away the first N
characters in each line (where N is the hopefully-constant length of your
timestamps plus that space char) and start the regex engine at the beginning
of the expected match. Then it doesn't have to waste _any_ time passing over
those chars.

Even if the exact details above aren't quite right the principal is (and is
well-known): Avoid premature optimization! (And the corollary: Measure it.
Profile your code, don't guess, you're probably wrong.)

~~~
thisrod
There's a higher order solution. Read the AWK book, do the exercises, then
ignore blog posts about regexps. Make an exception when it's Russ Cox
demonstrating how often this wheel has been reinvented in square form.

~~~
astangl
Which AWK book? The one by A, W, and K?

~~~
thisrod
Yes.

------
edwintorok
"Awk and grep use the Thompson NFA algorithm which is in fact significantly
faster in almost every way but supports a more limited set of features."

AFAIK the only feature of regexes that require backtracking are back-
references, as long as your regex doesn't use it why doesn't PCRE switch to
the more efficient algorithm, and use the backtracking algorithm only if you
actually need the feature that requires backtracking?

~~~
SerpentJoe
Backtracking is required in a lot of cases. Consider matching the pattern
/^(AA|AB)*$/ against the string "AAAAAAAAB". Before it can come up with the
answer (it doesn't match) the engine has to backtrack all the way from right
to left.

~~~
edwintorok
that can be compiled to a state machine where a decision to switch states is
taken based on current character only, and when all input is consumed you are
either in an accepting or rejecting state. In your case I think it only needs
2 states: state 0 moves to state 1 when it sees an A, and state 1 moves to
state 0 when it sees either A or B. For your string it'll be in state 0 when
it sees a B and thus rejects it.

~~~
thaumasiotes

                       ┌┐
                       ↓│
                   ┌────┴┐
                ┌─→│     │←──┐
                │  └─────┘   │
           ╔════╧═╗       ┌──┴──┐
        ├─→║      ╟───A──→│     │
           ╚══════╝       └─┬───┘
               ↑            │
               └─────A,B────┘
    

Three states if you want to process the whole input. If your model is "reject
when you fail to find an appropriate transition" rather than "reject if, after
processing the string, you're in a reject state", then you don't need the
failure trap and you can do it in two states.

Backtracking is definitely not required, nor helpful.

~~~
hobs
That diagram is quite pretty, did you just hand code it?

~~~
thaumasiotes
Yeah, all manual. :/

[http://unicode-table.com/en/blocks/box-drawing/](http://unicode-
table.com/en/blocks/box-drawing/)

[http://unicode-table.com/en/sets/arrows-symbols/](http://unicode-
table.com/en/sets/arrows-symbols/)

------
Mithaldu
Using this Perl module you can walk step-by-step through what a regexp engine
actually does given a regexp and a string to match against:

[https://metacpan.org/pod/Regexp::Debugger](https://metacpan.org/pod/Regexp::Debugger)

After installing the module it can be started simply by calling `rxrx` on the
command line.

------
vacri
Surely the simplest optimisation to add would be anchors? Instead of starting
the 'good' line "[12]", start it "^[12]".

As it stands, the 'good' regex will look through the whole line for any
occurrence of 1 or 2 in the line, then start applying the rest of the regex
from there. Putting an anchor in means it only looks at the first character
for that 1 or 2.

------
Reefersleep
Inspiring post - I'll be sure to put extra thought into how I write my regexes
from here on out.

Does anyone know a good browser-based game that trains you to do regexes well?

------
gshx
Bug in the month matching, instead of [01]\d which will also match 13-19,
maybe try something like: (0[1-9]|[1-9]|1[0-2])
[https://goo.gl/RMv4x2](https://goo.gl/RMv4x2)

