
The Greatest Regex Trick Ever (2014) - user_235711
http://www.rexegg.com/regex-best-trick.html
======
wonnage
Regular expressions match regular languages (hence the name). If your language
involves pairs of things (e.g HTML), it's not regular. Perl hacked support for
this in via backreferences and other extensions, but these are slow and
illegible. Use a proper context-free grammar parser if you need to parse a
context free grammar, you know?

More broadly, people fear and misunderstand regexes because they have no idea
how they work. It becomes much easier if you understand how they map to
deterministic finite state machines. Recommended reading:
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

Once you understand how they work, you can basically read a regex left to
right and intuitively know all the strings they'd match. There is no such
thing as an unmaintainable/illegible basic regex - they're just words with
some placeholders in them - it's when you cram in extended functionality
(which is basically a programming language where all the keywords are single
characters) that shit hits the fan.

~~~
joosters
I use regexes all the time for parsing data on a variety of 3rd-party
websites. Just because they aren't _perfect_ at matching every potential
theoretical situation doesn't mean they shouldn't be used. In practice,
regexes can be a simple and reliable way to grab data out of HTML. Don't
dismiss them out of hand!

Another point, generally overlooked by the theoretical purists, is that HTML
in the wild is rarely correct, and your perfect HTML parser will barf when
trying to process it. Regexes on the other hand don't have to care about exact
syntax and can cope with horribly mangled data.

~~~
evilotto
An ordinary regex obviously can't parse html, because html is not regular
(given nested elements and the pumping lemma). But what you can easily do with
a regex is to tokenize html - extract anything that looks like a start tag,
for example. The simply approach will obviously get some things wrong - a tag
inside a comment is meaningless, for example - but for a lot of uses, this
difference simply isn't important.

If the job is to extract all links from a webpage, regexes will do just fine,
and will probably be easier to write and understand than alternate approaches.
(This is absolutely not a fair comparison of course; you could compare writing
a regex engine to writing a html parser. But I digress.)

If the job is to determine whether a given webpage is a member of the set that
includes all valid html documents. then a regex is not sufficient.

If the job is to extract a list of syntax tokens from a webpage, a regex is
likely fine.

If the job is to assign semantic meaning to every token in that list, a regex
just won't work.

Either way, the point is to know what you're doing. Much "parsing" of webpages
is not parsing in the formal language sense, and who cares that it isn't
because it doesn't need to be.

~~~
paulmd
No, tasks like "extracting all links from a webpage" are absolutely trivial
using an HTML parser. Run the following XPath query on an HTML object:

    
    
      //a[@class='specified_string']/@href 
    

Yes, you have to understand the syntax XPath to write such expressions, just
like you have to know the language of regexes. Or at least be able to google
them.

The answer to "who cares" is "you", because you're the one who's going to
catch hell when your regex failed to capture some hyperlink that utilized some
feature of XML that exceeded your test-cases. The one-liner above is
guaranteed to Just Work on all valid XML documents, so why even create such a
monstrosity?

Everyone knows that Regexes Cannot Parse HTML, and yet _people still try it_
because they think they're smarter than Noam Chompsky. The real truth is that
everything looks like a nail to these people, because all they have is a
hammer.

~~~
desas
Because valid documents are rarer than documents where regex parsing is good
enough.

~~~
paulmd
Then you're not talking about parsing HTML/XML, are you? How could you
possibly know which links or syntax tokens are actually going to be displayed
on a page if you feed the browser's parser an invalid document?

There are fault-tolerant HTML parsers like TagSoup that are specifically
designed to handle dirty HTML and spit out a valid document object. If you
have sources that are malformed badly enough that it's still not working, you
can define custom SAX properties to handle them. But a task like that is
certainly a best-case effort and the interpretation of such a library is no
more or less valid than the interpretation of the browser's parser. It's not a
valid document to start with and nothing can make it so.

If you are only parsing values out of a single specific data template, you
know it's not going to parse as HTML or XML, you know that it's never going to
contain weird values, and you know it's never going to change - then go hog
wild. But it's fundamentally a brittle approach that only holds as long as
those assumptions do. I've made the mistake of believing some of those about
my data and it's bitten me before. And I really question the implicit
assertion that "most html parsing" would fall into that exceedingly narrow
category. Especially after a couple years of feature creep.

Just keep your logic general and normalize your data. Offer a failover to a
fault-tolerant parser in your data layer with a logged warning. This is much
more durable and doesn't silently generate invalid tokens or silently fail to
capture valid tokens. Regexes simply cannot offer the capability to fail
loudly. So once you are no longer actively babysitting your custom regex
parser it could have started failing at any time - how would you even know?

------
JoeCoder_
Here is the __TL;DR __. This regex matches Tarzan but not "Tarzan":

    
    
        "Tarzan"|(Tarzan)
    

You can also include more than one case of what you don't want to match. This
one also finds only the cases of Tarzan that don't match the first three
patterns:

    
    
        Tarzania|--Tarzan--|"Tarzan"|(Tarzan)
    

You can even use more complex regexes. This matches all words not in an image
tag:

    
    
        <img[^>]+>|(\w+)
    

And likewise this matches anything not surrounded by <b> tags:

    
    
        <b>[^<]*</b>|([\w\s]+)

~~~
ajnin
That's not exactly it, the regexp does match both "Tarzan" and Tarzan, but the
capture group 1 will only be set for strings that contain Tarzan without
quotes. So by examining that group after matching you know in which case you
are. (that also means you can only use this "trick" when you can examine
capture groups, i.e. not generally in editors).

------
singingfish
I once used the wonderful perl module
[Regexp::Assemble]([https://metacpan.org/pod/Regexp::Assemble](https://metacpan.org/pod/Regexp::Assemble))
to produce a regexp to match [every single suburb/town name in
Australia]([https://gist.githubusercontent.com/singingfish/d43c884fbac00...](https://gist.githubusercontent.com/singingfish/d43c884fbac0089d8523/raw/63eeaf88a2e8d896c07da3b2440080233dd48395/regex%2520for%2520every%2520suburb%2520name%2520in%2520australia.txt))
from a csv file download from the post office website. It was blazingly fast
... considering (better than the recdescent parser I'd been previously
experimenting with).

Here's the code that generated the regex:

    
    
        use Regexp::Assemble;
        my $ra = Regexp::Assemble->new;
        while (<$FH>) {
            $csv->parse($_);
            next if $. == 1;
            my @fields = $csv->fields;
            $ra->add($fields[1]);
        }
    
        my $suburbs =  $ra->as_string;

~~~
voltagex_
Can you still get that csv file? I thought AusPost had started charging for
everything.

~~~
x0
Woah, do you mean this? If so, wow, never thought my tiny little data
collection would come in handy.
[http://badcunt.club/cant/stop/information/babe/every-
austral...](http://badcunt.club/cant/stop/information/babe/every-australian-
postcode.csv)

~~~
voltagex_
Might be good for hacks, but it'll go out of date eventually and can't be used
for commercial purposes without incurring the wrath of AusPost. Thanks for
hosting it, though.

------
bza
This technique is unreliable in practice, and the author's discussion is
confused.

First, their explanation doesn't make sense. They're supposing that there's
some determinacy in the order in which a matcher can be expected to examine
the different possible matches. But that's provably not the case: if it were,
then deterministic and non-determinsitic finite automata would be
inequivalent.

But the technique in question does seem to require some determinacy as to
which of several alternatives will match against a string. Where does that
determinacy come from? The semantics of the alternation operator (the '|') as
usually formulated don't specify any preference among alternations. For that
reason, POSIX _additionally_ requires that a matcher return the longest
possible match (and if there are several such, the leftmost is what must be
returned). Where you do find an explicit guarantee concerning which of several
different possible ways of matching will be preferred, it's almost certainly
because the engine is aiming at POSIX compliance.

Such compliance has a significant cost, though, as it requires the matcher to
consider _all_ possible matches (in order to find the largest). For that
reason, most regex engines forego strict POSIX compliance and only guarantee
that some match will be returned if one exists, not that that match will be
the leftmost longest. Some engines offer the option of requesting strict POSIX
behavior, but the default will always be to eagerly return the first match
encountered (and recall the point above that there provably can't be a
guarantee about the order in which matches are encountered, in general).

You should never do this in production code unless you're sure that your
matcher is POSIX-compliant.

------
edoloughlin
I almost wrote this off as it seemed to be about how to write unmaintainable
regex soup but the author pulled something quite elegant out of the hat at the
end.

------
leeoniya
this one is great, too. matches all printable ascii characters:

[ -~]

[http://www.catonmat.net/blog/my-favorite-
regex/](http://www.catonmat.net/blog/my-favorite-regex/)

~~~
ExpiredLink
English only.

~~~
joosters
Hence the ASCII in the description.

~~~
ExpiredLink
Since accented characters are used in the English language it's not even
'English only'.

~~~
leeoniya
"printable ascii" describes everything that it is and everything that it
isn't; no further clarification is needed.

[https://en.m.wikipedia.org/wiki/ASCII](https://en.m.wikipedia.org/wiki/ASCII)

~~~
ExpiredLink
I will spare us an answer because it wouldn't be printable English.

~~~
tokenizerrr
You are the only one who said "printable English". Everyone else has said
"printable ASCII". Do you perhaps not understand what ASCII is?

------
smegel
> Match Tarzan but not "Tarzan"

Unfortunately it doesn't work.

Let's say I wanted to match a string following Tarzan but not "Tarzan", I will
try his technique:

    
    
        ("Tarzan"|(Tarzan))\s+and JillOfTheJungle
    

Unfortunately this matches both:

    
    
        "Tarzan" and JillOfTheJungle
    

and

    
    
        Tarzan and JillOfTheJungle
    

Or maybe he meant:

> Capture Tarzan but not "Tarzan"

~~~
geocar

        a=/(?:"Tarzan"|(Tarzan))\s+and JillOfTheJungle/;
        matched = !!a.exec(x)[1]
    

Works fine.

People often forget to solve the problem they need to solve (match x), and
instead work on other things (find a regex to match x).

~~~
jameshart
"People often forget to solve the problem they need to solve (match x), and
instead work on other things (find a regex to match x)."

Completely agree. One common mistake is building a complex regex to match the
elements you want to _find_ from a string, when an easier approach is to split
the string on a simple regex that matches the things you want to throw away.

------
rntz
This "trick" is simply exploiting a bug in regex implementations.

The regex

    
    
        "Tarzan"|(Tarzan)
    

should match the string

    
    
        "Tarzan"
    

in _two_ ways: first, matching the entire string; and second, matching the
substring "Tarzan" in the whole string "\"Tarzan\"". But most regex
implementations drop extra overlapping matches. I argue this is incorrect
behavior, because it complicates understanding what a regex _means_ \- you
have to understand the /order/ in which your regular expression matcher
interprets your regular expression, which is an implementation detail. I
conjecture that a DFA-based regex engine would not be able to exhibit this
order-biased behavior, at least not with the standard approach.

However, it's interesting that this "bug" turns out to be a "feature" for the
case of excluding other behavior. I'm not sure what conclusion to draw from
this.

~~~
GhotiFish
Given a string "aaabbb"

what should the results be, of the regular expression "(aa|aaa)(abbb|bbb)"?

$1 = ?

$2 = ?

~~~
rntz
I believe it matches in three ways:

\- the whole string, grouped as "(aa)(abbb)",

\- the whole string, grouped as "(aaa)(bbb)";

\- the substring "aabbb", grouped as "(aa)(bbb)".

~~~
GhotiFish
Sorry, I made an assumption there that you were talking about the practical
applications of regex, not the theoretical applications, and I was asking you
to explain how you would practically return multiple matches in... any
environment.

This was a small regex designed to create multiple answers to see how you
resolved the issue, obviously we can engineer regexes that return far more
results. So something's got to give. I don't agree with you that regex's
innately imply all matches are valid.

~~~
rntz
The original article _already_ relies on finding multiple matches, in order to
ignore the matches that don't contain the group that we're interested in.

Python's regex library, for example, can return multiple matches. It has three
functions:

\- `re.match`, which checks whether the whole string matches.

\- `re.search`, which checks for the first location in the string that
matches.

\- `re.findall`, which finds "all" non-overlapping matches.

I was simply suggesting that the "non-overlapping" constraint in findall is a
"bug", in some sense, because it exposes implementation details of the regex
engine.

But, again, given that it is apparently a _useful_ bug, maybe I am wrong. But
that leaves open the question what the right spec for regex matching is,
anyway.

------
aaronbrethorst
This article desperately needs a tl;dr, but major props for the two regex
tricks at the very top of the page.

------
userbinator
That reminds me of something else with regex which I thought was extremely
clever: implementing an A* search:
[http://realgl.blogspot.com/2013/08/battlecode.html](http://realgl.blogspot.com/2013/08/battlecode.html)

~~~
kragen
that is a very nice hack. this kind of 'do as much work per standard library
call as you can' approach is pretty much the way to go in interpreted
languages like python or octave, which is in part why numpy is so popular —
not only does it allow you to program at a higher level of abstraction, it
also makes your program run more efficiently.

however, i think that technically, that's not an A* search he implemented,
just a breadth-first search. i'm not an expert (i've never even implemented A*
search), so i could be mistaken. i'm interested to hear whether other people
agree.

------
incepted
Meh.

I think an even greater Regexp trick is the regular expression that determines
primality:

[http://stackoverflow.com/questions/3296050/how-does-this-
reg...](http://stackoverflow.com/questions/3296050/how-does-this-regex-find-
primes)

~~~
metafunctor
The article actually talks about the prime validator regex, and goes on to say
that while it's an awesome trick, it is not the "best ever" because it has
limited scope.

I for one think the trick described in the article is pretty useful, and might
even use it someday (and possibly have already without realizing it).

------
smsm42
Very nice trick, while using of foo|(bar) is very simple, somehow I don't see
such approach being used very often, and it looks like it could simplify a
number of things.

------
rbobby
Maybe I'm too old. I tend to think of a regex as either matching or not
matching.

Finding a bit of code that uses a capture to determine whether a match was
found seems like it would easily be confusing/inobvious.

Some pretty clear commenting and it would be ok... maybe.

Also... I wonder how well it would work as part of a larger regex, one that
already uses captures (or non-capturing groups)? The examples are all nice,
short and sweet... but how often do regex based solutions stay short and
sweet? A few maintenance cycles/years and suddenly you've got this funky
regex/capture thing that only Bob understands and he's way to busy to talk to
you for 5 minutes... and once you change things then Bob suddenly finds time
to review your code to complain how you broke it for such a simple change.
There goes your bonus you told the wife you were sure to get so you could take
her and the kids on vacation. The day after your divorce finalized Bob sends
you a fix request to use that improved scheme of yours because the old regex
one isn't flexible enough anymore.

------
forrestthewoods
I thought the answer was going to be "tricking the world into thinking regex
was a good idea". I've always considered regex to have the rare and elusive
"write only" flag. Write only. As opposed to read only. Because once you write
a regex that's it. You will never know what it does ever again.

~~~
TeMPOraL
Just use tests and comments. For instance, in many language it's trivial to
declare regexp in multiple lines like this:

    
    
        regex = "(^[0-9])" //catch the initial digit (group 1)
              + "/"        //skip the following slash
              + "([A-Z]+)  //capture the identifier (group 2)
              ...
    

Regexps are too powerful a tool to ignore. I'd much rather write a simple
regex than a whole screenfull of code, especially when the former does the
work faster (because regex engines are pretty efficient).

~~~
to3m
Another way of doing it is to use variables and helper functions, particularly
if you're going to be creating lots of regexps from similar components. This
is what I usually do in Python. For example, while making no promises about
the value of this code specifically:

    
    
        def any(x): return x+"*"
        def many(x): return x+"+"
        def capture(x): return "("+x+")"
        def lit(x): return re.escape(x) # shorthand
        bol="^"; eol="$"
        ident="[A-Za-z_][A-Za-z0-9_]*";
        
        find_function=(bol+any(" ")+lit("def")+many(" ")+
                       capture(ident)+any(" ")+lit(":")+
                       any(" ")+eol)
    

You can handle more or less stuff this way according to how much you and/or
your readers like regexp syntaxp. The above is probably further than I'd take
it in practice; I'm familiar with the regular expression syntax, but I'd still
probably at least use something like the `ident' variable just to keep clutter
out of the regexp.

(A nice demonstration of this sort of thing is emacs's rx module (see, e.g.,
[http://emacswiki.org/emacs/rx](http://emacswiki.org/emacs/rx)). I couldn't
find any good non-emacs documentation about this, nor much that would make
sense to people unfamiliar with lisp, but when you're in emacs you can get
help on it using C-h f rx RET.)

~~~
TeMPOraL
I agree. In terms of variables, I think a good strategy is assigning names to
meanings, e.g.

    
    
        PRODUCT_ID = "[0-9]{2}[A-Z]{1,5}"
        ...
        CUSTOMER_ID = "[0-9]{2}[A-Z]{1,5}"
        ...
        ...
        regexp = "(" + PRODUCT_ID + ")" // capture product ID in group 1
               + "something something"  // something something
               + "(" + CUSTOMER_ID + ")"; // capture customer ID in group 2
    

In this example, even though the two constants contain the same regular
expresison, they refer to two different concepts. Part of the problem with
understanding regexp-based code is connecting parts of the expression with
what they _mean_. Above strategy addresses this.

I also like assigning names to capture groups I depend on. So instead of,
later in code, asking for e.g. matcher.group(2), I ask for
matcher.group(GROUP_CUSTOMER_ID). Makes for a much more readable code.

Speaking of Emacs's rx, it's absolutely amazing and I'm sad that I only
discovered it just few days ago :(. Another similar concept, from the other
side of "code = data" equality is Common Lisp's ( _of course_ it's Lisp again)
CL-PPCRE and its internal representation of regular expressions:

    
    
        * (parse-string "(ab)*")
        (:GREEDY-REPETITION 0 NIL (:REGISTER "ab"))
        
        * (parse-string "(a(b))")
        (:REGISTER (:SEQUENCE #\a (:REGISTER #\b)))
        
        * (parse-string "(?:abc){3,5}")
        (:GREEDY-REPETITION 3 5 (:GROUP "abc"))
    

You can encode any regexp you want as an S-expression, trading off conciseness
for legibility. See [http://weitz.de/cl-ppcre/#create-
scanner2](http://weitz.de/cl-ppcre/#create-scanner2) for more.

------
brey
Very elegant, I like it.

Except I'm now having a major case of semantic satiation for the word
"Tarzan"...

------
chris_wot
I love regexes but they make my head hurt. I really rather badly need to spend
some time properly learning it inside and out.

~~~
GlennS
I first learned them properly by reading through [[http://www.regular-
expressions.info/tutorial.html](http://www.regular-
expressions.info/tutorial.html)]. I'd also recommend the website as a good
reference for when you forget things.

------
bro-stick
Unimpressive. The author of this article obviously didn't have a compiler
class where one learns how regexes are basically glorified NFAs that are
deterministicly convertible into a much more efficient DFA state machines
(read: PCRE JIT), instead of assuming regexes are processed by O(N^2)
algorithms.

~~~
kragen
'convertible into a much more efficient DFA state machines (read: PCRE JIT)'

pcre supports reduction to dfa and also jit, but not only are they not the
same thing, they are mutually exclusive. also, ever regexp engine i've seen
that supports capturing and backreferences uses not worst-case quadratic-time
but actually worst-case exponential-time algorithms, although i'm pretty sure
this isn't actually unavoidable.

may i suggest that the next time you think about posting a comment that begins
with 'Unimpressive. The author of this article obviously didn't', that you
include less than one major technical error per sentence in it.

------
wodenokoto
Generally I found it strange how difficult it is to do a search for everything
except something.

The script looks elegant, but like the author mentions, doesn't work in a
text-editor, so I would consider it the greatest.

~~~
ExpiredLink
Many searches that you can easily explain to your grandmother are difficult
with regex.

------
amluto
How does this handle '"Tarzan" Tarzan'?

------
jfb
Convincing the world that they don't exist?

------
logn
More like the author tricked you by changing the problem halfway through the
very long article. Try this instead:

    
    
      (?:(?<!")|(?!Tarzan"))Tarzan

------
jwhite
... and now you have two problems.

------
pkaye
Reminds me of this quote from Jamie Zawinski: "Some people, when confronted
with a problem, think "I know, I'll use regular expressions." Now they have
two problems."

~~~
jameshart
Why does this remind you of that quote? Because it's about regular
expressions? Because it seems to me that if you have the problem described in
this post, and you solve it the way the post describes, using regular
expressions, you have _solved your problem_ , regardless of what jzw says.

~~~
evilotto
When I'm faced with a problem, I think about using recursion. Then I have two
problems. Then I think about using recursion. Then I have three problems. Then
I ...

------
edward
Parsing HTML with a regex? You should read this answer on Stack Overflow:
[http://stackoverflow.com/a/1732454/84250](http://stackoverflow.com/a/1732454/84250)

