
Regular Expressions – Mastering Lookahead and Lookbehind - filipmandaric
http://www.rexegg.com/regex-lookarounds.html
======
filipmandaric
Most of the time I mention the topic of regular expressions to other
developers, I usually hear self-critical commentary like "oh, I'm terrible at
regex", and rarely anyone who loves them. I think they're great though, if you
take the time to understand them. They're something like a Swiss Army knife
for programming.

~~~
Retra
A programmer saying they are terrible at regex is like a mathematician saying
they are terrible at algebra.

~~~
rxhernandez
You are so right! How could you possibly go out of practice in real-time
biomedical algorithms engineering when you might use them one or two times out
of the year?! I must have missed the memo where all those strings are encoded
in a patient's hemodynamic signal.

There are few people more frustrating to algorithms engineers and embedded
engineers than people like you who think you are the only type of programmers
out there or that the programming you do is somehow superior(despite largely
relying on math you learned at a decent secondary school).

~~~
laythea
I agree. If I ever need to use regex, I will dive in, but I don't mainly write
software to process text. I mainly write control software for various things
like subsea oil wells, oil rigs, gas turbines etc. It is enough for me to know
it exists. I try to spread myself thin and broad, and then that gives me the
control of which areas to dive deep, depending on the task at hand.

------
nolliebs180
> It is that at the end of a lookahead or a lookbehind, the regex engine
> hasn't moved on the string. You can chain three more lookaheads after the
> first, and the regex engine still won't move.

Omg, thank you, that is the insight I needed and now I completely get it.
Internet +1 for the day.

------
sshine
A use-case for lookarounds that I often use is:

    
    
        grep -Po '(?<=...)pattern'
    

Which also cuts out and prints the relevant part of the line. This saves a
trip through cut, awk or perl. (-P is PCRE and -o is print only matched
characters, which the lookarounds aren't a part of.)

~~~
asicsp
I often use `\K` instead, which also helps if it is variable length lookbehind

    
    
        $ echo 'foo=5, Bar=3; x1=83, y=120' | grep -oP '\b[a-z]+=\K\d+'
        5
        120
    

further reading: [https://stackoverflow.com/questions/11640447/variable-
length...](https://stackoverflow.com/questions/11640447/variable-length-
lookbehind-assertion-alternatives-for-regular-expressions/11640500#11640500)

------
d--b
> our pattern becomes:

> \A(?=\w{6,10}\z)(?=[^a-z] _[a-z])(?=(?:[^A-Z]_ [A-Z]){3})(?=\D _\d)._

And then these guys wonder why people hate regexes? The "now you have 2
problems" quote fit perfectly for that case.

~~~
pygy_
The regexp syntax was devised for write-only programming at the terminal (at a
time when a terminal was a physical object, not a window in your GUI).

The regular formalism is pretty neat though. There are alternative syntaxes
(e.g. multiline regexps in Python) that are better suited for complex
matchers.

------
paulryanrogers
While useful to some I think advanced RE are like mixing in Perl or playing
code golf with production code. They tend to make code harder to read.

My preference in such cases is for multiple separated or longer REs (which can
be at least split in the surrounding code) and each part named or heavily
commented. Of course it's always worthwhile to consider non-RE solutions if
the problem can be broken down enough.

EDIT: Fixed typo

~~~
filipmandaric
Fair enough, but I really think the benefits of advanced regular expressions
are underappreciated in non-production and even non-application contexts.
Laypeople (and occasionally even developers) are impressed when you show them
how to search through a document or file system using a really complicated
pattern, where it would have taken several iterations of data manipulation to
achieve the same result without using advanced regular expressions.

~~~
Tobba_
It'd help a _lot_ if the grammar was actually readable. Combinations like .*
don't visually "read" like a single unit, and then to make everything worse
you often need a crazy amount of backslashes.

I'm not sure how you could fix that without introducing completely new
characters or color-coding parts of the expression though.

~~~
freedomben
The back slashes for escaping are absolutely _awful_. This is one of the worst
things about Java.

It's much better in languages with regex literals like Ruby and JavaScript.

~~~
kbp
It's especially nicer in Ruby (which got it from Perl) where you can use
whatever delimiters you like for regexes, with /abc/, %r"abc", %r{abc},
%r#abc# and so on all being equivalent, so you can just about always pick
something that won't clash with the characters in your pattern (You can even
use spaces as the delimiters, which looks terrible).

------
asicsp
I came to know about this wonderful site when I saw this article -
[https://www.rexegg.com/regex-best-trick.html](https://www.rexegg.com/regex-
best-trick.html)

example:

    
    
        $ # all words except those starting with 'c' or 'C'
        $ echo 'Car Bat cod12 Map foo_bar' | grep -ioP '\bc\w+(*SKIP)(*F)|\w+'
        Bat
        Map
        foo_bar
    

for more details: [https://www.rexegg.com/backtracking-control-
verbs.html#skipf...](https://www.rexegg.com/backtracking-control-
verbs.html#skipfail)

~~~
lifthrasiir
Which is a fancy way to say `grep -iv '^c'`. EDIT: Oh, I missed that the input
was a single line.

I personally feel that control verbs are bad additions to the regexp, even
though I do know that it is not a big addition to the regexp _engine_ itself
(e.g. naturally extended from posesssive quantifiers like `a++` or atomic
groups `(?>foo)`). Most uses of such verbs can be expressed with combined
parsers and simpler regexps, in the much simpler and maintainable way.

~~~
asicsp
sorry, it is not same as `grep -iv '^c'`

the `-o` option allows to output only matching portion, the regex is meant to
extract all words other than those starting with 'c' or 'C'

here's hopefully better example

    
    
        $ # do something with words not surround by quotes
        $ echo 'I like "mango" and "guava"' | perl -pe 's/"[^"]+"(*SKIP)(*F)|\w+/\U$&/g'
        I LIKE "mango" AND "guava"

~~~
lifthrasiir
Oh, you are right. I missed that all words are in the same line. That said
even the original article mentions that it only moves the captured group to
the entire match; I am generally in a position to avoid all uses of control
verbs, especially if it only costs one or probably two lines of the additional
code that I can fully control and comprehend.

------
qaq
for those who find regex not very readable:
[https://github.com/VerbalExpressions](https://github.com/VerbalExpressions)
// Create an example of how to test for correctly formed URLs var tester =
VerEx() .startOfLine() .then('http') .maybe('s') .then('://') .maybe('www.')
.anythingBut(' ') .endOfLine();

~~~
pygy_
[https://github.com/pygy/compose-regexp.js](https://github.com/pygy/compose-
regexp.js) is another option (800 bytes mingzipped):

    
    
        const {sequence, suffix} = composeRegexp;
        const maybe = suffix("?");
        const oneOrMore = suffix("+");
    
        const urlMatcher = sequence(
          /^/,
          "http"
          maybe("s"),
          "://",
          maybe("www."),
          oneOrMore(/[^ ]/),
          /$/
        );

------
projektir
Plugging this website as I've found it very useful to learn simple regex with
/ get over my "oh God I don't know Regex":
[https://regexone.com/](https://regexone.com/)

------
petrikapu
Is there regular expression to match regular expression?

~~~
ridiculous_fish
No. One way to convince yourself of this is that regexp capture groups must
properly nest: /())(()/ is invalid for example. Regexps famously cannot match
balanced parenthesis.

------
passthejoe
I used this article about a week ago to help me with a web scrape. Good stuff!

------
rdiddly
That is truly bizarre... I went looking for info on lookaheads just today, and
found myself on that very site, and now it's on HN. It's just that ol' HN
magic I guess.

------
arthur5005
You’ve got a problem you think regex can solve, now you’ve got 2 problems.

~~~
wruza
I wrote a list of json keys that should be taken from a message and
:‘<,’>s/\\(\S\\+\\)\s{0,}/t.\1 = message.\1;\r/g

Hey, did you commit already? Still typing?

~~~
jessaustin
It looks like your quote characters might be messed up there? Anyway, to parse
json on the command line one should just use jq.

~~~
wruza
Single quotes were modified by HN engine, right. It is not a command line
(thinking of sed?), it is the middle of a source code in my editor. I have a
json-parsed message and t is a target object.

    
    
      a b foo c
    
      t.a = !!message.a;
      t.b = String(message.b);
      t.foo = message.foo;
      t.c = tonum(message.c);

