
The New ‘Absent Operator’ in Ruby’s Regular Expressions - petercooper
https://medium.com/rubyinside/the-new-absent-operator-in-ruby-s-regular-expressions-7c3ef6cd0b99
======
viggity
i'm a huge regex nerd (see: regexpixie.com). But this is a classic example of
people trying to be clever instead of clear. And I would always much prefer to
read clear code than clever code.

Regexes are declarative and they're really good a narrow set of features -
namely, finding text that matches a pattern. When I see someone using
negative/positive lookaheads, or trying to match a valid date with a regex, I
see someone who is furiously pounding a nail with the fat end of a
screwdriver.

If you want to keep things simple with regular expressions:

* Be liberal with what your pattern matches and use a normal programming language for your complicated conditional logic to filter out crap you don't want

* Don't be afraid to break up the search with multiple regular expressions

* Ignore pattern whitespace and use it to visually break up your pattern. Nobody would agree to debug javascript that has been minimized, yet people do this all the time with regex

* For the love of all that is holy, USE NAMED GROUPS. It is a fantastic way to document your intent.

This is confusing: \d{1,2}/\d{1,2}/(\d{4}|\d{2})

This is not:

(?<day>\d{1,2})

/

(?<month>\d{1,2})

/

(?<year>\d{4}|\d{2})

~~~
petercooper
Have you considered doing a sort of regex best practices post? I'd be happy to
promote such a thing or even help get it out. I think a lot of people want
that sort of info.

~~~
falsedan
If I wrote a post, it would be: "read Ch. 12 of _Perl Best Practices_
(2005)[0]".

Yow, $30+ for the ebook! I wouldn't buy if just for that chapter… fortunately,
the guidelines from the book have been codified as Perl::Critic, so If you
don't mind missing out on the gently guiding text which accompanies each rule,
you can browse the Perl::Critic::Policy::RegularExpressions[1] submodule docs.

[0]:
[http://shop.oreilly.com/product/9780596001735.do](http://shop.oreilly.com/product/9780596001735.do)
[1]:
[https://metacpan.org/pod/Perl::Critic::Policy::RegularExpres...](https://metacpan.org/pod/Perl::Critic::Policy::RegularExpressions::ProhibitComplexRegexes)

------
fleshweasel
It seems like the "ASCII puke" concrete syntax for regex patterns doesn't
scale that well. Regular expressions have binary operators, parenthesization,
named groups, lookaheads, etc.-- if you're building a sophisticated regex of
more than 10 characters or so, why not have some kind of an object model for
this stuff so you can have reasonable forms of composition, naming of
intermediate values in construction of a larger pattern, and the ability to
attach modifiers to things without needing to pack more @#$%&*!^ un-Googleable
junk into string literals?

~~~
ebiester
Icon (and SNOBOL before that) had an alternate syntax that was more verbose
but more readable.

    
    
      s := "this is a string"
        s ? {                               # Establish string scanning environment
            while not pos(0) do  {          # Test for end of string
                tab(many(' '))              # Skip past any blanks
                word := tab(upto(' ') | 0)  # the next word is up to the next blank -or- the end of the line
                write(word)                 # write the word
            }
        }
    

I should implement something like this in ruby some day... (I did it in java
long ago, but this was before you just threw things up to github and I have
long since lost it.)

~~~
sublimeloge
For those who haven't seen the language before, I think it's also useful to
know what expression evaluation in Icon works using a recursive backtracking
algorithm. This means that the most natural way of writing a string scanning
parser (like the one above) more or less automatically gives you a recursive
backtracking parser. Like ebiester, I too have found it to be a nice way to do
certain kinds of simple string parsing.

------
kazinator
TXR has actual regex intersection and complement.

Match one or more digits, but not "3":

    
    
      $ txr
      This is the TXR Lisp interactive listener of TXR 172.
      Use the :quit command or type Ctrl-D on empty line to exit.
      1> [#/\d+&~3/ "1"]
      "1"
      2> [#/\d+&~3/ "3"]
      nil
      3> [#/\d+&~3/ "123"]
      "123"
      4> [#/\d+&~3/ ""]
      nil
    

This is most easily understood by equivalence between regexes and sets of
strings: ~3 means all strings that are not 3. This set includes the "" string.
It includes all strings of length 1 other than "3", and it includes all
strings of length 2 and greater. It's simply the complement of the set _{ "3"
}_. ~R means that whatever set of strings R matches is complemented (w.r.t the
universe set being the set of all possible strings).

Similarly A&B is just set intersection. Regular expression A denotes some set
of strings, and B denotes some set of strings. A&B denotes the intersection of
those sets.

Another example: search the string for the leftmost substring which starts
with _foo_ , ends with _bar_ and does not contain _abc_ , and is the longest
possible such string at its given position:

    
    
      1> [#/foo.*bar&~.*abc.*/ ""]
      nil
      2> [#/foo.*bar&~.*abc.*/ "foobar"]
      "foobar"
      3> [#/foo.*bar&~.*abc.*/ "foozzzbar"]
      "foozzzbar"
      4> [#/foo.*bar&~.*abc.*/ "foozabczzbar"]
      nil
      5> [#/foo.*bar&~.*abc.*/ "fooabcbar"]
      nil
      6> [#/foo.*bar&~.*abc.*/ "foobar1234"]
      "foobar"
      7> [#/foo.*bar&~.*abc.*/ "foobarabc"]
      "foobar"
      8> [#/foo.*bar&~.*abc.*/ "3foobarabc"]
      "foobar"
      9> [#/foo.*bar&~.*abc.*/ "3fooabcbarabc"]
      nil
      10> [#/foo.*bar&~.*abc.*/ "3fooabcbarabcfoobar"]
      "foobar"

------
perlgeek
I'm pretty sure you can emulate the absent operator with look-ahead
expressions. Untested: (?=((?!exp).)*$)

That is: a zero-width assertion where at every character until the end exp
doesn't match.

It's a nice idea to include this in the standard, and also illustrates how
convoluted regular expression syntax has become over the years, mostly because
it wasn't designed to be extensible.

~~~
runako
I haven't completely ingested this post yet, but this seems to be incorrect:

> I'm pretty sure you can emulate the absent operator with look-ahead
> expressions. Untested: (?=((?!exp).)*$)

From TFA: "Note that this is not the same as a negative look-behind or look-
ahead — we’ll see how shortly."

~~~
falsedan
The article explains that (?~foo) is not the same as (?!foo), since the latter
fails the match if it matches, but the former succeeds if it can match
anything that's not 'foo' (including 'oo'). The first examples anchor the
patterns, so that this behaviour:

    
    
      "foo" ~= /(?~foo)/ # => 1
    

isn't revealed.

------
daenney
It seems this has a high potential for bugs though if people aren't aware of
that detail with the `coffee and tv` example. Depending on how well versed you
are with regexes it isn't immediately obvious.

~~~
mst
Honestly, that's true of lots of unanchored constructs - which is why I tend
to try and teach people to always /^...$/ or /\A...\Z/ depending on context.

------
smitherfield
The "C-style comments" examples can be made a little easier to read using the
alternate-style Ruby regex literal.

    
    
      %r{/\*(?~\*/)\*/}
      %r{\A/\*(?~\*/)\*/\z}
      %r{\A/\*.*?\*/\z} # incorrect: matches /**/*/

------
RodgerTheGreat
My initial impression is that this seems similar to the negation operator in
parsing expression grammars. PEGs are "greedy", so the semantics of "non-
consuming" matches/negations are a bit more straightforward than this feature
appears.

~~~
jballanc
Same here. I wonder when we stop calling these "regular expressions" and start
calling them "parsing expressions"?

~~~
zokier
Perl6 already has taken steps towards that direction, they have "rules" and
"grammars", although I also see that they still call regexes regex but try to
avoid the term "regular expression"

> This document summarizes Apocalypse 5, which is about the new regex syntax.
> We now try to call them regex rather than "regular expressions" because they
> haven't been regular expressions for a long time, and we think the popular
> term "regex" is in the process of becoming a technical term with a precise
> meaning of: "something you do pattern matching with, kinda like a regular
> expression". On the other hand, one of the purposes of the redesign is to
> make portions of our patterns more amenable to analysis under traditional
> regular expression and parser semantics, and that involves making careful
> distinctions between which parts of our patterns and grammars are to be
> treated as declarative, and which parts as procedural.

> In any case, when referring to recursive patterns within a grammar, the
> terms rule and token are generally preferred over regex.

[https://design.perl6.org/S05.html](https://design.perl6.org/S05.html)

Of course it is not surprising that Perl community is pushing the envelope for
regex...

------
pcmonk
Fascinating. As I recall, things like negative look-ahead (or look-behind)
aren't formally regular expressions (i.e. the languages they recognize aren't
generally regular languages). Is this "absent" operator like that too?

~~~
perlgeek
A look-ahead isn't part of regular expression (in the CS sense) syntax, but
since regular expressions are closed under AND, OR and NOT, you can emulate it
with AND.

For example (?=a)b is the same as (a.*)&b if we take & to mean AND.

(But note that look-aheads don't influence the scope of captures in modern
regex implementations, and regular expressions don't even have the notion of
captures; this makes the emulation not practical).

Since I believe you can emulate the absence operator using look-aheads (see
[https://news.ycombinator.com/item?id=13939764](https://news.ycombinator.com/item?id=13939764)),
it should be expressible by regular language too.

~~~
rntz
It isn't quite that easy to emulate lookahead using intersection (AND). For
(?=a)b you're right, but that's because `a` and `b` both only ever match
strings of length one. `(?=foo)f`, for example, will match the first character
of the string "foobar". But `foo&f` is the empty language: no string is both
"foo" and "f"; they have different lengths!

The trouble with lookahead and lookbehind is that they aren't even describable
in terms of the "language" which the regular expression corresponds to;
rather, they modify how the pattern matches _in the context_ of an overall
string. So they don't quite use the same formalism as intersection, union, and
negation of regular languages.

~~~
bonzini
Apart from the length of the match, (?=foo)f is equivalent to foo.* & f.* (the
extra .* ensures that both of them match foo).

