Hacker News new | comments | show | ask | jobs | submit login
The New ‘Absent Operator’ in Ruby’s Regular Expressions (medium.com)
99 points by petercooper 91 days ago | hide | past | web | 43 comments | favorite



i'm a huge regex nerd (see: regexpixie.com). But this is a classic example of people trying to be clever instead of clear. And I would always much prefer to read clear code than clever code.

Regexes are declarative and they're really good a narrow set of features - namely, finding text that matches a pattern. When I see someone using negative/positive lookaheads, or trying to match a valid date with a regex, I see someone who is furiously pounding a nail with the fat end of a screwdriver.

If you want to keep things simple with regular expressions:

* Be liberal with what your pattern matches and use a normal programming language for your complicated conditional logic to filter out crap you don't want

* Don't be afraid to break up the search with multiple regular expressions

* Ignore pattern whitespace and use it to visually break up your pattern. Nobody would agree to debug javascript that has been minimized, yet people do this all the time with regex

* For the love of all that is holy, USE NAMED GROUPS. It is a fantastic way to document your intent.

This is confusing: \d{1,2}/\d{1,2}/(\d{4}|\d{2})

This is not:

(?<day>\d{1,2})

/

(?<month>\d{1,2})

/

(?<year>\d{4}|\d{2})


Have you considered doing a sort of regex best practices post? I'd be happy to promote such a thing or even help get it out. I think a lot of people want that sort of info.


If I wrote a post, it would be: "read Ch. 12 of Perl Best Practices (2005)[0]".

Yow, $30+ for the ebook! I wouldn't buy if just for that chapter… fortunately, the guidelines from the book have been codified as Perl::Critic, so If you don't mind missing out on the gently guiding text which accompanies each rule, you can browse the Perl::Critic::Policy::RegularExpressions[1] submodule docs.

[0]: http://shop.oreilly.com/product/9780596001735.do [1]: https://metacpan.org/pod/Perl::Critic::Policy::RegularExpres...


This isn't about negative lookaheads or lookbehinds. It's about negation proper. It is perfectly declarative: the language matched by (?~E) is the complement of the language matched by E.


I've found trying to point out the differences between regular expressions as finite automata and regular expressions as a reference to the regex|regexp constructs within programming languages difficult, and have given it up as a hobby.

The programming language constructs are useful in many cases but harder to reason about. The finite automata are easier to reason about but limit themselves to state machines. Here the complexity is that the operator operates in the state-machine space wrapped up inside the programming language construct.

Ok, so I guess I did not give it up as a hobby.


It seems like the "ASCII puke" concrete syntax for regex patterns doesn't scale that well. Regular expressions have binary operators, parenthesization, named groups, lookaheads, etc.-- if you're building a sophisticated regex of more than 10 characters or so, why not have some kind of an object model for this stuff so you can have reasonable forms of composition, naming of intermediate values in construction of a larger pattern, and the ability to attach modifiers to things without needing to pack more @#$%&*!^ un-Googleable junk into string literals?


Icon (and SNOBOL before that) had an alternate syntax that was more verbose but more readable.

  s := "this is a string"
    s ? {                               # Establish string scanning environment
        while not pos(0) do  {          # Test for end of string
            tab(many(' '))              # Skip past any blanks
            word := tab(upto(' ') | 0)  # the next word is up to the next blank -or- the end of the line
            write(word)                 # write the word
        }
    }
I should implement something like this in ruby some day... (I did it in java long ago, but this was before you just threw things up to github and I have long since lost it.)


Nice notation. Looks like a grammar with a bit more fancyness thrown in. I have a tool that does this kind of stuff: https://github.com/norswap/whimsy/blob/master/doc/autumn/REA...

I should try to implement something like this, to see how hard it would be (I think not too much, but I might overlook something).


For those who haven't seen the language before, I think it's also useful to know what expression evaluation in Icon works using a recursive backtracking algorithm. This means that the most natural way of writing a string scanning parser (like the one above) more or less automatically gives you a recursive backtracking parser. Like ebiester, I too have found it to be a nice way to do certain kinds of simple string parsing.


The regular expressions in Ruby, Perl and Python supported "extended syntax" for a long time (x flag). For an example, see the first first section http://www.perl.com/pub/2004/01/16/regexps.html (this is perl , ruby is very similar).

I think the article would be easier to read if all the regexes were in extended form, but I suppose that author is an expert regex user, so the examples were easy enough for him.

And finally, perl6 totally re-did text matching with "grammars" (https://docs.perl6.org/language/grammars.html) -- they use much more readable syntax, nameable groups, etc... It really is quite a wonderful thing, I with it was available in other languages.


Not so much an expert, but I did Perl for 8 years before 13 years (so far) of Ruby so a lot of exposure(!) I've not been a fan of extended syntax but it might be worth me giving it a proper go and writing something up if it's helpful so thanks!


Thats why i much prefer to use something like the Parse dialect that comes with Rebol / Red - http://www.rebol.com/docs/core23/rebolcore-15.html

Here's ebiester Icon example in the parse dialect:

    s: "This is a string"

    parse s [
        any " "                         ;; skip past any leading blanks (none in this example!
        any [                           ;; repeat ANY while keeps matching (true)
            copy word to [" " | end]    ;; next word up to next blank or EOL
            (print word)                ;; print word
            skip                        ;; skip past blank
        ]
    ]
And here is one translation of falsedan example:

    ws:     [some [" " | "^-"]]
    sep:    #"^-"
    digits: charset "1234567890"
    lower:  charset [#"a" - #"z"]
    upper:  charset [#"A" - #"Z"]
    cost:   [3 5 digits "." 2 digits]
    quoted-description: [
        copy quot [{'} | {"}]
        some [{\} quot | upper | lower | " "]
        quot
    ]
    unquote-description: [to sep]
    description:         [quoted-description | unquote-description]
    SKU:                 [some [upper | digits]]

    parse/case s [ws cost sep description sep SKU ws]


Don't all the languages which support complex regex also support composing regexes from objects/strings?

e.g.

  /\A\s*\d{3,5}\.\d{2}\t\(?:(['"])(?:[\s\w]|\\\1)+\1|[^'"]+)\t[A-Z0-9]+\s*\Z/
would be written as

  $ws                   = /\s*/;
  $sep                  = /\t/;
  $cost                 = /\d{3,5} [.] \d\d/x;
  $quoted_description   = /
    (['"])       # opening quote; remember this for later
      (?:        # (capture groups, don't read this)
         [\w\s]  #   a chars or space
         |       #   OR
         [\\] \1 #   backslash-escaped quote, same as opening
      )+         # (description can't be empty)
    \1           # closing quote, matching opening
  /x;
  $unquoted_description = /[^'"]+/;
  $description          = /(?: $quoted_description | $unquoted_description )/x;
  $SKU                  = / [[:upper:][:digit:]]+ /x;

  /\A $ws $cost $sep $description $sep $SKU $ws \Z/x


`x` mode and DEFINE can your friends. Example

  (?xi)
  (?(DEFINE)
  
    (?<year> \d{4} )  # I'm a comment
  
    (?<brand> honda|toyota )
  
    (?<model> crx|prius )
  )
...later...

  ((?&year)) ((?&brand)) ((?&model))


There is this project, Verbal Expressions: https://github.com/VerbalExpressions

It appears from time to time on Hacker News. The Javascript implementation has nearly 10k stars.


I have replaced most regular expressions with irregexes http://synthcode.com/scheme/irregex/ . They are sadly scheme specific, but does a lot for clarity.

For anything more complex than small things that can be understood in less than 20 seconds, I use a parser generator, be it parsack (racket version of Haskell's parsec) or whichever I have at hand.


Personally, I'm fine with it—but only because Regexps really belong to a different domain than people think.

Regexps do not exist to be a self-documenting syntax for writing code that gets read and maintained. If you are going to sit down, write, debug, commit, and PR some code that matches strings, for heaven's sake just write your pattern in BNF and apply a parser generator to it, or use a parser combinator library.

Regexps are intended as a fluent syntax for interacting with data. Regexps exist to be arguments to sed, awk, and vim's :s command. Regexps exist to let you type an SQL query into psql that finds rows with columns matching a pattern. They're meant to be a hand-tool, used by a craftsman during the manual work of analysis that comes before the job is planned.

And as such, regexp syntax features aren't meant to be composed into multi-line monstrosities that do all the work at once; they're meant to let you match chunks, and then pipe that to another regexp that winnows those chunks down, and then another that substitutes one part of each chunk, etc.

If you've ever seen a PERL script written in "imperative mode", where every line is relying on the implicit assignment of its result to the $_ variable, each line doing one more thing to that variable, each little regexp sawing off one edge or patching one hole—that's an example of the proper use of regexps. Such a script is effectively less a "program", and more simply a record of someone's keystrokes at a REPL.

And because of that, I honestly find it a bit strange that modern compiled languages build in "first-class" native support for regexps. They make sense in "scripting" languages like Ruby and Python because those languages can indeed be used for "scripting": writing code in their REPLs to do some manual tinkering, and then maybe saving a record of what you just did in case you need it again. But in languages like Go or Elixir? Why not just give the developer a batteries-included parser-combinator library instead? (If you, as a developer, need to parse regexps to support your users querying your system by passing it regexps, they could still be available from a library. But there's no need for a literal syntax for them in such languages.)

That being said, I wouldn't mind if an IDE for a particular compiled language accepted regular-expression syntax as a sort of Input Method Editor: you'd hit Ctrl+Shift+R or somesuch, a little "Input regexp: " window would pop up over your cursor, and then as you wrote and modified the regexp in the window, the equivalent BNF grammar would appear inside a text-selection at the cursor. That's a good use of regexps: to allow you to fluently, quickly create BNF grammars. As if they were a synthesizer keyboard, with each keystroke immediately performing a function.


TXR has actual regex intersection and complement.

Match one or more digits, but not "3":

  $ txr
  This is the TXR Lisp interactive listener of TXR 172.
  Use the :quit command or type Ctrl-D on empty line to exit.
  1> [#/\d+&~3/ "1"]
  "1"
  2> [#/\d+&~3/ "3"]
  nil
  3> [#/\d+&~3/ "123"]
  "123"
  4> [#/\d+&~3/ ""]
  nil
This is most easily understood by equivalence between regexes and sets of strings: ~3 means all strings that are not 3. This set includes the "" string. It includes all strings of length 1 other than "3", and it includes all strings of length 2 and greater. It's simply the complement of the set { "3" }. ~R means that whatever set of strings R matches is complemented (w.r.t the universe set being the set of all possible strings).

Similarly A&B is just set intersection. Regular expression A denotes some set of strings, and B denotes some set of strings. A&B denotes the intersection of those sets.

Another example: search the string for the leftmost substring which starts with foo, ends with bar and does not contain abc, and is the longest possible such string at its given position:

  1> [#/foo.*bar&~.*abc.*/ ""]
  nil
  2> [#/foo.*bar&~.*abc.*/ "foobar"]
  "foobar"
  3> [#/foo.*bar&~.*abc.*/ "foozzzbar"]
  "foozzzbar"
  4> [#/foo.*bar&~.*abc.*/ "foozabczzbar"]
  nil
  5> [#/foo.*bar&~.*abc.*/ "fooabcbar"]
  nil
  6> [#/foo.*bar&~.*abc.*/ "foobar1234"]
  "foobar"
  7> [#/foo.*bar&~.*abc.*/ "foobarabc"]
  "foobar"
  8> [#/foo.*bar&~.*abc.*/ "3foobarabc"]
  "foobar"
  9> [#/foo.*bar&~.*abc.*/ "3fooabcbarabc"]
  nil
  10> [#/foo.*bar&~.*abc.*/ "3fooabcbarabcfoobar"]
  "foobar"


I'm pretty sure you can emulate the absent operator with look-ahead expressions. Untested: (?=((?!exp).)*$)

That is: a zero-width assertion where at every character until the end exp doesn't match.

It's a nice idea to include this in the standard, and also illustrates how convoluted regular expression syntax has become over the years, mostly because it wasn't designed to be extensible.


You propose:

    (?=((?!exp).)*$)
I don't believe you're right, but can't be sure, because you haven't given a method of translating an arbitrary regular expression containing a negation (?~E) into one without.

For example, how would you translate:

    (?~exp)a.*
which asks for "anything not exp, then an `a`, then anything"?

    (?=((?!exp).)*$).*a.*
is wrong, since it doesn't match "aexp".

    (?=((?!exp).)*$)a.*
is also wrong, for the same reason.

    (?=((?!exp).)*a)a.*
is wrong, because it doesn't match "expba".

EDIT: argh, HN formatting rules about asterisks.


I believe he's saying you'd want:

  (?=((?!exp).)*a.*)
I'm a little dubious that full equivalence is maintained (particularly w.r.t. to e.g. #match), but my brain isn't up to handling convoluted REs this early.

Edit: Another hangup I have is how would you do something like:

  ((?~E1)|E2)E3
Obviously you can distribute by duplicating E3, but then it's pretty clear what the absent operator buys you.


    (?=((?!exp).)*a.*)
is also wrong; it doesn't match "expba", but

    (?~exp)a.*
does.


  /(?=((?!exp).)*a.*)/ =~ "expba"
  # => 1
for me at least (using #match is also truthy, but the matched string is not what I'd guess the absent operator would give).


Huh! Now I am super confused.


See also the second comment on the original issue:

https://bugs.ruby-lang.org/issues/5588


I haven't completely ingested this post yet, but this seems to be incorrect:

> I'm pretty sure you can emulate the absent operator with look-ahead expressions. Untested: (?=((?!exp).)*$)

From TFA: "Note that this is not the same as a negative look-behind or look-ahead — we’ll see how shortly."


The article explains that (?~foo) is not the same as (?!foo), since the latter fails the match if it matches, but the former succeeds if it can match anything that's not 'foo' (including 'oo'). The first examples anchor the patterns, so that this behaviour:

  "foo" ~= /(?~foo)/ # => 1
isn't revealed.


Well, it's not the same as a simple positive or negative look-ahead, but that's not what I proposed either.


It seems this has a high potential for bugs though if people aren't aware of that detail with the `coffee and tv` example. Depending on how well versed you are with regexes it isn't immediately obvious.


Honestly, that's true of lots of unanchored constructs - which is why I tend to try and teach people to always /^...$/ or /\A...\Z/ depending on context.


Yes, when I was initially investigating it, it tripped me up which is why I wanted to make a point of it.

I know a lot of people on HN have far more experience with regexes at a theoretical level than me, so I'm also hoping any errors are quickly highlighted in comments here for me to correct :-D


The "C-style comments" examples can be made a little easier to read using the alternate-style Ruby regex literal.

  %r{/\*(?~\*/)\*/}
  %r{\A/\*(?~\*/)\*/\z}
  %r{\A/\*.*?\*/\z} # incorrect: matches /**/*/


My initial impression is that this seems similar to the negation operator in parsing expression grammars. PEGs are "greedy", so the semantics of "non-consuming" matches/negations are a bit more straightforward than this feature appears.


Same here. I wonder when we stop calling these "regular expressions" and start calling them "parsing expressions"?


Perl6 already has taken steps towards that direction, they have "rules" and "grammars", although I also see that they still call regexes regex but try to avoid the term "regular expression"

> This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

> In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

https://design.perl6.org/S05.html

Of course it is not surprising that Perl community is pushing the envelope for regex...


Fascinating. As I recall, things like negative look-ahead (or look-behind) aren't formally regular expressions (i.e. the languages they recognize aren't generally regular languages). Is this "absent" operator like that too?


The negation (complement) of a regular language is always regular. So in the CS theory sense, these are still regular expressions. However, while regular languages correspond to finite automata, the transformation which negates a NFA is not very easy to describe, which is one reason regular expression matching libraries have rarely included negation proper as an operation. Implementing negation proper using backtracking is somewhat easier, but quite expensive if done naively.


A look-ahead isn't part of regular expression (in the CS sense) syntax, but since regular expressions are closed under AND, OR and NOT, you can emulate it with AND.

For example (?=a)b is the same as (a.*)&b if we take & to mean AND.

(But note that look-aheads don't influence the scope of captures in modern regex implementations, and regular expressions don't even have the notion of captures; this makes the emulation not practical).

Since I believe you can emulate the absence operator using look-aheads (see https://news.ycombinator.com/item?id=13939764), it should be expressible by regular language too.


It isn't quite that easy to emulate lookahead using intersection (AND). For (?=a)b you're right, but that's because `a` and `b` both only ever match strings of length one. `(?=foo)f`, for example, will match the first character of the string "foobar". But `foo&f` is the empty language: no string is both "foo" and "f"; they have different lengths!

The trouble with lookahead and lookbehind is that they aren't even describable in terms of the "language" which the regular expression corresponds to; rather, they modify how the pattern matches in the context of an overall string. So they don't quite use the same formalism as intersection, union, and negation of regular languages.


Apart from the length of the match, (?=foo)f is equivalent to foo.* & f.* (the extra .* ensures that both of them match foo).


Hi, I'm the author of Onigmo regex library.

I'm not sure that negative look-ahead/behind are formally regular expressions. (There is a (Japanese) paper which says that they are formally regular expressions, but I don't understand it.)

However, the absent operator is formally regular expression.

I translated the main points of Tanaka Akira's paper very roughly. https://github.com/k-takata/Onigmo/issues/87 I hope this helps you to understand the operator.


Negative lookahead/behind is regular in the mathematical sense. Anything that can be achieved by a finite combination of FSMs is, including OR, AND and NOT on their accepting states.

What isn't regular is stipulation that a group matches the same string as another group. E.g. "Same word twice" is not expressible by formal regular languages.


This initial paper on the concept seems to dig into that: https://staff.aist.go.jp/tanaka-akira/pub/prosym49-akr-paper... .. but it's in Japanese and a bit above my pay grade.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: