
Regular expressions you can read: a visual syntax and UI - secure
https://medium.com/@savolai/regular-expressions-you-can-read-a-new-visual-syntax-526c3cf45df1
======
kileywm
As someone who has crafted thousands of complex regular expression rules for
data capture, here is my take:

1\. This is a fine idea to aid regex newbies in crafting their expressions. I
see this as a gateway instead of a longterm tool. The expressions won't be
optimal (by no fault of the tool), nor will they likely be complete, but
that's not the point. If it helps reduce the barrier(s) to adoption of regular
expressions, then I can heartily support it.

2\. To the people who say they use regular expressions only a handful of times
a year, thus it's not worthwhile to invest time in learning the syntax, I
offer this: once you know it, you will use it far more often than you ever
expected. Find & replace text, piping output, Nginx.conf editing, or even the
REGEXP() function in MySQL. It's a valuable skillset in so many environments
that I expect you will use weekly, if not daily.

3\. Ultimately regular expressions, like everything, are extra difficult until
you know all of the available tools in the toolbox. At that point, you may
realize you wrote an unnecessarily complex expression simply because you
didn't know better.

~~~
danappelxx
Can you explain how you edit your nginx configuration files with regex? Just
replacing?

~~~
kileywm
Ah for Nginx, specifically, I meant to refer to location blocks. Granted,
regex should be used very carefully for performance reasons, but when
appropriate you can do some really cool pattern matching.

~~~
danappelxx
Oh, that makes more sense. Thanks for sharing :)

------
bartkappenburg
Our tool[0] for using persuasion principles on your site to increase
conversion had a UX problem when setting things up. We'd like to have a
generic way to detect what type of page a certain url is. Most obvious way was
to go with regular expressions (/. _\\-._ \\-d+\\.html for product pages for
example).

Turned out this was by far the most misunderstood setting while it was most of
the important ones. Target audience had something to do with it (marketeers),
but even when Google analytics or google tag manager is widely used by them,
setting up these expressions is really hard.

We decided to built an internal tool that generates a regular expression based
on examples for which the regex must hold. We called it the regexhelper. It
was so successfull that we made it into an external tool[1].

It's not perfect (in terms of generating the most efficient regexes), but it
works fantastic for our audiences of marketeers. Planning to open source this
as well!

An visual UI when dealing with regexes that are the result of our helper using
this idea could be beneficial.

[0] [https://www.conversify.com](https://www.conversify.com)

[1] [http://regexhelper.conversify.com/](http://regexhelper.conversify.com/)

------
Drup
If you want readable regexp, just use combinators and your language's variable
declaration facilities. No need for more.

I don't understand why people still insist on using insane syntax for regexps
instead of just ... functions (`rep` for repetition, `seq` for sequences,
`opt` for optional ..).

~~~
mikegerwitz
It's concise.

Regexps can be documented and split onto multiple lines in many languages and
commented, be it through string concatenation or formatting modifiers. I write
some complicated regular expressions, and I've found that splitting groups of
expressions onto multiple lines and indenting them handles most of the
problems that my coworkers have with groking them, and that I have when
returning to them.

I prefer the concise syntax (provided that it's reasonably formatted) for the
same reason that I prefer the concise syntax of sed, ed, and similar: it's
easy to mentally map and reason about symbols than it is large blocks of text.
I've been programming for nearly 20 years and I have found that I much prefer
manipulating mathematical expressions than I do large chunks of code, because
it uses a concise syntax where symbols mean something. I love such a notation.
(In the case of programs, when refactoring, my mind works in blocks of code as
units, as I'm sure most others' do.)

I'm not saying those benefits aren't possible with verbose code---they are.
But just as many prefer a concise mathematical syntax to a verbose program
that does the same thing, I prefer a concise formal definition.

I'm also not implying that you should try to write an entire grammar in a
single regular expression.

~~~
junke
Concise notations are great, and this is why regexp are so much used IMO. I am
by the way a fan of sed, which is clever enough to give you the choice over
the delimiter you use (s+/+_+g).

On the other hand, there are so many additions to the core formal language,
like backtracking or Larry Wall knows what, that syntax has become cryptic.
Besides, building regexps out of smaller ones is generally a pain with
strings, because you need to quote special regex characters, along with any
character that might interfere with the host language's syntax (e.g. emacs
regexes with four backslashes in a row). I prefer to read actual words, so the
following is fine for me:

    
    
        (defvar *email-regex*
          '(:sequence
            :word-boundary
            
            ;; IDENTIFIER PART
            (:regex "[A-Z0-9._%+-]+")
            
            #\@
    
            ;; DOMAIN
            (:regex "[A-Z0-9.-]+")
            
            #\.
            
            ;; TOP-LEVEL DOMAIN
            (:greedy-repetition 2 nil
             (:char-class (:range #\A #\Z)))
            
            :word-boundary))
    
    

After the recent discussions about Lisp, here is an actual example that can be
used by CL-PPCRE to scan "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\b". The
list structure allows you to compose your regular expression like any other
list, with intermediate functions, etc. without having ever to thing about
escaping your characters. When you need to use the string based, concise
regex, you wrap it in a ":regex" form and you have the best of both worlds.

~~~
mikegerwitz
> On the other hand, there are so many additions to the core formal language,
> like backtracking or Larry Wall knows what, that syntax has become cryptic.

I'm not going to argue with that, though languages like Perl (which have
extended regular expressions so much that they're not actually regular
expressions anymore) also allow named groups, for example.

I'm not arguing that certain circumstances aren't difficult to understand. In
such cases, I do actually compose expressions from separate ones (e.g.
variables): but they're still a concise syntax.

> (e.g. emacs regexes with four backslashes in a row).

Yes, such cases are unfortunate and confusing.

> I prefer to read actual words, so the following is fine for me:

For me, that took much longer to parse mentally than the equivalent regex. Or,
if you're okay with minor changes and a proper locale:

/\b[\w.%+-]+@[\w\d.-]+\\.[A-Z]{2,}\b/

I suspect that someone used to reading the notation you provided would have
opposite results than I do. The reason I find the actual formal notation for
the regex easier is because there's less to keep in memory---all the verbose
extras that I have to strip out when forming my mental image of the regex.

If the regular expression were more complicated, the solution you presented
might not be so bad. I would normally format it like this (if we stick with
the verbose character classes):

/\b [A-Z0-9._%+-]+ @ [A-Z0-9._-]+ \\. [A-Z]{2,} \b /

~~~
mikegerwitz
Ah, that didn't format well at all. I intended for this to display:

    
    
      /
        \b
        [A-Z0-9._%+-]+
        @
        [A-Z0-9._-]+
        \.
        [A-Z]{2,}
        \b
      /

------
dottrap
The problem with "regex" is that it left the pure computer science realm of
true regular expressions, and thus lost many of the mathematical properties of
regular expressions.

Regex's are then further abused to do things far beyond what true regular
expressions can do, which results in cryptic regex expressions whose behaviors
are implementation dependent instead of bounded by computer science
principles.

Lua creator Roberto Ierusalimschy resurfaced and explored the idea of PEGs
(Parsing Expression Grammars) as a better way to do the things that people
have abused regex to do, while keeping it grounded in pure CS principles,
allowing better syntax making things easier to express, more powerful
behavior, mathematically grounded complexity (for performance), and more
clarity in what can and cannot be accomplished.

This video presentation from the Lua Workshop explains all of this and more
about why PEGs. [https://vimeo.com/1485123](https://vimeo.com/1485123)

~~~
bsder
The problem with grammars is that they are too verbose.

Regexes are _concise_. This is their strength and weakness.

I can write a regex in a dialog box after initiating a "Find". I can't specify
a grammar like that.

The deeper problem with regexes is _programmers_. Most programmers do not have
the perspective to say "Whoa. This regex is too much. I'm really doing parsing
at this point and should probably switch up to a grammar."

~~~
dottrap
But that isn't even completely true and you miss my point about regex falling
outside the domain of "real" regular expresssions. Here's a Perl regex to
validate email addresses according to RFC 822 (and it doesn't actually handle
everything)

[http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html](http://www.ex-
parrot.com/pdw/Mail-RFC822-Address.html)

    
    
            (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
        )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
        \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
        ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
        \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
        31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
        ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
        (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
        (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
        |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
        ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
        r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
         \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
        ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
        )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
         \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
        )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
        )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
        *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
        |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
        \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
        \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
        ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
        ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
        ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
        :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
        :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
        :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
        [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
        \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
        \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
        @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
        (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
        )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
        ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
        :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
        \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
        \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
        ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
        :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
        ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
        .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
        ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
        [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
        r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
        \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
        |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
        00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
        .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
        ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
        :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
        (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
        \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
        ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
        ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
        ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
        ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
        ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
        \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
        ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
        ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
        :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
        \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
        [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
        ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
        ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
        ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
        ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
        @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
         \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
        ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
        )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
        ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
        (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
        \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
        \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
        "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
        *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
        +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
        .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
        |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
        ?:\r\n)?[ \t])*))*)?;\s*)
    
    

Now here's a Lua LPeg example:

[https://gist.github.com/daurnimator/3044300](https://gist.github.com/daurnimator/3044300)

    
    
        local P = lpeg.P
         local R = lpeg.R
         local S = lpeg.S
         local V = lpeg.V
         local C = lpeg.C
        
         local CHAR = R"\0\127"
         local SPACE = S"\40\32"
         local CTL = R"\0\31" + P"127"
    
         local specials = S[=[()<>@,;:\".[]]=]
    
         local atom = (CHAR-specials-SPACE-CTL)^1
         local dtext = CHAR - S"[]\\\13"
         local qtext = CHAR - S'"\\\13'
         local quoted_pair = "\\" * CHAR
         local domain_literal = P"[" * ( dtext + quoted_pair )^0 + P"]"
         local quoted_string = P'"' * ( qtext + quoted_pair )^0 * P'"'
         local word = atom + quoted_string
    
        -- Implements an email "addr-spec" according to RFC822
         local email = P {
        	V"addr_spec" ;
        	addr_spec = V"local_part" * P"@" * C(V"domain") ;
        	local_part = word * ( P"." * word )^0 ;
        	domain = V"sub_domain" * ( P"." * V"sub_domain" )^0 ;
        	sub_domain = V"domain_ref" + domain_literal ;
        	domain_ref = atom ;
        }
    

If you stay in the realm of "real" (theoretical/CS) regular expressions, then
regex doesn't have to be nasty. But the fact is that most people are not doing
this and trying to do things way outside of the domain. At this point, all
bets are off and other tools may be more correct, more appropriate, and more
concise.

Edit: Formatting

~~~
raiph
Perl 6 Rules reframe parsing along the lines you suggest.

[https://en.wikipedia.org/wiki/Perl_6_rules](https://en.wikipedia.org/wiki/Perl_6_rules)
contains the grammar used to define Perl's sprintf string formatting notation:

    
    
      grammar Str::SprintfFormat {
       regex format_token { \%: <index>? <precision>? <modifier>? <directive> }
       token index { \d+ \$ }
       token precision { <flags>? <vector>? <precision_count> }
       token flags { <[\ +0\#\-]>+ }
       token precision_count { [ <[1-9]>\d* | \* ]? [ \. [ \d* | \* ] ]? }
       token vector { \*? v }
       token modifier { ll | <[lhmVqL]> }
       token directive { <[\%csduoxefgXEGbpniDUOF]> }
      }
    

You could use these rules like so:

    
    
      if / <Str::SprintfFormat::format_token> / { ... }
    

Perl 6 Rules unify PEGs, regexes, and closures -- "[a] rule used in this way
is actually identical to the invocation of a subroutine with the extra
semantics and side-effects of pattern matching (e.g., rule invocations can be
backtracked)."

Have you looked at Perl 6 Rules?

------
eganist
I can see the claimed advantages to what's proposed, but I feel like if the
railroad diagram by RegExper could be reversed, that that would be a far more
successful visual syntax for regular expressions. Then again, most of my
regex-fu entails building a regex relatively close to what I want and then
repeatedly throwing it at a local instance of RegExper and test strings until
I have something which accomplishes what I'm looking for it to do. I'd
definitely fall outside the "true regex superheroes" category.

Anyway, to simplify what I have in mind for us less-than-experts, it'd be neat
if someone could put together a railroad diagram of a regular expression that
would then be compiled as the regex itself.

That being said, I don't have the presence of mind right now to determine if
two different regexes can result in the same diagram in RegExper. If so, that
kinda thoroughly breaks my idea.

~~~
gwildor
> most of my regex-fu entails building a regex relatively close to what I want
> and then repeatedly throwing it at a local instance of RegExper and test
> strings until I have something which accomplishes what I'm looking for it to
> do.

> I'd definitely fall outside the "true regex superheroes" category.

I think you just gave the definition of "true regex superhero". Best regex
programmers I know have a similar workflow.

------
consto
I understand that the first email regex is simplified and as a result doesn't
handle oddities such as weird symbols, quotations and IP addresses, but it
should be able to handle modern TLDs. Not only do you have names longer than 4
characters, but internationalised domain names starting with --.

[https://en.wikipedia.org/wiki/List_of_Internet_top-
level_dom...](https://en.wikipedia.org/wiki/List_of_Internet_top-
level_domains)

Depending how simple you want it either:

\b[A-Z0–9._%+-]+@[A-Z0–9.-]+\\.(xn--[A-Z0-9]|[A-Z]+)\b

or simpler:

\b[A-Z0–9._%+-]+@[A-Z0–9.-]+\\.[A-Z0-9-]+\b

However you could argue that validating email via regex misses the point
entirely. A simple, permissive regex is all you really need assuming you are
actually sending an email to check that the account exists.

~~~
malka
Usually, I just check that there is an '@' somewhere in it.

~~~
mioelnir
It is things like `net/mail.ParseAddress()` that I have come to really enjoy
in Go's standard library.

------
zwischenzug
In the past I've used kodos, but as time has gone on I've needed it less and
less:

[http://kodos.sourceforge.net/](http://kodos.sourceforge.net/)

As a result I'm not into the idea of such a visualisation; you should be using
regexps all the time, and internalising the rules. When that's not enough you
have to go and read up. I'm not sure such a visualisation will help that much
in those non-regular cases, simply because they won't always be available to
hand.

~~~
nv-vn
>you should be using regexps all the time

Personally, I try to limit my usage as much as possible. Regexes are basically
only useful for things I do in my terminal and occasionally verifying that
data is well-formed. For practical parsing, it's almost always better (and
runs faster) to use a more robust solution like parser combinators or
lexer/parser generators. Even oftentimes for things that are seen as perfect
regex use-cases (validating data or splitting strings), using a parsing
solution will work better -- for example, you can't be sure that all the
numbers in your data are small enough to not overflow the integers using
regexes alone (or at least not without resorting to extremely long and
unreadable regexes). Regexes are a tool that's easy to reach for, but a lot of
the time it's a tool that will end up breaking on you eventually.

~~~
zwischenzug
I use :s%/search/replace/g very very often (for example).

~~~
rjeli
for those following along in ex, it should be :%s/foo/bar/g

~~~
zwischenzug
Oops :)

------
Cozumel
Stuff like this while well intentioned is ultimately harmful, regex always
looked like total gibberish to me, then one weekend I sat myself down and
actually learnt it, no more issues. It's really simpler than it seems and
worth the effort to learn, programs like this just work as a crutch.

~~~
paul_f
The point of this, as the article says, is the rare use of regexp. If you have
to design a regex twice a year, taking one weekend to relearn it is too much
trouble.

------
callesgg
When i read the "graphical" version i missed the a major issue with that email
verifier, it only allows emails in UPPER CASE.

Spotted it directly in the normal one.

That says something.

PS, i do think it looks rather nice.

------
markbnj
I'll be really interested to see others' reactions to this. My first
impression when I glanced over the example construction was not good. I felt
like it really didn't improve comprehension, but just forced me to try to
learn a new way of seeing those symbols. Perhaps a visual regex "IDE" that
completely abstracted the syntax would be a better approach.

~~~
savolai
There appear to be lots of reactions over at reddit.

[https://www.reddit.com/r/programming/comments/4jfuq4/regular...](https://www.reddit.com/r/programming/comments/4jfuq4/regular_expressions_you_can_read_a_new_visual/)

(I am OP / blog post author)

------
kolapuriya
Depends on what you mean by "parse". If all you want is to search a document
that is known to be well-formed, find an element that meets a few criteria,
and grab a value out of that element, you can sometimes get away with using
regex to find a substring that "looks right" without actually parsing the
document. Running your document through an actual parser gives you access to
more information about the structure of the document and the context of the
elements of interest. Actually parsing your input is therefore more robust to
unexpected variations than any of the superficially-cheaper alternatives that
people try.

------
forrestthewoods
My #1 issue with regex is just knowing the damn syntax. Every implementation
is a little bit different.

Is there a good website that lets me select a language/platform/IDE/etc and
cleanly shows all the tools in that particular toolbox?

------
ZenoArrow
What about using this pattern matching visualisation with SNOBOL? I'd suggest
it could be better for this than RegEx.

[http://langexplr.blogspot.co.uk/2007/12/quick-look-at-
snobol...](http://langexplr.blogspot.co.uk/2007/12/quick-look-at-
snobol.html?m=1)

"The most interesting thing about the language is the string pattern matching
capabilities. Here's an small(and very incomplete) example that extracts the
parts of a simplified URL string:

    
    
       LETTER = "abcdefghijklmnopqrstuvwxyz" 
       LETTERORDOT = "." LETTER
       LETTERORSLASH = "/" LETTER
    
       LINE = INPUT
       LINE SPAN(LETTER) . PROTO "://" SPAN(LETTERORDOT) . HOST "/" SPAN(LETTERORSLASH) . RES
    
       OUTPUT = PROTO
       OUTPUT = HOST 
       OUTPUT = RES

END

In line 6, the contents of the LINE variable is matched against a pattern. The
pattern contains the following elements:

1.The SPAN(LETTER) . PROTO "://" section says identify a sequence of letters
followed by "://" and assign them to the variable called PROTO

2.The SPAN(LETTERORDOT) . HOST "/" secotion says take a sequence of letters
and dots followed by "/" and assign then to the variable called HOST

3.Finally the last section takes the remaining letters and slash characters
and assign them to the RES variable"

------
JelteF
Any time I want to write some non trivial regex I use
[https://debuggex.com/](https://debuggex.com/) to check/write it. It is also
great to quickly find out what a regular expression that someone else write
actually does.

------
Annatar
This helped me master regular expressions:

[http://www.amazon.com/Mastering-Regular-Expressions-
Jeffrey-...](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-
Friedl/dp/0596528124/)

once I read that, it was AWK forever.

~~~
joepvd
For me the same. Funny, as Friedl does barely discuss the ERE and awk (DFA)
implementations at length. It is a shame, as the `more interesting` NFA
implementations have performance issues under some circumstances. I had hoped
the dilemma and engineering decision between features of NFA and guaranteed
performance of DFA would get a more realistic discussion.

~~~
burntsushi
For a discussion of the "DFA" engines, I'd recommend Russ Cox's article
series.

The NFA implementations you speak of have problems because they use
backtracking and take worst case exponential time. Most implementations are in
fact not NFAs, since for example, an NFA is not a powerful enough tool to
resolve backreferences (which are NP-complete).

Both NFAs and DFAs in fact have equivalent computational power, and either can
be used to perform regular expression matching in linear time. There are of
course lots of performance differences in practice.

Friedl's book is great for what it is: a guide to optimizing regexes that
search via backtracking.

------
vatotemking
I use [http://regexr.com/](http://regexr.com/) for this purpose

------
tacos
I like when he got to the hard part and then just stopped writing instead of
doing some actual specification or design.

~~~
savolai
Hi there tacos! The amount of attention this has received seems to indeed
warrant more detail.

I would be happy to hear about what would fulfill your criteria for
"specification or design", so we can hit the spot as we go on. Thanks! -author

