
RegExpBuilder – Create regular expressions using chained methods - jrullmann
https://github.com/thebinarysearchtree/regexpbuilderjs
======
draegtun
Thought this might be of interest; below shows how the examples provided would
look in Rebol:

    
    
        digits: digit: charset "0123456789"
    
        rule: [
            thru "$"
            some digits
            "."
            digit
            digit
        ]
    
        parse "$10.00" rule    ;; true
    
    
        pattern: [
            some "p"
            2 "q" any "q"
        ]
    
        new-rule: [
            2 pattern
        ]
    
        parse "pqqpqq" new-rule    ;; true
    

Rebol doesn't have regular expressions instead it comes with a parse dialect
which is a TDPL - [http://en.wikipedia.org/wiki/Top-
down_parsing_language](http://en.wikipedia.org/wiki/Top-down_parsing_language)

Some _parse_ refs:
[http://en.wikibooks.org/wiki/REBOL_Programming/Language_Feat...](http://en.wikibooks.org/wiki/REBOL_Programming/Language_Features/Parse/Parse_expressions)
|
[http://www.rebol.net/wiki/Parse_Project](http://www.rebol.net/wiki/Parse_Project)
| [http://www.rebol.com/r3/docs/concepts/parsing-
summary.html](http://www.rebol.com/r3/docs/concepts/parsing-summary.html)

~~~
carlob
Mathematica also has its own string pattern sytax

[http://reference.wolfram.com/language/ref/StringExpression.h...](http://reference.wolfram.com/language/ref/StringExpression.html)

Something like that would be

    
    
        StringExpression[
            "$",
            Repeated[DigitCharacter],
            ".",
            DigitCharacter,
            DigitCharacter
        ]
    

or

    
    
        StringExpression[
            "$",
            Repeated[DigitCharacter],
            ".",
            Repeated[DigitCharacter, {2}],
        ]
    

or

    
    
        StringExpression[
            "$",
            NumberString
        ]
    

and the other is

    
    
        StringExpression[
            Repeated[
               StringExpression[
                   Repeated["p", {1, Infinity}],
                   Repeated["q", {2, Infinity}]
               ],
               {2}
            ]
        ]
    

This can be made more concise since StringExpression has an infix form (~~)
and Repeated can sometimes be replaced by postfix ..

~~~
akater
> Repeated can sometimes be replaced by postfix ..

Always, not sometimes. ;-)

------
tragomaskhalos
There have been many efforts similar to this in many languages, but most of us
seem happy to stick to the more succinct canonical form, supplemented via /x #
comments when things get too hairy

------
marktangotango
Generally, I find that if one's regexes are so complex that one needs
visualizers or other aids in writing them, one doesn't have a regex problem,
but a parsing problem. The method of parsing by recursive descent can often
lead to much more understandable (if more verbose) "pattern matching".

~~~
DenisM
Recursive descend is imperative, while regex is declarative.

Regex may be ugly, but you lose something important when you move from
declarative to imperative.

~~~
jerf
"Recursive descent" has that name precisely because it is not the only parsing
alternative, hence we can not simply call it "parsing".

------
UnoriginalGuy
Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions
if you ask me.

I've "learned" regular expressions multiple times but it just never sticks, I
have no idea why. It certainly doesn't help that there are several different
incompatible syntaxes (so what I remember and think "should" work doesn't).

I'd prefer to write RegX's in this style, however I would pay attention to
performance (not that Regular Expressions are high performance, however I
wouldn't want to see a large performance loss either).

~~~
UK-AL
Regular expressions are high performance if you use automata style(Regular
Language) regular expressions, which limits the use of some of the features
you can use.

Modern regular expression engines in a lot of languages, actually go beyond
the expressiveness of a regular language. This is what damages performance.

There is no reason why this would reduce performance... if its not doing
anything crazy.

If anything your taking work away from it. Your building the tree directly
here, where as parser would normally build a tree from the string. But since
this is integrating into the languages RE library i'm guessing its writing
that tree as a string, which is then passed into the regular expression
engine, to be turned into a tree again :)

~~~
UnoriginalGuy
I guess it depends on your definition of "high performance."

If a regular expression runs too often, even pre-compiled (as they should be),
you'll want to replace them with code written in the native language. I've
gone in and replaced a one line search/replace written in RegX (compiled),
with just a C-style for() loop over the wchar array, and had the memory usage
drop by near 80% and performance increase by over 60%.

So high performance is all relative. However RegX isn't something I'd describe
that way, even compiled. It is a nice way to write complex string parsing code
quickly however.

~~~
d4n3
> you'll want to replace them with code written in the native language

Probably not true for Javascript (and other scripted languages) - matching
regex uses native and highly optimized regex lib, which will usually be orders
of magnitude faster than implementing this in the language.

~~~
UnoriginalGuy
That isn't relevant in this context as the library linked couldn't be
integrated into JavaScript.

~~~
d4n3
Sorry, which library do you mean? The OP is a javascript library..

I just wanted to point out that regex is much faster in javascript than doing
things 'by hand'.

------
chris-at
Thanks, this is a lot better than writing this (even if the formatting worked
here):

``` (?xi) \b ( # Capture 1: entire matched URL (?: [a-z][\w-]+: # URL protocol
and colon (?: /{1,3} # 1-3 slashes | # or [a-z0-9%] # Single letter or digit
or '%' # (Trying not to match e.g. "URI::Escape") ) | # or www\d{0,3}[.] #
"www.", "www1.", "www2." … "www999." | # or [a-z0-9.\\-]+[.][a-z]{2,4}/ #
looks like domain name followed by a slash ) (?: # One or more: [^\s()<>]+ #
Run of non-space, non-()<> | # or \\(([^\s()<>]+|(\\([^\s()<>]+\\))) _\\) #
balanced parens, up to 2 levels )+ (?: # End with: \\(([^\s()
<>]+|(\\([^\s()<>]+\\)))_\\) # balanced parens, up to 2 levels | # or
[^\s`!()\\[\\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars ) )
```

~~~
eridal
actually most of the comments seem to imply that whoever wrote that don't
fully understand regexp syntax -- or, worst, she expects that whoever read
will not

    
    
        /{1,3}                        # 1-3 slashes
        |                             #   or
        [a-z0-9%]                     # Single letter or digit or "%";

~~~
GhotiFish
err... sorry?

[https://www.debuggex.com/r/EpocMU_7Fq_B_p9z](https://www.debuggex.com/r/EpocMU_7Fq_B_p9z)

edit:

wait, I thought about it for a second and I see what you meant. You're not
saying it's wrong, you're saying it's obvious.

I wasn't sure if it was obvious because I wasn't sure if {1,3} was supposed to
be {1-3} and there was a mistake in the expression, or if there was some kind
of unexpected error in the [a-z0-9%] expression.

Because even in this simple example, there is room for error.

------
jluxenberg
S-expressions are a natural fit for construction of regular expressions, see
[http://community.schemewiki.org/?scheme-faq-
programming#H-1w...](http://community.schemewiki.org/?scheme-faq-
programming#H-1w56qpn)

e.g.

    
    
      (: (or (in ("az")) (in ("AZ"))) 
        (* (uncase (in ("az09")))))

~~~
maratd
Regular expressions are a natural fit for construction of regular expressions.

Look, I know it takes a while, but once you get the hang of it, you won't need
any crutches to write regular expressions. The only tool that's really needed
is a way to rigorously test a regular expression to make sure it does what it
needs to do and there are a ton of those around.

~~~
andrewflnr
No, they're really not, as evidenced by all the quoting and meta-character
nonsense you have to deal with. Sure, it's not _too_ difficult to figure out,
most of the time, but I think a solution that puts characters and logic on
different quoting levels will almost always be better from an expressiveness
standpoint (ignoring ecosystem issues).

~~~
thwarted
This is usually borne by the string literal being used to express the regular
expression literal syntax in many languages. Perl, for example, has a regular
expression literal syntax that is part of the language proper (which has the
added benefit that non-dynamic regular expressions can be checked for syntax
at compile time). Python, in contrast, doesn't have a first-class regular
expression literal, but makes it easier to deal with by prefixing the literal
with r or R to create a "raw string" (which exists to avoid excessive
backslash escaping). Some regular expression engines use % as the meta-
character indicator, which is more compatible with C-style "escape sequences"
in double-quoted strings).

If you think characters and logic need to be on different quoting levels,
you're not taking the right perspective on regular expressions. \d or \w are
not an escaped d or w, they are their own atoms (or "the keywords of the
language", if you will), distinct from the atoms that match the ASCII
characters 0x64 and 0x77. The thing to remember with regular expressions is
always the first lesson presented: (non-meta) characters match themselves, the
regular expression /a/ matches the letter a. What's implied here, but rarely
said, is that that's not really the letter a in there, but rather an
expression that matches the letter a—it just so happens to also look like the
thing it matches. This distinction is subtle, but important. This can also be
made more evident by using the /x modifier if it's available to spread out the
individual expressions (put space between the keywords).

The primary difference in regular expression languages is often how "logic",
as you call it, is expressed. PCRE considers, for example, [ to be the
character for opening a character class and \\[ to match the byte 0x5b.
Admittedly, this is confusing when switching engines because 1) not every
character matches itself (the expression that matches a character and the
character it matches are not visually the same) and 2) other RE engines have
taken the opposite approach depending on if that engine was meant, by the
author, to have more literal atoms or more logic in its most common use (that
is, you save typing if you mean to match the byte 0x5b more frequently than if
you mean to open a character class).

As for "quoting", you almost NEVER should be using things like PCRE's \Q…\E
(or the quotemeta function) unless you're building regular expressions
dynamically from user-input. quotemeta and friends are not readability tools,
but safety tools.

~~~
andrewflnr
I'm using the term "quoting" in the general sense of a marker that some
sequence of symbols is being used as symbols, rather than for their semantic
values.

My perspective on regular expressions in one of a student who was not two
weeks ago introduced to the formal version of REs. In this formalism, there
are basically strings and operators on these strings. We don't usually use
quotes, but only because you can usually infer from context which bits are
strings and which are one of the small set of operators. But when we need to
match numbers with possible "+"es (the alternation operator) in front of them,
out come the quotes.

In a typical programming language, we don't have the luxury of expecting the
interpreter to infer things like that from context. Further, it's rather
common to try to match things that would otherwise be used as metacharacters.
This is exactly why quoting, in the general sense, was invented, so we can
tell what's the program and what's the input.

Granted, most of my RE experience is in Python, where everything is just
jammed in a string. There it's obvious that metacharacters and escapes are
just a worse-is-better substitute for quasiquoting. Maybe it's different in
Perl, but I'm skeptical. Strings matching themselves is cool. The problem is
that it's cool enough to prevent you from realizing when you've taken the
metaphor too far.

------
jgalt212
Definitely a debugable way to write regexes. Whenever I have to maintain a
hairy regex, I like to plot the regex as a railroad diagram.

These web based tools can do it:

[https://www.debuggex.com/](https://www.debuggex.com/)

[http://jex.im/regulex/](http://jex.im/regulex/)

~~~
philjohn
Love it - just visualised the PCRE generated from the EBNF for the N-Triples
RDF serialisation format[1] :)

[https://www.debuggex.com/r/Yxqws81Uif-
BGBN8](https://www.debuggex.com/r/Yxqws81Uif-BGBN8)

Important note - this is built up programmatically, it's not just a string
dumped in a parser!

[1] [http://www.w3.org/TR/n-triples/#n-triples-
grammar](http://www.w3.org/TR/n-triples/#n-triples-grammar)

~~~
jgalt212
That is one hairy regex. Now the inverse would be even better. You modify the
railroad chart and the regex updates.

~~~
philjohn
Fairly hairy, yes, but if you follow the railroad tracks, it's quite succint
for what it's doing!

------
dkarapetyan
Generalize just a little bit and you got parser combinators.

------
zzzcpan
Regexpes exist to avoid cumbersome code like this, to make it less error
prone. Makes me sad to see so many upvotes.

I get that some people have a hard time understanding regexpes with all the
backtracking and greediness. Yes, syntax is a bit complicated. Maybe
simplified predictable default mode could help. But there is no problem with
DSL being used as an abstraction. In fact, we need more DSLs, for everything!

------
psychometry
Now you have three problems.

------
kazinator
Yes, regexes can have other syntactic representations, like:

    
    
        (compound "$" (1+ :digit) "." :digit :digit)
    

Run:

    
    
        $ txr -p "(regex-compile '(compound \"$\" (1+ :digit) \".\" :digit :digit))"
        #/$\d+\.\d\d/

------
epicureanideal
Nice work! I don't know if it'll be ideal for all use cases, but it does add
some readability.

------
otakucode
Now do an example where you create a regex to parse the IMDB movies.list data
file!

------
gcao
Great work! This is very intriguing!

