
Show HN: Using functions to construct Regex in Python - iogf
https://github.com/iogf/crocs
======
avar
In case you aren't aware of this, what you've made is very similar to the
rx.el[1] that's shipped with Emacs since 21.1 (released in 2001).

Many of the comments here are speculating on what this might/could be used
for, but could simply look at how it's used in various Emacs modes compared to
writing raw regexes in string form.

It's not fully comparable, since a large reason to use rx.el in Emacs is
Emacs's nasty regex syntax coupled with having no native quoting construct for
regex, making writing anything an exercise in bashing your backslash key.

But it also makes it easier to programmatically generate regexes, which as
your Python implementation also shows can lead to much clearer code.

1\.
[http://doc.endlessparentheses.com/Fun%2Frx.html](http://doc.endlessparentheses.com/Fun%2Frx.html)

~~~
bhrgunatha
> compared to writing raw regexes in string form ...

> Emacs's nasty regex syntax

It's also case-sensitive too. Writing regex by hand when there is a mixture of
case sensitivity to recognise is not at all pleasant.

------
reikonomusha
(Separate comment addressing the library itself.)

I find the naming of the classes off.

Why not call X as Any?

Why use nouns for some things, and verbs for others? I found Include and
Exclude to be confusion. Inclusion and exclusion are relative to something.
You include something with something else. I think Only and AnyBut would be
better names.

Why call it Seq when it's closer to a Python Range? Seq usually means
sequentiality.

Why not call Times as Repeat or Repetition?

It didn't look like this had any compatibility with existing regexes. I can't
parse an existing regex into this library.

It looks like debugging could be a nightmare since it just passes things off
to Python's re library. So if an error happens there, it will be tough to
trace it to your original construction.

All of this class hierarchy building seems prime to replace with algebraic
data types, which would probably cut the code down to just a fraction of the
size.

~~~
Asooka
I also think the naming is really bad and can be improved. The central problem
is that you're using action words for a descriptive API. You should use
something like "Repeating" instead of "Times". Also, the parameters to Times
are distinctly un-pythonic. I would probably make it provide this API:

    
    
       def Repeating(pattern, *args, min=None, max=None, exact=None):...
    
       Repeating(P) => P*
    
       Repeating(P, a, b) and
       Repeating(P, min=a, max=b) => P{a, b}
    
       Repeating(P, n) and
       Repeating(P, exact=n) => P{n, n}
    
       Repeating(P, min=n) => P{n,}
    
       Repeating(P, max=n) => P{0, n}
    

"Exclude" and "Include" suffer from the aforementioned problem of being
procedural words for a declarative syntax. And they're easy to confuse with
one another, having only two differing letters. And they don't make it clear
that they match exactly one character. I would use "AnyBut" and "OneOf".

Why the split between Group and NamedGroup? Just add a name parameter to
Group!

Seq as mentioned by parent is also problematic. Use Range or maybe Between or
FromTo. To me FromTo flows the best, but Range is also ok.

I'm tempted to advocate adding operator overloading, so | would make a pattern
that matches any one of two alternatives, * would be used for specifying
repeats, etc. So you could write

    
    
       Digit = FromTo('0', '9')
       CapitalLetter = FromTo('A', 'Z')
       POCode = CapitalLetter * 2 + Digit * 3
       SmallLetter = FromTo('a', 'z')
       Letter = CapitalLetter or SmallLetter
       Name = CapitalLetter + Letter * (0, Inf)
       FirstName = Group(name='first', Name)
       LastName = Group(name='last', Name)
       FullName = FirstName + ' ' + LastName
    

However, implemented badly that can hurt readability much more than having to
write functions everywhere.

~~~
masklinn
> I'm tempted to advocate adding operator overloading, so | would make a
> pattern that matches any one of two alternatives, * would be used for
> specifying repeats, etc. So you could write

For repeat I think using indexing would make more sense though it might look
somewhat unusual:

    
    
        Pattern[:]
    

would be *

    
    
        Pattern[1:]
    

would be + aka {1,}

    
    
        Pattern[5]
    

would be {5}

    
    
        Pattern[m:n]
    

would be {m,n}

It also uses a single operator for all repetition operations which is nice.

------
reikonomusha
In Common Lisp, the library CL-PPCRE [0] is used for regexes, and has an
"AST"-syntax like this. It supports full Perl-compatible regular expressions
in both the tree syntax as well as the standard string syntax.

The tree syntax actually uses symbols and lists, idiomatic in symbolic
programming settings. This has the hugely convenient property of being
trivially serializable, just as the string representation is. This isn't true
with an unadorned object representation.

The documentation for the tree syntax is here [1]. Some examples from the page
are reproduced below. The PARSE-STRING function isn't used by the user except
for testing. All of the regex scanning and matching functions allow either a
string or a tree.

    
    
        * (parse-string "(ab)*")
        (:GREEDY-REPETITION 0 NIL (:REGISTER "ab"))
        
        * (parse-string "(a(b))")
        (:REGISTER (:SEQUENCE #\a (:REGISTER #\b)))
        
        * (parse-string "(?:abc){3,5}")
        (:GREEDY-REPETITION 3 5 (:GROUP "abc"))
        ;; (:GREEDY-REPETITION 3 5 "abc") would also be OK
        
        * (parse-string "a(?i)b(?-i)c")
        (:SEQUENCE #\a
         (:SEQUENCE (:FLAGS :CASE-INSENSITIVE-P)
          (:SEQUENCE #\b (:SEQUENCE (:FLAGS :CASE-SENSITIVE-P) #\c))))
        ;; same as (:SEQUENCE #\a :CASE-INSENSITIVE-P #\b :CASE-SENSITIVE-P #\c)
        
        * (parse-string "(?=a)b")
        (:SEQUENCE (:POSITIVE-LOOKAHEAD #\a) #\b)
    

[0] [http://weitz.de/cl-ppcre/](http://weitz.de/cl-ppcre/)

[1] [http://weitz.de/cl-ppcre/#create-scanner2](http://weitz.de/cl-
ppcre/#create-scanner2)

~~~
kazinator
It's worth mentioning that CL-PPCRE is not a set of bindings to a C
implementation. It is written _in_ Common Lisp, and it's fast.

------
gattilorenz
While the idea is very cool, it seems to have most of the drawbacks of
traditional regexes (i.e. the little Syntax quirks that eventually one has to
learn), with none of the benefits (regex are everywhere including text
editors, not just a python thing). It does make them more readable I guess,
but I'd like to hear other HNers' opinions.

I think for beginners a better approach would be using something like
RegexBuddy ([https://www.regexbuddy.com](https://www.regexbuddy.com) not
affiliated, just found it super useful when I started writing regexes).

~~~
stonewhite
I don't really think this is more readable than the regex itself once it
starts to get complex, like they all do.

Maybe if this was implemented on a prefix notation language it may have would
looked/read better

~~~
fiddlerwoaroof
I really like the way Common Lisp's ppcre library works: the functions all
accept either standard strings or a s-expression version of the regular
expression and then, using compiler macros, all invocations with statically-
determinable regular expressions get compiled to some internal representation
at compile-time rather than generating that representation at run-time.

[http://weitz.de/cl-ppcre/#create-scanner2](http://weitz.de/cl-ppcre/#create-
scanner2)

------
orf
Interesting, how does this compare to PyParsing[1]? It seems that pyparsing
does a lot of this already. Not that you shouldn't write your own, mind you!

Some improvements could be to use operators to reduce some boilerplate, like
using __* instead of Times(), i.e `(X() * 3) * 5`

1\. [http://pyparsing.wikispaces.com/](http://pyparsing.wikispaces.com/)

~~~
iogf
Interesting suggestion.

------
nemetroid
I find this:

    
    
        mail = '(?P<name>[a-z][a-z0-1\_\.\-]{1,})'
        hostname = '(?P<hostname>python[a-z]{1,})'
        domain = '(?P<domain>br[a-z])'
        match_mail = f'{mail}\@{hostname}\.{domain}'
    

quite more readable and easy to verify than the 50-line example in the link.
The real advantage of parser combinators over regexes is the ability to parse
data structures instead of just capturing regex groups.

~~~
klenwell
Or even just code comments, as with this example[0]:

    
    
      import java.util.regex.Pattern;
    
      public class Main {
          public static void main(String[] args) {
              String regexStr = "";
    
              regexStr += "\\b";     //Begin match at the word boundary(whitespace boundary)
              regexStr += "\\d{3}";  //Match three digits
              regexStr += "[-.]?";   //Optional - Match dash or dot
              regexStr += "\\d{3}";  //Match three digits
              regexStr += "[-.]?";   //Optional - Match dash or dot
              regexStr += "\\d{4}";  //Match four digits
              regexStr += "\\b";     //End match at the word boundary(whitespace boundary)
    
              if (args[0].matches(regexStr)) {
                  System.out.println("Match!");
              } else {
                  System.out.println("No match.");
              }
          }
      }
    

[0]
[https://codereview.stackexchange.com/questions/47432/comment...](https://codereview.stackexchange.com/questions/47432/commenting-
string-matching-regex)

~~~
blibble
those comments are no better than:

    
    
       i++; /* increase i by one */
    

sure, break up a larger regex to reflect its structure and comment those
sections, but something like \\\d{3} or [-.]? should really not need
commenting

~~~
bpicolo
Regex has enough sigil spam / characters that mean alternate things than "this
is a letter" that mentally parsing it beyond the trivial case is difficult.
They're ubiquitous, but they sure aren't a perfect UI

------
ddebernardy
I'm confused. Doesn't Python offer verbose regular expressions complete with
comments?

[https://docs.python.org/3/library/re.html#re.VERBOSE](https://docs.python.org/3/library/re.html#re.VERBOSE)

    
    
        a = re.compile(r"""\d +  # the integral part
                           \.    # the decimal point
                           \d *  # some fractional digits""", re.X)
    

Or is there more at stake than making readable, self-documenting regular
expressions?

~~~
always_good
It has the same advantages as combinators: the total computation can be broken
down into pieces, tested individually, and then composed and reused,
especially programmatically.

Why use functions when you can have one big main(args: Array<String>) with
inline comments?

btw, a hello-world snippet like /\d+\\.\d*/ isn't very scathing criticism of a
library.

------
matthberg
Really cool concept, yet not practical in my opinion. The classes and
functions called to replace the regex I consider to be harder to learn than
the initial regex. With regex there is a standardised, succinct way to query
strings, while this system adds unneeded complexity for the sake of using
English words. For quick uses integrated in other programs I can see the
usefulness, yet as a standalone I'll stick with regex.

~~~
iogf
The idea is having a set of entities on which one could reason better to
implement more complex filters(regex) as well as debugging them. It is sort of
a way of reasoning on filtering results it merely uses regex as an underlying
tool.

One could write the filter using crocs format, compile it to regex then use in
their programs. Using crocs you sort of define your data type then apply
functions on those data types to fetch your required output.

The crocs framework is a way to give more power to your imagination and
readability to others about what your imagination produces.

------
rajaravivarma_r
I always wanted to do something like this. Because I always forgot how to do
the 'look ahead' assertions.

I even named the library 'hu_regex', where 'hu' stands for human, but never
did anything further, as I couldn't think of any good builder patter for regex
that seemed intuitive.

Anyway, thanks for this. This comment section has introduced some good
libraries as well.

------
rnhmjoj
If you are interested in different ways to write/implement regular expression
this talk is worth the watch: [https://begriffs.com/posts/2016-06-27-fast-
haskell-regexes.h...](https://begriffs.com/posts/2016-06-27-fast-haskell-
regexes.html)

------
agumonkey
I always liked this idea (reminiscent of emacs rx dsl). But I'm not sure I'd
be using it that often...

------
rohitpaulk
A more mature version:
[https://github.com/VerbalExpressions/PythonVerbalExpressions](https://github.com/VerbalExpressions/PythonVerbalExpressions)

~~~
iogf
This one doesnt look to support lookahead/lookbehind nor it outputs valid
inputs for the matches though. However, it is interesting the way of how you
chain the VerEx to build the patterns.

------
iogf
I have been adding more docs to:

[https://github.com/iogf/crocs/wiki](https://github.com/iogf/crocs/wiki)

It would be interesting to hear comments :)

------
anentropic
How often do you programmatically generate regexes though?

I tend to find that regex patterns are effectively constants in the program

And if it's just for readability it seems a bit of a sledgehammer

------
carlochess
Hi, what's the diference between this project and a parser combinator?

~~~
yablak
A parser combinator may be more powerful? Not sure.

The python parsec package is really terribly documented, but there's a good
example here:

[http://www.valuedlessons.com/2008/02/pysec-monadic-
combinato...](http://www.valuedlessons.com/2008/02/pysec-monadic-combinatoric-
parsing-in.html)

EDIT:

The name of the package may have changed; it looks like this is the pypi
package being documented (if not; i'm confused because pypi's pysec package is
something totally different):

[https://pypi.python.org/pypi/parsec](https://pypi.python.org/pypi/parsec)

------
bane
Regexes aren't really all that hard if you avoid all the crazy back reference
craziness. I've taught them to non-programmers in about a half-hour and they
were able to use them reasonably well right after with a couple reminders.

There's three things you need:

1 - concatenation - basically one thing next to another, 'a' goes next to 'b'
to make 'ab' which matches any string with that in it. Examples:

'xzyqr2321abtwe' \- matches

'zyxabc' \- matches

'abcxyz' \- matches

'ab' \- matches

'ba' \- doesn't match

'azb' \- doesn't match

2 - alternation - one thing or another. The operator for this is the pipe
symbol '|'. So 'a|b' is 'a or b'.

'xzyqr2321abtwe' \- matches

'xzyqr2321atwe' \- matches

'xzyqr2321btwe' \- matches

'zyxabc' \- matches

'abcxyz' \- matches

'ab' \- matches

'azb' \- matches

Here's another 'abc|cab' 'abc or cab'

'xzyqr2321abtwa' \- doesn't match

'zyxabc' \- matches

'zyxcba' \- doesn't match

'zyxcab' \- matches

If it helps, you can use parentheses to group blocks of things for clarity.

'(abc)|(cab)'

3 - repetition - the operator we want to care about is '* ' also called the
_Kleene star_. This means that anything that comes _before_ the star can
repeat 0 or more times.

so 'a* ' means a pattern of 'a' 0 or more times.

'xzyqr2321abtwe' \- matches

'xzyqr2321btwe' \- matches (because of _zero_ or more times)

'xzyqr2321btwaaaaaaaaa' \- matches

so if you want to match an 'a' one or more times you can simply use rule #1
(concatenation) and put 'aa* '

'xzyqr2321abtwe' \- matches

'xzyqr2321btwe' \- doesn't match

'xzyqr2321btwaaaaaaaaa' \- matches

It turns out this is such a common regex, that some "syntactic sugar" was
invented to make it easier to work with, the '+' symbol. Which is used exactly
the same way: 'aa* ' = 'a+ '

Congratulations, you now know everything you need to make a regex that can
pretty much do anything. So what about all that other line noise one usually
sees in a regex? That's more "syntactic sugar", designed to make certain kinds
of regexes simpler to write.

For example, a regex using the above rules that can match any alphabet letter
from a through f is: a|b|c|d|e|f

This is a lot of typing, so you can use square brackets to build what's called
a "character class". '[abcdef]' = 'a|b|c|d|e|f'

If you have a long string of characters, you can use some more sugar and
simply put a '-' in between the first and last characters '[a-f]' = '[abcdef]'

Here's a complex example, a regex that can match any lower-case alphabet
character or number '[0-9a-z]'

It also turns out that a special character class pattern that can match any
character (except for newlines) is so common that a special operator '.' was
created. '.' matches literally anything.

For example, concatenating it with an "any number" regex gives you '.[0-9]'
which means any character followed by a number.

For repetition, there's also some syntactic sugar, here's a table, these all
go after the thing you want repeated:

* - 0 or more times

\+ - 1 or more times

{p} - p times

{p,q} - p to q times

{,q} - 0 to q times

{p,} - at least p times

? - 0 or 1 times

Finally, in the list of basic regex stuff, there's "capturing", this is how
regexes get used to parse strings. It works basically like this, anything that
is matched inside of a parentheses gets "copied" to a special variable. It
depends on your language on where this variable is. The name of the variable
is typically some 1, indexed variable like $1, $2, $3. The number of the
variable is the number of left parentheses in the regex.

Example: 'abc(.+)123'

abcd123 - matches, $1 is 'd'

abcf123 - matches, $1 is 'f'

abcqqq123 - matches, $1 is 'qqq'

abc123 - doesn't match

Another example: (abc([0-9]+))

abc123 - matches $1 is abc123 $2 is 123

If you want to suppress this capturing behavior, the '(' parentheses should be
written '(?:'

(?:abc([0-9]+))

Try these out here, it does a nice job highlighting
[https://regex101.com/](https://regex101.com/) it calls the capture variable
"groups" and shows them on the right under "capture information"

There, that's about it. Other advanced techniques like greedy matching,
negative character classes, callbacks, etc. flow pretty nicely from these
basic ideas.

~~~
always_good
imo one of the hardest parts of regex is reasoning about catastrophic
backtracking.

~~~
bane
They can get weird and difficult, but most of the time I see people struggling
with regexes is with really simple stuff. Once you start getting into
backreferences and all sorts of other non-formal regular expression stuff and
end up in firm regex-land it's time to start looking for proper parsing
approaches.

