
A human way to define regular expressions in Ruby - vbv
http://krainboltgreene.github.io/hexpress/
======
fendrak
Am I the only person who thinks that things like this are totally unnecessary?
Is learning/reading regular expressions really that difficult for most people?

Here's the subset of regular expressions that has gotten me through nearly all
of the regular expressions I've ever needed to write. As a plus, it has no
dependencies!

* - zero or more of the preceding character/group

\+ - one or more of the preceding character/group

? - zero or one of the preceding character/group

$ - end of line

^ - beginning of line

. - one of any one character

\ - escape the following character (for a literal '$' or '.', for example)

[<some characters>] - one of the given characters

[a-zA-Z0-9] - letters and numbers inside a group can have ranges!

(<something>) - capturing group (anything that matches inside it will be
accessible in the match object)

<thing1>|<thing2> \- either the first thing, or the second thing (or the
third, or the fourth...)

This isn't a complete, or even precise, definition, but knowing those things
will get you to the point where you can read and write expressions like this:

^(-|+)?[0-9]*\\.[0-9]+$

which matches things like -.2, 0.123, +0.1, etc. (floating point numbers,
basically). This likely has bugs, since I haven't tested it ;)

~~~
hcarvalhoalves
> which matches things like -.2, 0.123, +0.1, etc. (floating point numbers,
> basically). This likely has bugs, since I haven't tested it ;)

The fact you can't come up with a regex and be sure it's correct answers your
first two questions.

Regexes are incredibly useful, but the syntax is just a bunch of noise. This
library turns it into self-docummenting code, which is incredibly valuable.
Novices might prefer showing their regex-fu, but after you've spent a good
chunk of your life reading and writing code, you value clear code more.

~~~
fendrak
Under the hood, what's being generated is still a regex though. How can you be
sure ANY code is correct, aside from testing it?

~~~
misterbwong
True, but your argument can apply to any sort of abstraction (e.g. using C
instead of assembly or even assembly over 1's and 0's).

He's arguing that the nicer syntax is a valuable abstraction over regex's
overly-complicated syntax, not that the syntax will be 100% the same as regex
100% of the time.

------
cthor
I really don't understand why some people are so afraid of regex. Sure, it's
not perfect, but these symbol-less solutions feel to me like one step forward
and two steps back. A more interesting development would be something along
the lines of what Perl 6 is trying to do (if only someone could implement
it).[1]

We should start treating regex like a computer language in its own right,
rather than some second-hand citizen that we stuff into a single line without
any delimiting whitespace. Use the /x switch and comment your code as you
would with any other programming language, and you'll find that regex really
isn't that scary.

Take the example pattern. The regex written by a human might look something
like so:

    
    
        m{^
            https? ://        # Protocol
            (?: \w+ \. )?     # Subdomain
            ([\w\-]+) \.      # Domain
            (?: com | org )   # TLD
            /?
        $}x
    

It's certainly easier to parse. Debugging it is also a lot easier. We can see
that this won't match anything with more than one subdomain. We can also see
that it won't match subdomains that have hyphens in them. It also looks at a
glance much more like a URL and less like some arbitrary Ruby code.

[1]:
[http://www.perl6.org/archive/doc/design/apo/A05.html](http://www.perl6.org/archive/doc/design/apo/A05.html)

~~~
lelf
> _something along the lines of what Perl 6 is trying to do (if only someone
> could implement it)_

What, like, perl6?

    
    
        grammar URL {
    	rule TOP { <proto> '://' <domain> '/' }
    
    	token proto { 'http' 's'? }
    	token subdom { \w+ }
    	rule domain { <.subdom> [ '.' <.domain> ]? }
        }
    
        say URL.parse('http://lelf.lu/');
    
    
    
        Betty:hacks lelf$ perl6 gr.p6
        ｢http://lelf.lu/｣
         proto => ｢http｣
         domain => ｢lelf.lu｣

~~~
draegtun
And for those interested a full RFC 3986 (URI) in Perl6 grammar can be found
here - [https://github.com/ihrd/uri](https://github.com/ihrd/uri)

Here is direct link to grammar module -
[https://github.com/ihrd/uri/blob/master/lib/IETF/RFC_Grammar...](https://github.com/ihrd/uri/blob/master/lib/IETF/RFC_Grammar/URI.pm)

------
fishtoaster
This is interesting. I find that regexes are one of the few places where a
comment explaining what a block of code does is generally necessary. Since the
"ruby way" is to use aggressively replace comments with method and variable
names, a tool like this is a good way to achieve that goal.

That said, this also makes a huge tradeoff against conciseness. Personally,
I'd prefer

    
    
        # Is foo a valid klingon email address?
        foo =~ /gibberish/
    

Over

    
    
        foo = Hexpress.new.
            start('g').
            maybe('i').
            many('b').
            find('b').
            ...

~~~
krainboltgreene
In another rant about how my library isn't needed someone helpfully noted you
can actually do comments with the %r{} syntax:

    
    
        %r{
          ^
          # foo
          bar
          $
        }

------
acjohnson55
Please yes!

I did a little research and this derives from Verbal Expressions, as explained
[1] and implemented [2]. In any case, I'm emphatically in favor. I'm the sort
of programmer that needs to write maybe one regex per month, and I'm tired as
hell of relearning the human-meaningless syntax for regexs, let alone all the
little variations between languages. Not to mention, if it can vastly reduce
the number of symbols I have to escape when matching special symbols, so much
the better.

[1] [http://thechangelog.com/stop-writing-regular-expressions-
exp...](http://thechangelog.com/stop-writing-regular-expressions-express-them-
with-verbal-expressions/) [2]
[http://verbalexpressions.github.io/](http://verbalexpressions.github.io/)

~~~
krainboltgreene
I really need to link those in the readme, thanks! I also discovered on
releasing this a really old ruby library that predates either of those and I'm
adding that too.

------
jonaphin
Wow. My mind is blown.

It almost feels like what high level programming languages are to assembly
code.

Sure, we can (and should) learn Regex constructs, but it is undeniable that
this library provides an unmatched level of clarity.

Kudos to Krain for coming up with such an elegant solution to the issue of
Regex building/reading.

~~~
eCa
Personally, I would not use this. One of the powers of regex is the similarity
(depending on complexity and structure) between the regex pattern and what a
matching string looks like. By doing it this way, a lot of noise is introduced
that can make parsing more difficult.

I don't find:

    
    
        find { matching { [word, "-"] }.multiple }
    

Clearer than:

    
    
        ([\w\-]+)
    

It might be an alternative way introduce regexes though.

I would also like to see a more complex regex written this way.

~~~
krainboltgreene
In that example I wanted to show off a the matching and multiple syntax.

------
kamaal
Regular expressions were very invented to avoid this. Because this works fine
only as long your regular expressions are small and few.

Once that fact changes you will find yourself writing and staring at walls of
text.

You will do this enough number of times, then only hope that you have a more
succinct and powerful way of expressing such a idiom, you will do all that
only to realize using regular expressions are the only way to solving a range
problems which it was designed to solve.

The easiest analogy I can give you math. Prior to manipulation of symbols,
math was pretty much text. The whole of math looked paragraphs of puzzles and
word play. Worked fine when you want to do small things like postulates,
axioms and a few things derived from that. To move a higher level of
abstraction and interplay of concepts we had to get into symbols.

This is something similar.

------
tzury
Python's re.VERBOSE let you write [1]

    
    
        a = re.compile(r"""\d +  # the integral part
                           \.    # the decimal point
                           \d *  # some fractional digits""", re.X)
    

Instead of

    
    
        a = re.compile(r"\d+\.\d*")
    

But that is as far as it goes for me. From that point, masking REGEX with
additional layer of 10's of functions is not a wise move.

[1]
[http://docs.python.org/2/library/re.html#re.VERBOSE](http://docs.python.org/2/library/re.html#re.VERBOSE)

~~~
krainboltgreene
It's not meant for simple regex like that. It's meant for complex composed
regular expressions.

~~~
drdaeman
Seems that most languages with built-in regexes lack any operation on them
(like `/foo/ \+ /bar/` or `/foo/ | /bar/`). I'd blame the standard library,
not praise a third-party one.

------
Glyptodon
This completely illustrates what I hate most about ruby: the conflation of
'understandable code' with some sort of insane directive to turn everything
into an English sentence without any regard for what the code actually does.
At its worst it's almost an obsession with enforcing technical ignorance.

That said, regexes can be tricky to parse and making them clearer to the
average person is a worthy goal.

------
rajahafify
I don't understand why so much hate for the library. For me, this is a very
ruby way to tackle regular expression problem that newbie like me have.

Not that I'm saying that newbie shouldn't learn regex. But any level of
abstraction would be great. Rails abstract the complexity of web development.
Hexpress abstract the complexity of regex. Both are win in my book.

------
brudgers
Like VerbalExpressions, this a good idea carried out without full
acknowledgement of the nature of regular expressions. There's no shortcut -
concatenation, union, and kleene star. A URL operator doesn't replace any of
them.

If there isn't isomorphism between the traditional symbols and the new names,
then the expressions will be limited in expressiveness.

------
Argorak
Here the same[1]. This library squats vocabulary in a bad way. e.g.:

#word is (\w+). This works in english, but breaks very early. Still, it might
be correct in some time. "word" is extremely context-sensitive.

[1]
[https://news.ycombinator.com/item?id=6319584](https://news.ycombinator.com/item?id=6319584)

~~~
krainboltgreene
Good point, and more importantly I should be making use of the special Ruby
matchers.

------
dlau1
It seems like the argument against regexes is that they can get really complex
and unreadable. For something as simple as that url, isn't it easier to just
use a regex?

Seems like in the 'unreadable' regex case, you'll have a boatload of verbose
function calls to construct it.

~~~
krainboltgreene
Library owner here. A huge draw for me to make this library was composability,
not avoiding regex syntax.

I want to share patterns without having to do much.

------
hardwaresofton
So as computer scientists, we're in the business of tradeoffs right. I think
the conciseness tradeoff for clarity is good. Abstracting away from pure
regular expression syntax is a good thing, I think, because it offers better
AT A GLANCE reading.

Why would that be better? I recently saw Bret Victor's Inventing on Principle
talk on Vimeo, and one of the things he mentions that I really agree with is
that most of us must 'think like computers' to understand what our code will
do. The less we have to do this, the better, because we're terrible computers

Now am I likely to use something like this? No, mostly because I don't program
in ruby TOO often, and I am pretty well aware of how to use regular
expressions.

------
rzendacott
Here's a link to the VerbalExpressions organization that contains
implementations of a similar DSL for various languages:
[https://github.com/verbalexpressions](https://github.com/verbalexpressions)

~~~
prawks
Very cool. I like the standalone "start_of_line" more than the start("some
string").

~~~
krainboltgreene
We actually match (and beat) the VerbalExpressions standard if you `require
"hexpress/verbal_expression"`

------
jgmmo
Cute, but I prefer just using Rubular to test out regex's

------
lgrebe
Or instead of learning start maybe with words multiple has either ending you
learn ^?\w+(|)$.

I'd imagine that wrappers such as this one do not enable users to create any
more complex patterns, than ones they could make with a most basic
understanding of regular expressions. Whilst providing neither a base for more
complex regex use nor a community to further an understanding of regex.

~~~
krainboltgreene
Considering the composability I've designed here I'm not sure how you think it
wont "enable users to create more complex patterns"?

------
nirai
Now we need a human way to write Ruby in Python.

------
swalkergibson
The reason that this library is useful is because I don't have to maintain an
AST in my own feeble mind about what each special character in the regex
means. I can look at the code, at first glance, and immediately know what the
hell is going on.

Also, this.

[https://xkcd.com/208/](https://xkcd.com/208/)

------
krainboltgreene
Library owner here, I've added some replies to points made but I just want to
note that the readme on Github's website was a bit out of date and provides a
little more details now.

------
ryderm
Seems easier to just use a regex. Every programmer should know them already,
so this is just another API to remember. Still kinda cool though, but prob not
so useful.

