
I don't know Regex - dyml
http://ideasof.andersaberg.com/idea/17/i-dont-know-regex
======
danieldk
_Which one is the simplest? I rest my case._

Actually, I like neither. The code is easier to read, but the regex gives a
broader overview. This is something where parser combinators can shine. E.g.,
from Haskell's email-validate:

    
    
      addrSpec = do
      	localPart <- local
      	char '@'
      	domainPart <- domain
      	endOfInput
      	return (EmailAddress localPart domainPart)
    

Source: [http://hackage.haskell.org/packages/archive/email-
validate/1...](http://hackage.haskell.org/packages/archive/email-
validate/1.0.0/doc/html/src/Text-Email-Parser.html#addrSpec)

To end with a positive note: good work on the library! I think it will be
useful for many people who dislike writing regexes.

~~~
waps
Just because it's a regex does not mean you can't document it. There are many
regex tracers that can tell you exactly where a match fails. Plus regexes
condense a lot of information in small spaces, which makes them easier (imho)
to debug. Most other parsing syntaxes are one-offs, and very verbose.

And your average parsing library is not going to be using boyer-moore state
machine parsing like you can easily achieve with regexes. It's complex, terse,
fast, and the code that will be running your match is probably better debugged
than any code you could hope to produce (it's most programmers' understanding
of regexes that could use some debugging). Regexes also just make sense if you
know the theory behind the state machines.

So how about this way of writing the regex :

    
    
      regex = r"""(?x)           # Extended syntax (ignores \n and whitespace, allows comments)
      # Regex to match email addresses
      \b                        # Word boundary
      (?P<username>\w+)         # Username part
      @
      (?P<domain>[\w.]+)        # Domain
      \b                        # Word boundary
      """
    
      # Example use
      import re
      m = re.match(regex, "john@snow")
    
      print m.group('username')    # john
      print m.group('domain')      # snow
    

I find parser combinators very hard to use. I wrote parser combinator
libraries in C and one in java thinking it'd be easier to use than a parser
generator like ANTLR, and I've since rethought the process. ANTLR studio is
just so useful for writing a parser to example data.

There's also the concern that parsers are strictly more expressive than
regexes. If you need that, then regexes are simply out. However, most parser
generators allow you to easily combine regex(-like) tokenization with parsing.

~~~
danieldk
_Just because it 's a regex does not mean you can't document it. _

You are certainly right. Especially, if you use a package for automata or
transducers that allows you to apply common automaton operations (union,
intersection, composition, etc.) to combine expressions.

However, that's not how regular expressions are normally used or what the
standard libraries for most languages support. So, people either write (1)
simplified expressions (like yours above) that do not implement the relevant
standard; (2) write unreadable expressions; (3) 'compose' expressions through
string interpolation, which can become unreadable quite quickly (I've seen
enough in production code).

 _I wrote parser combinator libraries in C and one in java thinking it 'd be
easier to use than a parser generator like ANTLR,_

However yacc (which I assume you used for C) and ANTLR are hardly the state-
of-the-art of parser combinators. Try parsec or attoparsec sometime.

 _There 's also the concern that parsers are strictly more expressive than
regexes._

Not only that, (sub-)parsers are fully typed, making it much easier and safer
to combine parsers. E.g., here I know exactly what this parser will give me
(namely a Bar):

    
    
      foo :: Parser Bar

~~~
waps
I don't know Haskell so I haven't used parsec and attoparsec, beyond tutorials
and running a few examples. I read the tutorials, then proceeded to make my
own version in Java that works similarly (also typesafe). It's easy to make it
work, but it always retains the same disadvantage : runtime evaluation.

Runtime evaluation means that your program figures out the structure of the
grammar into it's internal "recursive" state machine every time (in some cases
every time you evaluate a string). This means that your parser is effectively
running on a very slow, very ad-hoc virtual machine inside your program. In
the case of ANTLR (or yacc) the program itself has the required structure.
Result, yacc parses happen at close to memory transfer rate (depends on the
grammar and the semantic actions obviously), whereas parsec parses, well,
you're lucky to get 10 MiB/s (similar grammar happened at > 1 GiB/s in java
with ANTLR).

The difference is large enough that it quickly becomes very hard to ignore.
Plus, I like the fact that the ANTLR syntax is more concise, which makes it
easier to keep the whole parser in your head once you're used to the language.
Furthermore there's the testing application that comes with ANTLR, ANTLR
studio.

~~~
enigmo
Attoparsec is a whole lot faster than Parsec. Orders of magnitude faster for
many parsers.

There is a common problem with most of these parser combinator libraries in
that they do no state machine optimization whatsoever... with the (sole?)
exception of uu-parsinglib.

Applicative parsers, avoiding monadic extensions, readily support optimization
and error checking much like flex/bison, Ragel and ANTLR. The cost of running
these optimizations once per program execution (not parser evaluation) is
fairly minimal and may not be noticeably slower than a precompiled parser..
particularly if the cost is amortized over many parses.

I do wish more of the parser combinator libraries actually did this though.
And there are lexx/yacc-like tools for Haskell as well: Alex and Happy
(famously used by GHC to parse Haskell source), and Ragel can be bolted in
when performance is absolutely critical.

------
irahul
Nice work. Personally I will still use the raw regex rather than the method
calls to build the regular expression. As another commenter pointed out, the
example regex is complex than it should be. It can be reduced to:

    
    
        pat = re.compile(r'^ \w+ @ [A-Za-z]\w*  \. \w+ $', re.X)
        if pat.match('r@acnt.me'):
            print "woot"
    

I won't bother explaining this regex(too simple). However, if it were
something complex, I would put inline comments:

    
    
        pat = re.compile(r'''^ \w+  # rahul
                        @
                        [A-Za-z]\w*  # thoughtnirvana
                        \.
                        \w+ # com
                        $''',
                        re.X)
    

Notes about the example regex:

    
    
      var regEx = {(?:^)[A-Za-z]([A-Za-z]+|(?:\d+))(@{1,1})[A-Za-z]+(.{1,1})[A-Za-z]+(?:$)}
    

(?:^), (?:$) - This is the same as simply using ^. It isn't captured by
default so there isn't a need to mark it non-capturing.

([A-Za-z]+|(?:\d+)) - What's going on here? You have a capturing group and
within that capturing group, you have the or part marked as non capturing.
What's the intent?

(@{1,1}) - @{1,1} is the same as @. Also, why are you capturing it? I think
you are using parens for making the regex readable. You should use the
IgnorePatternWhitespace instead [http://msdn.microsoft.com/en-
us/library/yd1hzczs.aspx](http://msdn.microsoft.com/en-
us/library/yd1hzczs.aspx)

~~~
masklinn
Seems to me the example regex is the (completely broken) output of the
(completely broken) generator expression below.

~~~
jobigoud
Oh… that would explain a lot. It does have that "generated by a tool" useless
verbosity feel to it.

------
Argorak
All these libraries suffer from the problem of decribing regexes in a non-
formal language (english!).

An example: What are .Letters()? [a-zA-Z]? Are diacritics included? The whole
UTF-8 letter range?

And suddenly, you have to specify that character soup and the example goes to
hell, because it reintroduces most of complexities in the original regexp.

------
BobTurbo
I would like to point out that I am actually the creator of this idea, and not
the author. The author has created a variation in C#, that has some
differences.

The original repository is at:

[https://github.com/thebinarysearchtree/RegExpBuilder](https://github.com/thebinarysearchtree/RegExpBuilder)

I came up with this idea 2 years ago. Some differences I see between my idea
and this c# implementation are:

Or() is confusing by itself. In mine, you pass in objects or strings, such as:

    
    
       var regex = new RegExpBuilder()
         .either(pattern1)
         .or(pattern2);
    
       var regex = new RegExpBuilder()
         .either("sometime")
         .or("soon")
         .or("never");
    

Also, all the special characters are escaped properly (\ is not escaped).

There are shortcuts - you don't have to do

    
    
       .exactly(1).of("hackernews")
    

you can just do:

    
    
       .then("hackernews");
    

In terms of differences between this and VerbalExpressions, verbal expressions
is very limited. It cannot represent many quantifiers (eg, at least 3 of
something), does not have decent ways to group subexpressions, and so on. It
can only represent (in a practical way), about 0.000001 % of regular
expressions, as opposed to RegExpBuilder.

~~~
dyml
You have my support! I read a blogpost showing off your RegExpBuilder and I
got inspired to create something similiar (as a chance to improve my regex and
coding skills)in C#, although I have some things I would love to do
differently than how your lib does it.

Thank you for a great library, after I have reached stable with this C# port,
i'd like to create a TypeScript version. I hope you do not have anything
against me writing spinnofs? :)

------
asperous
His example could be simplified to

    
    
        ^ ( [a-z0-9]+ @ [a-z]+ \. [a-z]+ ) $
    

With ignore case and ignore whitespace mode on. I work with Regex a lot so I
find this very readable, set in a universal format, and more concise. I will
gladly concede that the builder would be easier for those that aren't familiar
with regex.

~~~
dyml
Thanks for the comments! I bet I could improve the way regex is generated,
since i'm not so comfortable working with regex.

I'd also like to add features, a .Not operator would be really useful, and I'd
gladly take a pull request if anyone have an implementation in mind :)

If I receive some signals that others find this library useful and would like
me to add some feature, I'd be more than glad to do so.

~~~
mkching
If I came across the code in the original post, I would be confused as to what
the Or operator applied to. With a regexp, the parenthesis make this clear.

I would also assume that Exactly(1).Of(".") was meant to match a literal ".".
In a PCRE, you can surround a section with \Q...\E to force literal
interpretation, but I believe in .NET you would need to call Regexp.Escape.

The overall concept is not a terrible idea, but you should probably become a
little more familiar with regexps before trying to write a library that
creates them. While some things in the sample seem a natural product of being
code generated (e.g. "@{1,1}" instead of simply "@"), the use of "(?:" in many
places is simply not needed.

------
secoif
Unfortunately you're going to encounter regex a lot in your programming career
and this tool won't always be there to save you, so you are going to need to
learn regex one way or the other. You might as well get it over with sooner
rather than later.

This tool just hindering your progression and yet another abstraction someone
has to to learn if they are going to deal with this code. It would make sense
if this was a one-off thing and you'd be saving someone the effort of learning
some weird protocol or syntax, but since regex is so common and most
programmers have just learned to deal with them, you're actually adding more
cognitive load, since now they have to know two things instead of one. Imagine
coming across this in someone else's code and discovering the regex didn't
work as expected. Now I have to debug the regex and figure out whether it's a
bug in the tool, or in my regex, etc…

------
troels
Seems there is a bug, since this:

    
    
        .Exactly(1).Of(".")
    

expands to this:

    
    
        (.{1,1})
    

Which is wrong, as dot is a meta-character. It should be escaped.

~~~
masklinn
And {1,1} (twice) is completely unnecessary.

~~~
troels
Yes, as is once, actually.

------
vacri
I've been making good use of
[http://www.regexper.com/](http://www.regexper.com/) since it was linked here.
It's made learning regexes _much_ easier as it gives a clear workflow diagram.

For example, it showed that the horrible email regex in this article had a
couple of errors - the dot before the TLD should be escaped (without the
escape, it's 'any character'), and that group #1 can either be letters or
digits, but not both (when it can be).

It's still not a good regex, since there are characters like hyphens, dots,
and pluses that are valid pre-'@' characters, which both sample regexes fail
to recognise.

~~~
thiht
Personally I use [http://www.debuggex.com/](http://www.debuggex.com/) since it
offers a step by step visualization, a live generation of the diagram, a live
syntax checking of the regex, etc.

~~~
vacri
Just playing around with it now, it's nice how it builds up the regex as you
write it, but I did notice that it doesn't differentiate between '.' (match
one character of any kind) and '\\.' (the character for 'dot')

Hrm, on a closer look, it affects all special characters (like ^ and $) and it
does differentiate them, but only by turning them blue - makes it hard to see
the change.

------
deerpig
There are a lot of cool tools for helping write regexes, my favor is re-
builder mode in emacs. You write the regex in the minibuffer and see what
matches in the text in the buffer. It makes debugging regexes very easy.

Tools like sed and regexes are compact and very powerful, and they aren't
difficult to learn. I really don't understand the need for this library, which
seem needlessly verbose. And you will still need to be able to read regexes in
other people's code.

It's a nice idea, and good work, but in my opinion it's solving a problem that
doesn't exist.

------
npad
There are implementations of this idea for lots of languages (including C#)
available here:

[https://github.com/VerbalExpressions](https://github.com/VerbalExpressions)

------
gojomo
Last month's entry in this category, "Verbal Expressions":

[https://news.ycombinator.com/item?id=6164276](https://news.ycombinator.com/item?id=6164276)

------
draegtun
Some languages provide alternatives to Regexes. For eg. Rebol uses a _parse_
dialect instead -
[http://www.rebol.com/docs/core23/rebolcore-15.html](http://www.rebol.com/docs/core23/rebolcore-15.html)

Here is the articles example converted to Rebol's _parse_ dialect (minus
capturing but it's easy to add):

    
    
      ; build some prereqs for parse
      num:         charset [#"0" - #"9"]
      alpha-lower: charset [#"a" - #"z"]
      alpha-upper: charset [#"A" - #"Z"]
      alpha:       union alpha-lower alpha-upper
      alpha-num:   union alpha num 
    
      ; create parse rule block
      simple-email-rule: [
          alpha
          any alpha-num
          #"@"
          some alpha-num
          #"."
          some alpha-num
          end 
      ]
    
      ;
      ; then later...
    
      parse "valid@example.com" simple-email-rule  ; => true
      parse "notanemailaddress" simple-email-rule  ; => false

------
lutusp
With all respect, you're better off using a supportive regex environment that
accepts your regex entries and quickly shows their effect on some example text
you provide -- a builder/tester like this (just an example, there are many
similar ones):

[http://www.arachnoid.com/regex_lab/](http://www.arachnoid.com/regex_lab/)

Philosophically, there are two approaches to making regexes an effective tool
-- expand regex syntax until it's so verbose that there's no possibility for
confusion -- ironically a somewhat confusing tactic as this topic's comments
demonstrate -- or learn native regex in an interactive way that shows its
effect on example text, until you develop an instinct for it. I prefer the
latter.

It's like learning music by keyboard -- shall we paint each keyboard key a
different color and recode sheet music to agree, or shall we use a teaching
method that makes the keyboard gradually seem more natural?

------
Qantourisc
Your email regex is wrong. There are some obscure email address that will not
work. For example my.email domain+plus@some.weird3.com

For more see
[http://en.wikipedia.org/wiki/Email_address#Valid_email_addre...](http://en.wikipedia.org/wiki/Email_address#Valid_email_addresses)

~~~
AdrianRossouw
apparently this is the correct fully rfc-compliant email validation regex:

[http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html](http://www.ex-
parrot.com/pdw/Mail-RFC822-Address.html)

~~~
vacri
These guys say that a 430-character regex does the same trick as that
6k-character one (down in the RFC 2822 section)

[http://www.regular-expressions.info/email.html](http://www.regular-
expressions.info/email.html)

------
islon
_Which one is the simplest? I rest my case._

Of course the regex is simpler as everyone who knows regexes will understand
it.

If you don't know regex you should invest time on learning it. It's the same
if you say:

    
    
        I don't know german, look at this german sentence builder, it's so much nicer!
        > builder.firstPersonPronom().verb("like").directObject(new SecondPersonPronom());
        > => "Ich mag dich"

------
daGrevis
First thing to do when writing regexes is to write them on multiple lines.
Also, you should use comments. It will make them more readable and easier to
follow. Also, I want to suggest Zed's Learn Regex The Hard Way.

[http://regex.learncodethehardway.org/book/](http://regex.learncodethehardway.org/book/)

------
kaeluka
You could have named contents:

    
    
      var letter = "[a-zA-Z]";
      var letters = letter + "+";
      ...
    

then your regexp would look like this:

    
    
      var regEx = letter ++
        letters ++
        "|" ++ ...

------
roryokane
I learned regexes entirely from the tutorial at [http://www.regular-
expressions.info/tutorial.html](http://www.regular-
expressions.info/tutorial.html). It clearly explained how the regex engine
works so I can simulate it in my head and understand why a given regex does or
doesn’t work. I tried out various regexes in TextMate as I read through the
tutorial – nowadays I would use one of the online sandboxes listed on
[http://stackoverflow.com/tags/regex/info](http://stackoverflow.com/tags/regex/info).
That free tutorial was enough to get me very comfortable with regexes.

I also tested my understanding afterwards with some online exercises chosen
from these lists:

[http://www.emacs.uniyar.ac.ru/doc/em24h/emacs081.htm](http://www.emacs.uniyar.ac.ru/doc/em24h/emacs081.htm)

[http://blogs.msdn.com/ericgu/archive/category/11323.aspx](http://blogs.msdn.com/ericgu/archive/category/11323.aspx)

This reference was handy while doing the exercises: [http://www.regular-
expressions.info/reference.html](http://www.regular-
expressions.info/reference.html)

Knowing regexes has been very helpful to me in general. I have used regexes in
reformatting my code through find and replace, in finding the code that I need
to edit next or that could be causing a certain problem, in writing Apache
config URL rewriting rules, in writing poor man’s language parsers that
assisted me in generating code, in converting raw data into programming
language literals, in understanding user input validation rules, and in other
ways. I think that any serious developer who expects to work with more than
one programming language in their lifetime should understand regular
expressions. Thus, I encourage the OP to try learning regexes, using the
resources linked above.

That said, I agree that regexes could be easier to understand. I rather wish
that Perl 6’s revised, simpler regex syntax
([http://perlcabal.org/syn/S05.html](http://perlcabal.org/syn/S05.html)) were
the universal standard.

If you use regexes a lot, and get mentally strained by the complexity of some
of your bigger ones, consider learning about parsers too, another type of tool
that lets you manipulate text in more powerful ways, with longer but more
readable code than regexes.
[http://kschiess.github.io/parslet/](http://kschiess.github.io/parslet/) is a
simple parsing library to start with if you use Ruby. In fact, Parslet is
rather like a more powerful and more theoretically-sound version of the OP’s
library RegExpBuilder. Like RegExpBuilder, Parslet uses chains of methods with
English names to build parsers.

~~~
roryokane
Here is one possible translation of asperous’s simplified email regex into a
Parslet parser:

    
    
      #!/usr/bin/env ruby
      
      original_regex = /^ ( [a-z0-9]+ @ [a-z]+ \. [a-z]+ ) $/ix
      
      require 'parslet'
      include Parslet
      local_part = match['A-Za-z0-9'].repeat(1)
      letters = match['A-Za-z'].repeat(1)
      domain_part = letters >> str('.') >> letters
      email_parser = local_part >> str('@') >> domain_part
      
      user_input = "foo#bar.com"
      matches_regex = original_regex.match(user_input)
      matches_parser = email_parser.parse(user_input)
    

asperous’s regex:
[https://news.ycombinator.com/item?id=6319435](https://news.ycombinator.com/item?id=6319435)

Parslet info:
[http://kschiess.github.io/parslet/](http://kschiess.github.io/parslet/)

