
Validating Email Addresses with a Regex? Do Yourself a Favor and Don’t - olalonde
http://blog.onyxbits.de/validating-email-addresses-with-a-regex-do-yourself-a-favor-and-dont-391/
======
wahern

      For the local part, we only accept an RFC5322 dot-atom. However, for practical purposes,
      we further limit {atext} to alphanumeric characters, hyphens, underscores and plus signs.
      The rationale here being to keep it simple for the software we are validating for (it must
      deal with whatever we allow to pass).
    

_ugh_ The dot-atom production in the RFC 5322 grammar is only for _unquoted_
tokens. There's nothing preventing an e-mail address from, e.g., having two
adjacent dot characters in the local-part. Ignoring the fact that in 2016 we
shouldn't be limiting local-part components to ASCII, adding more needless
constraints is even worse.

RFC 5322 DOES NOT define the syntax of an e-mail address. It defines the
syntax of embedding e-mail addresses in To:, Cc:, and similar headers. That's
a critical distinction.

If you're using RFC 5322 as your guide, you _must_ handle text quoting
correctly. Otherwise the dot-atom restriction is completely and utterly
arbitrary. It's being chosen for basically no defensible reason; just to give
an air of credibility to a really bad approach to implementing networking
software.

FWIW, no finite state machine can parse RFC 5322 constructs correctly because
of the way that comments can nest.

If you want to write a parser for mailbox addresses in MIME headers, just
follow the algorithm DJB very helpfully explains at
[http://cr.yp.to/immhf/token.html](http://cr.yp.to/immhf/token.html). It's
much more correct; it harms interoperability less (if at all); and it's a
great way to learn how to implement recursive grammars.

That said, finite state machines are very powerful. One could do much worse
than learn a tool like Ragel. Its syntax supports both regular expressions
_and_ state charts. The machines it produces are blazingly fast. And unlike
almost any other tool, it supports restartable parsing so you can parse data
off-the-wire without having to buffer all of it or even any of it. Most
network protocols are line-based or TLV-based precisely because the tooling
sucks for implementing restartable parsers. But Ragel is great even when
you're afforded that convenience.

