
Perfect email regex finally found - mildweed
http://fightingforalostcause.net/misc/2006/compare-email-regex.php
======
percept
I've accepted that it's best to treat people like grown-ups and if there's '@'
and '.' and it's retyped then it passes. Someone can easily submit a fake name
or phone number or street address, and e-mail's no different.

If they get it wrong, intentionally or not, then they don't get their receipt,
confirmation, validation link, etc. and I believe in most cases the incentive
is there for them to get it right.

In the rare case where there's some incentive to circumvent the system and
this has some measurable impact on a site, then more validation may be
warranted. Otherwise, why worry about it?

Also: HTML5. ;)

~~~
pornel
Retyped!? Grown-ups can read what they write.

Retyping only makes sense for password field, which is obfuscated and doesn't
allow copy&paste.

~~~
jeff18
I would estimate about 0.25% of people will make a typo like "@homail.com" or
"@gmial.com"

Multiply that by say, 130,000 people, and you are dealing with 325 people who
don't receive their download, etc. and are not happy!

I think what would be really awesome is a regex that catches these common
typos and warns the user immediately.

~~~
mhartl
True, but you have to balance that against the small but nonzero number of
people put off by an extra text field. Plus, I would find email repetition
more annoying if I didn't always do Cmd-A/Cmd-C/Tab/Cmd-V, and in this case
the repeated field won't catch any errors.

~~~
sjs
The fact that you know the shortcuts for select all, copy, and paste puts you
in the top percentile of users. Most people don't even know that's possible,
and certainly not with keyboard shortcuts.

(The point being that for most users the faster approach that requires less
thinking is to type it twice. Sometimes I do things the "slow" or "long" way
when coding because it doesn't require a mental shift from the task at hand.)

~~~
sesqu
_The point being that for most users the faster approach that requires less
thinking is to type it twice._

Spoken like a touch-typist with a short email address. I feel confident in
claiming that annabelle.t.johnson@woodandplaster.co.uk will be looking for
that copypaste button. It's not like copypaste is a new technological
innovation - it's practically the most used feature of personal computing,
after backspace.

~~~
sjs
Good point. sami.samhuri@gmail.com takes me 3 seconds but might take someone
else 10 seconds. Thanks for the perspective.

------
gojomo
Nothing is finished, nothing is permanent, and nothing is perfect.

In particular, one of the evaluation tests used here is wrong: it requires
failure-to-match on a TLD with a digit in it:

numbersInTLD@domain.c0m

In fact, IDN TLDs will have digits in them. An internet-draft is in the works
to replace RFC1123's IDN-unfriendly implication that digits in TLDs are
illegal:

<http://tools.ietf.org/html/draft-liman-tld-names-02>

~~~
sigzero
If you say "will have" and "in the works" means it isn't the standard and the
current test is valid.

~~~
gojomo
A nice try at pedantry, which I would usually respect, but in the domain of
internet standards with which I familiar, you are wrong.

The existing specs are in conflict, with the more recent ones (such as IDN)
allowing digits. Internet authorities, including ICANN, have enabled domains
with digits in TLDs; software which is far more foundational than any web-
app's email validation regex has been updated.

Registration of names in some of these digited-TLDs has begun; you can visit
these TLDs with your browser; your users can have functioning email addresses
on these TLDs.

If your app rejects such email addresses because of slavish compliance with
imprecise language in a 31-year-old RFC, you'd be the one violating prevailing
standards, which are a function of more than just formal IETF RFCs.

That the Internet-Draft I referenced may soon become an RFC is just cleaning
up loose ends on a change that's already happened. This final step isn't even
strictly necessary for the _de facto_ standard to have changed by consensus
among practitioners. Plenty of vibrant well-understood standards never reach
RFC status, nor pass through any formal standards body. The standard is
ultimately what people do, not what someone once-upon-a-time decreed.

------
pornel
Not perfect: doesn't support IDN without punycode. Doesn't support IDN TLDs at
all.

Users of <http://موقع.وزارة-الاتصالات.مصر> won't be pleased :)

~~~
maw
At the risk of sounding flip, not supporting punycode sounds like a feature.
Getting internationalization right is clearly important, but punycode as a
means of doing so? It's one of the many things that make me weep for my
industry.

In fact, I often wonder if punycode is a prank that got out of hand.

UTF-8, on the other hand, would have been excellent for this purpose.

~~~
pornel
DNS doesn't allow 8-bit characters, so UTF-8 is not an option. I think
punycode is more efficient than UTF-5 would be.

~~~
maw
True. It doesn't. But as far as I know, there is no deep technical reason why
it can't. The dots in domain names are '\0', and a few characters ('@', in
particular) need to remain reserved, but other than that, what stops UTF-8
from being used?

------
Periodic
It does not look like any of these can detect or were tested against quoted-
local-part addresses. As I understand it, the local part can be quoted to
allow illegal characters to be used, e.g. "John Doe"@example.com

I fully understand that these are not in common use, but they are part of the
RFC and may be in use somewhere.

~~~
Terretta
His test also considers the % sign invalid, yet I've both sent and received
email that required the % for the mail to be routed/relayed correctly (think
"in care of" or c/o).

Granted, that was 17 years ago, but who's to say it's not in use somewhere?

------
eli
Uh, what happens if a new TLD has more than 6 characters?

I don't get this preoccupation with making sure addresses _look_ valid. The
ONLY way to validate an email address is to send it a message.

------
generalk
I never understood the purposes of the "perfectly valid by the RFC" email
regex. You may be able to with 100% accuracy say that something _should_ be an
email address, but you'll never be able to tell if it's a valid account on the
server, or if the server even exists.

------
augustl
Email validation is user input validation; you are protecting the user from
some cases of erronous input, and yourself from stuff that doesn't even look
like an email.

I find these large catch-all email regexps silly for two reasons:

1\. They are hard to write, hard to understand, and hard to maintain.

2\. Most importantly, they are difficult to understand for users. "You entered
an invalid email". Now what? the user asks.

This is why e-mail validation should be done in steps. Here is some Rails
pseudocode:

    
    
      validates_format_of :email,
        :with => /@/,
        :message => "Needs to contain an @."
      
      validates_format_of :email,
        :with => /\.[^\.]+$/,
        :message => "Has to end with .com, .org, .net, etc."
      
      validates_format_of :email,
        :with => /^.+@/,
        :message => "Must have an address before the @"
      
      validates_format_of :email,
        :with => /^[^@]+@[^@]+$/,
        :message => "Must be of the format 'something@something.xxx'"
    

Much easier to write, much easier to maintain, and much better error messages
to the users.

------
snprbob86
So many web services do this wrong, that it isn't even worth doing it right:
no one is going to complain that your service doesn't accept their "wacky!
quoted"@email.address

Sometimes, you aren't validating a whole string, you are searching for email
addresses in a sea of text, or an arbitrarily delimited, user-entered list of
contacts.

Support usernames with alphanumerics, dashes, underscores, periods, and plus
signs; Require a single @; Support domains with alphanumerics, dashes, and at
least one period. Screw anyone with something more complex than that. Done
deal.

~~~
huherto
I see your point. If somebody is using something weird, they probably have
problems everywhere.

------
andrewcooke
I have an implementation of RFC3696 <http://www.faqs.org/rfcs/rfc3696.html>
(which is the spec for validating emails) in Python here -
<http://www.acooke.org/lepl/api/lepl.apps.rfc3696-module.html>

That is part of Lepl - <http://www.acooke.org/lepl/> \- and although it's
implemented in a recursive decent parser, much is compiled to regular
expressions for efficiency. So you get the best of all worlds: regexp
efficiency; parser accuracy; standards based.

A blog post on the compilation to regexps is here -
<http://www.acooke.org/cute/LEPLOptimi0.html>

------
aphyr
It's certainly more concise than my previous favorite,

    
    
          qtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]'
          dtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]'
          atom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-' +
            '\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+'
          quoted_pair = '\\x5c[\\x00-\\x7f]'
          domain_literal = "\\x5b(?:#{dtext}|#{quoted_pair})*\\x5d"
          quoted_string = "\\x22(?:#{qtext}|#{quoted_pair})*\\x22"
          domain_ref = atom
          sub_domain = "(?:#{domain_ref}|#{domain_literal})"
          word = "(?:#{atom}|#{quoted_string})"
          domain = "#{sub_domain}(?:\\x2e#{sub_domain})*"
          local_part = "#{word}(?:\\x2e#{word})*"
          addr_spec = "#{local_part}\\x40#{domain}"
          pattern = Regexp.new "\\A#{addr_spec}\\z", nil, 'n'

~~~
billturner
This is the same one I've been using the last couple of years, and I wish I
could remember where I first came across it.

~~~
dkubb
From here maybe? <http://tfletcher.com/lib/rfc822.rb>

------
skoob
"Now you have two problems."

Seriously, I can't think of a single good reason why you would want to check
whether an email address is "valid". What you should be concerned about is
whether or not the address _works_ (and usually, can/does the person who just
signed up actually read and reply to email to that address).

Hypothetically, if an invalid address works (due to bugs in mail systems) --
then it works, and the only problem with accepting such an address is that the
bugs might get fixed. If an address is valid, there's no reason to assume that
it will work or that it belongs to the person who signed up. It isn't even a
good way of detecting typos; transpose two characters in an email address and
it will most likely still pass your validation.

~~~
mvalle
_Seriously, I can't think of a single good reason why you would want to check
whether an email address is "valid"!_

Because it's fun.

------
angelbob
It makes me happy that a contest of this kind is hosted at the domain
"fighting for a lost cause" :-)

------
jemfinch
How in the world did this get 172 points?

After this I almost stopped paying attention: "It's my philosophy that it's
better to accept a few invalid addresses than reject any valid ones, so I'm
shooting for 0 false-positives and as few false-negatives as possible."

But then I looked at the regexps and they miss an absolutely trivial fact:
valid email addresses can end in a dot. "jemfinch@supybot.com." is just as
valid (more so, in fact) than "jemfinch@supybot.com".

~~~
pbhjpbhj
>How in the world did this get 172 points?

Things don't have to be well done, nor do you have to agree with them for them
to be worth consideration/stimulating.

------
DanBlake
&*=?^+{}'~@12.34.56.78:2000 really looks strange, even though its a valid
email.

Good luck trying to register on any site with it though :)

~~~
petercooper
What's the deal with using a port number in an e-mail address? I can't imagine
many systems supporting that. Anyone got any more info on it that doesn't
involve me digging through 101 pages of RFCs? :-)

~~~
tedunangst
I don't think that's valid. I think they misread the RFC. First of all, an IP
address (address-literal) is supposed to be enclosed in [square brackets]. The
colon is only allowed to designate address family, as in "foo@[IPv6:XXXXXX]".

------
fragmede
It thinks baddomain@300.0.0.1 is valid, and thinks decimaldomain@2130706433
(localhost) is not valid. It also thought
user@3ffe:1900:4545:3:200:f8ff:fe21:67cf as invalid.

------
perplexes
Um, UTF-8? Unicode domains? Regexing email is a waste of time.

------
mey
I assume this breaks with the new non-latin TLDs or IPV6

------
bengross
I wrote an article "Validating Email Address in Web Forms - The Hazards of
Complexity" that discusses several comprehensive email validation libraries
and common problems with complex methods of validation.

[http://www.messagingnews.com/onmessage/ben-
gross/validating-...](http://www.messagingnews.com/onmessage/ben-
gross/validating-email-address-web-forms-hazards-complexity)

------
edw519
As a typical optimizer, I was always trying to reduce my source code a few
more bytes or speed up my processes a few nanaseconds here and there. I was so
proud of myself. Until I tried to maintain my own slickness.

There's a fine line between clever and practical. Why do I have a gut feeling
that this approach is way over that line?

------
ryan-allen
I think a basic token based parser might be more appropriate for thorough
validation of email addresses. I see it similar to trying to match XML or HTML
with regexps, very common but so very easy to break. You can't go to wrong
parsing an address token by token, can you?

------
lazyant
AFAIK a naked IP address as domain like IPInsteadOfDomain@127.0.0.1 doesn't
work (at least in some mail servers like Postfix they are bounced with a "bad
address syntax error). The address has to be like
IPInsteadOfDomain@[127.0.0.1] and I don't see this case covered.

------
palish
Wow.

    
    
      /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?
      ]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|
      biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|
      pro|travel|mobi|[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.
      [0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i
    

I mean... Why do this? Just, why? It's almost unreadable.

Just write a short 20-line function which validates an email address. Use if
statements. Write comments. Then verify that your algorithm does in fact
handle all corner cases, just like the regexp does. (Your verifications will
be in the form of short, simple unit test case functions.)

To encourage regexp abuse like that is to encourage bad programming.

~~~
Groxx
(late, but...)

Frequently because Perl's regex library is _insanely_ fast. Faster than if
statements + smaller regex / roll-your-own.

(as long as there's no look-ahead / look-behinds. It's still fast then, but
custom functions can sometimes do better.)

------
kapranoff
There's a classic email regex from Jeffry Friedl (author of "Mastering Regular
Expressions") himself:
<http://www.diablotin.com/librairie/autres/mre/chBB.html>

It's 6,598 bytes long.

------
dkubb
I've often wondered why we don't have any standardized test cases for email
validation. A simple list of known good, and known bad addresses to use as
test cases would really help.

You could almost make a game out of finding valid addresses that are not
matched, or invalid addresses that are false positives, or optimizing
candidate regexps. The list could be continuously growing as new variations
are discovered, and tested against previously submitted candidate regexps.

------
gry
I've accepted email validation is like security, better left to libraries by
people smarter than I am...TMail with over 2000 test cases.

But we know a valid email address when we see it, right?

\--

Edit: Nicely done, HTML5.

------
dkoch
This one in perl is fun as well:

<http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html>

~~~
fhars
And it is actually mostly correct, unlike the one in the article. (Has anyone
tested it with the new tld for egypt?)

------
mrb
NOT perfect, FAILS to accept valid addresses

Another know-it-all web developer guy who thinks he got his regex right...
These are valid addresses that he rejects: user@ua (.ua = Ukraine) user@km
(.km = Comoros) user@ne (.ne = Niger) Many ccTLDs have MX or A records
pointing to real MTAs.

------
nfnaaron
"Perfect email regex finally found"

Based on the HN title I thought it was going to be an article describing a
post-it found on Fermat's dressing table mirror.

Instead it's a list of mostly correct regexes. As Miracle Max might have
observed, mostly correct is partly imperfect, and partly imperfect is Not
Perfect.

------
avar
Here's one using modern Perl regex features. It's part of the perl regression
tests:
[http://github.com/mirrors/perl/blob/blead/t/re/reg_email.t#L...](http://github.com/mirrors/perl/blob/blead/t/re/reg_email.t#L13)

------
ck2
Well that's one less thing.

Now someone please make the perfect JSON regex decoder :-)

------
andrewvc
How ugly do non-regex based email validation functions look? I've never seen
one, but I've always wondered if that was a more elegant solution.

~~~
alnayyir
Regex is the only real sensible way to validate strings until something better
is found. Even if you just wrote code to do it manually, you'd really just be
writing a verbose and poorly implemented finite state machine that globbed
symbols together, which in the end, would just be inferior to writing a well
tested Regex string.

Regex can be easier to read if you have something do a graphical expansion for
you. Otherwise, it's write-once, read-never.

~~~
adamc
That's just silly. Depending on what's in the string, writing a parser might
be much better than a regex. Lots of parser libraries already out there, too.

~~~
aphyr
Especially when the spec itself is in EBNF.

~~~
_delirium
True, though it's EBNF with a bunch of explanatory text and annotations.

What would be interesting, but I can't find with some googling: Has someone
implemented a parser-generator based on the spec? The ideal would be that the
parser specification looks a lot like the RFC, since then you'd have more
confidence it was actually correct (and it'd be easier to maintain for future
changes).

~~~
twopoint718
There was a long post on the topic of 'regex-vs-parser' for email on reddit a
while back. I hope I'm pointing to the correct person, but he wrote a parser
in Haskell that validates against RFC5322:
<http://hackage.haskell.org/package/email-validate>

------
jholloway
I've been surprised at the number of forms that tell me my email address,
which has a single-letter local part, is invalid.

------
jimfl
Great! Now get to work on the perfect HTML regex.

(I know.)

------
tmountain
Something tells me this might not match admin@παράδειγμα.δοκιμή/Αρχική_σελίδα
(a valid UTF-8 domain name).

------
defdac
Fascination. Then acid reflux.

~~~
pavel_lishin
I believe that the image you're looking for to express your current emotional
state is
[http://uploads.postfarm.net/public/postfarm/uploads-2.0/i/if...](http://uploads.postfarm.net/public/postfarm/uploads-2.0/i/ifrewup.jpg)

------
javathehut
I'm so totally gonna steal this!

