

The regular expression required to validate an email address - bdfh42
http://thefrozenfire.com/data/emailregex.txt

======
ratsbane
But validating the semantics of an email address doesn't catch a lot of error
cases, e.g. "someone@hotmal.com" instead of "someone@hotmail.com" Question:
why don't email validation routines ever try to find out if the domain in
question is actually running something on port 25 and if it is, try to find
out if that's a valid account on that server? (I've never done that in an
email validation routine either but I have been thinking about it and
wondering if it would be a good idea.)

~~~
damnfrenchy
because email servers have stopped validating addresses on demand a long time
ago. That was the easiest way to get a list of valid emails for spam. It used
to be that you could even ask for a list of valid emails for a server.

You want to validate an email address, send an actual email to it and see if
anyone who cares answers it. Don't even expect much of an error if you try to
send an email to a bad address, a lot of servers will just send them in
/dev/null directly.

~~~
ratsbane
Validating the account on the server is a problem but that doesn't affect the
usefulness of validating the hostname itself as something which can accept
mail. E.g.: I just queried a database of about 40k email addresses in a
production system for "hotmal.com" and found two. "Hotmal.com" isn't running
anything on port 25 so it's likely that these people meant to enter
"hotmail.com." I'm sure if we filtered this entire pool of addresses for
misspellings we'd find a lot more. We could get something approaching 100%
accuracy if we required users to receive an email and click on a link to
verify; perhaps that's not a bad idea, but that often takes the user off of
the page where he or she entered the address and breaks the process and we'd
probably lose a few people. We're not going to get massive improvements in the
process because it's already pretty good but even a small improvement is worth
something.

Also I realize that if an email account is running on a server in someone's
kitchen it may be down at one moment but still able to receive emails so I'm
thinking more of an ajax call from the entry page to a validation routine
which would check for obvious things - the hostname doesn't exist, isn't
running, isn't running a mail server, is similar to known misspellings... and
then alerts the user inline that there MAY be a problem... but still allows
the user to proceed with the address entered.

~~~
ketralnis
Just connecting to port 25 isn't enough to verify this, however. The various
email RFCs allow for temporary failure and retries of most operations,
including connection. Additionally, you'd have to walk down every host listed
in MX for the given domain (since it's not actually the machine "hotmail.com"
that you connect to), which for hotmail.com look something like
"mail1.msft.net".

Really, the only way to verify an email address is to send mail to it.

~~~
cschneid
The only _foolproof_ way is to email it. Why not have a nice ajax widget which
calls to the server, and checks if anything is open on the port. Then displays
a message saying "there doesn't appear to be an email server at <server>,
please verify there aren't any typos".

~~~
ratsbane
Thanks - yes, that's just what I was thinking. I've never seen anyone do that
and I wonder why. Any kind of input verification, regular expressions or
whatever is only going to catch some of the error cases but the goal of the
game is to catch as many as possible. The regular expression at the top of
this thread is interesting but I don't think I'd use it - instead use a
simpler regex to catch most obvious errors (and disallow root@localhost, as
someone pointed out) and couple it with an ajax check-the-mail-server thing. I
think I'll try this.

------
dfox
This is interesting argument in discussion about readability, usability and so
on of regular expressions, but not something you want to use directly.

Almost nobody wants to validate email address according to SMTP-specified
format, because almost anything is valid SMTP address and most of the formats
supported by protocol specification are not usable in real world (because they
does not work or does something completely unexpected...)

Moreover to use this monstrosity zou need to perform some preprocessing on the
addres, because some aspects of SMTP address cannot be meaningfully expressed
by regular expression. IIRC, you have to strip comments and normalize
whitespace before you use this. It is probably good idea to allow comments and
real names in addresses posted to some website (for example because if you
paste address from some MUA, you will often get it with real name and
comments), but this will not help you with that.

------
kogir
Note: Run in production IFF you want a DOS opportunity on your hands :)

This is a perfect example of why not to use regular expressions to validate
email addresses. It likely (1) still misses some edge cases, (2) is nearly
impossible to verify, and (3) is exponentially expensive (see 2).

------
tptacek
Probably not (though I get the point of the regex, and it's well taken).
Here's a very readable reference:

<http://cr.yp.to/immhf/addrlist.html>

He also has a "mess822" C library (you could cons up bindings pretty quickly
to your language du jour); knowing the source, it's likely to be most
bulletproof validator on the net.

------
sh1mmer
This helpfully validates some valid emails you might not want, e.g.
root@localhost hardly helpful for a web site.

