The regular expression required to validate an email address

ratsbane · on Aug 18, 2008

But validating the semantics of an email address doesn't catch a lot of error cases, e.g. "someone@hotmal.com" instead of "someone@hotmail.com" Question: why don't email validation routines ever try to find out if the domain in question is actually running something on port 25 and if it is, try to find out if that's a valid account on that server? (I've never done that in an email validation routine either but I have been thinking about it and wondering if it would be a good idea.)

damnfrenchy · on Aug 18, 2008

because email servers have stopped validating addresses on demand a long time ago. That was the easiest way to get a list of valid emails for spam. It used to be that you could even ask for a list of valid emails for a server.

You want to validate an email address, send an actual email to it and see if anyone who cares answers it. Don't even expect much of an error if you try to send an email to a bad address, a lot of servers will just send them in /dev/null directly.

ratsbane · on Aug 18, 2008

Validating the account on the server is a problem but that doesn't affect the usefulness of validating the hostname itself as something which can accept mail. E.g.: I just queried a database of about 40k email addresses in a production system for "hotmal.com" and found two. "Hotmal.com" isn't running anything on port 25 so it's likely that these people meant to enter "hotmail.com." I'm sure if we filtered this entire pool of addresses for misspellings we'd find a lot more. We could get something approaching 100% accuracy if we required users to receive an email and click on a link to verify; perhaps that's not a bad idea, but that often takes the user off of the page where he or she entered the address and breaks the process and we'd probably lose a few people. We're not going to get massive improvements in the process because it's already pretty good but even a small improvement is worth something.

Also I realize that if an email account is running on a server in someone's kitchen it may be down at one moment but still able to receive emails so I'm thinking more of an ajax call from the entry page to a validation routine which would check for obvious things - the hostname doesn't exist, isn't running, isn't running a mail server, is similar to known misspellings... and then alerts the user inline that there MAY be a problem... but still allows the user to proceed with the address entered.

ketralnis · on Aug 18, 2008

Just connecting to port 25 isn't enough to verify this, however. The various email RFCs allow for temporary failure and retries of most operations, including connection. Additionally, you'd have to walk down every host listed in MX for the given domain (since it's not actually the machine "hotmail.com" that you connect to), which for hotmail.com look something like "mail1.msft.net".

Really, the only way to verify an email address is to send mail to it.

cschneid · on Aug 19, 2008

The only foolproof way is to email it. Why not have a nice ajax widget which calls to the server, and checks if anything is open on the port. Then displays a message saying "there doesn't appear to be an email server at <server>, please verify there aren't any typos".

ratsbane · on Aug 19, 2008

Thanks - yes, that's just what I was thinking. I've never seen anyone do that and I wonder why. Any kind of input verification, regular expressions or whatever is only going to catch some of the error cases but the goal of the game is to catch as many as possible. The regular expression at the top of this thread is interesting but I don't think I'd use it - instead use a simpler regex to catch most obvious errors (and disallow root@localhost, as someone pointed out) and couple it with an ajax check-the-mail-server thing. I think I'll try this.

dfox · on Aug 18, 2008

This is interesting argument in discussion about readability, usability and so on of regular expressions, but not something you want to use directly.

Almost nobody wants to validate email address according to SMTP-specified format, because almost anything is valid SMTP address and most of the formats supported by protocol specification are not usable in real world (because they does not work or does something completely unexpected...)

Moreover to use this monstrosity zou need to perform some preprocessing on the addres, because some aspects of SMTP address cannot be meaningfully expressed by regular expression. IIRC, you have to strip comments and normalize whitespace before you use this. It is probably good idea to allow comments and real names in addresses posted to some website (for example because if you paste address from some MUA, you will often get it with real name and comments), but this will not help you with that.

kogir · on Aug 18, 2008

Note: Run in production IFF you want a DOS opportunity on your hands :)

This is a perfect example of why not to use regular expressions to validate email addresses. It likely (1) still misses some edge cases, (2) is nearly impossible to verify, and (3) is exponentially expensive (see 2).

tptacek · on Aug 18, 2008

Probably not (though I get the point of the regex, and it's well taken). Here's a very readable reference:

http://cr.yp.to/immhf/addrlist.html

He also has a "mess822" C library (you could cons up bindings pretty quickly to your language du jour); knowing the source, it's likely to be most bulletproof validator on the net.

sh1mmer · on Aug 18, 2008

This helpfully validates some valid emails you might not want, e.g. root@localhost hardly helpful for a web site.