
How not to validate email addresses - swanson
http://mdswanson.com/blog/2013/10/14/how-not-to-validate-email-addresses.html
======
kbuck
I've found that the best ways for validating email addresses, in order, are:
checking for the '@' sign, resolving the hostname to the right of the '@' sign
to see if it has MX records (or an A record, since the specification
technically also allows sending mail to the server at the A record), and
sending an email to the address that includes a verification link that the
user must click.

The first two can be done without requiring any additional work for the user,
but people are so used to clicking verification links that they don't really
mind that either.

~~~
pbreit
And the advice here, which I agree with, is to skip your steps 2 & 3.

~~~
bichiliad
Eh, I don't think having a site that allows people to sign up on behalf of
other people easily is a good thing.

~~~
agency
As someone who owns {commonFirstName}{commonLastName}@gmail.com, I _hate_
services that don't require you to validate the e-mail address you sign up
with. And there's a special place in hell reserved for people who write
services which don't validate the e-mail you sign up with, but do require you
to sign in with your credentials in order to unsubscribe.

~~~
superbaconman
lol@gmail.com whoever owns this, hates their life.

~~~
reactor
Unfortunately no one can own that :) Google doesn't allow email id with less
than 5 chars (to mitigate random spamming)

~~~
elwell
how about asdf@gmail.com?

~~~
inthewind
Surely you mean aoeu@gmail.com?

------
RexM
Interestingly the email subscription box on the page goes against the
validation recommendations made in the post. Tried to subscribe with "Matt
Swanson <matt@mdswanson.com>" and was told it was an invalid email.

(I do understand that it is mailchimp and not the author)

~~~
swanson
I was wondering who signed me up for my own list :)

I like the "push the boundaries" thinking though! There is no reason why I
couldn't add the JS library to the form at the bottom.

------
yid
> People use Gmail's tag-syntax (i.e. matt+whatever@gmail.com) to sign up for
> stuff all the time. Are you allowing those?

Minor nit: this is not anything Gmail invented. This is RFC 5233 --
subaddressing:

[http://tools.ietf.org/html/rfc5233](http://tools.ietf.org/html/rfc5233)

------
robomartin
Having gone up and down this problem a number of times it is my opinion that
the only way to truly evaluate email address validity is with a fairly
elaborate state-machine based approach that provides you with feedback as to
what is wrong in order to decide how to deal with it (or not). Here's one
example:

[https://github.com/dominicsayers/isemail](https://github.com/dominicsayers/isemail)

The regex's floating around out there are horrible.

Validating email addresses doesn't necessarily mean that you affect the user's
experience. I think of it as an opportunity to avoid losing a potential
customer due to a silly mistake. One such example would be a one page sign-up
site where you are trying to collect the email addresses of those interested
in your offering. In this context it is important to try and catch errors. You
have a visitor who wants to keep in touch with you. He or she mistypes the
email address. If you don't detect it you might lose them forever.

Granted, all errors are not detectable. If someone types jeo@example.com vs.
joe@example.com there's precious little you can do about it in terms of
automated detection.

You can accept obviously bad email addresses, store them in your database and
simply tag them as such. This is where ML or human intervention might be able
to fix the problem or choose to discard it. Email list pollution can be dealt
with in other ways, for example, if you use this list to reach out to
prospective customers bad emails will simply bounce.

In the end what is important is to avoid losing real potential customers as
much as possible. I think a little software-based verification along with
giving the user the opportunity to catch the mistake is enough. All the junk
easily falls though the cracks of a multi-stage filter after the fact.

------
fnordfnordfnord
Someone should make this guy King of the Internets.

Swanson: I need you to take a look at address forms. I don't want to enter my
city and state any more after I've given my zip code.

PS Yes, I know that not everyone lives in the US.

PPS Yes, I've heard about some places where a single zip code serves two
cities. Edge cases, there will always be one or two.

~~~
zarify
I don't know about the US, but in Australia post codes (our zip codes)
regularly service more than one town (or principality) when you're rural, and
if you're even vaguely familiar with Australia we have a _lot_ of rural :)

I've seen this handled quite well by a number of sites - you enter your post
code and they'll just drop down a list of all the places it matches. Choose
your town/suburb and you're done!

The ridiculous part is that they'll often ask you to enter your state as well,
which you can derive from the first digit of the post code :/

~~~
jpatokal
Re: deriving state from first digit of post code, that's true 90% of the time
but the exceptions -- some of which are pretty big, eg. the entire Australian
Capital Territory -- will kill you.

[https://en.wikipedia.org/wiki/Postcodes_in_Australia#Austral...](https://en.wikipedia.org/wiki/Postcodes_in_Australia#Australia_States_and_territories)

~~~
zarify
Oh interesting, not having lived in the NT or ACT I didn't know about that
one.

------
pyre
It's worth noting that dealing with email addresses like this affects other
parts of the site. For example, I was signed up with Zappos with a
blah+zappos@example.com email. Everything worked perfectly other than some of
the links in the email, which didn't escape the '+', meaning that it was
interpreted as a space. E.g.

    
    
      "https://example.com/unsubscribe?email=me+example@example.com"
    

vs.

    
    
      "https://example.com/unsubscribe?email=me%2Bexample@example.com"
    

[ On the plus side, Zappos was really responsive, and fixed the issue when I
reported it. ]

------
epynonymous
i think validation should be done via sending an email to the email address on
hand and then requiring the end user to click a link in the email to activate.
for one, this validates that it's the actual user's account and then this also
validates the address form without having to be pre-emptive at the start.

~~~
swanson
Yeah it is kind of like authentication vs authorization when managing users.
Just because a user is correctly logged in, does not mean that they should
have access to something in the system. Just because a valid email is entered,
does not mean that the person entering it is the owner of said email. Though
these two concerns are often conflated!

~~~
inthewind
And this could be abused. I sign up to a popular service with someone else's
email address. I could be a nuisance and block them from signing up as
themselves, even it if it is temporary.

------
gpvos
I am wondering about the following scenario when you allow _anything_ to be
entered, but do require validation: what if I entered "myaddress@mysite.org,
someone-else@elsewhere.org". The mail will be sent to both addresses, and if
I'm faster than the other guy (probably, because I'm expecting the email), I
can basically sign someone else up for whatever it is. Okay, it's trivially
easy to find out that I did it (but I could use a throwaway address), but you
may want to prevent this scenario anyway to prevent harassment.

------
languagehacker
The current email validation approach in MediaWiki allows for email addresses
like "foo@bar", instead of "foo@bar.com". This ended up breaking unit tests at
Wikia when we upgraded versions last year.

I originally thought this was a bug. But if you think about it, MediaWiki is
capable of being deployed on an internal network. An internal wiki actually
could actually interact with email addresses only available in a given
intranet server, and not reachable from a given TLD.

~~~
Xylakant
some NICs or registrars (don't remember exactly) use <user>@<tld> as e-mail.
That's valid, so you can get your scenario even on the wild internet :)

------
bdcravens
With the revelation that the NSA collects email addresses by the millions,
you'd think they could offer a service where they validation the existence of
addresses.

[http://www.washingtonpost.com/world/national-security/nsa-
co...](http://www.washingtonpost.com/world/national-security/nsa-collects-
millions-of-e-mail-address-books-
globally/2013/10/14/8e58b5be-34f9-11e3-80c6-7e6dd8d22d8f_story.html)

------
javanix
Failing to figure out how to send an email to an individual when they sign up
may be a problem.

In the case of validation I tend to look at it as "What is the minimum I can
check for to ensure that I can get the data that I need out of this form?"

Too much and it becomes arduous to sign up, too little and the app ends up
trying to send an order confirmation to "Matt Swanson" instead of
"matt@mdswanson.com"

------
bluedino
I'd check the TLD as well.

Going through the error logs on our mail server, there are a lot of people out
there that get their email address wrong even if you have them type it last.
.cmo instead of .com, transposing letters in their name, spelling their
company name wrong...

~~~
inthewind
Isn't this pointless with the wholesale of TLDs?

~~~
inthewind
Also sometimes you don't even need the TLD:

user@myhostname, is a valid email address, and yet it's rejected by a lot of
libraries.

~~~
dfox
It's certainly valid address, but why would you want to input that as an email
address into public-facing service that you do not operate?

~~~
inthewind
Perhaps I do operate it. Can be very useful on say a dev box.

------
bichiliad
For those who still want email verification, I believe Mailgun offers a pretty
good API.

[http://blog.mailgun.com/post/free-email-validation-api-
for-w...](http://blog.mailgun.com/post/free-email-validation-api-for-web-
forms/)

------
tedsanders
Sometimes it's important to only allow emails that work your other services. I
once signed up for an EA account using an email with a + symbol, and it was a
horrible experience because some parts of their system couldn't handle +
symbols.

~~~
zarify
My favourite has been signing up for a couple of games - I think CubeWorld was
one and some generic MMO I was trying out was another, where the account
creation process handled a '+' email fine but the game client didn't handle it
and gave a login error.

The best part of that was that the MMO account didn't allow me to change my
email address.

------
fmax30
I actually always use this [1] regex for checking emails It is supposed to
follow the RFC 5322 standard. Why is it wrong to get emails through this.

[1].

(?:[a-z0-9!#$%&' _+ /=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'_+/=?^_`{|}~-]+)* |
"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f] |
\\\\[\x01-\x09\x0b\x0c\x0e-\x7f]) _" ) @
(?:(?:[a-z0-9](?:[a-z0-9-]_[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-] _[a-z0-9])? |
\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]_ [a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f] |
\\\\[\x01-\x09\x0b\x0c\x0e-\x7f])+) \\])

------
Fomite
I _just_ had my email address rejected by a validator, I expect because it was
a user@thing.thing2.domain format - and yes, that company lost a customer,
because it punted them into 'meh, can't be bothered'.

------
Continuous
To add to the conversation. There is a way to check if a mailbox exists using
SMTP. It works on Gmail and several other servers.

Python/PHP Code and explanation is here
[http://www.webdigi.co.uk/blog/2009/how-to-check-if-an-
email-...](http://www.webdigi.co.uk/blog/2009/how-to-check-if-an-email-
address-exists-without-sending-an-email/).

It was eye opening to understand the underlying SMTP protocol. There are some
pitfalls too as mentioned in the article.

~~~
pjc50
This works almost everywhere, except Microsoft Exchange.

------
cheald
Why exactly is an RFC 2822 parser wrong? It is _by definition_ the right way
to validate an email address.

The RFC specs emails as a CF grammar, not a regular grammar, which is why
validating all possible emails with regexes is so hard. Use a parser and call
it done.

If your goal is "don't prevent a user from signing up" then why validate
emails at all? Why not just accept anything, and whinge at them after you've
already captured them in your system?

------
arb99
This topic comes up again and again on here.

Like others have said most popular web app languages have some email
validation built in (like php has the FILTER_VALIDATE_EMAIL filter).

~~~
desas
I remember looking at the PHP built-in option a year or so ago, there are bugs
filed against that because it's too strict.

------
TomGullen
Who are these people who have whack email addresses trying to sign up anyway?
They must find the web a pretty difficult place to navigate. Has anyone
actually come across a real world scenario where a user has an email address
on the cutting edge of RFC specification? Does anyone capture these email
addresses? I've never come across one before in the wild. Is this a problem
that actually exists?

~~~
eCa
For some it exists:

[http://www.youtube.com/watch?v=JENdgiAPD6c](http://www.youtube.com/watch?v=JENdgiAPD6c)

------
guelo
Also, screw the "Confirm your password" cargo-cult. Email and password that's
it. Send a verification email and you're registered.

Another one, using the stupid asterisk character hiding password input field.
It's user-hostile, especially on mobile. No one is looking over my shoulder,
and if they are I'll take care of it myself, thank you very much.

~~~
adnrw
Your two points are related: because the password is masked and you can't see
what you're typing to spell-check, it's worth asking the user to type the
password again to make sure they spelled it correctly the first time.

That being said, I completely agree with you – I don't think there's much
validity in masking the password field except maybe when it's auto-filled by
the browser.

We have tested turning off the masking on various sites we've developed and in
general users tend to freak out and think the site is insecure as a result.

~~~
chinpokomon
Yup, makes perfect sense to me why that design doesn't go over for a public
accustomed to seeing the password mask. It would be neat if that could be a
preference set in the OS for power users, but I can see that being abused and
I see it no way compatible with legacy sites and applications. You couldn't
make it the default, because it drops a degree of security for all users of
the OS. And it wouldn't get adopted widely enough for it to be worth the
effort, being more expensive to support for developers.

------
inthewind
If you designated the input type as email, at least the browser could then
take over the responsibility, not necessarily of the validation, but possibly
of the suggestion that you might have mistyped or have an invalid email
address. Surely that's preferable to every site writing their own code.
Browsers should be web helpers!

------
latraveler
All the points you make are excellent. I've always leaned to the 'greedy'
regex, but never thought to articulate exactly why.
[http://www.radiumcrm.com](http://www.radiumcrm.com)

------
Semaphor
Recently I had a site reject my email address. The only thing I can imagine is
that they used a really old whitelist of TLDs and .me wasn't included yet…

------
vezzy-fnord
Some languages also have useful functions in their standard libraries for
this. I know Python has email.utils.parseaddr().

------
AsymetricCom
Wow, when it comes time to validate user input, don't! What a clever solution!

~~~
Widdershin
What's your solution?

~~~
AsymetricCom
Lets keep validating user input. Even if it's "hard" i.e. uses a regex...

Sure there's better ways to build a regex then to hard code it into each
method, but nevermind that, lets just accept whatever comes through the pipe
into our barely tested (in production) and highly insecure frameworks, as long
as it contains an '@'.

No matter what solution I propose, It's better than this "nonsolution" because
it's a solution.

~~~
johnzabroski
You are missing the point. Perhaps it will help if I frame the problem for you
differently: scrubbing bad data, and developing policies for minimizing bad
data in your system.

Not all validation needs to happen at the time of data entry.

Your "hard" regex may reject international email addresses. In fact, if your
regex's input isn't converted to Punycode first, you are a fool for even
attempting to use regex, because now your regex will likely fail on all IDNA
inputs.

And what is your test suite going to be?

And what did you "validate" exactly? That you matched your regex? What if the
e-mail isn't active, or the mailbox is full? Outlook 2013 actually has this
really cool feature called MailTips that provides more advanced mailing list
and e-mail address validation and warnings:
[http://blogs.technet.com/b/exchange/archive/2009/04/28/34073...](http://blogs.technet.com/b/exchange/archive/2009/04/28/3407377.aspx)

Suppose when you first signed up the user, they validated their email address,
but now the account seems to be inactive. How do you handle that scenario?
Continuous validation.

And how generally useful is your regex? What are you going to do if the email
came from OCR software output, or screen scraping output? Your ERP may have
the original document it was scanned from. Are you going to not store the bad
e-mail address simply because you wrote some "hard" regex that rejected it?
Not a straight forward question to answer, as it depends on your data model
for storing addresses. You might have a column IsConfirmed.

Here is another example of "continuous validation". Validating mailing
addresses. Most major e-commerce sites allow very liberal input, but scrub the
data in real time or near real time, because the postal service gives
discounts to companies that print "correct" address labels. "Correct" here
could mean "One Post Office Square, Boston, MA, 02109" instead of "1 Post
Office Sq, Boston, MA, 02109". This process is called Address Standardization,
and in areas of the world with rapidly growing economies, often times Address
Standardization vendors are behind, because some "streets" don't have
addresses yet and aren't known to exist in any GPS system. This is common in
many parts of China.

Here is another example of "continuous validation". How Google does spell
checking, as compared to the "fixed validation" in Microsoft Word's spell
checker.

~~~
AsymetricCom
All these other questions are dependent on context and out of scope of the
argument.

The article argument in a nutshell is that validating email is hard, so don't
bother, in fact, let users submit whatever they want including javascript.
Then just check for @ and send it off to your next parser in the chain, in
fact get lots of 3rd party parsers for misc features and send data to them
first. spend effort fixing autocomplete so users can enter data easier that
you will automatically accept. I'm sure this can only improve data quality...

I can imagine that wanting to know all the stupid shit your users submit as an
email is the correct solution in certain contexts, but for a majority of
cases, this article is wrong in everything that it suggests. Admittedly, there
is very little context given.

Perhaps the context is "I don't care about security of my users or my
services, and I will run whatever 3P code on my backend that appears to do the
job of making a webpage look spiffy and easy to use. Once I have 10 Million
(unverified) users, you sell your spaghetti factory and it's no longer your
problem."

After all that, he recommends not letting people use software without a
validated email address. Too bad he never bothers saying how he would get to
that point, only how he would avoid doing to work.

