
The Correct Way to Validate Email Addresses - amk_
https://hackernoon.com/the-100-correct-way-to-validate-email-addresses-7c4818f24643#.cswwiflh9
======
Benjamin_Dobell
The number of websites that try reject my email address with a + in it, ugh!

Surprisingly, the validation is often done 100% client-side anyway, and simply
modifying the incorrect regex lets my email address through... If I wrecked
havoc on your back-end, then it's your fault for sucking ;)

~~~
ryandrake
Even worse is rejecting my password because it has a + in it! Why do you as a
business care what my random password generator spit out??

Scarier still is when it's a server-side response that rejects my password for
its contents...

~~~
pwg
Caring what characters are in the password heavily implies that the site is
not hashing the plaintext password in any way, and scarier still, may just be
storing the plaintext password as plain text.

Why: Because if they were (at least) hashing it the output from the hash would
be a binary string in which case they would have to be 8-bit clean through to
the DB column where the hash output resided, and then there would be no reason
to care what character was in the input.

Caring specifically about a + in a password also implies that their
authentication might be setup internally as a http endpoint with URL encoding
of a "password=" form variable (because + is used as the escape for hex
encoded characters in URL encoding).

Both, of course, imply a lack of proper secure design in their password
handling.

~~~
brianwawok
> Caring what characters are in the password heavily implies that the site is
> not hashing the plaintext password in any way, and scarier still, may just
> be storing the plaintext password as plain text.

I don't think that is true at all.

I may very well want to put a few simple rules I validate serverside, such as

1) No username in password 2) No email in password 3) No list of 100 most
common passwords in password

All of which require me to look at the text for your password, none of which
mean I am storing it in plaintext.

~~~
Zancarius
> I don't think that is true at all.

I've run into a couple of sites that reject passwords containing ', %, and
other special characters that suggest there may be some truth to it.

If you're scrubbing input as if it's about to be insert into a database via
SQL, then there's really only two possibilities. Either a) you're running
legacy code that still does the check and does blind escaping (which has its
own set of implications) or b) you really truly are storing passwords as plain
text.

~~~
andreareina
Or they're cargo-culting on decades of experience where special characters are
verboten.

~~~
Zancarius
Hah, good point. Although I do have to wonder: When does it turn into
filtering by force of habit? After all the legacy cruft has long been
forgotten and is no longer maintained?

~~~
Freak_NL
Because the other monkeys will chew you out if you start doing it differently
all of a sudden. Nobody knows exactly why we're doing it the way we are doing
it, but it is complex, and changing it might break something.

It's the _five monkey experiment_ :

[http://johnstepper.com/2013/10/26/the-five-monkeys-
experimen...](http://johnstepper.com/2013/10/26/the-five-monkeys-experiment-
with-a-new-lesson/)

~~~
mgkimsal
It's not always that.

Let's use the "filtering chars from password" example above. You can't put
some special chars in password field, and you want to change that so it's
doing normal hashing where special chars don't matter.

In a larger org, even changing a practice like that so that it "makes more
sense" can have a big ripple effect.

You have to

* explain to someone else on the team who came up with the original process that it's flawed (and why)

* explain to other dept that they need to update their testing process (and why)

* get support dept to change their language/process

* change outbound messaging in all affected places (perhaps with code you can't touch, involving other teams)

* possibly have a flag that deals with 2 versions of data

Even if your change brings you in to line with normal/safe practices, you may
have to fight multiple inane battles, spend loads of time and political
capital, and at the end of the day, you'll be able to also accept a !"@+'$ in
a password field? Most people will not grasp the bigger issue at play.

~~~
Freak_NL
At a small business convincing colleagues isn't that hard. At a larger
corporation, getting the rest of the team on-board is the job of whoever is in
charge of defining security policies and such. Not allowing characters present
on all keyboards and input devices (!"@+'$) statistically increases the risk
of people picking weaker passwords then possible, and he/she will guide that
change through the proper processes, for example to make sure that any client
software interacting with the backend is aware of the upcoming change. Same as
with any security issue (e.g., the deprecation of SSLv3 ciphers in favour of
newer TLS versions).

If a business can't handle a change like this, something is really broken in
the development pipeline. Granted, that describes a lot of medium sized
companies…

~~~
mgkimsal
Take that one 'request' \- possibly initiated by a jr-mid level developer -
and stack it up against the 500 other todo items in the pipeline. You can make
those arguments about "statistically increases the risk of people picking
weaker passwords " \- unless this increases a bottom line or comes out of
someone else's budget, this sort of 'bug' is going to be _really_ low down on
the totem pole for all the reasons I mentioned, and a few others.

You can says "the process is broken" but it's also that same process that got
people where they are, puts food on their table, pays for the lifestyle, and
precious few people are willing to ever rock the boat at any company for
anything.

~~~
WorldMaker
Which gets back to some of the cargo cult thing too. It's not uncommon in a
large enterprise to smack up against things like "Years ago we paid a highly
trained Security Consultant a large amount of money to develop our Security
Guidelines, who are you and where are your security credentials to tell us to
do things differently?"

Even worse when that "Security Consultant" is still a retained coworker with a
fiefdom to maintain by war at all costs, a wizened old greybeard whose
seniority will always trump yours, and/or your boss.

------
nicolas_t
The one thing I systematically do in term of email validation is catch the
common typos of the main providers. So things like gmail.con, hotmai.com,
gmall.com and so on.

In 99% of those cases, it prevents someone from entering a wrong email.

We do not do email activation by forcing people to click a link in their email
to validate that they received it since that causes a drop in the funnel and
reduces the amount of revenues (non technical users tend to not come back when
you ask them to go to their emails to verify it). So, in this case, correcting
the typical typoes is very important. In our case though the information is
not extremely private so it's less of a problem to do this.

Having people type the email twice doesn't really prevent typoes, people copy
and paste. And if you disable paste, then it becomes annoying to users and you
don't want to annoy users during the signup process (plus I hate websites that
mess up with paste so I won't be hypocritical and do it).

Lastly, I know that having an email with a local domain name with no TLD is
valid but it'll never be valid in the context we are sending so supporting
them just doesn't make sense.

~~~
gpvos
I like activation emails, because it shows the website cares about being able
to email me. Then again, I'm a technical user.

~~~
nicolas_t
Oh, we do send a welcome email and actually include a link in the email to
subscribe to the newsletter but we don't make it necessary to use the website.

~~~
gpvos
In that case, I'd rather not have the site gather my email address at all.
Just create an account with an arbitrary name (or no account at all) and let
the user use the web site. As soon as an email address is connected to the
account (or the user name itself is an email address), I'd rather have it
verified.

But I can see how there may be other concerns (commercial or not) that
interfere with this.

~~~
nicolas_t
Users remember their emails, they don't always remember their nicknames
especially if they had to select an alternative nickname because the one they
usually chose was already used...

Honestly, it's ruthlessly pragmatic business reasons. Is it ideal? No but in
the end, it's the most painless for most users...

And, also most users on that particular site have little computer experience,
so it does affect how they react and how we work. If it were something
targeting the HN crowd, I'd probably have enforced email validation and would
see a much lower drop in conversion due to it because people on HN are used to
that and do not have a problem with it.

So, validations and UI workflow have to be adapted to your audience and your
business and that's the main point really.

In this particular case, we did experiments with enforcing email validation
for a subset of new signups or even allowing a small number of signups to
signup with a username instead of an email. So we do have data...

Most of the work I do for other customers doesn't touch those areas of their
app, so I only have relevant on hand experience on that particular site but I
think it's important to mitigate the recommendation of "Always validation
emails" or "Always use regexp" and try to think of the best experience for the
kind of users you're targeting.

Sometimes in HN, people talk in absolutes when things are instead very
context-dependent.

------
amk_
TL;DR the odds that the user entered an incorrect-but-valid address are way
higher than that they entered one which will not actually be able to receive
mail. Send a validation email.

~~~
wyldfire
For flip's sake, yes please! Close the loop, fer crying out loud.

I got a popular givenname.familyname@gmail.com address and I frequently get
mail that's meant for other people who share my name. The vast majority of the
time it's the individual themselves who sign up for a service or offering but
there's rarely a validation upfront.

The best emails are the ones who extend full trust to the email recipient over
some account during that first email. Facebook, shame on you.

~~~
lucb1e
> The best emails are the ones who extend full trust to the email recipient
> over some account during that first email. [Random company,] shame on you.

Well what else would you have them do? Have people enter their address and
send letters there to do a password reset?

A confirmation email before full trust is going to do little: a malicious
person would just click that link, right?

~~~
wyldfire
Someone could have just created an account and accidentally put the wrong
email address in. Therefore, putting a link that extends full trust in the
welcome/confirmation email is a design error. The fix for this error is to
require the newly created account holder to put in their password on this
first login. While the account's in this state it shouldn't be possible to
reset the password.

------
jwecker
I always assumed it was more a sanitization issue for security's sake. By
allowing only a simple subset ("common") email address type, you can be
ambivalent about what email server is running and how it reacts to the wide
variety of specially crafted email addresses.

With no validation other than sending the email, you have to know, for
example, what the server would do with an email address that claims to be
@localhost. Now it becomes a problem- or at least a question and concern- for
the backend system. Whether the backend interprets root@localhost as valid and
does exactly what it's told or rejects it due to some configuration- it has
become a backend complication and a DOS attack vector.

A simple policy of only handling a subset- the common class of email
addresses- is one of the things that allows us to have a simple mental model
of what the MTA is supposed to do. The fact that it sometimes caught a type-o,
or not, is incidental. "Invalid email" wasn't meant to imply the email address
doesn't fit the spec- it was meant to imply that a particular site or service
has chosen not to accept email addresses like that.

Or at least that's what I assumed :-)

~~~
zAy0LfpBZLC8mAC
> I always assumed it was more a sanitization issue for security's sake.

Sanitization is at best idiotic, at worst creates security problems. There is
no such thing as "bad characters", there only is broken code that incorrectly
encodes stuff. If you ever find yourself modifying user input "for security
reasons" (or really, for any reason at all), you are doing it wrong. The only
sane thing to do is to make sure that the semantics of _every_ _single_
_character_ of your user's input is preserved in whatever data format you need
to represent it in.

~~~
jwecker
An email address isn't a document though, it's a routing command. I don't mean
sanitization in the sense of inserting backslashes. I mean sanitization in the
sense of "we don't allow people to set their email address to a mailbox on
localhost at our mail server."

~~~
zAy0LfpBZLC8mAC
1\. Sanitization generally means changing information. As in, "removing bad
characters", that kind of stuff. That's different from validation, which
should result in rejection of bad input, and which can be perfectly fine.
However, more often than not, validation is implemented badly and rejects
perfectly fine input, which is why validation shouldn't be employed more than
necessary either.

2\. Rejecting @localhost addresses doesn't really make a whole lot of sense.
People could just enter the public IP address or hostname of the server, or
add a DNS A record under their own domain that points to 127.0.0.1, or an MX
record that points to localhost, or any number of other weird stuff that you
could not possibly validate anyway (if only because it could be changed at any
point lateron). Just configure your mail server properly and then send the
damn email, and if it does get sent to root@localhost, and possibly forwarded
to the admin--so what? People obviously could just sign up using your admin's
email address anyway, and that not only at your site, but at millions of sites
out there, you won't be able to stop them. There is nothing particularly
dangerous about receiving unsolicited signup emails or about sending emails to
yourself.

~~~
throwanem
> There is nothing particularly dangerous about receiving unsolicited signup
> emails or about sending emails to yourself.

Depends on what you do with them, in the latter case. There could be an
amplification attack there.

Validating domain parts to a certain extent isn't a bad idea, at least as far
as non-routable domain names and RFC1918 ranges go. I've seen this done
(actually implemented some of it, in fact) at a past employer, who were
basically looking to cover the 90% case in terms of not getting hosed by a
trivial attack. It doesn't take much effort and it makes 4chan's life harder.
What's not to like?

~~~
zAy0LfpBZLC8mAC
> Depends on what you do with them, in the latter case. There could be an
> amplification attack there.

Hu? How would that work?

> Validating domain parts to a certain extent isn't a bad idea, at least as
> far as non-routable domain names and RFC1918 ranges go.

What do you mean by "non-routable domain names" and what do you gain by
checking for RFC1918 ranges?

> I've seen this done (actually implemented some of it, in fact) at a past
> employer, who were basically looking to cover the 90% case in terms of not
> getting hosed by a trivial attack.

Why did you prefer that approach over a robust solution?

My idea of a robust solution: Have one central outbound relay that's
firewalled off from connecting to anywhere but the outside world, make all
servers that need to send email use that relay as a smarthost (so they never
connect to anything but that relay, regardless what the destination address
is), use TLS and SMTP AUTH with credentials per client server to prevent abuse
of the relay by third parties.

> What's not to like?

(a) that it's a lot easier to build a solution that's more robust, (b) it's
extremely likely that your implementation is buggy, thus rejecting valid
addresses, and (c) it's causing a maintenance burden (what happens when the
first people drop IPv4 for their MXes? I'd pretty much bet that you don't
check for AAAA records, so you'd probably suddenly start rejecting perfectly
fine email addresses, thus making the transition to IPv6 unnecessarily harder,
am I right?).

~~~
throwanem
'Non-routable' as in a single label, or as in not resolvable. I don't think it
is unreasonable to consider an address invalid when its domain part cannot be
resolved. Checking for RFC1918 ranges means you don't try to send to another
class of addresses that's never going to be received.

You would lose the bet. The product supported IPv6 from day one.

That is a robust, if somewhat complex, solution for a relatively small volume
of mail. When you're sending ten million messages a day by the end of the
first month, pushing everything into a single relay of any kind is asking for
a lot of trouble.

~~~
zAy0LfpBZLC8mAC
> 'Non-routable' as in a single label, or as in not resolvable. I don't think
> it is unreasonable to consider an address invalid when its domain part
> cannot be resolved.

What exactly do you mean by "cannot be resolved"?

> Checking for RFC1918 ranges means you don't try to send to another class of
> addresses that's never going to be received.

But why check for it? Is that actually a common mistake people make? An
attacker could change the address after you checked it, so it's not going to
help against attackers, is it?

> You would lose the bet. The product supported IPv6 from day one.

Good for you! :-)

> That is a robust, if somewhat complex, solution for a relatively small
> volume of mail. When you're sending ten million messages a day by the end of
> the first month, pushing everything into a single relay of any kind is
> asking for a lot of trouble.

Complex? Certainly less so than implementing validation yourself.

As for scalability: Well, yeah, as described that's more the setup for a
company that's operating various different services, none of which has a high
volume of outbound email (which would be most, even the best startups don't
have ten million signups per day and don't send much email otherwise, and even
that should actually still be managable with a single server).

But that's trivial to adapt without changing the general approach. First of
all, obviously, you could just add more relay servers and have client servers
select one randomly, that scales linearly. But if you really need to move
massive amounts of email for one service, so that adding an additional relay
hop for each email you send actually adds up to noticable costs, you can still
use the same approach: Just put the MTA onto the same machine(s) that the
service is running on, into its own network namespace (assuming Linux,
analogous technology exists on other platforms), and firewall it off there so
it cannot connect to your internal network. Potentially you can even just add
blackhole routes for your internal networks/RFC1918 ranges, so you would not
even need a stateful packet filter (though currently you might still need it
due to IPv4 address shortage).

~~~
throwanem
"Cannot be resolved" means NXDOMAIN.

Why assume email addresses only get checked in one place, and not all?

Ten million a day was a milestone. I left that company over a year ago; it
would astonish me to find that figure now exceeded by less than a factor of
twenty. Granted these are mostly not signups. They are outgoing emails
nonetheless, which makes the case germane despite that superficial
distinction.

Your proposed solution sounds pretty expensive in ops resource, to no
obviously greater benefit than the rather simple (well under one dev-day)
option we chose. You seem to feel yours is strongly preferable, but I still
don't understand why.

~~~
pixl97
NXDOMAIN can be a temporary error. The SMTP queuing protocol is designed to be
resilient against DNS failures, internet outages, routing problems, and
temporary mail delivery issues.

~~~
zAy0LfpBZLC8mAC
> NXDOMAIN can be a temporary error.

Unless some DNS server is broken, it actually cannot. NXDOMAIN is an
authoritative answer that tells you that the domain positively does not exist.
Not to be confused with SERVFAIL, which you should get if the DNS resolver ran
into a timeout or got an unintelligible response or whatever, NXDOMAIN should
only occur if the authoritative nameserver of a parent zone explicitly says "I
don't know this zone either locally nor do I have a delegation for it".

------
tedmiston
Hmm, sorry but I don't buy that the "correct way to validate" is _not_ to
validate the input.

Email addresses aren't a special enough case to be handled differently than
any other user input, which we always validate to both sanitize and show
client-side errors, if nothing else.

Sure, the complete regex is complex, but it _is defined_ and is hardly
unconquerable. Look at Django's `EmailValidator` implementation for example
[0] that is mature and well tested [1].

The author has not convinced me that ignoring validation is the right choice
when options with a scope so thorough exist.

[0]:
[https://github.com/django/django/blob/master/django/core/val...](https://github.com/django/django/blob/master/django/core/validators.py#L170)

[1]:
[https://github.com/django/django/blob/a9215b7c36bff232bcc941...](https://github.com/django/django/blob/a9215b7c36bff232bcc9416309726290dc74a9e8/tests/validators/tests.py#L48)

~~~
wtbob
> Sure, the complete regex is complex, but it is defined and is hardly
> unconquerable.

If it is a _regular_ expression, then it is not able to match all valid email
addresses, because the grammar of email addresses is context-free, and regular
expressions can only match regular grammars. It doesn't matter if it is
defined or not: if it's a true regular expression, then it simply _cannot_
validate email addresses.

(it may, of course, be a context-free expression masquerading as a regular
expression)

I wonder if the django validator will choke on perfectly valid email addresses
such as (this)"()<>[]:,;@\\\\\"!#$%&'-/=?^_`{}|
~.a"(is)@(valid)example.org(honest)

I suspect that it will, but of course I could be wrong.

~~~
thedufer
It is true that regular expressions in the CS sense can't parse context-free
grammars. However, PCRE, which is what most programmers are talking about when
they say "regex", can do so. So you're both kinda right, I guess. But you're
being a bit pedantic.

~~~
eyelidlessness
> PCRE, which is what most programmers are talking about

I wish that were the case. [http://www.regular-
expressions.info/refunicode.html](http://www.regular-
expressions.info/refunicode.html)

(They used to have a much more useful and concise comparison table but I can't
for the life of me find it.)

~~~
thedufer
Interesting; I've never had to think about how regex interact with Unicode.

I guess what I really meant is that programmers are talking about PCRE in
terms of power, not in terms of exact syntax. In particular, they have
recursive patterns, which are sufficient to pull them up to context-free
grammars.

------
JacobJans
I do a lot of optin email. Here are some examples of bounced emails that
people use to sign up:

* somename@gmail.co

* anothername@yhoo.com

* myemail@hotmial.com

These are very common errors that occur nearly every day. A regex isn't going
to help here. What does help, is a notification that asks people to verify
what they typed –– if the email contains an obvious, common error, such as one
listed above.

~~~
gizmodo59
Asking to retype though being a simple solution IMHO is asking for a lot.
Consider a user who uses mobile phone, even copy paste is annoying. Validating
if the mail box exists and that it does not belong to a provider like
mailinator and then sending a confirmation link to them works. While its not
perfect, it does address lot of other concerns without sacrificing user
experience.

~~~
stouset
Also, asking to retype makes sense for a _password_ where you can't visually
verify that you typed what you expected. For an email address, it's pointless
and frustrating. Send an email. If it bounces or doesn't get verified in a
timely fashion, it was a mistake and delete the account.

~~~
namenotrequired
And the user never discovers their mistake.

~~~
nilved
They'll discover they don't get a validation email.

------
SamReidHughes
There are other situations where you do have to guess whether a string of data
is a "real" email address or if it's some garbage data that turns up in
information systems. "Validating" email addresses ( _without_ emailing
somebody) is a real problem. By the way, maximizing the accuracy of such a
guess can require going against the standards.

------
r1ch
One thing I've found that helps a lot is instant delivery notifications. When
you try to register a new account, our "We've sent a confirmation email"
screen will report within a few seconds if there was a mail delivery error and
allow the user to correct their address. Common typo detection for popular
email domains is also beneficial
([https://github.com/mailcheck/mailcheck](https://github.com/mailcheck/mailcheck))

------
cyberferret
As an aside, I received an email from a government department recently that
had an '&' character as part of the email address. I didn't think that was
valid, but lo and behold, when I checked the specs, it IS indeed a valid
character in an email address.

Just goes to show that assumptions are often wrong, and you have to crack open
the spec document from time to time... [1]

[1] -
[https://tools.ietf.org/html/rfc2822](https://tools.ietf.org/html/rfc2822)

~~~
lucb1e
I have a feeling somewhere on the Internet is the story of the sysadmin who
got the request from that department for an email address with an ampersand.
S/he probably stopped themselves from replying "uh, that doesn't work", tried
it, and then learned it's valid.

------
Ameo
I have a .link domain for my personal email and a lot of sites refuse to let
me register because they don't recognize it as a valid TLD.

Then there's the textbook company that lets me register but refuses to let me
reset my password claiming that I'm trying to enter an "invalid email
address."

~~~
GirlsCanCode
I had a .to domain for a while, and I had an email address
"firstname@firstname.to" where firstname is my first name. I wanted to use it
when I needed to give out emails causally, like if someone from Church or some
other group wanted my email.

I gave up on it, not because of computer validation but because of stupid
people! Nobody would "get" the .to domain and they'd always think there should
be a .com or .gmail.com on the end of it. Especially older people.

So I gave it up.

~~~
toomanybeersies
I'm discovering the same problem with my *.xyz domain name.

~~~
icebraining
As a counterpoint, I was surprised to see we've had very little trouble with
our *.solutions domain, even from clients who are otherwise not very tech-
savvy.

~~~
justusthane
I sometimes have to repeat my _@_.ws email address more than once when reading
it to someone, but I've never had a problem with anyone messing it up or
thinking it's invalid.

------
KennyCason
Fully knowing where this was going to conclude, I still found myself reading
through the "lets build a stats model" part. I just had to check my calendar
to confirm it's not April 1st. :)

------
Xorlev
At this point, our email validity criteria:

.+@.+\\..{2,}

That is, at least one character for the inbox, at least one character for the
domain, at least two for the TLD (we assume that TLD-less domains are
undeliverable by us). This ensures we don't allow 'a@a' or 'a@a.a', but do
allow 'a@a.io'.

~~~
DHowett
a@[IPv6:2001::1] is, unfortunately for your validation regex, a valid e-mail
address.

[EDIT: I see that you consider TLD-free e-mail addresses undeliverable;
still!]

~~~
bbcbasic

      .+@.+
    

FTW?

~~~
Senji
Enjoy getting a billion e-mails per second written to root@localhost

~~~
bbcbasic
As if. There would be a single confirmation email sent which would not be
clicked.

------
dugluak
How about the common mistake of entering xyz@abc,com instead of .com. A lot of
times I unintentionally make this mistake. If the system doesn't prompt me in
this case then I would never know why I didn't receive any further
communication from it. That's FAIL in my opinion.

------
sly010
If someone's valid email address

[*\"32f2@13.31.43.11

they are up to no good and I don't want them as my customer.

Also, according to the standard email addresses supposed to be case sensitive,
since the username part refers to a unix user and unix is case sensitive. I
work with a lot of email address lists originally collected on paper and of
course noone knows that. So as bad as it sounds, part of my sanitation process
is to lowercase everything. Noone ever complained. What the standard says and
what people actually do is very different.

~~~
zeveb
> If someone's valid email address

> [*\"32f2@13.31.43.11

> they are up to no good

Well, that one in particular, perhaps — but what about "sam &
jill"@ourfamily.invalid? What about john(for sly010)@nowhere.invalid?

> Also, according to the standard email addresses supposed to be case
> sensitive, since the username part refers to a unix user and unix is case
> sensitive.

The local-part doesn't refer to a 'Unix user'; it is what it says on the tin:
the local part. It's up to the receiving server to determine what it wants to
do with case. It's the duty of other servers to preserve the email address as
transmitted.

> I work with a lot of email address lists originally collected on paper and
> of course noone knows that. So as bad as it sounds, part of my sanitation
> process is to lowercase everything.

I might use the email address as given, and if that bounces try lowercasing. I
don't know of anywhere that has case-sensitive names, but it's conceivable.

~~~
sly010
I am all for doing things by standards, but to me this is de facto. No one
uses case sensitive emails. If they did all sorts of weird things would
happen: People couldn't log in, but other people could. There would be
multiple accounts when there should be only one, etc. You would have to store
every email in original format and in some canonical format for uniqueness,
etc. It's a nightmare.

If I had to guess the main reason this apocalypse is not happening already is
because by default mysql indexes are case insensitive, so dude@dude.com is the
same as dude@DUDE.com, so people who use mysql never realize this being an
issue.

------
maxerickson
Yes, please do send activation emails (or perhaps a personal confirmation
email if you are establishing contact with someone that wrote down an address
for you).

Those of us with firstnamelastname@commonhost will appreciate not getting
bills and job offers and such.

~~~
JshWright
I've been getting monthly status reports from some guy's Hyundai for months...
The unsubscribe link does nothing... I'm tempted to reset his password and
change the email address.

~~~
kevin_thibedeau
Password resets are the only solution. I did that years ago when someone
signed up for a Facebook account with my address and I kept getting friend
notifications. I let it go for a few years and Facebook was happy to keep an
unvalidated account active the whole time.

------
buro9
The best way to validate an email address:

Send an email that they need to click on (or an email with a code they need to
enter), _OR_ ask the OAuth provider with authority for it to validate it (i.e.
Google Oauth for Google addresses, Windows Live for Microsoft Accounts, etc).

The best way to identify someone with an email address:

Store a canonical version of their email address alongside the users email.
Use the canonical version when signing-in/identifying and the raw version
originally supplied to send email.

This is the only way to not have duplicate accounts for
firstlast@googlemail.com vs first.last@googlemail.com vs first.last@gmail.com
.

The canonical email is always lowercase, no dots, no + part, no prefix or
suffix columns, known domain aliases are normalised to the most common alias
(googlemail.com > gmail.com).

I wrote a SQL canonical email func recently (in preparation for Persona shut-
down) if anyone is interested: [https://github.com/microcosm-
cc/microcosm/blob/master/db/mig...](https://github.com/microcosm-
cc/microcosm/blob/master/db/migrations/20160831112556_canonical_email.sql)

~~~
justinlardinois
> This is the only way to not have duplicate accounts for
> firstlast@googlemail.com vs first.last@googlemail.com vs
> first.last@gmail.com .

Periods being optional in usernames is something that Gmail started, not a
universal rule or part of the specification. Are you going to build into your
system all the possible variations for every email provider?

~~~
buro9
> Are you going to build into your system all the possible variations for
> every email provider?

I've analysed my user database and am only doing the large providers for which
I have high confidence.

------
omarforgotpwd
I first encountered the idea of email validation in Agile Web Development with
Ruby on Rails, which I read in middle school. In the book they give an example
of validating email addresses to show how you could use regex to validate. I
wonder if that contributed to the frustrating problem of developers trying to
validate emails and not accepting valid inputs.

------
gpvos
One more thing that irks me is that some websites capitalize or lowercase the
part before the @. Email servers are allowed to treat that part case-
sensitively, although most don't. (The part after the @ is indeed case-
insensitive.)

------
smallnamespace
Why not go the other way and compute the edit distance to commonly used domain
names, and then prompt the user and ask if it's correct?

E.g. if I type foo@googl.com, it should be pretty likely that I meant
google.com.

------
bballer
In all my projects I use the same methods for validating an email address: 1)
Does it contain a `@` 2) Split the string on the `@` and make sure that at
least one character exists on both sides of the `@` 3) Send verification
email.

This check is done server side, while on the client I just use an html5
input[type='email'] with a required attribute.

~~~
FungalRaincloud
That's probably reasonable, but you might want to amend 2 to "split the string
on the last '@'". There's no restriction to only '@' before the domain. It's
exceptionally rare that a server even allows '@' in the mailbox name,
fortunately for us.

------
bigger_cheese
There is someone at my work with a hyphen and an apostrophe in their email
address. Their inbox often gets used for testing things.

------
justinator
In Dada Mail [0], there's quite a few steps to figuring out if the email
address submitted for a mailing list subscription is valid, but most of it can
really be organized under sanitizing the data you receive, which you should be
doing anyways. Yes, we do validation for _form_ of an address client side
(which helps with hitting the server side so much), but we'll do it again
server side. We also look at stats on how many times an address was submitted
before, as well as per ip address over time across all fuctions of the app (as
well as specifically for subscribing). Oh, and even if it's valid, and "real",
sometimes we don't want to work with it either, ie: it shows up on something
like StopFormSpam. It's actually a ton of work and much to orchestrate.

[0] [http://dadamailproject.com](http://dadamailproject.com)

------
puddintane
I always search out that a validation library hasn't already been done in my
current language first before attempting validation with custom code - this
leads to more easily maintainable code for future eyes.

For example in PHP I use filter_var function with the FILTER_VALIDATE_EMAIL
[1] - while it's great to know why and how to do a particular thing
programming, it's better to use a time tested library that is maintained by
multiple eyes versus just your own.

[1]
[http://php.net/manual/en/filter.examples.validation.php](http://php.net/manual/en/filter.examples.validation.php)

------
SZJX
Of course the only way to be 100% sure would be to actually test it out by
sending an actual email isn't it. This is just so obvious. I think the point
of email regexes out there is not to make up for the user's silly typing
mistake. It's just more to reject nonsensical/malicious/blatantly false inputs
etc. as the first layer of protection really, and nobody would really spend
tons of time on crafting a "perfect" regex anyways I'm pretty sure, so that's
never been a problem at all.

------
chavesn
All of that statistical analysis was actually a bit silly, because I've never
heard the "typo" argument as a reason for email grammar validation[1]. Sounds
like a straw man. It didn't need to be disproven.

The conclusion is sound (although leaves out a discussion of the whether an
email confirmation field is at least better than nothing).

[1]: _(As a side note, I think the most common explanations for grammar
validation are programmer perfectionism and proactively stopping user garbage,
such as copy-paste errors or intentionally fluffed fields that will result in
a bounced email anyway.)_

------
zimbatm
TLDR; (with my own interpetation)

1\. Email validation regexp should be: /.+@.+/ => Tell the user to enter a
valid email if that doesn't match.

2\. Send a validation email to actually exercise the system

------
mangeletti
Firstly, great article.

Secondly, this guy is so hilarious.

I clicked to his podcast (where he reads Wikipedia pages) at the bottom and
listened to this hilarious episode [https://itunes.apple.com/us/podcast/david-
reads-wikipedia/id...](https://itunes.apple.com/us/podcast/david-reads-
wikipedia/id1052573447?mt=2&i=358964123). His specific style of sarcastic
humor (just like the article - is there a name for this style of humor btw?)
is rare and hilarious.

~~~
zeeveener
You named it for what it is. Sarcasm. Incredibly dry, sometimes monotonous
sarcasm.

------
contingencies
1\. International domain names... you cannot filter characters much because of
this. If you want to filter characters, be damn sure it's careful and precise
filtering.

2\. For immediate feedback, you can reasonably check it's not at at a non-
routable IP range, ie. address@127.x.x.x or address@192.168.x.x or similar.
However, this check is best done on your email server (MTA policy). It's
unlikely anyone would enter this without malicious intent so there's no need
to optimize for their use case.

~~~
developer2
"user@127.0.0.1" isn't even valid. An IP address as a host requires square
brackets a la user@[127.0.0.1] - but this is only by RFC, and _no_ real site
allows signing up with an IP address instead of a hostname.

Email validation is simple. You take the full RFC regex, and remove the
notations permitting square brackets and comments. You wind up with a regex
that accepts exactly 100% of real-world email addresses. If your regex
contains "\\.(com|org|...)", you're doing it wrong.

If you can't figure out how to do the full regex properly, then match against
/^[^@]+@[^.]+\\./ \- or hell, just check for an '@' symbol - as a basic "hrm
that kinda looks like an email address" and send the email. It's really not
difficult. The only problem is naive English-speaking people who only deal
with latin1 thinking that [a-z0-9]+ is somehow the only valid criteria for a
username or hostname.

------
amluto
I can think of one reason to do RFC2822 validation: security. It reduces the
chance that someone can give a bogus email address that makes some SMTP server
on the route misbehave.

------
nxzero
Correct way is run a regex for [wildcard @ wildcard . Wildcard] then send an
opt-in email real-time as the user is typing additional info. If it bounces
before the user finished the onboard form - alert them to the issue. If it
gets validate, autologin the user. If bounces or there's zero response by the
time the user completes the form, alert them, ask to type their email again
without access to the reprior entry - then give them the optin option via text
message.

~~~
jasonkostempski
Polling email on the server is awfully wasteful. I like the "request an
invite" pattern. Get an email address and nothing more, send the "invite",
clear the form, tell them if they don't get an email in 5 minutes, check spam,
still nothing, reenter the email. Don't associate the link with any particular
session, just the entered address so there is no risk of leaking any private
info, if they type in an address that actually belongs to someone else the
only options the other person would have is ignore the email or sign up
themselves. When they finally get their email right they'll get a link to a
place where they can fill in the rest of the info and you can be assured they
gave you the email address they wanted you to have.

------
RomanProofy
what about proofy.io I myself used for verification mails

------
tedmiston
From the perspective of a user typing their own email address correctly, I use
text substitution on OS X and iOS to never type my full email address. ex fg
--> foo@gmail.com

~~~
homero
fg

------
nbevans
Validating e-mails is pointless - short of sending them a confirmation URL and
waiting.

 _Sanity checking_ them however is often useful for checking form input or
data cleansing.

------
ChuckMcM
I think the 'valid but wrong' email is a more common failure. Having just
spent a few weeks with my wife trying to convince some poor person that they
had mistyped their email address when they created their Amazon account.
Reminded me of this xkcd ([https://xkcd.com/1279/](https://xkcd.com/1279/)).

It does seem effective to have someone type the address twice as that can
catch a typo fairly easily.

~~~
jcoffland
I hate it when sites make me type my password in twice like I'm some sort of
idiot. What's really bad is when they disable pasting. This is so stupid
because it makes sure I cannot copy and paste from my password DB. I will
often not use a site for this reason alone.

------
smaili
For those looking for the conclusion:

 _Send your users an activation email._

------
suhith
This keeps coming up on HN, in the end the best thing to do is exactly as the
article says. ACTUALLY SEND the email and see if they get it.

------
vkjv
I like the approach of accepting anything that is a reasonable length with
an`@`, but suggestinging corrections for possibly misspelled common domains.

[https://github.com/mailcheck/mailcheck](https://github.com/mailcheck/mailcheck)

If you really need to validate, the only way I know how is to send them an
email and click a link to confirm.

------
kazinator
Sorry, I disagree.

There is absolutely is a correct way to lexically validate an e-mail address:
namely, implement a parser for the syntax specified in whatever RFC is the up-
to-date successor of RFC 822.

There is such a thing as incorrect e-mail address syntax: namely, non-RFC-
conforming syntax, whatever that is.

You may reject that, and that's about it.

Please don't reject RFC-conforming e-mail addresses.

~~~
developer2
I agree, with two specific exclusions: notations for IP addresses in square
brackets, and comments. Just removing those two never-encountered-in-real-life
syntaxes reduces the RFC regex down to nothing. Comments are _not_ a real
thing, RFC be damned. And square bracketed IP addresses are never encountered
in the real world, especially with the requirements for PTR records to pass
all the major providers' spam filters.

Anyone trying to provide an email address with a square bracketed IP or a
comment is specifically trying to find an excuse to cause drama when their
email is rejected. Those exceptions aside, fuck anyone who validates email
addresses that don't permit perfectly valid real-world characters like '+', or
who whitelist specific TLDs (.com, .org, .net et al).

~~~
jcranmer
Email addresses are defined and have an interpretation not according to RFC
5322 (which defines how you can write them in a message), but rather RFC 5321
(which defines how to route email addresses). Note that the RFC 5321
definition does not permit CFWS--its existence is merely an artifact in RFC
5322 of being able to insert whitespace (and later comments) anywhere in the
grammar. Any tool which accepts CFWS in an email address that is not reading
an addr-spec field of an RFC 822 mail message is incorrect.

Beyond the issue of comments, I'd advocate rejecting the use of IP address
literals and quoted localparts. Additionally, despite the injunctions on
interpreting local-parts, in practice, it is best to treat email addresses as
case-preserving: you'd consider a@example.com and A@example.com to be the same
email address, but if the user typed in the latter, don't normalize it to the
former.

------
OOPMan
My approach to this is reasonably simple:

1: Does it have an @ 2: Is it at least 5 chars long (E.g. a@b.c)

------
fafournier
There seem to be a problem with the analysis. It seems to assume that
everybody uses qwerty keyboards! What about qwertz, azerty, dvorak or even
colemak... We need to recalculate with data aggregated and weighted by world-
wide populations! :-D

------
teekert
Small anecdote: We recently had a visitor from the US (a physician) who didn't
understand why we didn't need a .com at the end of an email address, she was
exchanging email addresses with a french person who's address ended in .fr :)

------
htor
What is weird is that input[type="email"] rejects some valid emails. Try out
this codepen:

[http://codepen.io/anon/pen/dpoqVQ](http://codepen.io/anon/pen/dpoqVQ)

------
Timucin
I wish the author put the single sentence at the end to the up with a TL;DR;
note: send verification emails, which I don't agree 100% while the
article/author claims that's the 100% right way.

------
chriscampbell
I respect the thought behind this article but if you don't want to build it
yourself, we use the company .BriteVerify and it works pretty dam well for
identifying invalid emails.

------
sigi45
His number is probably wrong. I personally wrote/typed my email address wrong
(when you have those two fields).

There are also common pitfalls people are doing like empty spaces at the
beginning and at the end which should be cleaned up before trying to send the
email to " whoever@gmail.com".

There is also an Email standard and when a normal library is able to validate
it (and there are free good libs out there) than the effort to do so is
similiar minimal but provides an additional support.

What you should do is also to make sure that you are not sending unlimited
emails out there. Otherwise you might be missused as a mail relay / spammer.

------
burnbabyburn
tl;dr send a confirmation email.

it's 20 years that people suggest a cool new way to deal with email addresses,
I don't even mind listening anymore! :)

~~~
dugluak
I think its good to do some kind of validation upfront. It's too much to ask a
user to go through whole process again for a silly typo.

------
mpetrovich
Some people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems.

— Jamie Zawinski

------
jordanielewski
Personally, I just check '@' presence

------
josh_carterPDX
I'm now going to put non-standard email characters in every form I fill out
just to see how the site handles it.

------
brightball
This is the best comprehensive way that I've found:

[https://github.com/kdisneur/email_checker](https://github.com/kdisneur/email_checker)

It breaks down into 3 parts that can be used either independently or as a
whole: format, MX and SMTP.

~~~
kej
That one fails " "@example.com and test@example, both of which are legal.

Also, since it relies on looking up MX records and then making an SMTP
connection to check if the user exists, why not just send the confirmation
email? You've already done all of the expensive stuff at that point.

~~~
brightball
You haven't had to bother the user yet. Also, shouldn't example.com fail?

~~~
kej
I meant those both fail on the format check, when they are legal formats. You
are correct that they should fail the MX portion.

------
gcb0
even Google fails this.

register a Gmail address and type in the wrong alternative email.

done. no confirmation required. someone with the mistyped email now have how
to reset your password and take your account.

------
ehnto
I have trouble with my simple .co address.

------
ikeboy
>I know hacking LinkedIn just to make a point about email validation is a bit
extreme, but it is important to back up one’s opinions with data

Hardcore.

------
chenster
Even from [http://emailregex.com](http://emailregex.com), a popular regular
expression summary for email, states it can only catch 99.99% of validate
emails. However, I still like to include some basic form of email validation
both on the client and server side, plus the activation email.

~~~
AgentME
>However, I still like to include some basic form of email validation both on
the client and server side

I worry you're more likely to permanently block a set of users with valid
email addresses than prevent a user from making a typo.

~~~
chenster
The key here is "BASIC", something like [a-z]@[a-z] so at least they will have
a at sign. Did you even read?

~~~
AgentME
I can't tell if you're being sarcastic or not, because that's perfectly
proving my point. That validation bans anyone with a number immediately before
the @, which includes my email address.

~~~
chenster
I was not.

------
chiefalchemist
Great article. Wrong. Question.

The better question: In it's current form, is an email addresses really the
best way to do what it is that's trying to be accomplished? (Hint: It's a fax
machine.)

I mean, if I have have a phone number, why can't I have an email number? Okay,
perhaps not the greatest example. But then again, if a phone number can be
switched from one carrier to another, in the second decade of the 21st century
shouldn't "email" get the same consideration?

Instead we're talking about regex or some other wonky validation? In 2016?
That's just silly.

~~~
jones1618
> Why can't I have an email number?

You can if you want (it is a valid email). But, if everyone had a numeric
email it would be a lot harder to catch spam or know if you are sending to a
bot. Sure, spammers can masquerade as a friendly name like
yourmom313@domain.com now but it's easier to spot names you know than
471871731@domain.com.

> Why can't emails be carried over to other domains?

I like the idea in abstract. Practically speaking, though, this would not only
require millions of emails to be purged from every ISP in the world it would
also require some kind of creepy global email registry. That defies the
decentralized spirit of the Web and would instantly become the Holy Grail of
hacker targets.

------
tgarma1234
No. Absolutely beginner level blog post. You would use a third party tool like
[http://www.datavalidation.com/](http://www.datavalidation.com/) or mailgun's
email validation service or BriteVerify etc etc. There are a ton of validation
services now. We are living in a time when trillions of email addresses have
been tried, entered and deployed to. So why reinvent the wheel on your
website? 3rd party services are based not only on parsing the string but also
on literally billions of emails actually deployed through various ESP's to
tell you up front whether the email address entered by a user is correct. I
could use a trashmail email address and it would validate by the OP's
standards. First validate using a third party service and then send the double
opt-in email to get a user click.

~~~
kinkdr
My philosophy is exactly the opposite. Allow anything as email address and
make it as simple as possible for the user to sign up to use my application.

This means, no email validation, no address verification emails, heck, I don't
even have a password confirmation field. One field for name, one for email and
one for password and you are in.

If 10% of the users don't trust me with their real, or even throwaway email
address, so be it. My goal is not to collect email addresses, but to have
users use my application and add some value to their life.

If another 10% of them enter a mistyped email, they will figure it out
eventually and change it if they care.

~~~
Karunamon
eh, password confirmation at least has a valid reason to exist - kinda sucks
when someone's first hit on your site is having to go through a recovery, and
if they didn't provide an email address, they're screwed.

