
Validating email addresses: A journey down RFC5321 [video] - StavrosK
https://www.youtube.com/watch?v=xxX81WmXjPg&ab_channel=FOSDEM
======
jacquesm
I wrote an RFC conforming email address validator. It worked perfectly on the
test set. Then you start feeding it real life email addresses and you find out
that lots of real addresses do not conform to the standard.

So then you have to mess it all up again and make it work so that it validates
anything out there that _might_ be real, and in the process you will end up
failing a lot of your test cases. Very frustrating.

~~~
Arnt
Can you name some examples?

I too wrote one and found lots of addresses that were valid but unusable (try
using a "jacques m"@example.com and see how far you get). But I can't remember
any that were usable in the wild but invalid according to my parser.

~~~
jacquesm
It's been four years and I don't have access to the code or the testset (work
for a customer), sorry. It would have been nice to have that bit open sourced
though, jvde if you are reading this, hint!

------
jcranmer
Some tips for email validation:

* Just say no to CFWS. CFWS is not legal in an email address, and anyone who tells you otherwise is incapable of reading the standards. CFWS is legal to insert in the written representation of the email address in RFC 5322 and RFC 5321, but it is not a structural part of the email address (you get the email address by deleting all such instances).

* Everything to the right of the @ is either a domain name, an IPv4 address literal embedded in [], or an IPv6 address literal prefixed with IPv6: and embedded in [] (e.g., [IPv6:::1]).

* Enclosing a localpart in quotes does not change the email address. Ergo, you can require that people not use quoted localparts if they don't have to. Actually, in general, you can probably require that people not have quoted localparts at all.

* The only characters truly forbidden from email addresses are C0 (and maybe C1, although the EAI specs are hazy about that). In practice, though, any character that requires quoting to access can probably be excluded from valid email addresses.

* Sending someone an email and telling them to confirm that they received it (e.g., by making the next step be triggered from a link in the email) is the only way to validate that an email address actually works.

~~~
StavrosK
Can you clarify on what CFWS is? I found what the acronym stands for, but not
what it actually _means_.

~~~
jcranmer
Comment-folded whitespace. It's the little things in parentheses (e.g., the
timezone name is usually included in a comment in the Date: header). It's also
the bane of anyone who has ever implemented an email message parser.

~~~
StavrosK
That makes sense, thank you.

------
bhouston
Our strategy, ensure there is a "@" and some text before and after the "@". We
require email validation (click a link in an email) so if they enter in a bad
email we catch it on the validation part anyhow.

~~~
onion2k
I think this demonstrates _why_ we validate things. There are two reasons - to
ensure the data is correct, and to catch errors that the user has made. On the
face of it they sound the same but they're not. As far as your service is
concerned a 'correct' email address is anything that looks a bit like an email
address. As far as the user is concerned a correct email address is
specifically _their_ email address. Your pattern works well enough for the
first part but not the second.

Any validation pattern that caters for the user should be trying to catch
instances where the user has entered their address incorrectly. That means
validating things like the general pattern ('does it have a @'), checking the
TLD ('is the TLD part of the domain in this huge array'), and catching weird
characters ('did the user really mean to use a ®'). Even if the validation
fails the user should be able to submit their address (this is client side
after all; the user can just disable it), but they should be informed that it
_appears_ there could be a reason to double check it first.

~~~
RandallBrown
It's nearly impossible to validate against TLD since there are arbitrary TLDs
now. I've always checked for something@something.something then you rely on a
verification email. If you want to make sure they typed in _their_ email
correctly, then make them type it in twice.

~~~
onion2k
Your client side validation should be looking for warning signs rather than
blocking the user though. You can use a distance algorithm (eg soundex) to
check against a list of known TLDs and warn the user to check their address if
it's _close_ to a known TLD but not an exact match. Anything that's not close
to a known value should result in a different warning. Nothing should stop the
user from submitting their value, but the form should try to help them get the
right value the first time around. That's what client side validation is for.
It is _not_ for blocking incorrect values (because it's trivial to get around
it).

------
jorangreef
If anyone is interested in general MIME validation, not specifically email
addresses, I wrote @ronomon/mime:

[https://github.com/ronomon/mime](https://github.com/ronomon/mime)

The source code has comments from the various RFCs inline. It's not just a
question of reading and knowing the RFCs but also of interpreting them and
balancing them.

The decoder will throw detailed exceptions which are designed to pinpoint
where the message went wrong, and which RFCs were broken by the message.

It also does just-in-time decoding, and decodes only the properties you
access. This is much more efficient if you only need a header or two to reject
spam quickly as part of a front line defense.

Above all, most of the methods are extensively fuzz-tested against independent
reference implementations.

------
sourcesmith
Domininc Sayers has written a fair amount on email address validation in the
past and has a decent library of test cases which I have found useful in the
past.
[https://dominicsayers.wordpress.com/category/design/code/ema...](https://dominicsayers.wordpress.com/category/design/code/email-
address-validation/)

~~~
StavrosK
Thanks for that, here's a link to the XML of the actual test cases:

[https://www.pastery.net/rnbakg/](https://www.pastery.net/rnbakg/)

------
filleokus
A comment from the presenter from Youtube:

Unfortunately, after I did some further investigation, it seems that some of
the slides in my presentation here are wrong. I took some information from the
"Email address" Wikipedia page, which turned out to have incorrect info, and
in one I just misunderstood the rules (the "local@domain(comment)" part, which
is invalid). I have since amended the presentation, but the general point of
this video of how to do validation still stands.﻿

~~~
StavrosK
I made another comment yesterday, after more searching: The presentation is
actually correct, as far as I can tell.

Yes, the standards are _that_ convoluted! There's even an "explanatory" RFC
that's just wrong. I have edited the YouTube comment to point that out
somewhat.

------
doxcf434
The problem is that there are many email addresses that are valid but are
likely just abusers. An email address entirely of * and 200 chars long is
valid in the RFC, but clearly not a human.

I settled on < 100 chars and:

`^[\w\\.\\+\\-]+@[\w\\-]+\\.[\w\\-\\.]+$`

We'll see how it goes in production :)

~~~
admax88q
What value does your system provide by limiting addresses to 100 characters
and the given regex.

Why not just allow any input and validate the address by attempting to send to
it. It's really the only way to tell if its a real address.

What abuse can a person bring on your system by having a 200 char email
address? That should be nothing in terms of server load.

------
jpalomaki
I don't think there's much point in spending lots of effort to validate the
email for structural correctness because you can't anyways verify the email is
really valid without sending a verification message (valid in the sense it
works and belongs to the specific person).

------
PunchTornado
just don't validate names or email addresses

~~~
amelius
Yes, what is the point of validating them?

You don't know if they work anyway, until you successfully send email.

~~~
detaro
Why go to the point of trying to send an e-mail, have the user wait for a
confirmation e-mail, ask for it again, wait again and then at some point find
their mistake if you could have immediately prompted them fix their mistake as
they put in user@gmailcom? Validation doesn't have to catch every mistake to
be useful.

~~~
amelius
Ok, but then a very simple approach would do just fine, where you show a
warning (instead of blocking the user).

