Hacker News new | comments | show | ask | jobs | submit login
Validating email addresses: A journey down RFC5321 [video] (youtube.com)
71 points by StavrosK 6 months ago | hide | past | web | favorite | 44 comments

I wrote an RFC conforming email address validator. It worked perfectly on the test set. Then you start feeding it real life email addresses and you find out that lots of real addresses do not conform to the standard.

So then you have to mess it all up again and make it work so that it validates anything out there that might be real, and in the process you will end up failing a lot of your test cases. Very frustrating.

If you think validating email addresses is frustrating, then you should try sending actual email.

Can you name some examples?

I too wrote one and found lots of addresses that were valid but unusable (try using a "jacques m"@example.com and see how far you get). But I can't remember any that were usable in the wild but invalid according to my parser.

It's been four years and I don't have access to the code or the testset (work for a customer), sorry. It would have been nice to have that bit open sourced though, jvde if you are reading this, hint!

Yes, there are two problems: The standard is rather convoluted, and many SMTP servers don't adhere to the standard. Considering that the email address might just be misspelled or the user might not exist anyway, it doesn't seem very productive to try to do much validation, other than maybe catch the very obvious invalid cases to cut down on your email sending costs somewhat.

The proper way seems to be: do your verification anyway, if the address appears to be invalid prompt the user once more to please check their email address and if they indicate it is ok then simply send the verification email assuming the user knows what they are doing. This will catch a very large chunk of accidentally mis-spelled email addresses and will allow the users to override the validator for strange edge cases.

Yes, exactly. If you issue warnings rather than hard errors, you also have a lot of leeway for notifying the user on things like "gamil.com" and other misspellings without being terribly annoying.

Some tips for email validation:

* Just say no to CFWS. CFWS is not legal in an email address, and anyone who tells you otherwise is incapable of reading the standards. CFWS is legal to insert in the written representation of the email address in RFC 5322 and RFC 5321, but it is not a structural part of the email address (you get the email address by deleting all such instances).

* Everything to the right of the @ is either a domain name, an IPv4 address literal embedded in [], or an IPv6 address literal prefixed with IPv6: and embedded in [] (e.g., [IPv6:::1]).

* Enclosing a localpart in quotes does not change the email address. Ergo, you can require that people not use quoted localparts if they don't have to. Actually, in general, you can probably require that people not have quoted localparts at all.

* The only characters truly forbidden from email addresses are C0 (and maybe C1, although the EAI specs are hazy about that). In practice, though, any character that requires quoting to access can probably be excluded from valid email addresses.

* Sending someone an email and telling them to confirm that they received it (e.g., by making the next step be triggered from a link in the email) is the only way to validate that an email address actually works.

Can you clarify on what CFWS is? I found what the acronym stands for, but not what it actually means.

Comment-folded whitespace. It's the little things in parentheses (e.g., the timezone name is usually included in a comment in the Date: header). It's also the bane of anyone who has ever implemented an email message parser.

That makes sense, thank you.

This StackOverflow replay happens to answer this question exactly, the actual definition of "CFWS" is near the bottom of the text:


It's just an excerpt of the spec, but I link to it instead of to the spec because I think it's easier to read (formatting/colors) and it has links to the spec anyway.

Ahh, thank you, I missed that part of the spec. The "folding" part was what confused me, thanks again.

> Sending someone an email and telling them to confirm that they received it (e.g., by making the next step be triggered from a link in the email) is the only way to validate that an email address actually works.

This is the gold standard but note it is itself prone to false negatives and can even have false positives.

Our strategy, ensure there is a "@" and some text before and after the "@". We require email validation (click a link in an email) so if they enter in a bad email we catch it on the validation part anyhow.

The one time I had to validate email addresses without using another library, I settled on /.+@.+\..+/

Basically the same as yours, but validate that a TLD exists.

Technically, this would fail on a few valid email addresses: foo@localhost, foo@.co, or foo@2001:0db8:85a3:0000:0000:8a2e:0370:7334

All those are valid, but we decided that none of those would apply to any of our customers.

I think this demonstrates why we validate things. There are two reasons - to ensure the data is correct, and to catch errors that the user has made. On the face of it they sound the same but they're not. As far as your service is concerned a 'correct' email address is anything that looks a bit like an email address. As far as the user is concerned a correct email address is specifically their email address. Your pattern works well enough for the first part but not the second.

Any validation pattern that caters for the user should be trying to catch instances where the user has entered their address incorrectly. That means validating things like the general pattern ('does it have a @'), checking the TLD ('is the TLD part of the domain in this huge array'), and catching weird characters ('did the user really mean to use a ®'). Even if the validation fails the user should be able to submit their address (this is client side after all; the user can just disable it), but they should be informed that it appears there could be a reason to double check it first.

It's nearly impossible to validate against TLD since there are arbitrary TLDs now. I've always checked for something@something.something then you rely on a verification email. If you want to make sure they typed in their email correctly, then make them type it in twice.

Your client side validation should be looking for warning signs rather than blocking the user though. You can use a distance algorithm (eg soundex) to check against a list of known TLDs and warn the user to check their address if it's close to a known TLD but not an exact match. Anything that's not close to a known value should result in a different warning. Nothing should stop the user from submitting their value, but the form should try to help them get the right value the first time around. That's what client side validation is for. It is not for blocking incorrect values (because it's trivial to get around it).

Yep, I do exactly the same. Let the mail servers validate the address.

I do generally the same, it probably validates some invalid addresses but at least it's the most realistic filter.

What do you consider "some text"?

Im assuming .+

If anyone is interested in general MIME validation, not specifically email addresses, I wrote @ronomon/mime:


The source code has comments from the various RFCs inline. It's not just a question of reading and knowing the RFCs but also of interpreting them and balancing them.

The decoder will throw detailed exceptions which are designed to pinpoint where the message went wrong, and which RFCs were broken by the message.

It also does just-in-time decoding, and decodes only the properties you access. This is much more efficient if you only need a header or two to reject spam quickly as part of a front line defense.

Above all, most of the methods are extensively fuzz-tested against independent reference implementations.

Domininc Sayers has written a fair amount on email address validation in the past and has a decent library of test cases which I have found useful in the past. https://dominicsayers.wordpress.com/category/design/code/ema...

Thanks for that, here's a link to the XML of the actual test cases:


A comment from the presenter from Youtube:

Unfortunately, after I did some further investigation, it seems that some of the slides in my presentation here are wrong. I took some information from the "Email address" Wikipedia page, which turned out to have incorrect info, and in one I just misunderstood the rules (the "local@domain(comment)" part, which is invalid). I have since amended the presentation, but the general point of this video of how to do validation still stands.

I made another comment yesterday, after more searching: The presentation is actually correct, as far as I can tell.

Yes, the standards are that convoluted! There's even an "explanatory" RFC that's just wrong. I have edited the YouTube comment to point that out somewhat.

The problem is that there are many email addresses that are valid but are likely just abusers. An email address entirely of * and 200 chars long is valid in the RFC, but clearly not a human.

I settled on < 100 chars and:


We'll see how it goes in production :)

What value does your system provide by limiting addresses to 100 characters and the given regex.

Why not just allow any input and validate the address by attempting to send to it. It's really the only way to tell if its a real address.

What abuse can a person bring on your system by having a 200 char email address? That should be nothing in terms of server load.

Take a rational approach but then provide a human-feedback mechanism for the very small number of edge cases that may crop up?

It shouldn't be "lets automate this and hope it goes well in production" i.e. the Google approach. It should be "lets use common sense and manage failure in a way that doesn't piss off customers".

I don't think there's much point in spending lots of effort to validate the email for structural correctness because you can't anyways verify the email is really valid without sending a verification message (valid in the sense it works and belongs to the specific person).

just don't validate names or email addresses

Yes, what is the point of validating them?

You don't know if they work anyway, until you successfully send email.

Why go to the point of trying to send an e-mail, have the user wait for a confirmation e-mail, ask for it again, wait again and then at some point find their mistake if you could have immediately prompted them fix their mistake as they put in user@gmailcom? Validation doesn't have to catch every mistake to be useful.

Ok, but then a very simple approach would do just fine, where you show a warning (instead of blocking the user).

I feel like trying to resolve the MX record for the domain could solve this. You can't send the validation email anyway, unless the domain resolved. But to do that, you actually have to (partly) parse the address...

Wouldn't that be nice. But then how do you propose dealing with trolls in games or community sites?

You can allow basically anything as an email address (exactly one @ symbol, text ahead and after, with at least one . after, something like /^([^@]+@[^@]+\.[^@]+)$/), and still verify that the user is human (captcha or something) and verify the email address (send signup email).

In all seriousness though, wouldn't sending a validation email that has them click a link suffice?

Not only would it suffice, it is the only way to tell if an email address is capable of receiving mail. There's no regex that can tell you that.

+ the address is not corrupted and belongs to the user

> just don't validate names

just don\\'t validate names.

Do you have an apostrophe in your name? Then you're hosed, and you get to spend some time figuring out the details.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact