Developer time is precious at a startup and supporting <RFC>fan 69™@root while still denying b ob@gmailcom is very, very far down the list of things to do.
In summary: I don't suggest doing 'perfect' email validation to RFC spec. You will save money/devtime and make more of your users happy by not doing it.
When I was validating myself for Amazon Prime Student, I literally had Amazon refuse to accept my student email in the form email@example.com because there were two '.'s in the mailbox portion. I had to send an email to support and it was eventually dutifully fixed.
And that's not an uncommon format for, you know, school emails. And that's an Amazon engineer who should have known.
I imagine there's developers who think "domain.tld" is the only thing valid to put in the domain portion, and that's going to fail with "domain.co.uk", or uncommon TLDs, or other perfectly valid constructs. And sure "it's only x% of the users" but it's a pain in the ass if you're that user. You need to be reasonably permissive.
(but on the other hand "myname@..." is not valid either, and that will fail and cost you money as well... hence leading us back to 'just follow the spec')
Furthermore, for a user, it is trivial to get another email address if the one they have causes issues, so it is not really an accessibility issue either.
For the initial email input, your logic works fine. Once it is applied downstream in a process, it begins to get messy. Someone might do an incorrect email validation that happens to block emails that you have already accepted or which you are importing from a valid source. Someone has already given the example of a login field not allowing them to use the email they signed up with. If such upgrades occur later in a projects life cycle, not only might you have to spend developer's time, you may also have a production outage.
Personally, I suggest using some, even if imperfect, validation when gathering the email initially (for the reasons you point out) and then not validating that information any further.
I’m always suspicious when sites cap passwords at < 32 characters, that almost always means it’s being stored in a reversesble format someplace - maybe encrypted, maybe obfuscated, or maybe not either (banks).
The sites I really trust don’t care how long your password is because their hash size is fixed. The only real length consideration might be that if a bunch of people send obnoxiously long passwords at the same time and they are using bcryprt or scrypt it might stress the server’s cpu, so they might put an upper limit to prevent that.
e.g. It drives me BONKERS how many systems absolutely reject my single-letter email (~"N@domain.com"), which I created specifically to make it easy and safe to type on mobile devices etc. Others will reject the "+" sign, or underscore, or dot/period, or (brilliantly) two periods or underscors, etc etc etc :=/
Writing code that doesn't need to exist that has a possible failure mode of not letting someone sign up at all is just a bad decision. If you're going to write that code, either go through the effort of getting it completely right or soften the failure mode. If you really think the user is somehow mistyping their email with special characters or an unusual TLD, then you could show them a non-blocking warning message.
(Actually RFC5322 already deprecates some syntaxes. For example, "John Hacker"."Ph.D., Esq."@example.com is a deprecated syntax (obs-local-part), because it contains multiple quoted-string components separated by dots.)
> or (brilliantly) two periods or underscors
Two or more consecutive periods is actually disallowed by RFC5322 (unless quoted). firstname.lastname@example.org is a valid address, foo..email@example.com is not. ("foo..bar"@example.com is however)
If enough people wrote in about not accepting one letter email addresses, then they would likely update the validation.
But if customer service tells users to use another email address in that scenario and the customer does that, then it might not be worth the effort to fix it.
Better to warm an email doesn’t look right, but let them continue if they want to
When I want to check postage or whatever, and they require an email address, firstname.lastname@example.org typically doesn't work, but email@example.com does.
I go a little farther. I figure an attentive spammer might figure out that if I use firstname.lastname@example.org to sign up for Amazon, I may have exactly the scheme where *@johnsmith.net will work, so they can just add that to the spam list as a wildcard and pick a new address every time. So instead, I use email@example.com, john102, john103, etc, to try and obscure my strategy and prolong the life of the domain forwarding.
Also unless you're keeping a lookup table you're losing a great benefit of the wildcard. You can, and I have caught a few places, tell when a company sells your email. If I get an email from company XYZ to my email firstname.lastname@example.org I know exactly who sold my email and to whom.
> unless you're keeping a lookup table you're losing a great benefit of the wildcard
That's true, I don't keep a lookup table per se, though I do have a deleted items folder that I could look back in. I'm not sure what I would do, though, if I knew what particular company sold my email address? Send them a nastygram they will just ignore? I just block the address and move on.
AFAIU, most buld spam is targeted on gullible or vulnerable people. The spam is often terrible on purpose.
Sophisticated or targeted attacks are a different category and they may be a good reason to prefer something non-guessable.
At least a few years ago, I noticed a lot of spam to <random first name>@<my domain> -- i.e., completely made-up addresses that I had never used. Since messages sent to those addresses were guaranteed to be spam, I started treating them as free training data for the spam filter.
I don't know if this still happens, though, because I haven't looked.
It seems like an obvious thing to try, but maybe not worth the effort of implementing it, given the high risk of false positives and the low % of people who actually do stuff like this (not to mention they're probably not people who click on ads anyway).
I have found this to work; I hardly receive any spam at all, and do not need any separate spam filter.
Additionally, sending a test email like that might also get the sender placed on a black list for triggering a spam trap inadvertently.
Do you know any site blocking domains with a catchall?
Worse, I know at least 5 or 6 people personally, which do catch all. It seems like a very poor method to reliably catch spammers.
EDIT: moreover, a service is perfectly within their rights to _internally_ store my email as `email@example.com` if they want - but they should still accept `firstname.lastname@example.org` as the identifier used to login with.
I've received spam emails to at least 70 different +addresses. It is absolutely useful for antispam.
Spammers don't care about the reputation of the company they bought or stole the data from.
Not all email providers support the + notion so you'd have to run domain lookup on some hard coded list
Also anyone with gmail address can also place dots almost anywhere into the local part, to create another unique address without using a + sign.
Contrary to popular belief, it is not a gmail feature.
I first heard of the + as destination filtering in the very early 90s at CMU where it was broadly used. Every single email address I've had since then has support the same (and notably, apart from a test account, I've never used gmail much, so that's not including gmail).
The '+' alias feature is a fairly common configuration, though, so for source labels it's better to either treat all unlabeled messages as spam or else use a more opaque labeling scheme (email@example.com) which doesn't hint at an alternative untracked email address.
That whole setup for tidiness is broken the moment a desired website does not accept an alias in your address, of course.
Why? They don't care about protecting the business interests of wherever they got that address from, and it's not like stripping the plus off will meaningfully increase the success rate.
Don't get clever, just follow the spec.
My email is refused by 0% of ecommerce shops... because I just have a normal email.
Don't be clever, pick a better email.
My email is just "me@<my-last-name>.al" which is just a tiny bit "unusual" - and over the years it got refused by a couple stores because of TLD. And Albania is not Cocos Islands, they're surely not popular with spammers.
If a store believes there's only ".com" gTLD and nothing else (this had really happened to me, some galaxy-brain made a form with a hardcoded ".com" suffix; not even ".net" or ".org" were accepted, unfortunately I don't remember the site) - well, fuck that store, their loss not mine. Worst case, if I really want something they sell, I'll give them a throwaway email - which will contribute to their mail bounces after some time.
 ".al" is a ccTLD for Albania which is not a country of my citizenship or residence. I've picked the domain name as hack - because my first name is Aleksei and my first and middle names form "A.L." initials as well. That, and because all relevant .name domains were already taken.
Think about it this way: either you can get some big brand .com email with no special username and never have an issue, or you can flail around 5% of the time and yell at the clouds.
Should everyone accept your email? Of course! I'm just saying you live in real life, and in real life people suck at building email forms. The problems you run into are on you.
No, the problems they run into are caused by (at best) mediocre developers. They’re entirely to blame. We have specs and standards for a reason.
Instead you can just get a big name .com email and call it a day. Live your life without trying to make some statement about email standards.
No, because it's not 1993, but I absolutely do use the contact forms or bug reporter for any website that doesn't accept my email. Most of them fix it, because it's objectively a bug caused by their non-compliant code.
However. I completely disagree with the conclusion “The problems you run into are on you.”
I didn’t create the problem by having the audacity to be from a different country.
Get a big brand .com email and you'll never run into an issue.
I'd suggest being clever is wasting countless hours to handle your edge case. Or writing your own email validation in the first place.
Isn't email validation a solved problem in that there are services or ready software which provide RFC-compliant validation? If some company is wasting countless hours to do something because of Not Invented Here syndrome, isn't that the same as some company deciding to write cryptography algorithms on their own and reaping what they sow?
That's surprising to me because there is nothing particularly weird about your email address. What exactly do they complain about?
No matter what you will constantly be getting addresses that conform to the spec but cannot actually receive mail.
If you are in the former category, then yes, follow the spec to the letter. If you're in the latter, then screw the precise guidelines of the spec and reject emails that are very unlikely to be valid: no quoted localparts, no IP address literals. In addition, go ahead and say that email is case-insensitive (more precisely, case-preserving).
The hard part is if you're writing an email client, because you're basically forced to have your hands in both pies.
I always have to tell people, in real life, "it's .co, not .com," just in case - humans do this too.
One of our testers found XSS with email injection (RFC compkiant validation passed) in our website.
And we are an e-mail company and should now better :D
Never trust user input!
But the way to prevent injection attacks is not to disallow or sanitize input, it is to escape correctly when interpolating strings in other languages.
that works as long as <RFC>fan 69™@root does not write articles for ZDNet
With precious dev time, you can do better by doing less.
I just make folks email me first.
And this wasn’t a new lesson then. But at least we were smart enough to listen to the people who had learned that lesson before us.
It is now over 25+years later, and I’m sad to see that many people seem to be bound and determined to force themselves to re-learn that lesson the hard way.
To adapt from a famous quote: "all email validation logics are wrong, but some of them are useful" ;)
Fundamentally, the problem is that if you’re trying to validate an e-mail address as being correct and you’re not sending an actual e-mail message to that address, then you’re doing it wrong.
We learned this lesson back in 1995, people.
He made a very convincing argument that while an IP address is technically a valid domain, but how many legitimate users were seriously using an IP address as their email domain? (zero)
Your other reasons for breaking interop between systems/languages are just whimsical and invalid. :)
One part of all this that I’m not aware of the situation around is “8. You can put emojis in the local part.” The HTML spec’s validator is all ASCII. It does remind you to punycode the domain labels, but makes no mention of internationalised local parts, and I’ve never learned about non-ASCII local parts or how well they’re supported. I gather they may require the sender to be capable as well as the receiver, whereas internationalised domain names were made compatible with all systems via punycode.
I couldn't care less if users want to enter undeliverable email addresses, they won't get emails. All that regex is intended to achieve is ensuring that the user hasn't accidentally filled the wrong field (e.g. tried entering their phone number) or mistyped a punctuation mark (foo#bar.com, foo@bar,com)
Strictly speaking, it won't match some valid email addresses, such as IPV6 domains. But if I receive a support ticket complaining that we don't accept email addresses with IPv6 address domain, I'll reply advising that the customer should purchase a domain name or sign up to one of many free email services.
Seeing as the web has long supported Unicode, where are e-mail addresses currently at in that evolution?
Are full Unicode e-mail addresses something that is decently supported today, or still largely theoretical? Is this regex sufficient? What kind of e-mail addresses do people in China most commonly use, for instance?
For the local part, though, it does look like browsers have fallen down, though I’m not particularly familiar with the situation there. Testing it in Firefox to confirm, ascii@υνικοδε validates, but υνικοδε@ascii doesn’t. https://github.com/whatwg/html/issues/4562 seems to be where progress is made from time to time. As usual, it’s not as simple as we might hope.
Baby shoes because of anglosphere programmers that can't fathom people wanting to use their own alphabets and thus forget to support it.
Clearly "anglosphere programmers" fathom it every day when they use UTF-8 almost universally in webpages. Also, you know, things like emoji are pretty popular in the "anglosphere" as well.
It's obvious that the real reason is an ancient e-mail RFC, and that while upgrading webpages to UTF-8 was relatively easy, in that it only needs 2 parties to support it -- the browser and the server -- upgrading e-mail is almost infinitely more complicated, because you have to wait for virtually all email code in the world to be upgraded, since an e-mail address is pretty useless if it doesn't work everywhere.
It other words, it's a coordination problem. Not an ignorance problem.
And unfortunately, Punycode  doesn't seem to be a particularly viable stepping-stone/compatibility solution here. E.g. if a user tries to use ドメイン名例@example.com and it fails, asking them to instead type in a seemingly-gibberish firstname.lastname@example.org, where that could also conflict with a real e-mail address of that name.
At least four decades of mostly bad internationalization support it's no longer accusatory, it's empirical and quite generously worded.
Isn't that a way of saying "while disallowing perfectly valid options"?
There's a distinction to be drawn between the requirements of the actual MTA/MUA/MSA layers and user applications built on top of them. For the latter, considering emails to be invalid if they contain IP literals or quoted localparts is going to be more helpful than harmful (there's less scope for vulnerabilities in doing so). It's just like assuming email addresses are case insensitive: it's inappropriate if you're an MTA, but for everybody else, go ahead and assume they are.
A-ha, but here you're wrong because you've excluded IDNs. This is really why you should not try to be clever.
More importantly, what problem is this even trying to solve? Someone accidentally typing a 300 character domain? If they are intentionally feeding you gibberish they’ll just give you more realistic looking gibberish.
I’m absolutely certain that the 63-character limit for domain labels is never going to change, because it’s hardcoded in enormous amounts of software and hardware, and there’s no even vaguely compelling reason to even attempt to change it. But if such a thing did change, then you’d just add this to the extremely long list of things that needed to be updated.
People who thought TLDs would only ever be up to three characters long were simply wrong from the very start because they didn’t understand what they were dealing with. (As a simple example, .arpa was there from the start.) Understand that this wasn’t a matter of anything changing, it was that some people misunderstood and thought that a convention they observed was in fact a rule.
The problem this sort of validation is solving is weeding out things that are definitely not going to work, as soon as possible, because it’s good to point out problems to users as soon as possible, rather than having something silently fail or only notifying the user about it much later. Syntactic validation isn’t the be-all and end-all of accepting email addresses, but it’s definitely still worthwhile, even though you should generally do other validation based on DNS lookups and/or sending actual emails as well.
So what makes it the best?
[Edit: it also assumes you've already parsed out the "real" address from the rest of the text field, which to me makes it a half-validator at most.]
But it would seem to be the best for general-purpose web use, e.g. signing up for a newsletter with an e-mail address that's pretty much guaranteed not to break anything.
Instead of being conservative in output, it's intentionally being conservative in input.
> it also assumes you've already parsed out the "real" address from the rest of the text field, which to me makes it a half-validator at most
I’m confused. The explicit purpose of this stuff is to validate an email address. Not to extract an email address from a freeform text field, which I think is what you’re talking about. Deciding how to do that is a whole ’nother can of worms.
Unless you're developing an app for an intranet, that's not a concern for most people.
This is already a field where there is a lot of misinformation flying around, and a page that merely regurgitates all of that misinformation without the perspicacity to realize that its purported information is internally incoherent is not helpful.
Unfortunately, many websites are configured to reject email addresses that contain a plus character. I've also encountered websites in the past that did accept the + character when creating the account where the email address serves as the user name, but then could not log in because their log in form rejected the + character in the user name.
No idea what sort of security this is supposed to provide.
They probably see it as some sort of security / anti-spam mechanism.
My emails have a tendency to become spam filter bycatch, to the point that when I was job hunting last year I'd have to ring people after I sent them my resumes etc. to confirm they actually received my email.
And when I give people my email address, I usually have to assure them that email@example.com is a legitimate email address and not a joke (it's not actually steve, but you get the point).
Definitely would not recommend using it for your personal address.
If he said he used "firstname.lastname@example.org", then it's possible he has a wildcard MX record for *.example.com, but that's not at all what he said, although perhaps it's what he meant.
Regardless, the question remains unanswered.
I also use greg-*@domain instead of *@domain, since their docs claim that setting up *@domain tends to attract more spam.
It's not cheap from PM, and there are loads of hosting providers that will provide catch-all email for free with your hosting package (but with some usually pretty poor webmail client) or if you use a mail client it should work too.
I like having good webmail and mail app and other things so I pay, but there are plenty of good options available. Sadly self-hosting email server is not really an option for a variety of reasons, but you should easily be able to use catch-all e-mail addresses.
At some point, I need to migrate away from google and build out my own personal mail server.
The one exception is Craigslist; if I email someone with my normal email, I never get a response. I always use gmail for that.
It's a freemium model, but I've never needed anything in the paid tier
Sounds like this was in person at store though which is extra weird because seems unlikely that scammers would be trying to sign up en masse at a physical location (unlike if the form is connected to the internet)
email@example.com thus became firstname.lastname@example.org
I tried logging in, resetting passwords, nothing worked. I had to go to the authorities and make a written request to allow them to interrogate the database by the equivalent of my social security number, and that’s when we realized they just stripped the +.
Much more reliable than the + -thing, which breaks in the weirdest of places.
I had to switch my hosting provider at one point because they stopped supporting catch-all. I have no idea how many "addresses" I've used, since I don't create a specific email for each, so I had to get new hosting (note: this was over 10 years ago)
> I also get a lot of email from idiots who don't know their own address
Holy crap there are a lot of them. I've got one bank sending me the dude's statements. He's also been on some interesting trips, seen all his hotel stays, etc.
The other finally figured it out but his wife still hasn't after more than a decade. It gets really old receiving reminders to service a vehicle I've never owned from a dealership 2000 miles away among other similar crap.
Samsung doesn't accept emails with "samsung" as prefix, so I have email@example.com for them. I have no idea what's the logic behind.
This allows a person to use any damn thing they want as their email address, provided it works and they can get the email.
Also, humans make mistakes. You should detect spelling errors and typos then suggest corrections. 
Always send the confirmation "did you sign up?" email. Always.
It's hard to be smart with something like names.
In practice, you build your UI for the latter. You add captchas or other friction for the former.
Obviously, this is a different scenario than your bank not accepting your valid (per RFC) email address. Which is why any sort of blanket advice is pretty dumb. Not that I care to aid spammers...
The other scenario might be a site that puts up a "paywall" type thing, where you are forced to enter an email address to gain quick access to something, but doesn't want to bother you with going and verifying an email (e.g. instant discounts, downloading a PDF, etc.). Or in-person email address collection when you buy something in a store. It's never a good idea to collect email addresses of people that have no desire to subscribe to your marketing.
Send that address a confirmation email. Now you've got consensual opt-in and you've somewhat protected yourself from adding a wrong address to your recurring mailing list.
Prevent abuse with long (seconds) delays between submissions from the client. If the user thinks they did it right, they're waiting on their email inbox anyway; if they immediately realize they made a typo, it'll take 2-3s to fix.
The RFCs were written when manually (not from cron) sending email to another user on your local system as a thing that actually happened. I'm certain you actively want to avoid that now.
If I can send you an email and you can verify that you have access to that email, your email is "valid enough" for me.
Then, the validation is basically "is there an @ and after a dot in there?". I find that after that, every hour spent on improving the validation will just cause more emails falsely flagged as invalid, more support requests from the people who couldn't sign up with valid emails, it's code we need to maintain, anytime edits the validation logic risks breaking sign ups completely.
So with more "improvements" to the validation, you just cause more problems. Then why do it?
I hear the reputation arguments, but in practice, it never happened to any of the organizations I worked for.
What happens though very often is naive engineers trying to solve problems the business doesn't have with knowledge they lack...
premature implementation is the source of most evil. :-)