Hacker News new | past | comments | ask | show | jobs | submit login
Complexities of e-mail validation logic (netmeister.org)
287 points by Tomte on May 24, 2021 | hide | past | favorite | 385 comments



If your email is <RFC>fan 69™@root I am not going to let you signup. Sending emails cost money and bouncing emails affects your sender reputation. Also, for every user out there using <RFC>fan 69™@root as their email address, there is going to be thousands of people accidently entering their email address incorrectly and not getting a alert about it. Yes you could do fancy shit like checking mx records and whatnot, but come on- Im not going to maintain/build that infrastructure for the one out of a million people who are trying to use that address.

Developer time is precious at a startup and supporting <RFC>fan 69™@root while still denying b ob@gmailcom is very, very far down the list of things to do.

In summary: I don't suggest doing 'perfect' email validation to RFC spec. You will save money/devtime and make more of your users happy by not doing it.


This sounds great but what you think is "common" probably isn't.

When I was validating myself for Amazon Prime Student, I literally had Amazon refuse to accept my student email in the form first.m.last@myschool.edu because there were two '.'s in the mailbox portion. I had to send an email to support and it was eventually dutifully fixed.

And that's not an uncommon format for, you know, school emails. And that's an Amazon engineer who should have known.

I imagine there's developers who think "domain.tld" is the only thing valid to put in the domain portion, and that's going to fail with "domain.co.uk", or uncommon TLDs, or other perfectly valid constructs. And sure "it's only x% of the users" but it's a pain in the ass if you're that user. You need to be reasonably permissive.

(but on the other hand "myname@..." is not valid either, and that will fail and cost you money as well... hence leading us back to 'just follow the spec')


This is a false dichotomy. Supporting dots in the email address is trivial, following the spec is not.

Furthermore, for a user, it is trivial to get another email address if the one they have causes issues, so it is not really an accessibility issue either.


It depends upon where you are validating email input at.

For the initial email input, your logic works fine. Once it is applied downstream in a process, it begins to get messy. Someone might do an incorrect email validation that happens to block emails that you have already accepted or which you are importing from a valid source. Someone has already given the example of a login field not allowing them to use the email they signed up with. If such upgrades occur later in a projects life cycle, not only might you have to spend developer's time, you may also have a production outage.

Personally, I suggest using some, even if imperfect, validation when gathering the email initially (for the reasons you point out) and then not validating that information any further.


I actually run into this all the time with passwords using a password manager. Lots of places will accept the creation of a password that's long/complex/etc but then when you actually try to log in with it it won't accept a long password, won't accept certain characters, will silently truncate it and throw an invalid password error, etc.

Sometimes disabling Javascript will fix it, sometimes not. I occasionally have resort to using "I forgot my password" until I figure out what the actual underlying requirements of the passwords are.


Etrade lets you create 32 character password, but if you enable 2FA you suddenly can't login because apparently they concatenate them together and then check the length. So make sure your password is max 26 characters. (they might've fixed this but I haven't tried).


Like GP mentioned, Etrade also does the thing where it accepts the . character on password creation, but not login. That was fun to figure out.


Curious, can you login with 26 characters and your MFA seed to bypass MFA entirely?


Yup! Same thing with the ridiculous verify-my-identify questions. One I encounter all the time is the local community college, which let me use spaces in my answers on creation, but not at entry time. Grrrr.


That happens too often! A lot of places where I try to use a really long password will silently truncate it, but different forms will truncate at different lengths, so what might work for registration might not work for login or changing the password later.

I’m always suspicious when sites cap passwords at < 32 characters, that almost always means it’s being stored in a reversesble format someplace - maybe encrypted, maybe obfuscated, or maybe not either (banks).

The sites I really trust don’t care how long your password is because their hash size is fixed. The only real length consideration might be that if a bunch of people send obnoxiously long passwords at the same time and they are using bcryprt or scrypt it might stress the server’s cpu, so they might put an upper limit to prevent that.


I ran into this with Sony's PlayStation site. I generated a passphrase that the registration form allowed but from then on I was unable to access my account. I went through the same trial as you and found that they were truncating the password to something like 16 characters. That was just this past year, so I'm pretty sure it's still that way.


As a user, I got burned by that several times. Now, when I create a new account somewhere, the first thing I do is log out and try to log back in.


I don't encounter this very often myself. So far the only place I've seen this is Paypal. facepalm


<input type="password" maxlength="xx"> must be illegal. It can't be noticed whether is input truncated.


I've run into this with labcorp. Their desktop webapp takes subdomain emails, but their mobile iOS health webpage login thinks a subdomain email is invalid and disables the login button. They also don't let you change your account email so you can never really fix this issue properly.


I get your point, but it ends up pretty arbitrary who picks up what part of spec to implement / which part of spec they deem "common sense".

e.g. It drives me BONKERS how many systems absolutely reject my single-letter email (~"N@domain.com"), which I created specifically to make it easy and safe to type on mobile devices etc. Others will reject the "+" sign, or underscore, or dot/period, or (brilliantly) two periods or underscors, etc etc etc :=/


There are also blacklists for names you can use. My real e-mail is admin@myname.com but Facebook doesn't allow me to use that e-mail, warning me that only personal e-mails are allowed. Paradoxically I ended up using my work e-mail to get around the restriction.


Agreed. Anyone that decides to come up with a "common sense" subset of emails to allow is more likely to accidentally block some perfectly normal addresses than actually accomplish something useful. I remember I once made this point in an HN thread, and I got a very dismissive and overconfident reply by someone who posted a regex that turned out to block my own email address (which had a number immediately before the @ symbol).

Writing code that doesn't need to exist that has a possible failure mode of not letting someone sign up at all is just a bad decision. If you're going to write that code, either go through the effort of getting it completely right or soften the failure mode. If you really think the user is somehow mistyping their email with special characters or an unusual TLD, then you could show them a non-blocking warning message.


My email address ends with the .cc TLD, and the number of websites which say "Did you mean to type .ca?" and then refuse to let me continue without changing it drives me similarly batty.


Maybe what we need is a narrower standard that subsets RFC5322 to remove (or at least deprecate) rarely used address syntaxes, such as double-quoted local parts.

(Actually RFC5322 already deprecates some syntaxes. For example, "John Hacker"."Ph.D., Esq."@example.com is a deprecated syntax (obs-local-part), because it contains multiple quoted-string components separated by dots.)

> or (brilliantly) two periods or underscors

Two or more consecutive periods is actually disallowed by RFC5322 (unless quoted). foo.bar.baz@example.com is a valid address, foo..bar@example.com is not. ("foo..bar"@example.com is however)


I don’t even have anything weird going on with mine other than using my own domain with a less common 2 letter country code TLD, but it gets me rejected from signing up to at least a few services per year


Customer complaints usually determine the spec.

If enough people wrote in about not accepting one letter email addresses, then they would likely update the validation.

But if customer service tells users to use another email address in that scenario and the customer does that, then it might not be worth the effort to fix it.


It’s getting to where if you support @gmail.com and no other you’d still get 80% of signups.

Better to warm an email doesn’t look right, but let them continue if they want to


I think it's more likely the marketing department doesn't like people leaving fake addresses.

When I want to check postage or whatever, and they require an email address, a@b.c typically doesn't work, but no@mail.com does.


This logic is why so many Web sites today won't let you use a plus sign in email addresses, which ruins a really nice Gmail feature.


As people said, true spammers know to just strip off the "+" in the email address. This is actually a fun reason to set up your own domain and set up email forwarding for *@example.com to go to your main gmail or whatever account, then the "username" part of the email I just set to the domain of the account I'm signing up for. So I'll use amazon@example.com when signing up at Amazon (or whatever site).


> So I'll use amazon@example.com when signing up at Amazon

I go a little farther. I figure an attentive spammer might figure out that if I use amazon@johnsmith.net to sign up for Amazon, I may have exactly the scheme where *@johnsmith.net will work, so they can just add that to the spam list as a wildcard and pick a new address every time. So instead, I use john101@johnsmith.net, john102, john103, etc, to try and obscure my strategy and prolong the life of the domain forwarding.


Hard truth is you're not worth enough for a spammer to look for that pattern, it's a numbers game and you're just making it harder on yourself.

Also unless you're keeping a lookup table you're losing a great benefit of the wildcard. You can, and I have caught a few places, tell when a company sells your email. If I get an email from company XYZ to my email abc@example.com I know exactly who sold my email and to whom.


I agree that I'm probably not worth the effort, but if this kind of domain wildcard strategy were to become more popular it is entirely feasible for a rudimentary machine learning algorithm to detect its use.

> unless you're keeping a lookup table you're losing a great benefit of the wildcard

That's true, I don't keep a lookup table per se, though I do have a deleted items folder that I could look back in. I'm not sure what I would do, though, if I knew what particular company sold my email address? Send them a nastygram they will just ignore? I just block the address and move on.


I think a typical spammer doesn't care much about such users, but if given given a choice they would rather avoid such users.

AFAIU, most buld spam is targeted on gullible or vulnerable people. The spam is often terrible on purpose.

Sophisticated or targeted attacks are a different category and they may be a good reason to prefer something non-guessable.


I just have a entire domain for the purposes of spam. Anything sent to there ends up in my bulk folder. I use amazon@domain.com so I can tell who sells my email or gets hacked. Never noticed someone trying to send a email to any addresses I haven't previously used.


> Never noticed someone trying to send a email to any addresses I haven't previously used.

At least a few years ago, I noticed a lot of spam to <random first name>@<my domain> -- i.e., completely made-up addresses that I had never used. Since messages sent to those addresses were guaranteed to be spam, I started treating them as free training data for the spam filter.

I don't know if this still happens, though, because I haven't looked.


This is currently happening to my email domain. Gets rejected as it doesn't have a valid hash (recipient name), but the logfiles are full of <3 letters>@mydomain.com and <english_word>@mydomain.com rejections.


Yeah, this is an age-old issue -- in the early 00s, my mom got a domain and used the email <first_initial>@<domain>.com. She gave up battling the deluge of spam after about a year. We looked through the logs, and saw that her next choice of handle was also getting tons of spam, too, because it was also short.


I do the same thing. I use whatever@domain.email. The addresses are temporary if I want them to be and I can automatically lock the senders to a list that is either automatically learned after x days or manually curated. I've seen some 'marketing' mail get filtered but no hacks yet.


I’ve got amazon@domain.com email for my domain and I’ve never created such an account, much less given it out. Without some uniqueness in the username, I’m not sure you can tell a company sold or lost your data.


Spamming is a numbers game. I kinda doubt enough people are using this scheme to make figuring this out worthwhile for a spammer.


I've wondered about this with big companies like Facebook, Google, Amazon, etc. as well as behind-the-scenes spyware/ad firms who are all probably very interested in linking my identity across user accounts, email addresses, device fingerprints, etc. I've hoped that there aren't enough people doing it (yet) for these orgs to find it worth the effort.


There very much are companies doing this and selling it as a service...here's an API that you can query with a piece of contact information to retrieve all sorts of additional information, including hashes of alternate email addresses, mobile device ids, social media profiles, and plenty of other stuff: https://platform.fullcontact.com/docs/apis/enrich/person-ins...


At a certain point -- probably the moment it becomes a business unto itself -- this kind of data collection should be subject to all the same rules we've come up with for credit bureaus. It should be a legal requirement that I can get the entire profile they have built for me.


I was wondering specifically if they have special cases to identify such "personal" email domains and use them for record linkage.

It seems like an obvious thing to try, but maybe not worth the effort of implementing it, given the high risk of false positives and the low % of people who actually do stuff like this (not to mention they're probably not people who click on ads anyway).


Given the sheer amount of money involved, I believe it is likely that there are players in the market who are far more capable than we give them credit for.


I kinda imagine that spammer go for low hanging fruit. So spammers won’t bother with defeating a catchall domain forwarding, as it’s unlikely to give them returns. Although a motivated attacker might decide to try to send interesting phishing.


What do you use for email hosting? Ive tried to do something similar but most places have a limit for email addresses even with paid plans.


Fastmail does not seem to have any limit listed. I have not tested any extreme case since I just use a wildcard.


well, you could turn it around and use + addresses everywhere, so that any legitimate response must be to one of your + addresses. then treat anything without + as spam.


That makes me want to use an email address of the form +myname@mydomain.com, just to see how websites would handle stripping out everything starting from the +.


The + is also useful for knowing who sold your email address on or was responsible for a data breach. If I start getting spam to <my name>+hulu@gmail.com, then I know I could chase down Hulu on Twitter for an explanation.


I do a similar thing, except that the email is actually hosted at my domain rather than being forwarded, and that I have a list of email addresses that I accept and reject all others; if I receive too much spam at one address, I disable receiving at that address.

I have found this to work; I hardly receive any spam at all, and do not need any separate spam filter.


I've run into at least one domain that blocks their own name in your email; that was fun.


Didn't you find you got a deluge of spam to generic addresess like admin@, info@, offers@ and so on? I tried this, although it was probably about 15 years ago now, and reverted it because I got about the same amount of spam as genuine emails.


Although hardly anyone uses yahoo mail anymore, they actually have this feature built in. Basically email aliases.


Note that you should only do this for maybe 6-18 characters, some sites will test send an email to [30-100 character random string]@example.com and see if it bounces - if it doesn't, it'll suspect that domain to be some spammer with a catch-all email inbox and block it.


That's a terrible approach, plenty of valid, legitimate non-spamming domains use catchalls of arbitrary length for all sorts of reasons.

Additionally, sending a test email like that might also get the sender placed on a black list for triggering a spam trap inadvertently.


That's a worrying strategy because there are many reasons for using a catchall. Example: one email per site to track companies selling personal data, then maybe bounce that single email address.

Do you know any site blocking domains with a catchall?


Yeah, if you have a domain of your own the sensible thing is a catchall, use a different address everywhere and block the ones that spam.


Do you know what sites do that? I have my own domain and I haven't seen anybody do that. The obvious solution is to configure your mail server to only accept usernames before the '@' that adhere to some rule which only you know. Like checking if it is a palindrome or something obscure like this.


I watch multiple corp's mail logs extensively, this is not even remotely a common thing.

Worse, I know at least 5 or 6 people personally, which do catch all. It seems like a very poor method to reliably catch spammers.


Max local part size is 64 octets. So 100 chars would be out of spec.


The best are sites that let you sign up with a ‘+’ but not log in. Zappos used to be the most prominent example.


Reminds me of a patio11 post (which I haven't been able to track down) where he said he gets people signing up with a '+' but then forgetting to include the extra part when they log in later. His login code accepts both versions and increments a counter to track how many people were too smart for their own good.


I've run across at least one banking site which accepted a password on the sign-up page which was later rejected by the login page. The validation scripts on the login page used a more limited set of permissible special characters which didn't include parentheses. Fortunately it was only a client-side check, so it was relatively simple to bypass it once using developer tools and change the password.


American Express at one point let me set a password over 8 characters, but logging in after only worked if I provided only the first 8.


At one point I know they also weren't case sensitive.


Why would you ever validate the characters of a password on the login page? What a weird thing to do.


I once had a site _silently strip_ the + from signup email. So when I submitted `myname+yoursite@gmail.com` as my email address, they started sending mail to `mynameyoursite@gmail.com`. Madness.


This is common; spammers know the semantics of '+' for gmail and will strip it. You need to assume that it will happen.


GP said the site stripped the “+” only, essentially sending his email to another address entirely. Spammers strip the “+” and whatever follows it, so the spam ends up at the same address.


If spammers are converting from `myname+yoursite@gmail.com` to `mynameyoursite@gmail.com`, they are welcome to spam it as much as they want - it won't get to me.

EDIT: moreover, a service is perfectly within their rights to _internally_ store my email as `myname@gmail.com` if they want - but they should still accept `myname+yoursite@gmail.com` as the identifier used to login with.


I’ve seen sites send emails where the unsubscribe link doesn’t work because the URL contains the email address I signed up with and that email address contains a character that their web server doesn’t play well with.


Interesting. How did that work? Does that mean that they would only create the user account under the + suffix? I imagine they must have had two email fields - the canonical email for login and then a separate notification email?


Why do so many people responding to this seem to assume the plus sign is to fool spammers? Of course it's not useful for antispam. It's mostly meant to make it easier to trace where a (legit) email comes from, for example to set up filters. https://gmail.googleblog.com/2008/03/2-hidden-ways-to-get-mo...


> Of course it's not useful for antispam.

I've received spam emails to at least 70 different +addresses. It is absolutely useful for antispam.

Spammers don't care about the reputation of the company they bought or stole the data from.


Can't you use "." Anywhere in your email to use the same multiple times in Gmail?


Yes, assuming the websites following this logic don't block that too, but then you have to keep track of a mapping of dots to websites yourself instead of it being obvious from what you put after the plus sign.


Or use a password manager.


Spammers aside, I'm interested to know what strategy different saas companies do in regards to users creating an account with + alias - Do you let users create multiple accounts with the same email but different + alias ? Or do you recognize that it's an alias and say that the account already exists ?

Not all email providers support the + notion so you'd have to run domain lookup on some hard coded list


Why should they care though? Anyone with a catchall can create 7 billion normal looking addresses under their domain. It's not like it would prevent anything.

Also anyone with gmail address can also place dots almost anywhere into the local part, to create another unique address without using a + sign.


Because of spammers creating garbage accounts on your platform


Yeah, but why care about email address format, if it's not going to stop spammers anyway, and you're just risking losing legitimate customers if you mess up trying to mangle part of an address you should not be touching according to the recommendation in the standard?


The email spec considers these different emails. It's not a websites job to worry about how an individual host treats them. Gmail also treats . in the user section as useless but most hosts do not.


> This logic is why so many Web sites today won't let you use a plus sign in email addresses, which ruins a really nice Gmail feature.

Contrary to popular belief, it is not a gmail feature.

I first heard of the + as destination filtering in the very early 90s at CMU where it was broadly used. Every single email address I've had since then has support the same (and notably, apart from a test account, I've never used gmail much, so that's not including gmail).


I don't really think that's the same. Forgetting the "+" in the validation regular expression is something else than refusing to implement all kinds of extra checks to support very weird and very unused things.


Sites likely prefer your canonical/standard email address over any plus version. It would be easy to trim anything after the plus too I guess and just email you at your normal address


I configured my mail server to use _ as a sub mailbox identifier to stop creeps who block +. I assume they are doing it to make sure their precious spam shows up in my inbox.


OTOH, it being a standardized thing, a spammer would absolutely just strip that plus part off. Better do it secretly like a catch-all.


It's not really all that standardized. The use of a '+' character to indicate an alias or label is merely convention—if you run your own server you can set the separator to any character you wish, or disable the feature altogether. As far as the RFCs are concerned the '+' character is just part of the account name and there is no reason why it cannot be a mandatory part of the account name on any particular server, such that stripping off the '+' and any trailing characters results in an invalid e-mail address, or even someone else's e-mail account. For sending email or using an email address as an account identifier it's definitely incorrect to treat abc+xyz@example.com and abc@example.com as equivalent. The same goes for account names which differ only in capitalization or placement of periods: some servers are case-insensitive and ignore periods in account names (e.g. Google) but these are server-specific traits and compliant email senders should not assume that every server will work the same way.

The '+' alias feature is a fairly common configuration, though, so for source labels it's better to either treat all unlabeled messages as spam or else use a more opaque labeling scheme (unique-hash@example.com) which doesn't hint at an alternative untracked email address.


Subaddressing is standardized in RFC 5233.


For the Sieve Email Filtering Language, yes. Which is not actually part of SMTP. And even in RFC 5233 the specific separator sequence is up to the server; the RFC only specifies queries for ":user", ":detail", and ":localpart" to filter on the different fields independent of the choice of separator.


Can you recommend a library / tool which can extract those reliably?


I think that when using email aliases to identify spam sources, the crucial part is that you can filter the stripped address (as well as any unapproved alias) to be directly identified as spam and then the +alias part becomes a key to properly get into the inbox.

That whole setup for tidiness is broken the moment a desired website does not accept an alias in your address, of course.


> a spammer would absolutely just strip that plus part off

Why? They don't care about protecting the business interests of wherever they got that address from, and it's not like stripping the plus off will meaningfully increase the success rate.


The plus is a standard feature of email, not a Google specific thing like ignoring dots.


gmail != email. Breaking gmail features helps ensure a level playing field.


It also breaks legit non-gmail address


My email address is valid and has been valid for a really long time.. but about 5% of ecommerce shops refuse to accept it.. so they don't get my money.

Don't get clever, just follow the spec.


If 5% of ecommerce shops refuse to accept it, it's likely you being clever.

My email is refused by 0% of ecommerce shops... because I just have a normal email.

Don't be clever, pick a better email.


What's "normal", though? "<8-10latinalphanumerics>@gmail.com?"

My email is just "me@<my-last-name>.al"[1] which is just a tiny bit "unusual" - and over the years it got refused by a couple stores because of TLD. And Albania is not Cocos Islands, they're surely not popular with spammers.

If a store believes there's only ".com" gTLD and nothing else (this had really happened to me, some galaxy-brain made a form with a hardcoded ".com" suffix; not even ".net" or ".org" were accepted, unfortunately I don't remember the site) - well, fuck that store, their loss not mine. Worst case, if I really want something they sell, I'll give them a throwaway email - which will contribute to their mail bounces after some time.

__________

[1] ".al" is a ccTLD for Albania which is not a country of my citizenship or residence. I've picked the domain name as hack - because my first name is Aleksei and my first and middle names form "A.L." initials as well. That, and because all relevant .name domains were already taken.


Might sound strange but yes, me@<my-last-name>.al is being clever. You found a nice short clean email by buying a domain from Albania and setting up a me@ address. That's clever.

Think about it this way: either you can get some big brand .com email with no special username and never have an issue, or you can flail around 5% of the time and yell at the clouds.

Should everyone accept your email? Of course! I'm just saying you live in real life, and in real life people suck at building email forms. The problems you run into are on you.


> Should everyone accept your email? Of course! I'm just saying you live in real life, and in real life people suck at building email forms. The problems you run into are on you.

No, the problems they run into are caused by (at best) mediocre developers. They’re entirely to blame. We have specs and standards for a reason.


I honestly don't understand what you're trying to say. What's actionable about your view? You going to call up every business that doesn't accept your email and tell them their programmers suck? Businesses like this are never going away. It's a losing battle.

Instead you can just get a big name .com email and call it a day. Live your life without trying to make some statement about email standards.


> You going to call up every business that doesn't accept your email and tell them their programmers suck?

No, because it's not 1993, but I absolutely do use the contact forms or bug reporter for any website that doesn't accept my email. Most of them fix it, because it's objectively a bug caused by their non-compliant code.


I completely agree with the pragmatic response, and not only encourage it, but do this myself too. It is absolutely the correct advice in the situation.

However. I completely disagree with the conclusion “The problems you run into are on you.”

I didn’t create the problem by having the audacity to be from a different country.


We should totally shame developers/businesses who don't accept valid emails, just as we should continue shaming those with insane password policies or insecure practices.


This conversation reminds me of the 0.00001% of people that browse the internet with JS disabled and then complain about how so many sites don't work for them.


If you aren’t accepting very normal email addresses at perfectly valid TLDs, then you are a bad programmer. At least import a list of the new TLDs every ten years.


Of course they're a bad programmer. But we live in real life, where bad programmers exist.

Get a big brand .com email and you'll never run into an issue.


>> Don't get clever, just follow the spec.

I'd suggest being clever is wasting countless hours to handle your edge case. Or writing your own email validation in the first place.


> wasting countless hours

Isn't email validation a solved problem in that there are services or ready software which provide RFC-compliant validation? If some company is wasting countless hours to do something because of Not Invented Here syndrome, isn't that the same as some company deciding to write cryptography algorithms on their own and reaping what they sow?


> about 5% of ecommerce shops refuse to accept it

That's surprising to me because there is nothing particularly weird about your email address. What exactly do they complain about?


I would assume because it's only 2-chars (me) and they're filtering anything <3 as invalid.


Yeah, that's what I would guess as well. But there's a big difference between "follow the [ridiculously complicated] spec to the letter" and "don't do obviously stupid things like filter out email addresses with short names". The latter is good advice, the former not so much IMHO.


For a couple glorious years I had a 2-letter email address at a single-letter .com domain. It was rejected a surprisingly small number of times.


Ah, that's a good assumption. My initial assumption was some sites have a very dumb whitelist of valid email domains. This seems more reasonable (although, also dumb).


Quotes included or not?


Not. Obviously, or the rejection ratio would be a lot higher than 5%.


Your money is likely a minuscule part of the revenue and supporting your email would likely cost more. This was the point, that it is probably clever to choose a validation that covers 99.99% of customer emails rather than cover the whole spec.


If you can show that "just follow the spec" ends up opening up more opportunity than it closes off, then you can convince people. However, when gmail, outlook, etc. do not allow these zany e-mail addresses, you're going to have a hella hard time convincing me of this unless you are in the 1% of spenders.


Do GMail et al actually prevent you from sending to and receiving from these zany addresses? Or merely prevent you from creating one @gmail.com?


Creating one. But when you consider just how many customers are using gmail and outlook addresses, and not to mention, GSuite/fastmail/etc. addresses under custom domains, it makes more sense why rejecting @gmail.com@gmail.com is worth more than allowing some crazy e-mail feature that is effectively not used.


The routing features are obsolete; they go back to the days when lots of email users weren't on the Internet directly and had to use relays. They are still in the spec, yes.


I assume it comes from similar lineage as UUCP paths. Either way, email standards are a bit ridiculous. It needs the kind of rehaul that occurred with HTML5 of looking at what email implementations actually do and pushing them in one direction. I suspect that is not happening ever, so failing that there will probably always be things in the spec that just simply don’t work across everything anymore.


You absolutely should check the MX records, though. It’s easy and catches tons of typos. I was floored by the difference when I implemented this as pre-check before a Stripe Checkout form.


There's also incredibly low stakes in allowing a technically-invalid email address to pass validation. Just use a very permissive pattern (e.g. contains an '@') and be done with it.

No matter what you will constantly be getting addresses that conform to the spec but cannot actually receive mail.


I found websites not allowing perfectly valid tlds, so maybe they could be starting not using .com in their regex. (.email)


The basic rule of thumb I use this: are you implementing email at the MTA level (needing to build/parse RFC 5321 commands or RFC 5322 blobs directly), or are you using email closer to a "universal internet ID" purpose (i.e., application perspective)?

If you are in the former category, then yes, follow the spec to the letter. If you're in the latter, then screw the precise guidelines of the spec and reject emails that are very unlikely to be valid: no quoted localparts, no IP address literals. In addition, go ahead and say that email is case-insensitive (more precisely, case-preserving).

The hard part is if you're writing an email client, because you're basically forced to have your hands in both pies.


Yes, in practice I've found the exact same thing. Either use an email validation service or be more restrictive than the RFC. [1] Also prompting "Did you mean bob@gmail.com?" when the user types "bob@gmaail.com" helps a lot with human input errors. [2]

[1] https://www.mailgun.com/email-validation/

[2] https://www.npmjs.com/package/mailcheck


Except that as someone with an email at a .co domain, I get really irritated when it asks me "do you mean [mydomain].com?"

I always have to tell people, in real life, "it's .co, not .com," just in case - humans do this too.


Worse, I had services trying to be smart correcting .co to .com


Yep, it's solving for the majority case. As long as it doesn't block signup you can just ignore it.


How do you reconcile your concern for the cost of sending emails with your unwillingness to do super basic validation like checking an MX record?


From where I sit, both of those concerns sit on the same side of fence. GP argues against extensive developer time spent on validating edge-case emails, and says they do so in no small part to avoid having emails bounce etc., as doing MX or other validation to follow-up on these edge-case emails validity within your service does nothing to imply others have put in this same costly and nearly superfluous support, likely leading to more emails bouncing and accordingly degrading the trust in their business as a sender


But why restrict the syntax arbitrarily in the first place? It is not going to catch the common typos anyway. Most typos will just result in a wrong but still syntactically valid email address.


I’ve always wondered if it’s possible to have a valid email address which is also an SQL injection attack, XSS or similar ?


of course it can: https://security.stackexchange.com/a/106996

One of our testers found XSS with email injection (RFC compkiant validation passed) in our website.

And we are an e-mail company and should now better :D

Never trust user input!


Probably, since the local name part can be basically anything.

But the way to prevent injection attacks is not to disallow or sanitize input, it is to escape correctly when interpolating strings in other languages.


> Sending emails cost money and bouncing emails affects your sender reputation.

that works as long as <RFC>fan 69™@root does not write articles for ZDNet


How about not doing any pre-validation (save for whitespace stripping) and have a validation e-mail (which you should require anyway) take care of any typos?

With precious dev time, you can do better by doing less.


You risk sending a junk message tho, which affects your sender-spam score with other providers.

I just make folks email me first.


I do the validation email, works great. Just be sure to protect the sign up form with some type of bot detection (I use recaptcha, but simpler methods are fine for most sites).


Instead of every developer implementing validation logic, shouldn't we have validation libraries to take care of this?


100% agree. This is especially true if the address mail is going to be displayed somewhere for example, it's generally a good idea to limit email address to a sunset of what the RFC allows.

To adapt from a famous quote: "all email validation logics are wrong, but some of them are useful" ;)


As the Sr. Internet Mail Administrator for AOL in 1995, we quickly learned that the only real way to validate an e-mail address was to send mail to it. Even that wasn’t guaranteed, but anything less was doomed to unexpected failure and misery.

And this wasn’t a new lesson then. But at least we were smart enough to listen to the people who had learned that lesson before us.

It is now over 25+years later, and I’m sad to see that many people seem to be bound and determined to force themselves to re-learn that lesson the hard way.


I also came to the same conclusion some years ago. Or more specifically, my manager brought me around after I tried arguing that it was worth the time to make sure that users could use an IPv6 address as their domain (the lack of periods after the @ would cause `user@2001:0db8:85a3:0000:0000:8a2e:0370:7334` to fail validation)

He made a very convincing argument that while an IP address is technically a valid domain, but how many legitimate users were seriously using an IP address as their email domain? (zero)


You’ll love this SSL cert:

https://[2606:4700:4700::1111]/


This falls into the category of the classic “Stupid shit that most programmers believe”.

Fundamentally, the problem is that if you’re trying to validate an e-mail address as being correct and you’re not sending an actual e-mail message to that address, then you’re doing it wrong.

We learned this lesson back in 1995, people.


Totally agree with this. Trying to be perfect is a good road to paralysis and not getting things done. Software is like people: it's ok to not be perfect, especially if they're always trying hard to be better and doing good things for society.


Yes, you should reject it, because the address is invalid. You can't have unquoted <> in the local part. You can't have spaces there either.

Your other reasons for breaking interop between systems/languages are just whimsical and invalid. :)


I've run into some places where a subdomain email is not ok, which has been pretty annoying. All email validators should be able to at least take first.last+company@subdomain.example.com


Especially when using a country TLD, suffixes like .co.za are appended to the name of the actual ISP or email provider.


I think the best syntax validation technique for email addresses now is found in the HTML spec: https://html.spec.whatwg.org/multipage/input.html#valid-e-ma.... As they say, this is a wilful violation of RFC 5322, because that’s simultaneously too strict, too vague and too lax to be useful. They give a grammar, and the following regular expression implementing it:

  /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
Remember that the web is a platform that lives and breathes this stuff. A lot of thought went into this grammar for valid email addresses. This is a good way of filtering out obviously bad stuff while allowing all realistic and sane inputs.

One part of all this that I’m not aware of the situation around is “8. You can put emojis in the local part.” The HTML spec’s validator is all ASCII. It does remind you to punycode the domain labels, but makes no mention of internationalised local parts, and I’ve never learned about non-ASCII local parts or how well they’re supported. I gather they may require the sender to be capable as well as the receiver, whereas internationalised domain names were made compatible with all systems via punycode.


I've always just used

  /^.+@.+\..+$/
That is, "Some characters, an @, some more characters, and a period"

I couldn't care less if users want to enter undeliverable email addresses, they won't get emails. All that regex is intended to achieve is ensuring that the user hasn't accidentally filled the wrong field (e.g. tried entering their phone number) or mistyped a punctuation mark (foo#bar.com, foo@bar,com)

Strictly speaking, it won't match some valid email addresses, such as IPV6 domains. But if I receive a support ticket complaining that we don't accept email addresses with IPv6 address domain, I'll reply advising that the customer should purchase a domain name or sign up to one of many free email services.


Huh. Interesting this doesn't support international email [1] addresses, e.g. квіточка@пошта.укр or Dörte@Sörensen.example.com.

Seeing as the web has long supported Unicode, where are e-mail addresses currently at in that evolution?

Are full Unicode e-mail addresses something that is decently supported today, or still largely theoretical? Is this regex sufficient? What kind of e-mail addresses do people in China most commonly use, for instance?

[1] https://en.wikipedia.org/wiki/International_email


Internationalised domain names are supported: as I mentioned, the spec explicitly reminds you to do punycode.

For the local part, though, it does look like browsers have fallen down, though I’m not particularly familiar with the situation there. Testing it in Firefox to confirm, ascii@υνικοδε validates, but υνικοδε@ascii doesn’t. https://github.com/whatwg/html/issues/4562 seems to be where progress is made from time to time. As usual, it’s not as simple as we might hope.


> where are e-mail addresses currently at in that evolution?

Baby shoes because of anglosphere programmers that can't fathom people wanting to use their own alphabets and thus forget to support it.


You don't need to be so accusatory or ungenerous about it.

Clearly "anglosphere programmers" fathom it every day when they use UTF-8 almost universally in webpages. Also, you know, things like emoji are pretty popular in the "anglosphere" as well.

It's obvious that the real reason is an ancient e-mail RFC, and that while upgrading webpages to UTF-8 was relatively easy, in that it only needs 2 parties to support it -- the browser and the server -- upgrading e-mail is almost infinitely more complicated, because you have to wait for virtually all email code in the world to be upgraded, since an e-mail address is pretty useless if it doesn't work everywhere.

It other words, it's a coordination problem. Not an ignorance problem.

And unfortunately, Punycode [1] doesn't seem to be a particularly viable stepping-stone/compatibility solution here. E.g. if a user tries to use ドメイン名例@example.com and it fails, asking them to instead type in a seemingly-gibberish eckwd4c7cu47r2wf@example.com, where that could also conflict with a real e-mail address of that name.

[1] https://en.wikipedia.org/wiki/Punycode


> You don't need to be so accusatory or ungenerous about it.

At least four decades of mostly bad internationalization support it's no longer accusatory, it's empirical and quite generously worded.


This is a pretty pessimistic take. The real answer lies somewhere between budget and speed. If someone asked me to support non-latin alphabet, I'd have no idea where to start and the amount of people that would use that feature isn't worth the consideration. It's not that I don't fathom it, it's that I don't have time for that shit.


> while allowing all realistic and sane inputs.

Isn't that a way of saying "while disallowing perfectly valid options"?


What's disallowed are a) IP literal addresses and b) localparts that require quoting. These email addresses are highly likely to break many processing steps anyways; I've only ever seen category b in sendmail configs (it can be useful for internal email rerouting purposes).

There's a distinction to be drawn between the requirements of the actual MTA/MUA/MSA layers and user applications built on top of them. For the latter, considering emails to be invalid if they contain IP literals or quoted localparts is going to be more helpful than harmful (there's less scope for vulnerabilities in doing so). It's just like assuming email addresses are case insensitive: it's inappropriate if you're an MTA, but for everybody else, go ahead and assume they are.


> What's disallowed are [...]

A-ha, but here you're wrong because you've excluded IDNs. This is really why you should not try to be clever.


IDN A-labels would still be accepted. Using the U-label is likely to require the same level of support as EAI, because without EAI support, non-ASCII strings are likely to horribly, horribly screw up the lower levels of the stack, and I wouldn't recommend supporting EAI without actually testing to make sure your stack can really handle EAI. (Not to mention EAI localparts being their own can of worms).


No normal user will enter their e-mail address in punycode. This is something your own software should be doing.


And yeah, IDNs are covered by that spec reminding you to punycode domain labels first.


How sure are you that the 61 character limit won’t change in some future DNS improvements? People used to think TLDs would only ever be up to 3 characters long.

More importantly, what problem is this even trying to solve? Someone accidentally typing a 300 character domain? If they are intentionally feeding you gibberish they’ll just give you more realistic looking gibberish.


You seem unreasonably hostile.

I’m absolutely certain that the 63-character limit for domain labels is never going to change, because it’s hardcoded in enormous amounts of software and hardware, and there’s no even vaguely compelling reason to even attempt to change it. But if such a thing did change, then you’d just add this to the extremely long list of things that needed to be updated.

People who thought TLDs would only ever be up to three characters long were simply wrong from the very start because they didn’t understand what they were dealing with. (As a simple example, .arpa was there from the start.) Understand that this wasn’t a matter of anything changing, it was that some people misunderstood and thought that a convention they observed was in fact a rule.

The problem this sort of validation is solving is weeding out things that are definitely not going to work, as soon as possible, because it’s good to point out problems to users as soon as possible, rather than having something silently fail or only notifying the user about it much later. Syntactic validation isn’t the be-all and end-all of accepting email addresses, but it’s definitely still worthwhile, even though you should generally do other validation based on DNS lookups and/or sending actual emails as well.


That regular expression fails to validate a bunch of the examples from the article. And also single-word addresses, which are pretty useful if you want to route email locally.

So what makes it the best?

[Edit: it also assumes you've already parsed out the "real" address from the rest of the text field, which to me makes it a half-validator at most.]


Yes, that's the explicit point of it.

But it would seem to be the best for general-purpose web use, e.g. signing up for a newsletter with an e-mail address that's pretty much guaranteed not to break anything.

Instead of being conservative in output, it's intentionally being conservative in input.


Single-word addresses? I presume you main domain names with only one domain label (like “localhost”). Read the grammar again, they’re supported.

> it also assumes you've already parsed out the "real" address from the rest of the text field, which to me makes it a half-validator at most

I’m confused. The explicit purpose of this stuff is to validate an email address. Not to extract an email address from a freeform text field, which I think is what you’re talking about. Deciding how to do that is a whole ’nother can of worms.


> And also single-word addresses, which are pretty useful if you want to route email locally

Unless you're developing an app for an intranet, that's not a concern for most people.


Hope you're not in PHP, Perl, or Ruby!

http://emailregex.com/


That's a pretty bad page, given that it gives regexes that match very different things for different languages, without a) explanation what the differences are, and b) any rationale for why you may or may not want to choose between the different versions, let alone c) why different languages "deserve" different versions of the regex.

This is already a field where there is a lot of misinformation flying around, and a page that merely regurgitates all of that misinformation without the perspicacity to realize that its purported information is internally incoherent is not helpful.


I provide my email address with the +companyname suffix on the local part as a way to filter my email into various folders based on the To header contents.

Unfortunately, many websites are configured to reject email addresses that contain a plus character. I've also encountered websites in the past that did accept the + character when creating the account where the email address serves as the user name, but then could not log in because their log in form rejected the + character in the user name.


I got sick of companies rejecting email with "+", and bought a domain to use for email (among other reasons). Now I've got a wildcard entry in DNS, so any valid local part gets routed to my inbox. So instead of "username+company@example.com" I can do "company@example.com".


I ended up giving up on that after one too many websites rejecting my custom domain (which I’m the only one using) on signup. These lazy / ignorant colleagues are annoying -_-‘


I've been using a similar scheme for about 7 years now and have never had my email rejected by a website on signup.


The American Kennel Club rejected mine because the domain was “too similar” to their name. I guess just because it had a “kc” in it? Completely bewildering.


I use this scheme (company@mydomain.com) and one that I remember blocking for this reason is Aliexpress/Alibaba - aliexpress@mydomain.com was rejected so I use ali@mydomain.com.

No idea what sort of security this is supposed to provide.


It happens rarely, but some only accept a very limited number of domains (ie Gmail, Outlook, etc).

They probably see it as some sort of security / anti-spam mechanism.


I use a .xyz domain for my personal email, and I sort of regret it.

My emails have a tendency to become spam filter bycatch, to the point that when I was job hunting last year I'd have to ring people after I sent them my resumes etc. to confirm they actually received my email.

And when I give people my email address, I usually have to assure them that steve@stevetech.xyz is a legitimate email address and not a joke (it's not actually steve, but you get the point).


I host my own email server, and .xyz is one of the 2 or 3 TLDs I went in the config files and manually blocked since nothing but spam comes from it (and lots of it).

Definitely would not recommend using it for your personal address.


Can you explain the DNS part? AFAIK the sender just looks for MX on the domain itself, regardless of local part.


The actual address in the email header should still contain the subdomain though.


The address "company@example.com" doesn't point to a subdomain, though, the only reference to the company is the local part of the address, and so has nothing to do with DNS.

If he said he used "joe@company.example.com", then it's possible he has a wildcard MX record for *.example.com, but that's not at all what he said, although perhaps it's what he meant.

Regardless, the question remains unanswered.


I just set up my mail server to use - rather than +, and don’t encounter this problem.


What provider do you use for email? That does sound nice.


Fastmail supports it. The best part about fastmail is that you can reply from the same address you got the email for. This is useful in customer service scenarios that identify your account based on email address.


The reply part is somewhat new. I had to delete and re-add the wildcard to get it on my old (circa 2013) account, but very nice to have


I use migadu for this.

I also use greg-*@domain instead of *@domain, since their docs claim that setting up *@domain tends to attract more spam.


Huh, how did I not ever hear/find out about this when I was choosing a provider... I think this is the first time I've seen them mentioned on HN, despite searching through quite a few de-googling threads. Will definitely take a closer look!


Another Migadu user here slowly degoogling myself. $19 a year is a bargain for my usage and the features I get.


Also a migadu user, I'm a huge fan and can't speak highly enough of them. Their pricing model is a perfect fit for me and their support address is really quick to respond.



I use ProtonMail and sign up to everything with <service>@<custom-domain> so I can track what they do with my email.

It's not cheap from PM, and there are loads of hosting providers that will provide catch-all email for free with your hosting package (but with some usually pretty poor webmail client) or if you use a mail client it should work too.

I like having good webmail and mail app and other things so I pay, but there are plenty of good options available. Sadly self-hosting email server is not really an option for a variety of reasons, but you should easily be able to use catch-all e-mail addresses.


The paid version of gmail (google workspace/gsuite) offers this as well (they call it "aliases"). I haven't explored the option myself, but I do recall seeing something like this in the admin panel. Whether they charge for it or not is probably something I should look into.

At some point, I need to migrate away from google and build out my own personal mail server.


mailbox.org also provides the functionality to use your own domain and a have a wildcard entry, where all emails go into your inbox.


I've tried something similar with Fastmail, and it works out well for the most part. I have ran into more than a couple services which won't accept email addresses not on a whitelisted domain for some reason and I had to use an @gmail.com address which forwards to my domain.


Out of curiosity, are those popular services? I'm in process of setting up email on my own domain and it would suck having to fallback to Gmail if some service uses an accepted list of domains.


fastmail is reasonably popular. Gmail is bigger, but fastmail is big enough that they cannot be ignored, unlike when I ran my own personal server and often found myself in blacklists without any knowable way to get off.


I've had a couple places not take my .us domain, but almost everything is fine with my .org. The places I've run into that are really picky don't like gmail or other free email providers.

The one exception is Craigslist; if I email someone with my normal email, I never get a response. I always use gmail for that.


https://forwardemail.net/ is fantastic if all you want is to forward domains somewhere else.

It's a freemium model, but I've never needed anything in the paid tier


In the UK, my domain name provider offers free e-mail forwarding for (I think) 10 specific e-mail address, plus a catch-all forwarder for anything else. Works quite well.


I'm on fastmail.


That causes weird behaviour in places, where they assume the bit before the @ is a "username".


I've been using this strategy for years and have not encountered that issue before. That would mean the part before the @ would have to be unique across all domains. That doesn't make any sense. You couldn't have webmaster@domain1.com and webmaster@domain2.com registered for example.


Or ben@gmail.com and ben@hotmail.com couldn't both be registered. This scheme is so obviously flawed I can't imagine it's widely implemented.


I've been using an own domain with wildcard emails for many years now. I'm yet to encounter a single scenario of inferred names.


I was unable to provide my email address for a retail rewards program last week because the input field for the domain was a dropdown in their POS. Not the TLD, the entire part of the email after '@'!


Yes this is terrible. On the other hand, if your goal is to prevent people from signing up using disposable domains, the blacklist approach (which I have tried before) is a never ending game of whack a mole.

Sounds like this was in person at store though which is extra weird because seems unlikely that scammers would be trying to sign up en masse at a physical location (unlike if the form is connected to the internet)


Jeez, wow. How many domains were in that box?


"There are other emails besides gmail and hotmail? Woah!" - the person who thought that was a good idea, probably.


There is absolutely no way that someone who thought building a dropdown for email domain name is a good idea wasn't putting AOL and Yahoo! as the first two options.


Until about a decade ago, this was extremely common in Japan. RIP mobile email, another victim of smartphones in general and the iPhone in particular


I used this wonderful trick to sign up for my government issued eID (it was something else but works for explaining). What they decided to do is to simply remove the + and don’t let me know about it.

my_email+service@foo.bar thus became my_emailservice@foo.bar

I tried logging in, resetting passwords, nothing worked. I had to go to the authorities and make a written request to allow them to interrogate the database by the equivalent of my social security number, and that’s when we realized they just stripped the +.


Fastmail allows for companyname@youraccount.fastmail.com -style addresses. Even for your own domains.

Much more reliable than the + -thing, which breaks in the weirdest of places.


I've been using fastmail for years and didn't know that. Thanks!


Ages ago, back in myspace days, their system would permit + when creating an account, but could not handle this in their forgot password / password reset system. I never was able to delete my account because of this.


Sony's SEN used to have an account creation page that would permit +, but subsequent sign-in interpreted it as a URL-encoded whitespace. No login for you


lol you should have tried to enter a URL-encoded plus sign, %2B.


All social media accounts are delectable on a long enough timescale.


As it happens, it was eventually done for me:

https://mashable.com/article/myspace-data-loss/


I use Fastmail with my own domain name and unlimited email inboxes, so I use companyname@mydomain.com to sort incoming mail.


I do the same thing and believe it or not I’ve seen websites reject emails with their own name in the email.


I had one do that. When I give the address in person I get "do you work here?"

I had to switch my hosting provider at one point because they stopped supporting catch-all. I have no idea how many "addresses" I've used, since I don't create a specific email for each, so I had to get new hosting (note: this was over 10 years ago)


I recently got a letter from a companies' law department and had to explain the whole thing :D


If you use Gmail here's a fallback option: Gmail ignores "." in the local part. So foo.bar is the same as f.ooba.r to Gmail. Obviously quite limited and more hassle to keep track of.


One of my primary pet peeves with Gmail. It leads to a lot more junk mail arriving in my inbox. My real Gmail address is 'first.m.last', and almost all the spam I get is addressed to 'firstmlast'. Gmail is great at filtering out spam so that I don't see most of it, but if not for their unconventional filtering of recipients, I'd get even less. I also get a lot of email from idiots who don't know their own address and provide mine instead, and literally all of that would bounce without their . handling.


Same here. I send everything that is firstlast@gmail straight to junk.

> I also get a lot of email from idiots who don't know their own address

Holy crap there are a lot of them. I've got one bank sending me the dude's statements. He's also been on some interesting trips, seen all his hotel stays, etc.


Same. I don't have a very common name but there are at least two other people who share it. One has used my GMail address to apply for jobs and for his unemployment benefits. I'm guessing he isn't having much luck with either one.

The other finally figured it out but his wife still hasn't after more than a decade. It gets really old receiving reminders to service a vehicle I've never owned from a dealership 2000 miles away among other similar crap.


I thought it would be nice to have my name without numbers as my gmail, but with all the stories i've heard, I think i'm glad I have the numbers now.


This pattern is often abused by spambots trying to avoid dupe detection, so using it excessively may lead to your login being treated as spam.


I use a catch-all to have a <website>@<mydomain>.com login for every website.

Samsung doesn't accept emails with "samsung" as prefix, so I have samsun@mydomain.com for them. I have no idea what's the logic behind.


I use reverse DNS notation for the local part. So that would be "com.samsung@mydomain.com" in my case.


I got sick of + not being accepted and switched to using - for all my aliases, which works everywhere I've tried. It's annoying, but practical (assuming you run your own mail server, or have the ability to manage it client-side).


Plenty of hosted solutions support wildcard - including GSuite and Fastmail.


Ditto, with the same hassles mentioned by you and others, such that I'm actively looking at email services that handle this sort of thing better using approaches such as mentioned below - domain@mydomain style registration addresses.


You can have unlimited handles with fastmail if you're looking for that


I find that a lot of website don't allow + sign precisely because of Gmail usage.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: