Hacker News new | past | comments | ask | show | jobs | submit login
It's Impossible to Validate an Email Address (elliot.land)
61 points by elliotchance on Apr 3, 2016 | hide | past | web | favorite | 67 comments



What I'm about to say is more general than regex, but can online services please stop trying to validate my email address?

If I gave you an email address that you think is invalid, rest assured I did it for a reason. I'm not an imbecile: I know how to type my address correctly (especially when you make me type it twice). For all the imbeciles who don't know how to type their address correctly, the phone system still works fine.

I may have given you my real email address with a plus-sign for a filter. Don't tell me it's invalid.

I may have given you a fake email address, because I know you're just going to spam me. If you tell me it's invalid, I'll either spend an extra few minutes cooking up a better fake email address, or I'll leave your site.


https://medium.com/i-m-h-o/please-stop-verifying-my-email-ad...

I wrote something about this a while ago. I'm no wordsmith but it fits here.


But, statistically speaking, it probably is an imbecile.

These systems aren't optimizing for someone like you trying to be clever. They're trying to handle the more common cases of someone typing a street address in the wrong field or forgetting the ".com" or putting in an HTTP address or some such.

They're trying to catch that, not every possible pedantically legal string you could throw at it. Maybe they could improve the validator, or maybe they'd rather be spending programmers' time on things with actual business value.

The sum total of imbeciles may well represent a bigger market than the sum total of people like you.


Agreed. I find things like "youdont@needmyemail.com" usually work, and then if they ever bother to read what’s in their database they may get a hint.


kindlysodoff@mailinator.com is nice, and actually works for that one activation link you have to click.


I know it can be frustrating, it's happened to me too, but the reality is that email validation generally isn't for you. It's for the 99% of other people who would greatly appreciate the heads up that they've typed "something@yahoocom" or "Boys@MenFan1@whatever.com" and it's probably not what they meant to type.

Now we can talk about HOW validation is implemented, and I think it would be completely fair to raise a warning: "Hey, did you mean to enter this?" instead of "sorry, nope." when non-trivial addresses are encountered.


But it's extremely rare for a typo to result in an invalid email address, contrived examples notwithstanding.


If you happen to control the web page where the user is entering the email, this little piece of code has been a godsend for us:

https://github.com/mailcheck/mailcheck

I agree with the idea that it's impossible to validate. But, mailcheck takes the approach of seeing if the email is potentially wrong, then prompting the user with what it thinks they meant. It's usually right, but if not, it allows whatever the user wants.

For example, if your user types in "user@gmil.con", it will suggest "user@gmail.com".


This kind of stuff is great - make suggestions, but allow it to go through even if you think it's wrong. For the last email validation I worked on, there were only 2 absolute blockers - there must be an @ sign, and the domain must have MX records (emails that are technically valid remain useless if we can't send them anything).

There were a number of other checks (being close to yahoo.com or gmail.com or other common email hosts, containing surprising characters, etc) that would trigger warnings, but still allow the check to pass if the user assured us it was correct.


>>domain must have MX records

Technically, the RFC(5321) says MX records aren't required. You may be throwing out some small number of valid emails.

"If an empty list of MXs is returned, the address is treated as if it was associated with an implicit MX RR with a preference of 0"


This advice should probably extend to a variety of form elements.

Like DRM and a lot of other efforts to “control” things, aggressive validators invariably punish people who are just trying to do legitimate things. Don’t piss off your real customers.

I used to live in a town with a 12-letter name, and more than once a form decided that it knew the Universal Sensible Maximum Length of Town Names and wouldn’t let me type the last couple characters. And it usually doesn’t stop there, because once a site is incapable of storing things sensibly it invariably starts to have trouble matching things, giving errors that are just plain stupid (e.g. “this other thing doesn’t match what you entered”, well no shit...).

There is also too much thought put into what constitutes a person’s “name”. Generally, to work across all possible cultures, use a single, very long text field that can contain whatever the person decides to type. After accepting their input as-is, feel free to internally perform parsing logic to try to allow additional database queries but under no circumstances should your page make any assumptions.

The real “you should be fired as database administrator” mistake though is to store modified data without telling the user. This usually happens with passwords; I use a site for months and then one day accidentally hit Return too soon and my password works anyway, meaning they just CLIPPED whatever strong password I entered and stored whatever they felt like (usually 8 characters). NEVER do things like that without telling the user.


It is 2016. We should know better by now than to clip passwords at all. Sure, place a limit of 512 characters on it to prevent abuse, and for security reasons there can be a number of minimal requirements for length and complexity, but please let me be the one to decide if 32 characters is sensible or not.


> One more interesting tidbit is if you use unique sub-addresses for each of the sites you sign up to you will be able to see when someone, or rather who, sells your email to someone else... Busted!

Can't the spammers simply strip the subaddress/label after '+' ?


Exactly. This technique while noble in intent is very easy to subvert. I tried subaddresses for over a year and still found spam coming in without the tag. I don't know if others have had more success with it but I haven't noticed a difference. I'm pretty sure it's just being stripped.

Perhaps the reverse is more effective. Use a subaddress for all your mail and ignore that without a label coming in.


No, they can't. Or rather, they can, but they'll be wrong. The ‘+’ is not a feature of email addresses; it doesn't mean anything different from any other character. It's just a feature of some destination systems that they ignore everything from ‘+’ when assigning incoming mail to accounts.

Generally, that's configurable; e.g. in postfix it's set by the optional |recipient_delimiter|. Suppose it's set to ‘-’, and you, John Public, sign up for SuspiciousService using the email address <john+public-suspserv@example.com>. Normally, the mail gets delivered to your 'john+public' mailbox. But if Mr Spammer strips everything from ‘+’, he sends to <john@example.com>. And your system knows that that is not only not a valid local mailbox, but also that it matches a plus-stripped address, and therefore consigns the message and its sender to the fiery pits of hell.


It doesn't have to be +. It can be any character you configure. Recent postfix versions even supports multiple characters.


I went a step further and bought a .com that I setup with a catchall email forwarder. Most signups get a custom email at that domain. I still get Russian bride emails sent to oilchangeplace@


They can.

They don't.

It's extra effort for them for nearly zero marginal gain.


Some sites however will just ban + on registration. I've seen registration allow + but login disallow (also different password lengths occasionally, wtf?), though I can't think of any offhand.


> It's extra effort for them for nearly zero marginal gain.

I wouldn't say it's nearly-zero gain; by applying a tiny sed expression they obtain a basically unblockable e-mail address.

It's easy to blacklist johndoe+amazon@gmail.com but very few people would be willing / able to blacklist their top-level johndoe@gmail.com. So the spam keeps coming.

Spammers are annoying but the progammers behind them are smart.


I once heard the story of a man who helped Aruba set up their DNS (.aw) in the late 90's. In exchange, as part of his compensation, he asked for an email address at the top-level domain, and received something like js@aw, which is a perfectly functional email address, but trips up a lot of validators.


It isn't a valid address. TLDs must not resolve, so it should be impossible to make a server handle it (yet, it is mostly possible, because most DNS servers do not completely implement the RFCs - still, there's no guarantee it will work on every network).


The first one I found that does resolve: http://ai./

It has an MX record too. There is nothing wrong with this.


It is most definitely wrong, if your definition of wrong includes disallowed / not recommended by ICANN and IAB.

What is wrong with it is, amongst other things, the real-world possibility of colliding with internal hostnames.


Indeed - http://dk./ works too.


    tk has address 217.119.57.22
    cf mail is handled by 0 mail.intnet.cf.
    to has address 216.74.32.107
    io mail is handled by 10 mailer2.io.
    gg has address 87.117.196.80


I heard this story about Ian Goldberg, Anguilla, and the obvious email address.


It might have been Anguilla, the story was told to me a long time ago.


Dick move, though.


If you were to ask me for a regex, I'd say /.+@.+/.

That's the easiest and most accurate way to do it by regex. Sure, some invalid addresses may still get accepted, but that is unavoidable. Even the most thorough validation[0] is going to accept nonexistent addresses.

[0] Except those that validate by sending a mail to it. Sending an email is the only way to be sure.


That is the only correct answer.

    .+@.+
In human language: an email address consists of something, an @, and something.

That's it. The name part can be anything; so don't validate it beyond having a character.

We don't have one-character domains yet, but there is no reason to exclude them. It is already possible to arrange your own top-level domain if your pockets are deep enough (.google), so don't be surprised if some megacorp starts using a top-level domain without any subdomains. Don't validate the domain part beyond having a character.


Sending an email is the only way to be sure.

Absolutely this. The check for an '@' and something before/after it is for sanity, and anything beyond that would involve actually trying to use the address.


> If you were to ask me for a regex, I'd say /.+@.+/

How about "one@two@three@four@example.com"?


Any validation is also going to allow "someonewhodoesntexist@nonexistentdomain.com", unless you send an email to it.


Yes there are loads of invalid email addresses it allows. It's not supposed to be perfect.


Yeah, the point of checking for the @ is chiefly to prevent inexperienced users (e.g., elderly) from entering the wrong type of data. For example, enter a line from their postal address instead of their email address.


    # get email addresses
    grep -EiEio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

    # censor email addresses
    sed -r 's/(<)[[:alpha:][:digit:]\._%\+-]+@[[:alpha:][:digit:]\.-]+\.[[:alpha:]]{2,4}(>)/\1--removed--\2/g'
If email doesn't meet those, I drop them on the floor. Then again, I drop email on the floor for lesser reasons.


So you're blocking email from your local .museum, from anyone who has an Irish name like O'Connor, all international TLDs, and most newer TLDs.


Yes.


So any new gtld longer than four characters would be dropped on the floor? E.g. .software? If you're trying to censor email addresses from a file, that doesn't work that well. :)


Yes. Actually for the longest time I rejected the connection with "Your name is too awesome!"


This would not allow email addresses with an = in them, which I've had. The point of my overly tolerant regex is exactly that: any more restrictive validation is going to drop some perfectly legal email addresses. Sending an email is the only way to be sure.


Let's remind the famous quote from Jamie Zawinski: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

I was neglecting this quote for a long time, until I started using regular expressions in real projects...


I never really learned how to write regex. I can write simple ones with the use of a tool to help me figure out what I need to write (and a cheatsheet to explain the terms).

I realize they could save me some future potential grief, but I'm usually more concerned with the present actual grief they cause me.

I feel like I'm letting down the side.


If you start writing a lot of Unix pipelines or use Vim extensively you start to get a very strong working knowledge of regular expressions (awk can solve basically any problem, but it creates at least twice as many as it solves :P).


So many routing solutions use regexes. Is that really necessary?


This checks whether or not an email address follows RFC 5322 via parsing vs via regex: https://github.com/jackbowman/email-addresses


Yes! The title is misleading; it's not at all difficult to syntactically validate an email address; it's just not possible using regular expressions (HE COMES).


Title on article would be more complete if said by Regex.


That's what the article describes, but it's hard to validate an email address by sending to it too if you want time bounds. My mail server implements greylisting and its frequently difficult for me to verify my email address on services that send me tokens that expire. Greylisting typically delays by only 10 minutes or so but there are plenty of times that mail servers can be down for extended periods, or a quota/disc is full, or an intermediate mail router is down, or any number of other problems


It seems there are weird things you can use in an email address that nobody does, as a result what is used and considered to be an email address has matured. If you create an email address that is weird, in practice you'll be less capable of using it.

The weirder it is the fewer web forms or software you'll successfully put it into.

I think we can just say no, functionally, you cannot put comments or additional @ symbols into your email address. It hasn't worked for long enough, people know you just aren't supposed to do it. I'd be surprised if you were allowed to create such a thing signing up for bing for example. You probably need to be the administrator of some chaotic UNIX server with full DNS, in order to force it to happen at this point.

Even Google Chrome's built in email field validation doesn't allow you to do it.

I shouldn't be expected to jump through the hoops necessary in order to allow "technically valid" email addresses that someone went out of their way to make, when I could more easily suggest they use a normal one.


> I shouldn't be expected to jump through the hoops necessary in order to allow "technically valid" email addresses that someone went out of their way to make, when I could more easily suggest they use a normal one.

I really hope you don't work on anything important if your stance is, "I shouldn't be expected to implement specifications correctly because it's easier to only implement part of it."

Why are we even having this discussion? Implement it correctly once, put it in a library and never worry about it again. You don't have to jump through any hoops, you're only making more work for yourself by implementing the standard incorrectly and then having to deal with customers that think that the ITEF standard is more valid than your personal definition of what an email address should be.

If your code is passed down the line and eventually hits someone who writes unit tests for actual valid email addresses then your name is going to come up on the git blame when it fails.


Even Gmail, what I would consider a gold standard, only allows letters, numbers, and periods. So when you're done accusing me of not being someone viable for a position anywhere important maybe you should send a message outlining the same argument to Google.


Huh, Gmail completely implements RFC5321 and you can send and receive email from any valid address, even from their web client, and they validate the email address.

I stand by my position, failing to implement the spec correctly should be considered an error even when Google does it.

If you want a compromise, how about printing "we don't support email addresses with X" if you want to be picky for SPAM detection, simplicity, or something rather than "this email address isn't valid." Google does this for some valid-but-not-accepted addresses on their signup page but it's not sophisticated enough to catch everything.


It's easy to validate that the syntax is correct. The problem lies in what you're trying to do with those addresses. If you're importing a mailing list archive, chances are a syntactical check is the only one you can do, because half the domains for older lists don't exist anymore, and most of the mailboxes won't.

If you want to send email to that address, you're probably going to want something that can suggest gmail as a replacement for gmial. You can also check that the domain exists and has a MX record. If you run your own mail server you can probably even check that the mailbox exists...

If you want emails to be unique, you'll need to apply per-site logic like gmails optional .'s and strip the + segments. That's important if you're combining multiple lists of emails, or importing an existing mailing list for a user.

The gist is the real world is complicated, but you can pretty easily set up something that handles 90% of it.


Wouldn't it be reasonable to have a sanity check that can be bypassed by the user? It is very likely to be a mistake if there's no full stop in the address, but there are exceptions [0]. I would like to see a warning if I accidentally type vostok@examplecom instead of vostok@example.com.

[0] https://mail.gnome.org/archives/evolution-list/2002-January/...


your first example is a valid email


I remember reading the opinion that one need only verify that an email address is an email address by answering, "Does it have an '@'?". Yes? Email address. No? Try again.

Perhaps it is nice to ask a question one step above the email verification: How much responsibility does//should the user have, how much responsibility does//should the designer have, in ensuring the user's email address as valid?


With a regex/parser maybe -- but it's very easy to require "activation" from said email address as a verification.


I added a comment to the authors article. You can construct a regex to process every valid email address except those with nested comments (a feature no one in the real world ever used): https://github.com/dhoerl/EmailAddressFinder


I've been using Mailguns validator [0] for a while now and have been quite pleased. It catches common typos and validates DNS on the host name.

[0] https://documentation.mailgun.com/api-email-validation.html


For anyone who would like to test their email address validation code, I wrote a fuzzer which can generate syntactically valid addresses (among other things).

https://github.com/nradov/abnffuzzer


Is there ABNF for email addresses in an RFC?


Yes RFC 2822. Look for the addr-spec rule.

https://tools.ietf.org/html/rfc2822#section-3.4.1


This brought me back to how Postfix has been handling this all these years.

http://www.postfix.org/ADDRESS_VERIFICATION_README.html


I think that spaces are also valid in email addresses. So, even <bilbo brouha@example.com> would be a valid email address in that case...


They are valid within quotes, so that <"b b"@example.com> is valid, but <b b@example.com> isn't.

Almost anything is valid within quotes, but quotes can not appear everywhere.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: