Hacker News new | past | comments | ask | show | jobs | submit login
Full email validation regex (RFC 2822) (iamcal.com)
54 points by ZeljkoS on Sept 24, 2014 | hide | past | web | favorite | 60 comments



Please read this : https://nikic.github.io/2012/06/15/The-true-power-of-regular...

RFC5322-compliant regex:

    /
        (?(DEFINE)
            (?<addr_spec> (?&local_part) @ (?&domain) )
            (?<local_part> (?&dot_atom) | (?&quoted_string) | (?&obs_local_part) )
            (?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) )
            (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dtext) )* (?&FWS)? \] (?&CFWS)? )
            (?<dtext> [\x21-\x5a] | [\x5e-\x7e] | (?&obs_dtext) )
            (?<quoted_pair> \\ (?: (?&VCHAR) | (?&WSP) ) | (?&obs_qp) )
            (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)? )
            (?<dot_atom_text> (?&atext) (?: \. (?&atext) )* )
            (?<atext> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+ )
            (?<atom> (?&CFWS)? (?&atext) (?&CFWS)? )
            (?<word> (?&atom) | (?&quoted_string) )
            (?<quoted_string> (?&CFWS)? " (?: (?&FWS)? (?&qcontent) )* (?&FWS)? " (?&CFWS)? )
            (?<qcontent> (?&qtext) | (?&quoted_pair) )
            (?<qtext> \x21 | [\x23-\x5b] | [\x5d-\x7e] | (?&obs_qtext) )

            # comments and whitespace
            (?<FWS> (?: (?&WSP)* \r\n )? (?&WSP)+ | (?&obs_FWS) )
            (?<CFWS> (?: (?&FWS)? (?&comment) )+ (?&FWS)? | (?&FWS) )
            (?<comment> \( (?: (?&FWS)? (?&ccontent) )* (?&FWS)? \) )
            (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment) )
            (?<ctext> [\x21-\x27] | [\x2a-\x5b] | [\x5d-\x7e] | (?&obs_ctext) )
    
            # obsolete tokens
            (?<obs_domain> (?&atom) (?: \. (?&atom) )* )
            (?<obs_local_part> (?&word) (?: \. (?&word) )* )
            (?<obs_dtext> (?&obs_NO_WS_CTL) | (?&quoted_pair) )
            (?<obs_qp> \\ (?: \x00 | (?&obs_NO_WS_CTL) | \n | \r ) )
            (?<obs_FWS> (?&WSP)+ (?: \r\n (?&WSP)+ )* )
            (?<obs_ctext> (?&obs_NO_WS_CTL) )
            (?<obs_qtext> (?&obs_NO_WS_CTL) )
            (?<obs_NO_WS_CTL> [\x01-\x08] | \x0b | \x0c | [\x0e-\x1f] | \x7f )
    
            # character class definitions
            (?<VCHAR> [\x21-\x7E] )
            (?<WSP> [ \t] )
        )
        ^(?&addr_spec)$
    /x
Also, if you want to validate a mail address, send a mail. There is no other way.


You should probably confirm it contains an @ first. The question is how much validation is necessary prior to sending the email?


There needs to be an @, something in front of it, and something behind it. If you've got that, try mailing to it.


  $ echo "Hello" | mail claudius


Local email addresses are of course easier to identify. You can just check if the user exists on the server. But that's not much use to websites and other online services.


Not necessarily, "checking if the user exists on the server" is ambiguous. You could check /etc/passwd, but the mail system may use virtual users for "local" delivery, where "local" is defined in this case as not requiring a domain portion. The only way to even check if a user exists even locally is to try to send it mail.


That's local mail, not internet mail. Technically you're talking about a local mailbox, not an email address.


That's a context free grammar, you cheater!

<Read the link…>

Okay, so Perl Compatible Regular Expressions can parse context-free grammars. And context sensitive grammars. And who knows what more.

I understand there's a difference between theory and practice, but this is a plain misuse of the word "regular". PCRE should be renamed "Perl Compatible Parsing Facility" or something.


I regularly tested the "email regex du jour" at my previous job whenever these types of articles came up. IIRC, it was against 15+MM known good email addresses, and probably double that in known bads and nearly every one tested had its issues. [edit: we had something like 150,000 distinct active domains, and probably 1/2 that of distinct MXes (if you rolled up all the google-biz and microsoft hosted stuff)... if you think getting your email delivered by gmail is difficult, try a school district in Wyoming that appeared to have a 300baud connecting it to world running an ancient version of Groupware that rejected email according to the weather report as far as we could tell...]

Most people working on the code for that sign-up page (/what have you) neither have the regex-fu necessary nor the understanding of email to write the regex correctly... So you get a lot of shitty regexes (especially large corporations) that don't support apostrophes or dashes/plus signs in the local parts. And it doesn't matter how good your regex-fu and RFC comprehension abilities are, there are a lot of broken implementations out there and blocking a subscriber because of their broken system isn't a great business.

It took awhile, but eventually we switched our signup forms to do a couple of very effective things beyond a very simple address regex: 1) auto-suggest for common misspellings of our most common domains (gmal.com, yaho.com, etc.) 2) while the "please re-type your email" gave us enough user delay, we did a DNS lookup of the domain, then an MX lookup. If there was a problem with either, we passed an error to the user like "Please double check the domain of your email address..." 3) check for domains you know have moved. We were B2B, so if you watched your bounces closely, you'd know that asdf.com was moving to hjkl.com, so you could update your existing records, but people have serious muscle memory, and it's worth reminding them on the signup page.

I was working on tying in our bounce database (you are keeping a record of all your bounces, right?) so that automatically flagged domains would prompt the user with an error like "We've been unable to deliver to your email domain recently, if your email address is typed correctly, we recommend using a secondary email address if you have one..."


I worry about people putting things like this on the Internet. Any experienced developer knows it's a joke and that there are better ways to validate e-mail addresses; but there are plenty of inexperienced -- copy-and-paste -- developers out there. A colleague of mine did something similar, for example: he didn't even know what a regular expression was and I could see, as it was a much simpler pattern than this one, that it would fall quite far from the mark.


I think its highlighting the fact that most email regex validation is wrong. Email is a more complex specification than most people realise.

I don't even think you can parse it using a regular language(though most regexe engines go beyond this).


As well as traditional verification, I often use a perl script that takes an email as `ARG[1]`, runs this, and exits with `0` or `1` (for easy cross-language usage) because: 1) I don't like the idea of my frontend giving the impression my software takes garbage; 2) poking around my database and seeing obviously wrong, maliciously entered 'emails' makes the OCD in me flare up. Works well for me.


1. Make sure there is at least one '@'

2. Make sure there is at least one '.'

3. Make sure the entire thing is at least 4 characters long (@, . and two other characters)

4. Resist the temptation for something smarter

5. Send an email with unique link to verify


> 4. Resist the temptation for something smarter

I worked on SaaS product with a largely non-technical audience, and we had a frequent issue with people mistyping their email addresses.

We tried several things. Turned out that both confirmation email and asking to type email twice hurt conversion rates badly (in our case - all audiences are different).

However, checking email for potential typos worked really well. We had a small set of rules:

1. Domain is very close to a popular email provider (@gnail.com, or @yaho.com, etc.).

2. Email contains a fragment very close to user name: pol@rodgers.tld for Poul Rodgers.

3. We had universities as customers, and a lot of students would enter "name@university-domain.com" instead of "name@university-domain.edu". We had a special check for it.

Overall, a couple lines of JavaScript helped us to get rid of 97% of mistyped email addresses.


Strictly speaking user@localserver is valid, but would fail test 2.


Same for user@{ai,ax,cf,dm,gp,gt,hr,io,kh,km,lk,mq,pa,tt,ua,va,ws,bar,college,host} and probably some other TLDs I didn't bothered to check.


True, but is it something you actually want to accept?


Yes. Or no.


Surely if you refuse TLDs (which I guess is what rule 2 does, checks that there's a dot in the domain part and refuses local domains and TLDs), you need at least 5 characters (a@b.c), and since TLDs are at least 2 letters that's 6. Although it'll still let through "@ff.cc"


There doesn't have to be a dot in an email address. a@nl is totally legal, and might even work.


> 2. Make sure there is at least one '.'

This will fail on sending an email to an IPv6 address, which has no '.'.

I know, it is nitpicking and especially non-tech would never send an email to an IP address. ;o)


That's a feature, not a bug.


Why? It wouldn't fail on an IPv4 address, so what's the difference?


Because if someone tries to sign up for my site with an IPv6 email, that's a glowing neon sign saying "SUPPORT NIGHTMARE AHEAD"


Agreed. /.+@.+\..+/. Anything "smarter" is almost always overthinking it.


Dot in domain part? That's too restrictive.

A few TLDs have MX records. There's no reason to reject an address like, say, "postmaster@ws" - it's a perfectly valid address that could be actually working.

And why bother validating anything beyond the fact string's non-empty and there's "@" character? Shoot an email, if they receive it — it's a valid address (no matter how weird it may look), if they don't — well, it's not like something bad happened.


Depends on what you're doing. If it's data capture on some kind of competition page, then a more complicated email regex can increase data quality a lot.

I have built quite a lot of data capture forms, and while no or limited regex increased number of entries, a more complicated regex combined with checking MX records improved conversion rate, because there were fewer junk entries. I used `\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b` (taken from regular-expressions.info).

It has a few false positives and a few false negatives, but overall, it optimises conversion rates, which is what I was being paid for.

YMMV.


> \.[A-Z]{2,4}

That should definitely be {2,}:

* there are a bunch of gTLD with more than 4 characters: http://en.wikipedia.org/wiki/List_of_Internet_top-level_doma... even ignoring geoTLD and brandTLD

* ccIDN are all more than 4 characters since the ACE prefix ("xn--" for IDNA) is already 4 characters all on its own


True. It probably does need a bit of an update.


Ah, I remember this famous question at StackOverflow on this: http://stackoverflow.com/questions/201323/using-a-regular-ex...

The gist is to avoid regular expressions in favor for some third party library like so: http://barebonescms.com/documentation/ultimate_email_toolkit...

This, of course, becomes an issue when you want to do everything in javascript and go down the rabbit hole of regular expressions. Sort of like deciding on a pattern from assumptions as one might make the mistake of doing with names: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...


/@/ usually does the job for me... hehe


Unfortunately many devs try to be smarter and my something+website_name@gmail.com is usually rejected as incorrect :/


Ironically, after many years this is still being rejected as incorrect by the Google Apps admin dashboard for Groups...


This a pain because gmail allows you to append anything you want after the + and it will still reach your inbox.

For example - If your email address is johnsmith@gmail.com you can have mail sent to johnsmith+stopspamming@gmail.com and it will go into your inbox.

http://lifehacker.com/144397/instant-disposable-gmail-addres...


I use a slightly more complex variation of that:

   ^.+@.+\..+$
Works wonders. I think when testing the addresses on a sign up form, we got only 0.5% that we couldn't relay too which was a pretty good hit rate.


It looks to me like you're neglecting email addresses from <http://ai./>. The 'ai' TLD does have an MX record.


Yes; I get told that every time I mention it :-)


I would suspect that anyone whose email did not match that regex would have such a miserable time generally getting it rejected as invalid left right and centre that they would just cave in and get a simpler one that did.


That's pretty good. I learned from Rails Tutorial to use this one:

    /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i
It's worked for me when I needed to use it.


Hope you don't have international customers, because http://äöüß.de is a totally legal domain these days.


Very true. My simplified one on the parent comment of that one makes no attempt to involve Unicode or character ranges for precisely that reason!


My regexp: [^@]+@[^@]+


That fails on the perfectly valid email address: "Foo@bar"@example.com

Although if that's your email address, you deserve everything you get ;)

If you want to verify if an email address is valid, just ask your mail server if it is capable of sending mail to it:

  root@flan:~# /usr/sbin/sendmail -bv '"Foo@bar"@example.com'
  "Foo@bar"@example.com verified
  root@flan:~#


Does it fail? I don't see ^ or $.


The regexp looks for exactly one '@'-character preceded and followed by at least one character that's not an '@'-character. Or, in other words, it does not allow for an email address with more than one '@'-character.


The point of the post was that since the regexp doesn't include start or end of a string it would still succeed:

"Foo@bar"@example.com -> "Foo@bar"


I assumed it was anchored and he missed those off.


I don't need those “perfectly valid” email addresses. Person who create such addresses potentially dangerous.


This is not really "validating" emails in the sense most people think of it. The RFC is about addressing SMTP envelopes, not entering email addresses. This would not be appropriate for e.g. checking if an address entered in a signup form is "valid." This includes a bunch of things that make no sense and aren't really email addresses (like embedded comments) and meanwhile has no idea that bogus@example.com is not an address that will actually receive mail. The only way to know an address is valid is to email it.

It's mostly a joke. One might want to use this if writing a mail server, but even then...


This isn't even about SMTP envelopes. RFC 2822 is about email headers, so it's even worse. Totally invalid for any real world usage outside perhaps an email client.


My most recent encounters with idiotic email validation is that many apps don't accept anything on a recent TLD. Even f-ing AWS SNS web console didn't let me add a perfectly valid address in a notification topic.


Can somebody explain why it has to be so long? Are there so many special cases?


Have a read of RFC 2822[1] section 3.4

In brief, the address specification may look like the simple "local" @ "domain", but those subparts can be non-regular (i.e., making them hard/impossible for a regular expression engine to parse) or contain a lot of exceptions (e.g., the domain could be google.com, or it could be 12.34.56.78, or localhost, or a number of other things).

[1] https://www.ietf.org/rfc/rfc2822.txt


It's not a regular grammar. Regular expressions were designed to handle regular grammar, and there are cases where something that looks valid is actually invalid: ex@256.255.255.255.

(See the above posters link: https://nikic.github.io/2012/06/15/The-true-power-of-regular...)


I think this partly overstates the complexity of validating an e-mail address in a registration form or similar. If your aim is only to get a syntactically correct address to which you can try to deliver mail to, you don't need to accept stuff like:

* "Name surname" <address@example.com>

* Name surname <address@example.com>

* Group name: Member 1 <one@member.com>, "2, member2"<two@member.com>, three@member.com

* guy@nonpubliclyresolvabledomain

There are many other RFC2822-valid kind of addresses that you don't need to accept if you are not writing an e-mail client, SMTP server, or similia.


Aka: do not attempt to use this if you are sane.


...and this is why you don't ask someone to write regex to match an e-mail address on an interview.


Surely it's exactly why you might ask that — admittedly as a semi-trick question, which you should probably only direct at experienced people who really ought to recognise it as such.

Email validation is a problem with a lot of plausible answers — many of them wrong — so it has the potential to be quite a good discriminant (depending on whom you're trying to hire, of course).


Neat visualization thereof - https://www.debuggex.com/r/v99uZHQj97Tkgnjy


I tried pasting this into Expresso so I could browse it with its analyzer tool, but it refused.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: