

Valid Email Addresses - ry0ohki
http://en.wikipedia.org/wiki/Email_address#Valid_email_addresses

======
eli
It's weird how often email address validation comes up on HN. I think it must
be bike shedding for the web app generation. It's the simplest piece of a
common web app for which there could possibly be any discussion or
disagreement.

Yes, you can have a crazy looking email address. But that doesn't mean you
should. And you have no right to expect a web form will let you enter your
address with nested comments.

On the other hand, a lot of developers seem to confuse validating an email
address against the RFC with confirming that it is the user's true and correct
address. This is not possible without sending the address a message.
Regardless of how good your regex is, it will let many typos through and it
will fail to stop "fake@fake.com". I'd suggest spending time elsewhere.

~~~
haberman
> And you have no right to expect a web form will let you enter your address
> with nested comments.

While nested comments are a bit extreme, I'm not a fan of the attitude that a
user has "no right" to use features that are a documented part of the spec.
Just because a feature is uncommon or doesn't seem important to you doesn't
mean it's not important to some small subset of your users.

For example, that's not far from saying that a user has "no right" to put +tag
in their email address (after all barely anyone uses that), but some people
find this extremely valuable.

~~~
saurik
No, nested comments are _not_ part of "the" spec (in a way that would imply
"the e-mail spec"); sure, they are a feature of "a" spec, but that spec is
actually somewhat unrelated to what an e-mail address actually is: they are
just a feature of the MIME specification for header field values.

If you look at the SMTP specification (which, given that it defines the
protocol in charge of using e-mail addresses for actual delivery has a much
better claim to being "the" spec), you will note that you aren't allowed to
use an e-mail address with nested comments in that context, as they have no
meaning to SMTP.

However, you will also find that the rules for what characters are allowed and
which have to be _escaped_ are different, as that's what these specifications
are actually discussing: how to _escape_ an e-mail address for use with
specific transport protocols.

An actual e-mail address? It seems to support pretty much anything followed by
an @ followed by a domain name. It is just that in MIME, if you want to have a
space character you will need to put it in quotes, or if you want a quote you
will need to use a backslash.

The user of your web form, of course, is not typing MIME: there is a box that
they can just type their e-mail address into, and it should probably support
the raw syntax of their actual e-mail address, not a randomly chosen format
required for escaping.

To make this more clear, one has to ask: why MIME escaping? Why not require
the user to use HTML attribute escape sequences? That way, if their e-mail
address contains a special character, instead of using quotation marks and
backslash escaping, they'd use entities, like "&quot;".

Honestly, that makes about as much (if not more) sense. Meanwhile, of course,
the user's username and password fields should also be escaped similarly, and
if the user attempts to the use a bare < or > they should get a validation
error "please escape your password using RFC1866 (HTML)".

Previous, more detailed versions of this same complaint:

<http://news.ycombinator.com/item?id=4794368>

<http://news.ycombinator.com/item?id=4486872>

~~~
haberman
Thanks for the clarification, I did not realize that there are multiple
parallel RFC tracks that define differing syntax and semantics of email
addresses. Your claim then, is that all of the complicated syntax defined for
email addresses in RFC2822 and RFC5322 is for the sole purpose of escaping
characters that are significant to MIME? What about "+" -- is it just
convention that most email hosts ignore everything to the right of that, or is
that actually specified somewhere?

~~~
saurik
Yes: that is just convention. In fact, RFC5233 defines an extension to Sieve
(a purposely-not Turing-complete language for filtering e-mail that is
implemented as part of many mail systems) that parses those + addresses; this
is the only e-mail-related standard I've so far come across that mentions this
common feature (and I've read through numerous at this point ;P).

However, it does not define the syntax for + addresses (even so far as to
define the "+"), as + is only a convention (as is the entire concept of having
detailed/sub-addressing at all): it even has various examples, such as
"5551212#123@example.com", that use alternate characters.

> NOTE: Because the encoding of detailed addresses are site and/or
> implementation specific, using the subaddress extension on foreign addresses
> (such as the envelope "from" address or originator header fields) may lead
> to inconsistent or incorrect results.

> Implementations MUST make sure that the encoding method used for detailed
> addresses matches that which is used and/or allowed by the encompassing mail
> system, otherwise unexpected results might occur. Note that the mechanisms
> used to define and/or query the encoding method used by the mail system are
> outside the scope of this document.

Also, yes: RFC5322 defines a ton of syntax, and all of that syntax is related
to MIME headers; a "structured header" has particular rules related to
whitespace and is allowed to contain comments, so e-mail addresses included as
part of the address lists used in headers like To and From are going to be
adapted to follow those rules.

FWIW, RFC5322 actually has a SHOULD NOT on the things that make it un-similar
to the SMTP specification. The two specifications really do attempt to use
fairly similar syntax. You thereby are allowed to have comments and crazy
whitespace in weird places in MIME, but "please don't" ;P.

> Comments and folding white space SHOULD NOT be used around the "@" in the
> addr-spec.

The goal really did seem to be, I will happily admit, to have the two
protocols be largely compatible to the extent that they could: the same list
of reserved characters is used by both (as a key example, SMTP also doesn't
allow the ()'s despite not supporting MIME comments). There are some weird
differences, like RFC5321 allowing empty double-quotes as the local part;
although, RFC821 did not seem to have that corner case, so I'm starting to
think this is bug introduced in RFC2821 (I had read mailing list posts about
this issue a while back, but somehow it wasn't clear from those that it is a
mistake).

I maintain, though, that it is very weird to be forcing this particular escape
sequence set everywhere: when you lift e-mail addresses out of angle addresses
and lists you don't need it anymore, as you can parse the address from the
right unambiguously once you hit the @. Regardless, I do need to emphasize the
statement in one of the earlier versions of my comment that RFC3696 has
recommendations for e-mail address validation, and it includes the MIME
escaping. I thereby doubt that my opinion, to be explicit, is shared by some
of the people who worked on these specifications.

(That said, RFC3696 is weird... it mentions, for example, a limit of 64
characters on a username, but in fact that was just a "minimum maximum" from
SMTP, and SMTP was quite clear that "TO THE MAXIMUM EXTENT POSSIBLE,
IMPLEMENTATION TECHNIQUES WHICH IMPOSE NO LIMITS ON THE LENGTH OF THESE
OBJECTS SHOULD BE USED", while at the same time saying that you must not send
such things; I guess "welcome to Postel" ;P.)

------
neya
There's a reason why most of those valid email addresses are not allowed by
most email providers. As an example, in the 1997-early 2000's, underscores
were very popular amongst the internet users and most of them had an
underscore in their E-mail addresses. Whilest applying for a job, they would
fill out a form in the employer's website/some form application software where
the data is then later tabulated.

Once tabulated, the entire email address would be underlined by many popular
softwares back then, since it's (was) essentially considered as a link. Then,
while the recruiters were trying to copy and paste these prospective e-mail
ID's in their respective email clients, the underscore would be missed out and
would have just a space instead (which the recruiters have no idea as to why)
and thus the recipients would miss out from receiving these emails. Hence,
many email providers wouldn't allow one to register an underscore (or any
complex character) after receiving many such reports, just to avoid these
hassles.

------
nthitz
I don't care for overaggressive email validators myself, but if you are
registering with my service using an email of
""()<>[]:,;@\\\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org" I'll probably
want you to enter something more reasonable.

Anyway just because it is valid according to the RFC, doesn't mean that it's
actually a valid user's email address.

~~~
jacquesm
I guess I won't be using your service then! Quite frequently I find that
services will not accept my perfectly valid email address as valid. This is on
the whole their loss, since if they can't even properly validate an email
address there are probably more loose ends and I'd rather not find out which.

~~~
digitalsushi
On the whole, it's probably not a loss. It's probably a gain. On the whole,
they validated almost all of the email addresses. They spent their corner case
money instead on making the product awesome, and instead of poor
[]:,;@\\\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org getting a copy,
everyone else got something better.

~~~
fayden
Why not use a librairy that validates emails? You save time, and you actually
accept valid emails.

I agree that the example with a lot of symbols is over the top, but when a
website doesn't accept foo+bar@host.com, I assume the product will be sub-par
quality wise. The author did not follow of rigorous process for something as
simple as email validation, I doubt he'll be more rigorous in other parts of
his project.

~~~
pyre
There are other concerns too. I submitted a bug report to Zappos because their
emailer did not URL-escape '+' in foo+bar@example.com. Resulting in:

    
    
      http://www.zappos.com/?email=foo+bar@example.org
    

which was treated as:

    
    
      http://www.zappos.com/?email=foo%20bar@example.org
    

by the browser. It worked fine when I manually encoded the '+'.

------
alpb
Here's what a RFC-compliant email address regex looks like: <http://www.ex-
parrot.com/~pdw/Mail-RFC822-Address.html>

Hint: God damn long.

------
dutchbrit
For people validating emails in PHP, use the following code:

filter_var($email, FILTER_VALIDATE_EMAIL);

The actual beast: [https://github.com/php/php-
src/blob/master/ext/filter/logica...](https://github.com/php/php-
src/blob/master/ext/filter/logical_filters.c#L499)

~~~
ry0ohki
oh wow had no idea that existed, thank you!

------
unreal37
I had no idea that email addresses could be so complex.
""()<>[]:,;@\\\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org" is valid?

It won't be a popular view around here, but that SHOULDN'T be valid. The spec
needs to change. I won't be making changes to any of the large sites that my
company manages to accept weird characters like brackets, semicolons and
quotation marks as a valid email address. That's just asking for a XSS or SQL
injection attack and other trouble.

Sorry, but the spec is just wrong here.

~~~
mrb
You sound like a lazy developer who doesn't want to learn the proper way to
protect against XSS and SQL injection. If you have this mentality, your code
probably has similar vulnerabilities in pieces of data other than the email
address.

------
spindritf
> Invalid email addresses

> Abc.example.com (an @ character must separate the local and domain parts)

Well, that also depends on the context because replacing @ with a dot is
exactly what you'd do in a zone file:

    
    
        $ dig soa google.com +short
        ns1.google.com. dns-admin.google.com. 2012113000 7200 1800 1209600 300
    

So it gets even hairier.

~~~
duskwuff
Well, yes. But at that point it's no longer an email address.

------
jayferd
Why do email addresses need comments again?

~~~
digitalsushi
It helps you remember to whom you gave the address.
mikec+farmville@digitalsushi.com is a nice comment that tells me "I used this
for farmville; who is now emailing me with it?"

~~~
eli
No, that's not a comment. That plus sign is part of the email address. It's up
to your mail server how to figure out how to parse it and stuff it into the
correct inbox.

The RFC allows for actual nested comments to appear within the address. Though
nobody actually uses this and really the RFC is talking about formatting for
email messages in transit, not how you should or shouldn't record your address
on a form.

------
nofinator
Someone should update this McSweeney's list:
[http://www.mcsweeneys.net/articles/e-mail-addresses-it-
would...](http://www.mcsweeneys.net/articles/e-mail-addresses-it-would-be-
really-annoying-to-give-out-over-the-phone)

~~~
edmond_dantes
A contemporary "Who's on first?" scenario.

