Hacker News new | comments | show | ask | jobs | submit login
Stop Validating Email Addresses With Your Complex Regex (davidcel.is)
155 points by master_dee 1490 days ago | hide | past | web | 201 comments | favorite



Why not have a simple validator that works with 99% of users emails but not make it mandatory that it passes validation?

"We see that bob@localhost doesn't look like a email address are you sure it's right?"

That way you can help users that messed up their email but not prevent all the corner cases. The idea is that most email addresses fall in a very narrow subset of the RFC: user@domain.tld and most people would have entered their email wrong if it didn't match that pattern.


I don't think this is good advice.

From a previous startup we saw a ton of signups like, "john@gmail" and the like. Obviously this person will not get a validation email -- and in all likelihood will not be able to log in to his account when he returns. It's best to catch him when he's entering the information.


I hear and understand a lot of the comments on this thread mention that regex saves the user from a typo and such. So I want to vouch for a github project called mailcheck[0] by the Kicksend team that's great.

At Ventata, we used to have the same issues you've all described with people forgetting things like ".com" and "gmial" vs "gmail". Once we started using mailcheck our bounce rate went way down. Now we only get bounces when people deliberately give us faulty addresses, but we aren't really concerned with trying to convert them. They are checking us out and want to stay anonymous, I don't mind that.

That condition aside, Mailcheck is pretty much all you'll ever need.

[0] https://github.com/kicksend/mailcheck


Mailcheck looks pretty interesting - I'd been considering adding some sort of client side validation for email addresses for my personal projects, and this looks pretty solid. There are a few cases that it doesn't catch that I wish it would, but it appears it would catch a fair percentage of simple mistakes.

For example, it catches foo@gmail, but it doesn't catch foo@outlook. Of course, gmail is going to be the more common case. Still, it looks good overall and an improvement over no client side validation.

Edit: Actually, I see you can define a set of domains to check against. Very nice.


The only thing mailcheck doesn't do is check if the domain has MX records, so valid looking domains that you'll never be able to send anything to will pass.

I've tried to help this situation by creating an API for you guys: https://www.emailitin.com/email_validator


Domains without MX record can be valid, as mail servers, upon lack of MX record, will query for A record.


Or AAAA records.


I've seen a huge number of users do exactly this.

Anyone who says "let the confirmation email handle it" clearly only has HN readers as customers. The type silly stuff, then email support to complain they never received the confirmation.


I love these articles that are posted and the ACCURATE answer is that the article proposes bad advice.

In what world is not validating an email a good thing? It's not like emails vary after a certain complexity is reached. A better article would have been someone documenting a validation regex that approaches perfect without exceeding insane complexity.

Next we'll see articles to not run the Luhn algorithm on credit cards =/.


> In what world is not validating an email a good thing?

Validating an email address is important. The way you do that is send an email to that address. You can't do it with regex, and attempting to do so leaves you open to a variety of flaws.


That is not validating - that is accepting it blindly into your database. You potentially just lost a user and/or customer (or made them unhappy because now they have to register again).

Like I said, when you accept a credit card, normally you validate the formatting of the card before trying to charge money using the info. Emails are very similar.


What? They give you their email. You immediately send them an email with a confirmation link. They confirm, at which point you put their email in your database. You must send them a confirmation link because you shouldn't be sending them email in the future unless you've had them confirm their address and that they want to receive email from you.

You appear (but perhaps I'm misunderstanding you) to be asking a user to enter their email address; checking that against a regex; storing it; and sending email to it. That's bad, don't do that.

> You potentially just lost a user and/or customer (or made them unhappy because now they have to register again)

You're potentially losing customers because their valid email addresses are not validating through your broken regex; or their incorrect email is validating through your regex.

> when you accept a credit card, normally you validate the formatting of the card

Credit cards are trivially easy to check for formatting. You use the Luhn algorithm which tells you if it's possible for that number to be valid or not. This is because there's a strict format for credit cards. There is no such format for email addresses. That's why the only sensible way to check a user's email address is to send a confirmation email to them.


Using a simple regex isn't a bad thing. 99.99999999% of my users aren't using a TLD so the regex .+@.+\..{1,63} is great.

The problem is that people are naive and write incorrect regexes. Also, don't attribute bad programming to me in your comments when you have no idea what regex I use - that's just rude and belligerent.

Emails are trivially easy to check for basic user errors - such as leaving off the TLD or not even providing the domain. Saying otherwise is just being naive again.


I regularly run into email fields that won't accept +. This is just an extension of that thinking.


Then that is a very poor validation attempt. Sidestepping things that have been programmed poorly in the past doesn't make for good programming practices.


still there is a difference between validating * @ * . * and trying to check if parenthesis are matched correctly.


Worked on custom mailing scripts for a few larger newsletters which had no regex based validation. It was "add and forget" on every form submission. Of the failed emails, forgetting the ".com/.net/.org" on the email was the most common mistake I saw as well.


I think it's great advice; remember, the advice is: Stop Validating Email Addresses __With Your Complex Regex__. Remember that new TLDs are added, and that john@tld can actually be entirely valid.

If you want to prevent "john@gmail", then use a real RFC-compliant e-mail address parser, and attempt to resolve the domain component MX/A records (and remember, it might be an IP address).

If that fails (or your regex fails, or whatever validation you use), ____SUGGEST____ to the user that the address appears to be invalid. There's no reason for an overzealous registration form to refuse to accept the user's actual e-mail address.


"If that fails (or your regex fails, or whatever validation you use), ____SUGGEST____ to the user that the address appears to be invalid. There's no reason for an overzealous registration form to refuse to accept the user's actual e-mail address."

Bingo. Help people, don't hinder them.


Honestly if the user is signing up with an ip email then you really shouldn't accept it - it may be valid but something fishy is going on for sure.


Why does it matter? If I managed to acquire the 8.8.8.0/24 netblock, I might very well want to use 'user@8.8.8.8' as my e-mail address.

I don't see why it has any material affect on someone requesting e-mail addresses: if it's valid, then it's valid.

This seems to be an example of the misplaced sense of propriety with which people approach validating e-mail addresses -- that somehow, your job isn't just to help the user enter an address, but also to define what is and is not 'reasonable'.

Imagine if companies refused to accept "1 Infinite Loop" as a street address because it's clearly ridiculous. Except that it's Apple's actual street address.


The one user in 10,000 who actually uses an ip address as an email address can surely figure out why they're getting rejected. Of course, he'll probably go channel the Comic Book Guy and write a scathing post on his blog (Worst. Regexp. Ever.), but the normal people who use the service probably won't see that anyway.


Again, what material difference does it make to you?

Overzealously rejecting valid addresses is an application of subjective and inaccurate ideas about what addresses 'should' look like, and ignores the simple fact we've already mutually and formally defined valid address formats via the IETF RFCs.


> Again, why does this matter to you, other than some sort of misplaced sense of authoritarian aesthetics?

Yes! Great Comic Book Guy impression.

Why it matters is that for most smallish companies, you want to get something up that helps your users not do stupid stuff (†), but due to time and resource constraints, you're likely to end up with some kind of 80/20 solution. It'll work well in most cases, and fall down in some others. I would certainly agree with the idea that you not force people, but a nudge is probably going to save you money in increased user retention and fewer support hassles.

† - I once had a person ask why their emails to http://example.com were failing.


> Yes! Great Comic Book Guy impression.

In that case, great junior engineer impression on your part.

Pedantry matters in complex interoperable systems, because otherwise they're not interoperable. This is why we have detailed standards documents on e-mail address formats.

> I would certainly agree with the idea that you not force people, but a nudge is probably going to save you money in increased user retention and fewer support hassles.

A 'nudge' isn't going to come from yet-another-broken-email-validation regexp. There's no need for an 80/20 solution; this just isn't that hard.

> † - I once had a person ask why their emails to http://somesite.com were failing.

That's not a valid e-mail address (as per RFC822).


The crappy regex solution is much faster than what you suggest, and will work 99.9% of the time. The time you spend doing it the right way will reduce your conversion rate because people will view your site as slow. In this case worse is better.


> The time you spend doing it the right way will reduce your conversion rate because people will view your site as slow.

During all those new account registrations they're constantly making?

Does this feel slow to you? https://www.emailitin.com/email_validator

> In this case worse is better.

No, it's not. Every time you exclude a valid e-mail address, you lose or annoy a customer for no reason other than your own lack of understanding of the RFCs.

There are a number of steps that can be taken that don't involve broken regexps that filter out valid addresses. Please stop trying to justify doing it completely incorrectly.


Does this feel slow to you? https://www.emailitin.com/email_validator

Yes, it did.


I got a response in 55ms. We must have different definitions of 'slow'.


I must have hit a speed bump somewhere then, it took around 2 seconds to look up my email address.


Because those who would sign up with them would be 1 geek with an ip address and 200 scammers who didn't want to pay 10 bucks for a domain.

I imagine I would have trouble renting a car with cash, even though it would allow more customers.


I do see a misplaced sense of propriety here, and it's coming from you. Blocking a few valid email addresses in the service of helping most customers is just a business decision, not a moral issue.

For example: requiring shirt and shoes in a restaurant will block some customers from dining there. But it improves the experience for all the other diners, so it's a net win.


The question is why people are validating the email in the first place.

* to ensure it is deliverable? Well, then you better send them an email.

* to let people know when they misread the labels and put something that was clearly not an email in the email field? A simple check for an at-sign is usually sufficient.

* because some tester opens a ticket saying you can enter an invalid email in the email field? Yeah, that's where most of the complicated regexps come from.


    to ensure it is deliverable? Well, then you better send them an email.
I deal with user support for a site and I'd estimate at least 2% of our new users (>50 people PER DAY) enter wrong email addresses. Not "I forgot to put .com at the end" but "I thought my email was john.doe@gmail.com when it's actually john.doe@yahoo.com" which would pass validation with flying colours. The only real "solution" is to tell a user if the validation email has been sent yet (to deal with "well maybe I should wait 5 more minutes") and if it has and they don't have it allow them to change their email to their real email address. So many sites (incl. the one I manage) do not allow this, it's crazy.


>if it has and they don't have it allow them to change their email to their real email address

Couldn't an unscrupulous individual use that feature to take over non-activated accounts? An immediate use for that exploit doesn't spring to mind but this makes my spidey sense tingle. What sites allow this?


From my experience a user will almost always remember the password they've just entered, even if they got the email wrong. They should be able to login to a not-yet activated account and be presented with the option to correct the email for the activation email to be sent to. There's no potential for abuse there.


* Because users often miss a character like a dot or an @, and catching that early saves a lot of pain with undelivered confirmation e-mails and so on.


Kicksend has a library for that: https://github.com/kicksend/mailcheck

I actually think that this library functions as a really great client-side validation that won't get you tripped up in trying to be RFC compliant. There's really not anything more that I'd do aside from sending that blessed confirmation email.


> because some tester opens a ticket saying you can enter an invalid email in the email field?

This is the source of 80% of all "bugs" I've fixed over the years.

Another personal favorite: If you enter WWWWWWWWWWWWWWWWWWWW W WWWWWWWWWWWWWWWW for name, it messes up the layout on the display screen.


That's roughly 40 chars ? People coming from some regions easily have 20 to 30 chars for the family name alone [1]. That's more or less the length of our test string if the add the given name(s).

[1] http://news.bbc.co.uk/2/hi/africa/5651310.stm


I think the point is that they are using all capital W's which are the widest letter, but real strings of the same length are never that wide.


Thanks, I didn't get it.

If the issue only appeared on all W's I'd guess you could set the bug as minor and discuss if it's worth fixing. If it costs an incredible amount of time to fix it, the problem relies more on defining priorities than on having too much granularity on the testing side.

To go back on your parent post, I'd say validating funky emails is of the same level. The test team should bring up the edge cases, fixing them or not is a matter of priorities.


Because most users couldn't type their own email address, or even a properly formatted email address to save their life.

"My email address is joe.aol or was it aol.com@joe? Wait joeaol@com?"


I'd say nothing of value is lost in that case. These people are very costly to support. Email has been in common use for at least 20 years. They need to step up to the plate and learn at this point.


I agree. In addition to being pretty damn old at this point the pronunciation of "@" should resolve the "aol.com@joe/joe@aol.com" issue for all but the most clueless (all but the most likely to cost you in support). That case is particularly egregious.


There are times when it is good business to stand on principle, and then there are times to just help your customers a little bit. Email signups are definitely the latter IMO.


might I remind you one of the reasons why Apple have posted record profits over the last decade? It's worth nailing the UX experience to be as inclusive as possible.

It's also worth considering that one of the biggest generational markets (baby boomers) include a lot of those people you're telling us to ignore.


Refusing to send email to someone because your software mistakenly rejects their email address is not exactly being as inclusive as possible, though, is it?


Because it takes system resources to deliver email. Furthermore, if people make a simple typo, why go to the extent of attempting to send something to it when it's obvious?


Because it's not obvious. Ask developers to recite the rules for correct email address, and most of them will get it laughably wrong. I blame the standard, which is far more "featureful" than is actually required, but that is the way it is.


It's featureful because it's old. Who routes email to UUCP any more? Just look at the sections of 5322 that are devoted to "Obsolete Syntax".


Actually, I was just thinking the core error was something else; conflating routing with identity. bob@subgenius.com is a routing instruction, 'J. R. "Bob" Dobbs' is a human identity, and '"J. R. 'Bob' Dobbs" <bob@subgenius.com>' is just a mess.


> 'J. R. "Bob" Dobbs' is a human identity

Is he really though?


This is a really good point.


So don't attempt to register an email with comments in it, or for that matter with an ip host.

We need a much more restrictive standard for emails, but until we have that we have to accept that each site will have a competing not-completely-overlapping set of standards. So make sure your email doesn't attempt to do anything too funny.


I feel about this much the same way I feel about the endless proliferation of Markdown variants; if we could all agree on one simplification, sure, but the current situation where we haven't really stinks. For instance, ask early Gmail users about putting + in the email.

So while I agree with you in principle, in practice it seems infeasible. The differences bite in practice, unfortunately.


Only ticket I've ever seen about email validation was that a webapp was refusing to accept a customer's valid (and already registered/paying money in another system) address.


Testers gotta find bugs. QA's needs eat too you know.


I worked in a lead-generation agency for some time. We were doing competitions that could produce 50-100K entries per day, where the end result was a marketing email encouraging you to buy something.

So we did lots of multivariate testing, with permutations of regex patterns, MX record validation, sending a confirmation email, client-side only, server-side only, etc.

What we found was that a decent regex pattern (we used \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b, taken from http://www.regular-expressions.info/email.html), with MX record validation and common junk domains blocked (e.g. mailinator.com) produced the largest conversion rates in the follow-up email.

In other words, less validation produced more entries, but they would have been lower quality, which affected our sender reputation and cost more. The confirmation email was awful for conversion. YMMV.


Oh the irony. His suggested Regex is not even correct.

    /.+@.+\..+/i
A trailing dot is valid at the end of a domain name. These domains are said to be fully qualified.

http://www.dns-sd.org/TrailingDotsInDomainNames.html

This also potentially eschews internal deployment.

His other suggesiton, /@/, at least, is not harmful, but there are validators based on the RFC that do the job for most platforms.

Validating the address may provide useful feedback to users who accidentally mistyped something, invalidating the permise of his last paragraph.

.

/someone is wrong on the internet.


Amen! Anyone else here use myemail+token@gmail.com when they have to register with their email to find out who is selling them out and to make spam filters easier?

It still amazes me that 70% of the places I attempt using foo+bar@gmail.com call it invalid. And that does not even begin to touch the myriad valid permutations that are "invalid" out there.


I use 33mail.com for the same purpose - it gives you a unique wildcard subdomain (and also a shortened domain (<yoursub>.33m.co) so you can create uinque addresses per signup etc, without the problems of + being rejected (or gicing away your real email address)

You can then block any address with a click if it's being abused.

If you're feeling generous, enough people using this link will earn me premium features: http://www.33mail.com/rj37w3


Maybe they're on to you and only pretend that they think it's invalid so you'll give them a slightly-less-throwaway address ;)


And likewise, it amazes me that people think that for all their efforts at combating filters, captchas and the like, that the most nefarious of spammers aren't stripping off "+token"s from email addresses.


Spammers tend to be stupid. When I post my email address as "me+tag@mydomain.com" on a certain popular, well-scraped website, I see lots of rejected traffic to "tag@mydomain.com".


I used to use mail@mikeash.com as my primary e-mail address. Enough sites rejected that (due to thinking that "mail" was bogus somehow) that I eventually switched to mike@mikeash.com.


I wish OP had provided some actual A/B registration fall-off data. For me client-side validation is more for catching user error than to do any actual "validation". That's always been the job of confirmation email.

I think there is going to be a very low percentage of users who will register again if they don't receive a confirmation email, unless you're giving away free iPhones. Granted regex can not catch most of the user-generated errors, but it can catch a few which could still increase you registered users.

To that point a better UI/UX (font size, spacing, etc.) might do a better job in lowering typos in email.


I wish I'd done that too, in retrospect. I could obviously have done much more research around my opinions, and there's one use-case in which what I'm advocating simply does not work: when you are _paying_ to send those emails. In that case, yeah, you're gonna wanna do some validation.


My goto for email validation is /^.+?@.+?\..+?$/

Incase I've typed it wrong, that should basically work for anything that contains at least one @ and one dot, in that order, as well as at least one character at beginning, middle and end. It's served me well thusfar.

Edit for clarification: The reason I prefer this over just checking for an @ is that if you're just checking for @ a common mistake like "me@hotmail,com" will be considered valid.


My favorite: /.\@.*\../

It should be similar to your version, but only matches just enough parts that require for email validation (i.e. "o@example.c" part of foo@example.com).


I like that, much more elegant! My only changes would be to make the middle .* into .+? as that way it requires at least one char, and the ? is for lazy repetition.


Just a top level domain after the @ is a valid email (e.g. foo@com). Things like this are why we end up with massive regular expressions for email validation. That said, if you are only warning the user, but not preventing them from submitting foo@com, then it is probably good enough.


Note that there are valid, probably in-use, e-mail addresses on TLDs, e.g. "username@cx". You may not care enough to support them though.


Yep, that's pretty much what I do too.


I think that the domain part of email addresses could be an IP address. Depending on how IPv6 addresses are displayed there, they won’t contain a dot.

Somewhat artificial, yes.


I'll admit I hadn't considered IPv6b addresses, might have to rethink my trusty regex. Sad, it's served me well for so many years.


IPv4 also has a valid decimal representation. http://1249764136/ will send you to Google!


TIL.

I wonder what the security implications of this is.


If using IPs in an email address you are supposed to surround it with square brackets anyway.

I wouldn't add IP address support to an email regex because I'd rather turn away such perverse data anyway. Nobody uses IP-based email addresses.


A square bracket, however, is not a dot (which is for what the original regexp checked).

And I just tried sending an email to me@[my.ip.addr.[0]]. Postfix somehow recognised it was for this local host, but failed because it wasn’t part of the virtual domains I had put into it. ‘perverse’ seems to be somewhat appropriate.

[0] Sorry, too lazy to check the appropriate documentation IP address ranges. 20db?


I thought new TLD being worked on didn't need to have dots in them.

Why not just check for x@x?


To use just the TLD, the domain will need to be fully qualified with a trailing dot, "x@tld.". Otherwise, the domain can get confused with just the hostname in the local domain (tld.example.com). Just the hostname is a valid email address but wouldn't need to be accepted by most signup forms.

My guess is that nobody will use just the TLD because of the confusion. And because a lot of software does not support fully qualified names.


As far as I know the new tlds will still have a dot between tld and actual domain (I could be wrong though) but it's just been pointed out that IPv6 emails addresses could fail on my regex. The reason I don't currently use x@y is just to have a better shot at catching typos without being too restrictive.


There is no requirement for any other level in a domain, except for the TLD.


That's exactly what you should do.

^(.+)@(.+)$

max length is 254 according to the RFC I believe, so you can check for that too.


The bonus with that regex is that it kind of looks like a dead clown.


Bingo. I'm sure there are some HNers with arpanet emails who would appreciate this; it's not just about ipv6!


IIRC, last I checked max length was 320.


I thought this as well, until looking for info one day and stumbling onto this:

http://stackoverflow.com/questions/386294/what-is-the-maximu...

Turns out the 320 value was actually incorrect, its 256 but the mailbox is wrapped in square brackets, so 256 - 2 = 254.


This has been an issue since the day I started programming for the web, back somewhere in '95.

It has regularly come up on HN, and pretty much any programming related forum I've used since the mid-90's.

As an industry at the heart of the information society you have to wonder what the hell we are doing wrong if we cannot stop this constant regression into well known bad practices.


I understand the argument re validating email addresses passively (regex, no regex, etc.) vs actively (send an email by SMTP).

What I don't understand with this ever-repeating discussion is why the complexity has to be visible. e.g.

    > <LARGE REGEX>
    > Yeesh. Is something that complex really necessary?
Many functions are complex - we put those in libraries, pushing them under the hood, and move on.

What is so special about parsing email addresses that makes everyone invent their own solution - regex or otherwise?


Plus a Large Regex for mail validation is not supposed to be heavily used. It's supposed to be used once at registration for example. So why would it matter if it's slow/heavy/...


Maybe it is less resource-intensive to actually send an email rather than use a heavy regex to validate the email?


I really don't think so since you're soliciting an email server while a regex is just some code that has to be run, and they are run on a tiny string (a mail is never really long).

Also it's bothering for the user, if you need mail confirmation then do it, but otherwise it should be a RULE OF THUMB to always avoid annoying user. Thus avoid mail confirmation.

This article is actually a really bad advice. I don't know why it's upvoted so much.


I'm not completely sure that I get annoyed when a web site sends me a confirmation email. It helps me know that the site indeed knows my correct email.


> What is so special about parsing email addresses that makes everyone invent their own solution - regex or otherwise?

A valid email address can contain almost anything; this makes validation via a standard parser mostly useless. As such, devlopers reach for stricter parsers out of a combination of a not comprehending the standards, feeling vague discomfort about letting 'just anything' past data validation, and misplaced concern for users that they believe can't type their own e-mail address.

Add to that the occasional business complaint from the marketing arm about bogus e-mail addresses, and you have people repeatedly solving the problem in slightly different ways, justifying their own divergences from the standard by applying the justification that nobody will use a 'weird' address anyway, and they're actually being helpful.


> misplaced concern for users that they believe can't type their own e-mail address

How is this misplaced? People screw up even the most basic of computer tasks all the time.


1) Because the solutions actually prevent some users from typing their actual e-mail address.

2) There are so many ways to get the e-mail address wrong that it's almost not worth bothering validating the few things that you can validate.

Now, here's what would be an interesting validation method that doesn't actually require sending an e-mail. It requires an RFC-compliant e-mail parser, not a regexp:

- Perform A/MX lookups on the domain part. The domain part can be an IP address, so those get a free pass.

- Connect to the returned MX, issue a MAIL FROM+RCPT TO:

  c> MAIL FROM: test@example.org
  s> 250 2.1.0 Ok
  c> RCPT TO: is_address_valid@example.com
  s> 554 5.7.1 <is_address_valid@example.com>: Relay access denied
  c> RSET [reset the transaction, no e-mail is sent]
- If you get back a permanent 5xx error, the address is invalid. If you get back a 250 Ok, the address is probably valid (it could still be a relay that allows backscatter, in which case it will allow any address on one of its configured domains). If you receive a 4xx, the address may or may not be valid -- graylisters will send 4xx, as will servers that can't currently accept e-mail, etc.

This gives you definitive failure (5xx) and almost-definitive success (250 Ok). It's a cheap DNS lookup + TCP connection that you can begin performing immediately and asynchronously when a user enters their address in a form.

... or just send the user an activation e-mail.


Hopefully that's not the SMTP syntax you're actually using.

    * There's no space between FROM: and the address in SMTP
    * Email addresses must come between angle brackets
I'd reject (give you a 5xx) that from my mail server for those reasons alone.


> Hopefully that's not the SMTP syntax you're actually using.

I typed it out live. I'm not an SMTP client and I don't have the RFCs memorized.

> I'd reject (give you a 5xx) that from my mail server for those reasons alone.

Postfix accepts it. I haven't checked the RFC to verify your concerns, but assuming they're correct, then my expectation is that postfix is liberal in what it accepts because A) it's a good idea, and B) a real mail transfer agent probably ignored those two minimal rules at some point in the past.


Postfix (and the other big receivers) will ignore it, but will send using the proper RFCs. It's still a good sign of a badly written bulk mail engine, and worth rejecting for.


> It's still a good sign of a badly written bulk mail engine, and worth rejecting for.

No, it might be worth scoring the e-mail with a spam filter, but the MTA shouldn't be overzealously throwing away e-mail.


My MTA doesn't "throw away" email. It sends back a 5xx with an appropriate message that the sending end isn't RFC compliant. Absolutely nothing wrong with that - if it's a legit sender they know I didn't get the mail.


At which point the legit sender does what -- replace their MtA? More likely they just contact thir receipient out-of-band (eg; gmail) and avoid your over-zealous mta.

Whether you think you're 'Throwing away' is just semantics. From a user's perspective that's exactly what you're doing.


It's my MTA, I'm the "user". My server, my rules. Just like if you want to come into my house you come in the front door and take your shoes off, not crash through the window in muddy boots.


In that case, I wouldn't say it's a particularly useful anecdote for anyone else.


That won't really work with people who mistype domains, e.g gmale.com as that domain may have catchall enabled.


So in addition, 'spell check' for likely domains. People probably don't mean to type 'gmale.com' -- but don't prevent them from doing so, if that's what they really meant.


I think it's mostly just bikeshedding. This is a problem that is both largely unimportant but also common enough that many have encountered it (and thus have an opinion).


Don't bother even reading it. His solution is to "Just send your users an email. The activation email is a practice that’s been in use for years, but it’s often paired with complex validations that the email is formatted correctly. If you’re going to send an activation email to users, why bother using a gigantic regular expression?"

Want to know why it's not more common than the regex "method"? His method has its own host of problems - what if your mail server is down for six hours - will people come back to your site six hours later when they get the email? What flags will get set on your sender account when Gmail gets 100,000 bogus email sends? Do you force your users to "Look in your inbox and click the activation link" for every email address change also? There are others but I've made my point. There's a finite amount of "stuff like this" that users will put up with - you can either put the onus to "get it right" on the user (regex validation for emails), or you can put that onus on your system.

An argument for another is always, "If a user can't get their email address entered correctly, I don't want them as a customer". And you can take that multiple ways - technical difficult entering emails, "challenging" email addresses, etc.


People are far far more likely to get their email address wrong by misspelling their own name or putting @hotmail.com when they meant to put @gmail.com; regex will not protect you from either of these things.

We actually had an email list of ~50k people that had been validated within nothing other than "check there are at least 3 characters in the string" and when we looked at which addresses were bouncing when we sent to them there were approximately zero that failed because they had ommited the @ or because they were using some weird invalid unicode.

Even the spam bots were submitting valid email addresses.


Spam bots, if there was no check in place to slow them down, would dwarf real people registrations in all systems always. So let's not confuse these two topics - they are different. One part of a system that allows users to register needs to ensure that you have an identifier for a customer and a way to contact that customer, and other techniques try to ensure that you aren't allowing the spammers in the door. Whether you use regex or sending an activation email - neither of those can tell you whether this email address is or is not a spammer.


That's true, but I would posit that you need a registration email anyway (assuming you even care if the email is valid) because even if your regex is perfect there's no way to detect people simply mistyping their email address in a way that is technically valid.

This is going to your dominant type of failure.


""If a user can't get their email address entered correctly, I don't want them as a customer""

Kinda harsh sentiment considering that we all mistype stuff, especially on site that disable auto complete.


In PHP you have default functions that can verify emails :

filter_var('bob@example.com', FILTER_VALIDATE_EMAIL)

more info here : http://php.net/manual/en/function.filter-var.php


And that uses a giant regex.


At least it's a standardized way of doing it. If it turns out to contain an error, it can be fixed for all websites with an update.


The driving force is that you want to correct an invalid email ASAP, preferably in the client with live feedback coloring, etc.. Most email services I know don't give you any immediate feedback, and some only give you a basic check that can bounce later. So claiming you just have to check for an @ sign and try sending means there is going to be a huge delay before you know about the error.

Saying the user will just come back and register again is not good. That's like saying if your page is very slow to load, users will just wait for it to load. They don't. They leave and never come back most of the time.

If you can't afford to write the code to help the users fill out the form in a way that will work, fine, that's something you didn't have time for considering the percentage of users it will help/retain. But don't claim it is useless.


Not useless, but validation of the presence of a valid email does not validate the accuracy of the email. It may be valid and still wrong. Therefore the return on investment, and the possible exclusions of valid email doesn't justify the time in most cases. So not useless, but certainly a poor investment of time.


A basic regex is more than enough and you don't even need to deliver a message, just connect to the MX for their domain and check that A) The domain resolves an MX and B) That you can handshake for a 250 OK on the rcpt to header only, then drop the socket. Done! It's not that slow and you're leveraging the one thing an SMTP server does really well - be RFC822 compliant. It's something that can be delegated out of process anyway (as a promise or RPC etc) as soon as the email is entered, and resolved when they submit the form. Problem with the email? Then raise it for correction or pass-through... its probably the same amount of code and half the time for end-to-end delivery testing than crafting a bunch of edge case regexes and praying it works.


How can you validate the MX record for their domain programmatically?


If you really want to do checking of email addresses right on the signup page, include a confirmation field so they have to type it twice.

No. This puts the burden of checking email validity on every user, even perfectly capable valid users. If you're validating for edge cases (mistakes or otherwise invalid addresses), treat it as an edge case and don't annoy users who can type.


> This puts the burden of checking email validity on every user, even perfectly capable valid users.

? Whenever I hit a form which wants me to retype my address, I just triple-click to select the entire address, then middle-click to paste it into the confirmation field.


Meet the airline I booked with yesterday: two email fields, with paste disallowed only on the second one.

:(


How do you even prevent that? You need some seriously broken code to catch a system wide short-cut.


Yet another reason I love lastpass. None of this nonsense anymore.


Perhaps every sign up page should include a "I just want to check it out" button to let you in and demo the product.


Assuming that running the regex is much faster than sending an email, it would probably be much less server load to check the regex and never send X% of emails, unless X is extremely small.

(Looking up and implementing a regex) * 1 + (running the regex) * (every email) + (sending email) * (every valid email) < (sending email * every email)

Also, this post only considers the signup/activation use case. If you're getting an email for ecommerce to send an order confirmation, you want to know if the email might be invalid before the user completes the order and you try to send it.


This assumes that you get the regex 100% right and never lose a user by rejecting a valid email address. This is much harder than it seems ( http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html ), and is no guarantee an valid email address that is in use, as the article makes clear.

After some very basic checks, e.g. "contains at at least 3 chars, one of which is an @", you should Just. Send. The. Email.

Who bothers to type in a complex but invalid email address? The overwhelmingly common failure modes are:

1) Nothing entered at all. The basic check catches this.

2) Deliberate invalid email address. e.g. homer.j.simpson@springfieldnuclear.com - a regex will not catch this.

3) Typo in email address. e.g. john.smith@gmial.com - a regex will not catch this either.

The regex has downsides and complexity, but essentially no benefit.


Please don't quote that RFC-822 regexp when arguing this. That's for the contents of mail headers (which can include comments and so on), not an actual valid email address.

A regexp for validating RFC-2821 email addresses is actually fairly simple.


Whichever RFC it is, it is not so simple that everyone gets it right. For innstance, a significant percentage of websites don't let you register email addresses containing a plus sign in the name, e.g. john.smith+foo_bar@host.com


Arguably 3 should be covered by user prompt.


You still need to confirm the validity of the email request by sending an confirmation email, so you might as well just check for '@' and let your confirmation system handle the rest.


The OP and a lot of posters here don't seem to understand the problem or the purpose of the solution. The whole point of this is to avoid an unrecoverable error, a bricked account.

You are only trying to catch email addresses that are entered in error at account creation time so that a user will actually get the confirmation email.

The actual problem is that if they enter an email address incorrectly they will crate a dead account that they can never log into again. In addition if they used their favorite user name, or a referral code or any other important consumable when creating the account then you've effectively blocked that user from even creating a second account.

The real solution is to use validation email to confirm an email address, but to allow them to login to the account even if the email is not yet validated. You won't even have to make them type it in twice. Simply limit them to only being able to edit account information and settings.

Email is validated, users have a window to correct any issues and you've eliminated the unrecoverable error altogether... oh and no regex.


HTML5 has you covered. You can use HTML5 input verification with the following:

   <input type="email">
Of course, if your user is not using an HTML5 compliant browser, then this will be ignored.


When the regex is not RFC complaint is the worst case, for example, I want to use . or + on my mail address and the website don't allow me.


... And remember, the password can only be 8-16 characters [A-Za-z0-9] because we wouldn't want to do accidentally cut yourself on some other 'weird' character like a space or underscore or something. ;-)


cough Windows 8


Yeah, and then you wind up sending mail to "joe@hotmailcom" or "liz@gmailc.om", and your users don't get your messages and are sad. Don't validate the local part, do validate the domain.


Especially since this is extremely easy to do, it’s just three DNS queries away (plus one in case of CNAMEs).


So what's wrong if you do a full validation (http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html) ? You as developer or site owner or user don't need to do it by hand or in your head. It is done in a fraction of a second by the computer even if benefits are not the greatest like validating the strength of a password but still. Complaining about it because you don't like it and telling other people not to do it because of your reasons and spending time writing a blog post about it is overkill - like validating the email address with a regexp :)


There's nothing wrong with doing full validation. Just don't use that regexp - it's for RFC822 email addresses, which is how you might see them in an email header, including things like comments.

You want an RFC821 (or more specifically RFC5321 now) email address regexp. See my post here about the email validator I wrote: https://www.emailitin.com/email_validator


You'll hurt your email reputation if you send too many emails that bounce. It's worth checking everything you can before firing off an email. This includes using a decent regex and doing a lookup on the domain to make sure they have an MX record. While technically you can have a mail server with no MX record (it falls back to sending to the A record), you won't find too many mail servers configured that way in the wild. In many cases such as email marketing, protecting your email reputation is far more important than handling the 1 in a million user with an unusual email address or mail server configuration.


Also, instead of just failing to allow the email address, you could warn the user that what they entered doesn't appear to be valid, and that they should double check the address.


Glad I'm not the only one just checking for an '@'. I wouldn't want to prevent users from registering with my sites just because I didn't foresee some funky email address formatting with my regular expression.


So you're forcing me to write @localhost :-(

Edit: Oh wait, on your own sites :-) I'm just slightly annoyed that I have to add a fqdn when doing local development with some apps...


Discussion from previous post https://news.ycombinator.com/item?id=4486108


IF I were to validate by regex, I would put a confirmation for emails that I couldn't validate that read "We are very sorry but your email doesn't appear to be valid, however validating emails is very difficult so it may be our mistake. Can you confirm your email is correct?" And if they don't modify it, accept it as valid. It is an extra step but seems more friendly.


If you want to maintain a high reputation for your MTA's deliverability, ignore this post. Attempting to send hundreds, thousands, or tens of thousands of malformed addresses to domains (some of which will be well-formed) will result in a higher spam score that will ultimately create more work for whoever is managing your mail platform.


I would rather lose a few users through a faulty regexp than lose double digit percentage through an email activation step.


No regexp in the world will tell you if an email address is real.


Of course - I doubt anyone (smart) has ever claimed that to be the case.

However, they can, if implemented correctly, tell you if the email address is syntactically valid.


Why check it's an email at all if you're just going to presume it works?


Think about the repercussions of not using a validation email. Anyone can sign anyone else up. That's even more unacceptable IMHO.


How many emails to asdasdasd@example.com or sdfsdf@gmail.com are you willing to pay for (the nickels add up)? What if someone uses my email address to sign up for your service?


That seems terrible when combined to a username which needs to be unique. User registers with username, email and whatever else. Email is incorrect, they never receive the activation email and cannot register a new account using their preferred username.

Of course there's plenty of ways around that, but this seems to be the most common pattern.


This has come up so often on Hacker News that I decided to create a very simple JSON API for checking email addresses. Free to use for anyone. Performs the right regexp check for email addresses based on RFC-5321 rules (not the oft-quoted but incorrect RFC-822 rules, which are for mail headers), performs MX lookups to ensure mail can be delivered, and performs the same "did you mean" type checks that kicksend's mailcheck performs.

I've included both jQuery and server side example code on the site.

https://www.emailitin.com/email_validator


If I type name@outlok.com instead of name@outlook.com, it says the email is valid - when in fact it is not.


The domain outlok.com has an A record which points to 208.87.35.108 which I can even connect to on port 25 (though I don't perform that test). Until you actually mail that address there's no way to distinguish this from a valid email.


Who says it's not valid?


You collect their email because you wish to send email to them at some point.

Thus, you must send a confirmation email, with a "click to confirm" link in it.

This keeps your email address list clean; it also validates all the email addresses.


I would think the following would be best practice:

1) use LPeg or something similar to validate the actual text of the email (here's some LPeg that parses the headers of an email, certain one can pull out the email address portion: https://github.com/spc476/LPeg-Parsers/blob/master/email.lua).

2) Take the domain part and do a DNS MX lookup on it (to be pedantic, if that fails, then one should do a DNS A lookup). That will check if the domain is at least valid.


It's an interesting intellectual exercise to build THE email validation regex, but it's shortsighted to inflict your experiment on the public.

While I definitely enjoyed how Friedl's book (http://regex.info/book.html) builds over several chapters to an ever more complex solution, maybe a page long, my takeaway was: don't bother. A friendly UI will help users avoid an obvious mistake, but as other posters have pointed out, the only real validation is, does an email get there?


I wish more sites would use a validation email. I get a good number of emails for people that have my (real) name but use my email address on gmail (since I have the same name) to sign up when they don't want to recieve emails from the site. Thus I get to enjoy them. When the site sends a validation email I just ignore it and never hear from them again.

As a side note, be cautious of using such a tactic. I have recieved their logins, CC and Physcial Address information because of this.


How about this: Add a mailto link that sets the subject to some type of token, they click the link, hit send, app catches the email, sets the correct address accordingly.


This is actually a very good idea. But most users have a throw away email address for registrations and one for personal use. And for me for example, if i click a mailto link it will open the mail client with my personal email so I'll have to copy paste all that into my gmail throwaway email that may not even be opened. So I'll have to open the browser, click on........ But is indeed a good workflow alternative to consider for email verification.


Valid points. I hate mailto links in general, and can totally see the workflow getting messed up by not having your default mail client set to what you actually want.

However, I think most people know how to at least access their email (always logged in), so provided you could get them into their client, with a token, quickly, with a small number of clicks, might be interesting. Of course, it could be spam central.


Forging mail headers is easy enough to do that I wouldn't consider this approach to validate an email address. Sending a validation email is much harder for spammers to bypass.


A good reason to validate email addresses is to prevent SMTP injection.

Depending on how you're sending the mail, it may be possible to insert arbitrary headers and body after a \r\n in the email address field. I know I've built at least one system that is vulnerable to this. Then you can put the body after your special headers and hide the rest of the message (either as an attachment or an HTML comment).

This then makes your signup form into what is effectively an open relay.


I'm against all of the complex regex as well, having learned through trial and error that it is usually way more trouble than it's worth. That being said, there are many cases where the email address being verified is not the user's email, but maybe someone they are doing business with, and they don't want the system sending a verification email to every email address they are saving with your software.


Is a decent regex with dynamic yellow field coloration (and bolding for accessibility) accompanied by a message like "Your email address is of an unfamiliar format or may contain a typo" too intrusive? Then just allow the user to submit with that email without any automated validation.

It's not 100% idiot-proof, but I'd imagine it would be pretty effective for laypeople and hackers alike.


I don't understand why sites don't just warn the user with a confirmation dialog if their regex doesn't match (e.g. "Your email address looks wrong, are you sure") and allow users to use their potentially invalid email anyway. This avoids the problem of users making obvious mistakes and the problem of users with strange RFC-compliant email addresses being denied.


OP is right about not using regex, but wrong about the "just send it" solution (for reasons outlined by several other posters).

We use this clever (and well-explained) solution from http://my.rails-royce.org/2010/07/21/email-validation-in-rub...


I believe that he is right that you should not rely on regexps to validate emails, but I disagree that you should stop it completely.

First of all - less incorrect emails sent - less chances to get marked as spam host. Second - it is very easy to catch obvious errors user can do on front end and ask user to correct it. These two is big ones imho


I don't validate emails at all. If you want to enter 'a' that's fine but you won't get any emails.


I'm the same. An email address isn't an identity.

Some people use many email addresses and so could create many accounts. With one email address they can still use the '+blahblah' method to sign up unlimited times, unless you prevent that which would annoy people who use it legitimately for filtering.

Some people have a garbage or throwaway email account that they sign up for everything with, and only ever look at to find the confirmation emails.

If people don't want to give you a valid email then there's no reason to be sending them anything.


If you own any domain you can forward <anything>@yourdomain.com to the same inbox. Then sign up for unlimited accounts that way.´


Careful now. I'll get you blacklisted as a spammer pretty quick if I fancy with a hole like that.


What do you mean?


For those that use PHP, use the following: filter_var($email, FILTER_VALIDATE_EMAIL);

Regex from the source: https://github.com/php/php-src/blob/master/ext/filter/logica...


yes, there's a good page here about the different regexs:

http://fightingforalostcause.net/misc/2006/compare-email-reg...


Indeed. I was installing Crunchbang. Our local proxy knows to route our local username to our email addresses in our VPN. Yet, the validation required an @ symbol. It is against our standards to use our @ email internally. Sigh.


In PHP land you can use this handy built in function ( http://php.net/manual/en/filter.examples.validation.php ) which works well enough.


So we should be using a recursive descent parser? The email address format is a context-free grammar, so the "regexes" that would work are using features not included in true regular expressions anyway.


I like the "don't regex emails" but don't like the "unnecessary email validation emails". If you're requiring email validation unnecessarily (ie, you're not PayPal), you're losing a lot of business.


I agree with the author of this blog, but he doesn't address the problem where you want to scrap all the email addresses in a text file. For this situation, I don't see what to use except regexp.


First you verify the email, then you process the emails. What good is a list of email addresses if you haven't verified their authenticity?


Yeah, but it you want to get all the emails in a text like "Please contact me at myfakemail@gmail.com. I already sent an email to contact@mycompany.com, but no one replied...", then you would have to use regexp.


I would like to make a plea to anybody coding email verifications to please include the ip address of the request, this way we get a sort of warning somebody is trying to impersonate us.


"Feeling ambitious? Then check for the dot too: /.+@.+\..+/i. Anything more is overkill."

I personally prefer: /[^@]+@[^@]+\.[^@]+/

Basically the same except that it will throw an error if someone enters an extra at sign.


Yeah, I basically check for one @ sign and some very basic sanity checking on the right side.

/^[^@]+@[^@]+\.[^@.]+$/


As I just noticed, @ is valid in the local part of an email, too, it just has to be properly quoted (using \ or " "). You can send me email at

    "a# b.@c"@[IPv6:2001:4dd0:fc8c::1]
    a#\ b.\@c@[IPv6:2001:4dd0:fc8c::1]
now :-)


For what it's worth, sending a confirmation email is a great way to stop bots as well, skirting the whole Captcha thing.


Out of curiousity, why would you not use input type="email" instead?


Agree, but this is not an example of a regex problem.


Actually monolithic regex are bad in general.


my favorite is google chrome built in email check which is enabled with a required tag.


Proof is in the pudding...


I'd rather let the Regex do the job than the email server .

That's just me.


Good point considering security, but I still disagree. You're much more likely to get it wrong by validating than a mailserver is.


I read sections about email validation in 1998 in moldy thrift store PHP hardcovers. Why is an article that is slightly worse than these historic artifacts on the front page?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: