Hacker News new | past | comments | ask | show | jobs | submit login
Email Regex that works 99.99% (emailregex.com)
30 points by chenster on Feb 22, 2015 | hide | past | web | favorite | 61 comments



I believe this falls under the category of "things that may be fun to play around with but should never be used in a real system".

Unfortunately, I bet there are thousands of "real systems" employing regexes like this... How many problems does this solve? Probably zero. How many does (/will) it cause? Probably much more than zero.


Not sure why this has been downvoted, but yes these sort of things do cause more problems than they solve.

Case in point, at work the other day I found a bug in a service I manage. It consists of a front end form (built by one team), which submits data to another system (built by my team), which then passes the data to a third party. The third party was rejecting the data we were trying to send them as the email addresses were apparently invalid. The validation they were doing didn't match the validation the front end form did, so to the user everything seemed fine.


So what caused the problem, wasn't that front end and 3rd party was checking if email is valid, but the fact that there is no standard way of checking it.

I'm not saying that checking email address with regex is good way of doing it, but there is countless examples of people doing it and it would help a lot if everyone would use one standardized regex for that.


As others have suggested, the best "standardised regex" solution would just be something like:

    /.*@.*/
Some people aren't happy with this because it allows invalid emails to be entered. But the alternative - such as what's being attempted here - is a nightmare!

Even if there was a standardised regex for all emails, which would not break when new TLDs are released, or new unicode characters are supported, or whatever, there would still be no guarantee that someemailaddress@domain.com is actually valid!

Your best bet is to simply allow users to enter anything (with perhaps some minor regex check like I did above), perhaps ask them to enter it twice, and perhaps send an account validation email.


After extensive research, I have finally come up with a way to improve upon 99.99%. I have come up with a regex which will work for 100% of email addresses, and a significant number of regex engines, as an added bonus! Behold:

    .*@.*
/s


  $ nc localhost 25
  220 uhura.z ESMTP
  MAIL FROM:me
  250 2.1.0 Ok
  RCPT TO:root
  250 2.1.5 Ok
  DATA
  354 End data with <CR><LF>.<CR><LF>
  Look ma, no @!
  .
  250 2.0.0 Ok: queued as 8DB653E0065
  quit
  221 2.0.0 Bye


/bow

.* it is!


Your verbosity is outlandish.

.*

will match all email addresses, and as an additional feature, all other strings too!

As we all know, more functionality with less code is better, therefore, this is clearly superior to all other regexes.


    .+@.+
Should do the trick


I considered that one, but I'm sure there's some system out there which doesn't require an account name ahead of the "@" symbol. We're shooting for 100% accuracy, after all. ;)


In that case, let me improve it further :-)

    /@./
Though I'm not really sure there could be an email without username part.


What regex would you suggest to match all e-mail addresses contained within an arbitrary body of text (as opposed to a single text field where you don't have worry about other text strings)?


The same one. If it has an '@' symbol embedded in it, it might be an email address. The only way to know for certain is to query the server its attached to.

Of course, if you're really trying to scrape email addresses reliably, looking for the '@' won't work reliably, since people are used to obfuscating their email with [at] or (at) to protect against spam (well, more spam). You will probably need to dive into AI theory to more reliably get email addresses.


You know the different languages match different sets of email addresses, right? The reason the Perl ones are so much longer, is that they work for _all_ RFC5322 addresses, where the JS match a subset.


I simply don't validate emails up front anymore. The only thing I check for is if the string contains an @-char, I only do that to be nice if it's left out by accident. Instead of having a monstrous regex pattern in my code I simply email a confirmation link the user must confirm.


That's the thing, the only way to validate an email address anyway is to actually have the user take some action to do so. Otherwise, I would question how important setting up the email thing is in the first place.


Jeez, that's 100 email addresses in a million that it won't work on. Plus it's a pain in the butt. [Edit: Though I suspect that 99.99% figure was made up]

Just get people to enter their email twice (which filters out most mistakes where people are entering their names or somesuch), don't validate it with regex, during the signup process make sure you tell them to expect an email which they must confirm before they are added / before their account is activated. Send a confirmation email with a clickable link. If people don't get it, and the service is important to them, they'll try again or contact you through another means.

(I was involved with the running of a mailing list with well over 1m double-opt-in subscribers. Less than 100 of these turned out to be invalid [Edit: yeah, that's a guess, like the OP's 99.99%], and we dealt with it easily at our end, by properly handling any bounces)


Utterly pointless. An email regex tells you that the email address (probably) conforms to a pattern that means it might be a valid email address (for now, until new weird TLDs emerge and the patterns have to change...), but it has no way of telling you whether that address can actually receive mail. `foo@bar` fails these regular expressions and `foo@bar.invalid` passes them, but neither will receive mail.

As I have told people for many years: if you must do this, check at most if there's an @ and (perhaps) a dot somewhere after the @, which is enough to stop someone who has accidentally put their name in the email address field, or a similar user error. Anything else is a waste of brainwidth and will result in more problems than it solves.


The dot isn't mandatory either... There are also local (for example company internal) email services.


Completely useful, as part of a two step process:1

1. Filter with the regex - what's left has a valid format, making step 2 much saner.

2. Extract and validate the domain name - super simple now, because the domain component is known to be sane.

(Optional but good idea 3: Handle exceptions....)

Step 1 is almost always the hardest part, now it's mostly done.


The assumption here is "0. create a regex that does not have false negatives". You seem to be taking the author's word for the "99.9% Works" part - based on what? (Exactly.)


Correctness and utility are two different things, feel free to investigate the former whilst we discuss the latter....


Well thank you, I doubt I would have come to such conclusion by myself. Are you saying that "turn away 0.1% of your customers" is a useful approach? Well, your call, I guess.


Whoa, there: You're shifting your argument. Are you now accepting that it is 99+% correct? Your original objection was as to the correctness of the regex.

If it is 80% good, then, yeah, your underlying assertion is likely good: Too little coverage to be all that useful. But 99+% coverage? For a quick pass filter? That's pretty good, we can work with those numbers.

After all, one could have multiple layers of progressively more accurate but progressively slower filters. Not an uncommon or unusual approach.

Besides, the JS version could be very useful as a quick pass usability check, validating whether the user made a silly mistake. And adding some logic to record the bogus entry, compare it to what they write next, pass it to the server if need be, can all be very useful.


> `foo@bar` fails these regular expressions and `foo@bar.invalid` passes them, but neither will receive mail.

foo@bar could still receive email, if you had a host in your DNS domain named bar, with a user named foo.


It's a very good offline check. If that's not enough, you have to do online checking (DNS, RCPT TO, actual mail with confirmation link, etc.)


It's excellent to extract email addresses from a text.


This would be valuable only with a proper test suite, nothing fancy but two files with valid and invalid addresses. I don't trust these and very hard to debug a complex regex, it's wway easier to argue about test cases.


Yeah this is not great practice.

We built an app where sending a validation email upfront was not a practical option some time ago, and the best strategy I found for ensuring the email was valid was to lookup the MX records and ask the mailserver for the given domain, by issuing RCPT commands. Many mailservers will just drop connections when RCPT is for someone who doesn't exist or can't be routed to which was a good indicator of a typo or invalid address. And of course if the MX lookup fails the domain is incorrect.

Still wouldn't recommend this method either though really.


Did you remember to fall back on a A and AAAA lookup if there is no MX record? That is what you are supposed to do.


Only way to check email.

Lookup for an '@' & parse response log from provider to know if addresse works.


So in 1 million emails, there are 10,000 valid emails that will be rejected. I guess if you use this regex then you'd better hope your service doesn't become popular.


Your math is off by a factor of 100.


Assuming that the 99.99 figure is not something the author pulled out of...thin air. In absence of any data, I'm inclined to believe that the figure was gained by this very method.

Edit: Oh look, the submitter has pulled yet another figure out of thin air (author says "three nines, trust me," submitter says "four nines, omgwtfbetterthanslicedbread!!!1!"). Suspiciouser and suspiciouser.


D'oh :-/ Really should have thought more before posting.


Title here has an extra 9, compared to what's on the site. In either case, how would one back up those percentages anyway?


Nice idea. But a blind implementation of the grammar set out in the RFC is not performant. It's better to drop the obsolete syntax rules and folding white space. I have yet to see a user legitimately try to input an email with a comment mid-domain.

I implemented this in Ruby a while back, but I also went the next step and added a DNS check for a MX record. That way you can ensure there's a mail server to receive an email. Heck, I even wrote a blog post about it.

http://stevenkaras.github.io/blog/verifying-email-addresses

We've had pretty good feedback so far, but I've also spotted a few emails that people enter that are clearly fake (e.g. asdf at test.com).


Meh. I wish people would just give up on trying to validate email adresses all together (except for maybe basic stuff like checking for an @). They'll almost always forget about some edge case.

I use a pretty ununsual TLD (.su, the old TLD for the Soviet Union, which still remains in the root zone), and from time to time, I come across a site that won't accept my email address because of that. Most of those sites turn out to be generally crappy though, so not much of a loss…

Also, many sites don't accept + in email adresses, which is annoying as hell if you want to use the address extension feature of Postfix et al.


Does't work on the dreaded "I am a terrible but nonetheless valid email address"@example.com.

I echo the advice of everyone else - validate with something very simple like .+@.+ and then by sending an email. Trying to recapture the complexity of the email system via a tool like regular expressions is tilting at windmills. It's like trying to develop a regular expression to determine whether a name is real or not.

https://www.exratione.com/2012/09/what-constitutes-an-accept...


Note that these regexes aren't even matching the same thing, so who knows what these things are matching. Whatever it is, it's probably not 99.99% of the world's emails, and either way nobody is going to check that. Especially nobody is going to parse that Perl beast.

As a tangential anecdote, I always thought it would be interesting to drop a backdoor into some canonical piece of code like this that noobs are bound to copy-paste. It might be the most efficient way to worm your way into the largest number of computers worldwide.


There are two Perl regexes. The beast is for 5.8 (check your Perl version; it's probably above 5.12). The other is basically a BNF grammar and is trivial to parse. The only easier ones are those that throw up their hands and just look for an @ with characters before and after.


These have wildly different behaviour. The .NET and Javascript ones are even exchangeable (both are valid js). They will also not match internationalized domain names unless converted to punycode before validation.

IMO you either implement the RFC, or use the absolute dumbest validation possible: 1) there is one '@' character present 2) there is at least one dot on the right side. Anything else will exclude some valid addresses, and you're unlikely to ever hear feedback/complaints from someone who had their sign-up email rejected.


I haven't bothered with validating an email address for awhile, but I have used https://isemail.info successfully in the past.


In JS the regex is built into the browser: you can leverage HTML5 .checkValidity() on any type="email" input.

https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Form...

Note this is super liberal, so user@domainwithnodots (which is RFC valid, but probably also a user error) is still considered valid.


Any idea on who is responsible for this micro-site?

I find it strange that there's no information on the author, sources, references, attribution, or credits on the page at all (other than the WordPress theme attribution).


Three nines is now the new four nines? No data to back it up? Bah humbug!


How come the .NET version is so small compared to the others?


The various versions aren't equivalent to each other. Some validate 100% according to the RFC (perl), and some make a compromise (JS, PHP, .NET)


What's so special with .NET regex handling ?


The one for Ruby is missing. Anyone?


Why perl regexp is so long?


Because it actually validates any RFC compliant email address, while the others (especially the .NET) ones are people's attempt at "good enough" regexes.


Alternatively, don't validate email addresses with a regex because it's pointless. Check that at least an '@' symbol is present and just send a damn email already[0]. Or better, don't send a damn email at all and let them access it because it's pointless to have an extra verification step.

If you really want to do some clientside validation, just keep a basic regex and warn the user if the address doesn't seem right.. Or use MailCheck[1].

0. http://davidcel.is/blog/2012/09/06/stop-validating-email-add...

1. https://github.com/mailcheck/mailcheck


> Or better, don't send a damn email at all and let them access it because it's pointless to have an extra verification step

I've heard lots of people say not to bother validating but I always thought it was because you could just confirm with a verification email. Why's this step useless?


I don't get it either. Not doing any verification ist just an invitation for trouble. "My password doesn't work, why can't you send me a new one?"


Pointless? Methinks you've never worked with Mandrill or another service that penalizes you for hard/soft bounces.

If you validate emails as complying to the email format used by those services, you can reduce your hard bounces to a bare minimum. If you check for a valid mail server at that domain, you can reduce that to 0. If you connect to their server and check for an error for that mailbox, you can reduce your soft bounces to 0 too, but that requires a lot of legwork to set up.


If you're willing to work with a service that you're paying for AND that penalizes you for erroneous bounces, that's kind of your problem.


It even sounds like you're doing their work.


> Check that at least an '@' symbol is present and just send a damn email already[0].

Checking there's a dot after the @ is probably a good idea. Although it theoretically blocks people who want mail sent to a TLD, that probably isn't a big effective issue.


Or, if you want to do something inbetween the two, and be reasonably certain it's a real email but not having to have the customer validate, do a quick MX lookup on the domain they supply (cache these on a local bind instance or whatever if you like), and if you really want to go a step further, send a quick HELO and headerset to the remote SMTP server (many libraries will do this for you), and see if responds with a 250.

Of course, all this takes time, so you always have a tradeoff between speed and validation level, from full-on user-validated email to "enter whatever".


Regexps are not only about validation. E.g. sometimes you want to scan some text for e-mails (to highlight them, protect, scrap or whatever).

Although for most languages if there's nothing built in, there's at least one lib that provides it. I prefer using that over copy-pasting it from some random website.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: