I once had an issue where a mobile user's browser would do a Google search every time they typed one of our domains into the URL bar. They weren't doing anything wrong - other domains worked fine. They were using some Android phone from 2013 which had been abandoned by the manufacturer, and it turned out the stock browser was using the public suffix list to decide if the input should be treated as a URL or as a search query. The list was as old as the browser, and since the domain ended in ".link" (added in 2014) it didn't think our domain was a domain. (The workaround was to type out http:// or https:// before the domain, but that's crappy UX.)
If you use this in a server-side app, please use cron or something to keep it updated. If you embed it in a client app, make sure to update it as part of your build process. And if you know your app won't be updated very often, consider having it update separately from the app itself. (Does anyone know if this is hosted on some public CDN somewhere? I assume having a client app fetch directly from publicsuffix.org isn't kosher.)
Edit: My mistake, on https://publicsuffix.org/list/ they do recommend having client apps pull directly from their site, and limiting updates to once per day.
Indeed, there are some differences (https://www.diffchecker.com/tcbbvy7p) with IANA's root domains list: http://www.iana.org/domains/root/db (which only includes TLDs, as the name indicates)
Secondly, maintaining this list is a pain. You have hundreds of two letter top level domains, one for each country. Each country with its own NIC in charge of sub-domains. Each NIC with the power to add or delete subdomains (check out https://en.wikipedia.org/wiki/.uk for just one of hundreds of examples). Some "countries" even sell off their top level domain like .nu . Then you have .us (https://en.wikipedia.org/wiki/.us) that has wild card domains like http://vil.stockbridge.mi.us/ where the vil. is fixed and part of the domain and stockbridge.mi is the important domain information. Of course they're always adding more top level domains: .ninja, .wtf, etc (maybe there is a .etc now?). Then you have all the blogging and hosting platforms that use personalized domains for hosting content. Many are listed in the publicsuffix list, but I'm guessing not all!
I ended up writing my own publicsuffix parser in PERL a few years back for the blekko search engine. The main purpose being to be able to group web pages together by site/owner. There is nothing quite like feeding every URL on the internet through your parser to find bugs and corner cases.
It also feels a bit like a hack. Would there a be a better way to do this, maybe in the DNS system to denote ownership/isolation?
I had many long discussions with package and application maintainers on the topic of how to best provide data to the library. Arguably you need an up-to-date version of the list if you want to access the web securely today.
You can bundle a static version with the library. That's nice for developers, and users because it just works after "pip install", but will get out of date because library won't update as often as the list changes. Some distros like Debian provide their own mechanism for updating the list.
You can also download behind the scenes and cache. This way data is always up to date, but users don't like that apps call some web service on start-up and some people want to use the library off-line.
If you look at different libraries and apps, each one does it differently. Getting it right is hard. Things would be so much easier if this information would come decentralized from the DNS system.
This list may help to level the playfield between domain grabbers and legitimate domain users.
If you've got a script searching for available domains you don't really want it wasting time searching dictionary words on *.accident-investigation.aero as they'll probably all flag legit, meaning if you use the DNS lookup then whois method you've wasted a lot of time searching for domains you can't have.
> Please don't.
He is correct. Seriously. Don't.
The right way to validate an email address, if that's something you need to do, is by sending it a message with a link and a response code, and seeing whether the user clicks the link or enters the code. This is the only right way. There are many wrong ways. They make people unhappy. They will make you unhappy. Do not use them.
You cannot validate email addresses for shape and form. The sole invariant is that there be at least three characters one of which is @. Everything beyond that is in the hands of the DNS and the domain's MTA. You cannot predict every fashion in which they will behave. You should not try to.
Have you seen the regex that correctly matches every variant of email address form described in RFC 822? It is five kilobytes long. "Ah," you may now think, "I can use that!" You should not. There are new standards with new variations. The regex is incomplete. It will be incomplete forever. It is a five-kilobyte Perl regex. No one will ever understand it well enough to extend it.
Just send a link and a code. Ask the user to click on the link or give you the code. When you have received the click or the code, you know the email is valid. When you have not, assume it is not. This is the method that works. It is the only method that works. Use this method and be happy. Use any other and be sad. Which you prefer is up to you.
I used to agree with this viewpoint, but now I'm having some doubts.
Suppose I enter the email address "email@example.com@example.com". It validates according to your rule. Where does it go?
Suppose, furthermore, that I upgrade my servers and they start parsing it differently. They're allowed to do that, right? Maybe my old MTA sent a confirmation email to example.net, and the new one is now sending emails to a user named "firstname.lastname@example.org" at example.com. Haven't I just done something very wrong by allowing a user to trick me into sending emails to an unvalidated address? Isn't this both a violation of the Postel principle ("conservative in what you send to others") and a security hole?
It seems like it would be better for my application to parse the email address and validate what it's doing, and store only unambiguously-parseable addresses.
Or maybe they're not allowed to parse it differently. Maybe there's a single consistent way to parse email@example.com@example.com, and every single email application I might use will get it right. What is this esoteric lore that email applications know that my application cannot? If they can parse it, can't I?
Slight word charge on the rule. "..one and only one of which is @..." with characters within quotations ("") not being counted. Quotations need to be considered because this is a valid email address: 
>Suppose, furthermore, that I upgrade my servers and they start parsing it differently. They're allowed to do that, right? Maybe my old MTA sent a confirmation email to example.net, and the new one is now sending emails to a user named "firstname.lastname@example.org" at example.com.
If you upgrade your servers and they parse incorrectly by inserting quotations where there were no quotations entered, then the parsing is bugged. Although the validation was bugged to begin with for allowing two delimiters in an email address.
Furthermore, if anyone uses an email that is so heavily eccentric just because it is "technically valid" I'm sure they don't expect their email to work most of the time.
I will fully agree with the claim that a regex is the wrong way to validity-check an email, and I will easily believe that implementing the check you describe takes 5 kilobytes of regex. But a rule like "Exactly one un-quoted @ sign" or "Exactly one un-quoted @ sign, and the string on the right needs to be a well-formed domain name" or something is pretty simple in normal code.
"Well-formed domain name" is an existing concept: one or more labels separated by dots, each of which contains only ASCII letters, digits, or hyphens, cannot start or end with a hyphen, and cannot be more than 63 characters long, and no more than 253 characters total. Again, not something I'd do with a regex, but a small number of lines of code.
Although you have me curious how you would check with code without using regex.
No. Read the rule again.
That's not the rule that 'Freak_NL posted, which is why I read your rule as "at least one of which is @".
And, besides, as others have pointed out, "exactly one of which is @" is incorrect. Valid email addresses can have multiple @ characters.
Your opinion might have a lot of reasons to support it, but it does not seem to be consistent with the position you were advocating. I am interested in not validating email addresses, because of your convincing argument that I should not. And now you want me to validate them.
The most sensible approach (in my opinion) is to validate with this minimal regex:
I've been using this regex to make sure there actually is text before and after the @
Once you get that into a regex format, remember that a quoted dot is different from an unqoted dot, and don't forget that you handle comments within email addesses correctly.
There are plenty of reasons why people say not to use regex to validate email addresses.
Wikipedia <https://en.wikipedia.org/wiki/Email_address#Valid_email_addr... > gives the following example:
There are regexes out there that capture all the complexity of valid email addresses today (but who knows if they'll work with, say, next year's additions to the top-level domains?), and you can copy them from StackOverflow if you really want them; but why bother?
Nevermind. `/.+@.+/` it is.