Hacker News new | past | comments | ask | show | jobs | submit login
The Public Suffix List (publicsuffix.org)
79 points by dan1234 on Aug 18, 2016 | hide | past | favorite | 40 comments

If you use this for something, please be sure to keep the data fresh! Especially now that new TLDs are being added at a steady clip.

I once had an issue where a mobile user's browser would do a Google search every time they typed one of our domains into the URL bar. They weren't doing anything wrong - other domains worked fine. They were using some Android phone from 2013 which had been abandoned by the manufacturer, and it turned out the stock browser was using the public suffix list to decide if the input should be treated as a URL or as a search query. The list was as old as the browser, and since the domain ended in ".link" (added in 2014) it didn't think our domain was a domain. (The workaround was to type out http:// or https:// before the domain, but that's crappy UX.)

If you use this in a server-side app, please use cron or something to keep it updated. If you embed it in a client app, make sure to update it as part of your build process. And if you know your app won't be updated very often, consider having it update separately from the app itself. (Does anyone know if this is hosted on some public CDN somewhere? I assume having a client app fetch directly from publicsuffix.org isn't kosher.)

Edit: My mistake, on https://publicsuffix.org/list/ they do recommend having client apps pull directly from their site, and limiting updates to once per day.

> Especially now that new TLDs are being added at a steady clip.

Indeed, there are some differences (https://www.diffchecker.com/tcbbvy7p) with IANA's root domains list: http://www.iana.org/domains/root/db (which only includes TLDs, as the name indicates)

This list is used to determine what domains and sub-domains "belong" together, in the sense that they are controlled and/or owned by the same entity. For instance *.google.com is all google, so x.google.com and y.google.com can be trusted to share the same SSL key, safely share javascript (XSS), etc. However x.blogger.com and y.blogger.com are probably two completely separate blogs, people, domains, SSL keys, javascript domains, etc. And you wouldn't want to see x.blogger.com's web pages showing up in search results for y.blogger.com.

Secondly, maintaining this list is a pain. You have hundreds of two letter top level domains, one for each country. Each country with its own NIC in charge of sub-domains. Each NIC with the power to add or delete subdomains (check out https://en.wikipedia.org/wiki/.uk for just one of hundreds of examples). Some "countries" even sell off their top level domain like .nu . Then you have .us (https://en.wikipedia.org/wiki/.us) that has wild card domains like http://vil.stockbridge.mi.us/ where the vil. is fixed and part of the domain and stockbridge.mi is the important domain information. Of course they're always adding more top level domains: .ninja, .wtf, etc (maybe there is a .etc now?). Then you have all the blogging and hosting platforms that use personalized domains for hosting content. Many are listed in the publicsuffix list, but I'm guessing not all!

I ended up writing my own publicsuffix parser in PERL a few years back for the blekko search engine. The main purpose being to be able to group web pages together by site/owner. There is nothing quite like feeding every URL on the internet through your parser to find bugs and corner cases.

A great and important list.

It also feels a bit like a hack. Would there a be a better way to do this, maybe in the DNS system to denote ownership/isolation?

Yes, it is a hack. I maintain the publicsuffix Python library (https://pypi.python.org/pypi/publicsuffix/).

I had many long discussions with package and application maintainers on the topic of how to best provide data to the library. Arguably you need an up-to-date version of the list if you want to access the web securely today.

You can bundle a static version with the library. That's nice for developers, and users because it just works after "pip install", but will get out of date because library won't update as often as the list changes. Some distros like Debian provide their own mechanism for updating the list.

You can also download behind the scenes and cache. This way data is always up to date, but users don't like that apps call some web service on start-up and some people want to use the library off-line.

If you look at different libraries and apps, each one does it differently. Getting it right is hard. Things would be so much easier if this information would come decentralized from the DNS system.

There's a IETF WG working on this[1]. It's been a while since I checked on the progress, so this might not be up-to-date, but it was DNS-based, with a way to export to a static file like the PSL (I'd assume in order to avoid the performance impact of browsers having to wait for another DNS query to determine cookie scope, etc.)

[1]: https://datatracker.ietf.org/wg/dbound/charter/

Fix how cookies work. From my understanding, that's why this list is needed.

I have always wondered why this information is not stored in the DNS itself.

This is security critical information. DNS is not secure. (Don't say DNSSEC - practically undeployed solutions don't count as solutions.)

I recently found this list and use it in a Drupal module to invoke hooks based on the domain of a URL. I didn't realize how many edge cases there are for these and originally tried to do it with regex.

I do hope they use the same care and due process as they do with root certificate inclusion in the public suffix list (as well as the ability to add/remove entries manually). I think this is a great move for better security, but with great power comes great responsibility.

@HN mods: this list could be used to make the domain in parens after the link more accurate, no?

Quick plug for my related project: https://github.com/QA2/public-suffix-metalist - pull requests welcome.

A domain hacker's playground.

Indeed, although I suspect that domain grabbers have their own lists anyway.

This list may help to level the playfield between domain grabbers and legitimate domain users.

It's not the be all end all though. A lot of the domains listed are not usable by most people.

If you've got a script searching for available domains you don't really want it wasting time searching dictionary words on *.accident-investigation.aero as they'll probably all flag legit, meaning if you use the DNS lookup then whois method you've wasted a lot of time searching for domains you can't have.

This list has been maintained for 9 years, if there was any leveling that needed doing it happened a long time ago.

I was thinking of using this as the basis for some email validation as I got a lot of incorrect email addresses in form posts.

Freak_NL says:

> Please don't.

He is correct. Seriously. Don't.

The right way to validate an email address, if that's something you need to do, is by sending it a message with a link and a response code, and seeing whether the user clicks the link or enters the code. This is the only right way. There are many wrong ways. They make people unhappy. They will make you unhappy. Do not use them.

You cannot validate email addresses for shape and form. The sole invariant is that there be at least three characters one of which is @. Everything beyond that is in the hands of the DNS and the domain's MTA. You cannot predict every fashion in which they will behave. You should not try to.

Have you seen the regex that correctly matches every variant of email address form described in RFC 822? It is five kilobytes long. "Ah," you may now think, "I can use that!" You should not. There are new standards with new variations. The regex is incomplete. It will be incomplete forever. It is a five-kilobyte Perl regex. No one will ever understand it well enough to extend it.

Just send a link and a code. Ask the user to click on the link or give you the code. When you have received the click or the code, you know the email is valid. When you have not, assume it is not. This is the method that works. It is the only method that works. Use this method and be happy. Use any other and be sad. Which you prefer is up to you.

I prefer to do email suggestion rather than validation - The only validation I have is that it should contain an @. And then use something like http://getmailcheck.org/ to suggest spelling corrections for common domains. We saw a massive reduction in bad emails just by implementing this without having to annoy users with incorrectly implemented validation rules.

> You cannot validate email addresses for shape and form. The sole invariant is that there be at least three characters one of which is @. Everything beyond that is in the hands of the DNS and the domain's MTA. You cannot predict every fashion in which they will behave. You should not try to.

I used to agree with this viewpoint, but now I'm having some doubts.

Suppose I enter the email address "geofft@example.net@example.com". It validates according to your rule. Where does it go?

Suppose, furthermore, that I upgrade my servers and they start parsing it differently. They're allowed to do that, right? Maybe my old MTA sent a confirmation email to example.net, and the new one is now sending emails to a user named "geofft@example.net" at example.com. Haven't I just done something very wrong by allowing a user to trick me into sending emails to an unvalidated address? Isn't this both a violation of the Postel principle ("conservative in what you send to others") and a security hole?

It seems like it would be better for my application to parse the email address and validate what it's doing, and store only unambiguously-parseable addresses.

Or maybe they're not allowed to parse it differently. Maybe there's a single consistent way to parse geofft@example.net@example.com, and every single email application I might use will get it right. What is this esoteric lore that email applications know that my application cannot? If they can parse it, can't I?

>Suppose I enter the email address "geofft@example.net@example.com". It validates according to your rule. Where does it go?

Slight word charge on the rule. "..one and only one of which is @..." with characters within quotations ("") not being counted. Quotations need to be considered because this is a valid email address: [0]

`"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com`

>Suppose, furthermore, that I upgrade my servers and they start parsing it differently. They're allowed to do that, right? Maybe my old MTA sent a confirmation email to example.net, and the new one is now sending emails to a user named "geofft@example.net" at example.com.

If you upgrade your servers and they parse incorrectly by inserting quotations where there were no quotations entered, then the parsing is bugged. Although the validation was bugged to begin with for allowing two delimiters in an email address.

Furthermore, if anyone uses an email that is so heavily eccentric just because it is "technically valid" I'm sure they don't expect their email to work most of the time.

[0] https://en.wikipedia.org/wiki/Email_address

Aha. Well, now we're approaching something more complex than /@/, aren't we? :)

I will fully agree with the claim that a regex is the wrong way to validity-check an email, and I will easily believe that implementing the check you describe takes 5 kilobytes of regex. But a rule like "Exactly one un-quoted @ sign" or "Exactly one un-quoted @ sign, and the string on the right needs to be a well-formed domain name" or something is pretty simple in normal code.

"Well-formed domain name" is an existing concept: one or more labels separated by dots, each of which contains only ASCII letters, digits, or hyphens, cannot start or end with a hyphen, and cannot be more than 63 characters long, and no more than 253 characters total. Again, not something I'd do with a regex, but a small number of lines of code.

I'm not the original person you were speaking with - sorry for any confusion.

Although you have me curious how you would check with code without using regex.

> It validates according to your rule.

No. Read the rule again.

Did you mean "exactly one of which is @"?

That's not the rule that 'Freak_NL posted, which is why I read your rule as "at least one of which is @".

And, besides, as others have pointed out, "exactly one of which is @" is incorrect. Valid email addresses can have multiple @ characters.

I am not Freak_NL. I am also not interested in participating in the deviancy of people who choose to use email addresses containing quoted strings which themselves contain @s. Anyone who wishes to engage in such self-abuse is entirely free to do so. I don't feel myself constrained to play along.

Well, okay, but you've now stepped away from the goal of allowing all valid email addresses, and are now restricting email addresses that don't fit your personal opinion of what a good email address is, which is at odds with what the actual rules for valid email addresses are.

Your opinion might have a lot of reasons to support it, but it does not seem to be consistent with the position you were advocating. I am interested in not validating email addresses, because of your convincing argument that I should not. And now you want me to validate them.

Where did I claim that as my goal? If somebody just has to be a wiseass, that's fine, but I don't mind making him unhappy, because trying to make him happy along with everybody else will make me very unhappy indeed. If I can keep from making myself unhappy, and keep from making unhappy anyone who doesn't just have to be a wiseass, I'm willing to call it a day.

The Perl regex is also incorrect, since it assumes that comments were stripped beforehand. It also requires you to have backend processing that knows that <a . (kidding) b @ example.com>, <"a.b"@example.com>, and <a.b@example.com> all refer to the same email address. It would help if people knew that you need to look at RFC 821(/2821/5321) to figure out what constitutes a valid email address instead of RFC 822.

Please don't. A lot has been written about email address validation (search HN for a few good discussions); it is horrendously complex to get right. Any mistake or oversight means perfectly valid email addresses won't work, and maintaining your home-grown solution means keeping tabs on any updates to this domain suffix list and other relevant standards.

The most sensible approach (in my opinion) is to validate with this minimal regex:

Essentially just requiring an at-mark between two other bits of text. After that send a confirmation email to see if it works. That is pretty much fool-proof and low maintenance.

If you absolutely must help the user in preventing any common mistakes, use a library that provides hints in the UI for those presumed errors without actually invalidating the input. There are JavaScript libraries that do this fairly well.

> Essentially just requiring an at-mark between two other bits of text.

I've been using this regex to make sure there actually is text before and after the @


Can there be more than one @ or whitespace? If not you could use:

Filtering out emails with multiple @'s and whitespace seems like an obvious win to help prevent typos and copy-paste errors.

Whitespace in front of the @ is only allowed within quotes, and any quoted part must be seperated by dots from unqoted parts. "my address"@example.com is allowed, as is my." ".address@example.com, but not my"address"@example.com.

Once you get that into a regex format, remember that a quoted dot is different from an unqoted dot, and don't forget that you handle comments within email addesses correctly.

There are plenty of reasons why people say not to use regex to validate email addresses.

Yes, the are valid e-mail addresses that contain whitespace or multiple @s.

Wikipedia <https://en.wikipedia.org/wiki/Email_address#Valid_email_addr... > gives the following example:

    "very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com

Exactly. It is very much our nature as programmers to attempt to capture this seemingly trivial bit of structured text into seemingly innocuous programming rules, but the rewards of the added complexity are just too meager. Even if an email address validates, it still doesn't mean that you have a working email address. Sending an email is the only way to be sure, and if you are already doing that, then why risk excluding valid email addresses with a decidedly non-trivial regex?

There are regexes out there that capture all the complexity of valid email addresses today (but who knows if they'll work with, say, next year's additions to the top-level domains?), and you can copy them from StackOverflow if you really want them; but why bother?

It is safe in almost all applications to assume that quoted-string local-parts do not constitute a valid email address (ditto for domain literals).

Oh my.

Nevermind. `/.+@.+/` it is.

You are right of course. A typo on my end (fixed).

If you feel you really have to validate an email address without sending an email to said address, don't use a regex. Use something that's meant for parsing, like LPeg (for example: https://github.com/spc476/LPeg-Parsers/blob/ee63df57826fd5ef... which handles the RFC grammar, including comments) or lex/yacc or something other than a regex.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact