
The Public Suffix List - dan1234
https://publicsuffix.org
======
derefr
@HN mods: this list could be used to make the domain in parens after the link
more accurate, no?

------
DanielDent
Quick plug for my related project: [https://github.com/QA2/public-suffix-
metalist](https://github.com/QA2/public-suffix-metalist) \- pull requests
welcome.

------
randomstring
This list is used to determine what domains and sub-domains "belong" together,
in the sense that they are controlled and/or owned by the same entity. For
instance *.google.com is all google, so x.google.com and y.google.com can be
trusted to share the same SSL key, safely share javascript (XSS), etc. However
x.blogger.com and y.blogger.com are probably two completely separate blogs,
people, domains, SSL keys, javascript domains, etc. And you wouldn't want to
see x.blogger.com's web pages showing up in search results for y.blogger.com.

Secondly, maintaining this list is a pain. You have hundreds of two letter top
level domains, one for each country. Each country with its own NIC in charge
of sub-domains. Each NIC with the power to add or delete subdomains (check out
[https://en.wikipedia.org/wiki/.uk](https://en.wikipedia.org/wiki/.uk) for
just one of hundreds of examples). Some "countries" even sell off their top
level domain like .nu . Then you have .us
([https://en.wikipedia.org/wiki/.us](https://en.wikipedia.org/wiki/.us)) that
has wild card domains like
[http://vil.stockbridge.mi.us/](http://vil.stockbridge.mi.us/) where the vil.
is fixed and part of the domain and stockbridge.mi is the important domain
information. Of course they're always adding more top level domains: .ninja,
.wtf, etc (maybe there is a .etc now?). Then you have all the blogging and
hosting platforms that use personalized domains for hosting content. Many are
listed in the publicsuffix list, but I'm guessing not all!

I ended up writing my own publicsuffix parser in PERL a few years back for the
blekko search engine. The main purpose being to be able to group web pages
together by site/owner. There is nothing quite like feeding every URL on the
internet through your parser to find bugs and corner cases.

------
cdubzzz
I recently found this list and use it in a Drupal module to invoke hooks based
on the domain of a URL. I didn't realize how many edge cases there are for
these and originally tried to do it with regex.

------
jasonjei
I do hope they use the same care and due process as they do with root
certificate inclusion in the public suffix list (as well as the ability to
add/remove entries manually). I think this is a great move for better
security, but with great power comes great responsibility.

------
yrro
I have always wondered why this information is not stored in the DNS itself.

~~~
hannob
This is security critical information. DNS is not secure. (Don't say DNSSEC -
practically undeployed solutions don't count as solutions.)

------
profmonocle
If you use this for something, please be sure to keep the data fresh!
Especially now that new TLDs are being added at a steady clip.

I once had an issue where a mobile user's browser would do a Google search
every time they typed one of our domains into the URL bar. They weren't doing
anything wrong - other domains worked fine. They were using some Android phone
from 2013 which had been abandoned by the manufacturer, and it turned out the
stock browser was using the public suffix list to decide if the input should
be treated as a URL or as a search query. The list was as old as the browser,
and since the domain ended in ".link" (added in 2014) it didn't think our
domain was a domain. (The workaround was to type out [http://](http://) or
[https://](https://) before the domain, but that's crappy UX.)

If you use this in a server-side app, please use cron or something to keep it
updated. If you embed it in a client app, make sure to update it as part of
your build process. And if you know your app won't be updated very often,
consider having it update separately from the app itself. (Does anyone know if
this is hosted on some public CDN somewhere? I assume having a client app
fetch directly from publicsuffix.org isn't kosher.)

Edit: My mistake, on
[https://publicsuffix.org/list/](https://publicsuffix.org/list/) they do
recommend having client apps pull directly from their site, and limiting
updates to once per day.

~~~
waldir
> Especially now that new TLDs are being added at a steady clip.

Indeed, there are some differences
([https://www.diffchecker.com/tcbbvy7p](https://www.diffchecker.com/tcbbvy7p))
with IANA's root domains list:
[http://www.iana.org/domains/root/db](http://www.iana.org/domains/root/db)
(which only includes TLDs, as the name indicates)

------
treve
A great and important list.

It also feels a bit like a hack. Would there a be a better way to do this,
maybe in the DNS system to denote ownership/isolation?

~~~
avian
Yes, it is a hack. I maintain the publicsuffix Python library
([https://pypi.python.org/pypi/publicsuffix/](https://pypi.python.org/pypi/publicsuffix/)).

I had many long discussions with package and application maintainers on the
topic of how to best provide data to the library. Arguably you need an up-to-
date version of the list if you want to access the web securely today.

You can bundle a static version with the library. That's nice for developers,
and users because it just works after "pip install", but will get out of date
because library won't update as often as the list changes. Some distros like
Debian provide their own mechanism for updating the list.

You can also download behind the scenes and cache. This way data is always up
to date, but users don't like that apps call some web service on start-up and
some people want to use the library off-line.

If you look at different libraries and apps, each one does it differently.
Getting it right is hard. Things would be so much easier if this information
would come decentralized from the DNS system.

------
throwanem
A domain hacker's playground.

~~~
dan1234
I was thinking of using this as the basis for some email validation as I got a
lot of incorrect email addresses in form posts.

~~~
throwanem
Freak_NL says:

> Please don't.

He is correct. Seriously. Don't.

The right way to validate an email address, if that's something you need to
do, is by sending it a message with a link and a response code, and seeing
whether the user clicks the link or enters the code. This is the only right
way. There are many wrong ways. They make people unhappy. They will make you
unhappy. Do not use them.

You cannot validate email addresses for shape and form. The sole invariant is
that there be at least three characters one of which is @. Everything beyond
that is in the hands of the DNS and the domain's MTA. You cannot predict every
fashion in which they will behave. You should not try to.

Have you seen the regex that correctly matches every variant of email address
form described in RFC 822? It is five kilobytes long. "Ah," you may now think,
"I can use that!" You should not. There are new standards with new variations.
The regex is incomplete. It will be incomplete forever. It is a five-kilobyte
Perl regex. No one will ever understand it well enough to extend it.

Just send a link and a code. Ask the user to click on the link or give you the
code. When you have received the click or the code, you know the email is
valid. When you have not, assume it is not. This is the method that works. It
is the only method that works. Use this method and be happy. Use any other and
be sad. Which you prefer is up to you.

~~~
geofft
> _You cannot validate email addresses for shape and form. The sole invariant
> is that there be at least three characters one of which is @. Everything
> beyond that is in the hands of the DNS and the domain 's MTA. You cannot
> predict every fashion in which they will behave. You should not try to._

I used to agree with this viewpoint, but now I'm having some doubts.

Suppose I enter the email address "geofft@example.net@example.com". It
validates according to your rule. Where does it go?

Suppose, furthermore, that I upgrade my servers and they start parsing it
_differently_. They're allowed to do that, right? Maybe my old MTA sent a
confirmation email to example.net, and the new one is now sending emails to a
user named "geofft@example.net" at example.com. Haven't I just done something
very wrong by allowing a user to trick me into sending emails to an
unvalidated address? Isn't this _both_ a violation of the Postel principle
("conservative in what you send to others") _and_ a security hole?

It seems like it would be better for my application to parse the email address
and validate what it's doing, and store only unambiguously-parseable
addresses.

Or maybe they're not allowed to parse it differently. Maybe there's a single
consistent way to parse geofft@example.net@example.com, and every single email
application I might use will get it right. What is this esoteric lore that
email applications know that my application cannot? If they can parse it,
can't I?

~~~
Nadya
_> Suppose I enter the email address "geofft@example.net@example.com". It
validates according to your rule. Where does it go?_

Slight word charge on the rule. "..one and only one of which is @..." with
characters within quotations ("") not being counted. Quotations need to be
considered because this is a valid email address: [0]

`"very.(),:;<>[]\".VERY.\"very@\\\ \"very\".unusual"@strange.example.com`

 _> Suppose, furthermore, that I upgrade my servers and they start parsing it
differently. They're allowed to do that, right? Maybe my old MTA sent a
confirmation email to example.net, and the new one is now sending emails to a
user named "geofft@example.net" at example.com._

If you upgrade your servers and they parse _incorrectly_ by inserting
quotations where there were no quotations entered, then the parsing is bugged.
Although the validation was bugged to begin with for allowing two delimiters
in an email address.

Furthermore, if anyone uses an email that is so heavily eccentric just because
it is "technically valid" I'm sure they don't expect their email to work most
of the time.

[0]
[https://en.wikipedia.org/wiki/Email_address](https://en.wikipedia.org/wiki/Email_address)

~~~
geofft
Aha. Well, now we're approaching something more complex than /@/, aren't we?
:)

I will fully agree with the claim that a regex is the wrong way to validity-
check an email, and I will easily believe that implementing the check you
describe takes 5 kilobytes of regex. But a rule like "Exactly one un-quoted @
sign" or "Exactly one un-quoted @ sign, and the string on the right needs to
be a well-formed domain name" or something is pretty simple in normal code.

"Well-formed domain name" is an existing concept: one or more labels separated
by dots, each of which contains only ASCII letters, digits, or hyphens, cannot
start or end with a hyphen, and cannot be more than 63 characters long, and no
more than 253 characters total. Again, not something I'd do with a regex, but
a small number of lines of code.

~~~
Nadya
I'm not the original person you were speaking with - sorry for any confusion.

Although you have me curious how you would check with code without using
regex.

