
How to Download a List of All Registered Domain Names - jwcrux
http://jordan-wright.com/blog/2015/09/30/how-to-download-a-list-of-all-registered-domain-names/
======
jtwaleson
A couple of months ago I processed all metadata from the Common Crawl project
for all indexed domain names. This was about 10TB of metadata and resulted in
26 million domain names. EC2 costs were only about 10$ to process this. If
anyone is interested, let me know.

edit: available as torrent here: [https://all-
certificates.s3.amazonaws.com/domainnames.gz?tor...](https://all-
certificates.s3.amazonaws.com/domainnames.gz?torrent)

~~~
jtwaleson
This was a fun Saturday afternoon project to do actually. I spawned a single
c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl
Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/\-
30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-
was' pipelines and got a solid 5Gbit transfer speed.

The bottleneck was userland CPU, so probably the gzip processes. Took about 5
hours. Cutting of the path and piping through sort and uniq took another hour
or so.

The only costs were the price/hour of running the EC2 instance. Network costs
were 0 as you're only transferring data to your instance.

~~~
gerner
In my experience, and I suppose depending on the data, I've found that grep is
often the bottleneck for data pipeline tasks like you describe. The silver
searcher
([https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher))
is, in my experience, about 10x faster than grep for tasks like pulling out
fields from json files. It's changed my life.

pv (pipe viewer,
[http://www.ivarch.com/programs/pv.shtml](http://www.ivarch.com/programs/pv.shtml))
and top are pretty handy to measure this kind of thing. You should be able to
see exactly which process is using how much CPU, and what your throughput is.

------
nly
A warning about parsing zone files... the grammar is deceptively tricky.

While TLD registries will _probably_ provide you with files in a sane
subset[0] of that specified in RFC 1035, there are a number of things that
will _NOT_ work in general:

\- Splitting the file in to lines (paren-blocks and quoted strings can span
lines, strings can contain ';' etc).

\- Splitting the file on whitespace (it's significant in column 1 and inside
strings)

\- Applying a regex (you'll need lookahead for conditional matching and it'll
get ugly fast)

Don't go down the road of assuming it's a simple delimited file.

A few references:

[https://www.nlnetlabs.nl/projects/nsd/documentation.html](https://www.nlnetlabs.nl/projects/nsd/documentation.html)

[http://www.verycomputer.com/96_5ad11cc47053d8b0_1.htm](http://www.verycomputer.com/96_5ad11cc47053d8b0_1.htm)

[0] See page 9 of [https://archive.icann.org/en/topics/new-gtlds/zfa-
strategy-p...](https://archive.icann.org/en/topics/new-gtlds/zfa-strategy-
paper-12may10-en.pdf)

~~~
fanf2
Yep.

I use the BIND tool named-compilezone to canonicalize zone files, which allows
me to apply simple regex parsing, because I can assume one record per line,
all fields present, and no abbreviated names. Main disadvantage is it is not
very fast.

~~~
nly
It's probably slow because by default it does a bunch of DNS queries to
foreign zones to see if your NS records etc are good.

------
irfan
I had been downloading the zone file for .PK domains on daily bases until they
blocked the zone transfers. Based on comparison of these daily zone files I
managed to publish the statistics [1] and also broke the news about hacked .PK
domains [2] which was picked up by all leading tech blogs and news agencies.

Currently, I cannot find a way to get the zone file even by officially
requesting the registry manager.

[1]: [https://www.i.com.pk/pknic-domain-registration-
statistics/](https://www.i.com.pk/pknic-domain-registration-statistics/)

[2]: [https://www.i.com.pk/110-pk-domains-managed-by-
markmonitor-g...](https://www.i.com.pk/110-pk-domains-managed-by-markmonitor-
got-hacked-by-turkish-hackers/)

------
vortico
What if someone were to maintain an unofficial list with one domain per line,
freely available as a daily torrent or served directly? Would there be a
rights problem with mirroring and filtering ICANN data?

~~~
scandinavian
A lot of tlds don't provide zone files unless you are a registrar. They would
probably not be happy if someone put those out to the pubic. For com and the
likes they would probably not care as much.

The best public list of domains I have found is the Project Sonar DNS (ANY)
scans. I don't know how they do it, but their scans are pretty complete, at
least for.dk domains, which are the ones I use.

[https://scans.io/study/sonar.fdns](https://scans.io/study/sonar.fdns)

~~~
ndr
It's quite slow to download from that site, I guess there are a lot of HN
readers consuming their bandwidth :)

~~~
scandinavian
Their download speeds are often bad sadly, they really should provide a
torrent. But it's free and they have some really interesting datasets, so it's
worth the wait.

------
axaxs
FWIW, a TLD zone file does not contain every registered domain name, just
those with DNS records. There is typically a good amount of domain names
registered but without records, for reasons such as reserved names, malicious
content takedowns, etc.

~~~
larrys
Exactly. The title "How to Download a List of All Registered Domain Names" is
not correct.

For example in .com .net if a registrar puts a domain "on hold" that pulls it
from the zone file. As such it will not appear in any zone file download. This
is in fact a way for a company to keep a domain under wraps. As such:

register domain name (in .com )

pull the name servers

The name won't be in any zone file. And will be off the radar until it has
nameservers or until someone guesses to see if it's been registered.

(Note other registries work differently (or the same) the above is specific to
.com and .net)

------
zamalek
Now someone just needs to train an NN to recognize botnets and spam domains.

~~~
ech
_cough_
[https://news.ycombinator.com/item?id=10352001](https://news.ycombinator.com/item?id=10352001)
_cough_

we're in the process of getting all the zone files we can to reduce the amount
of DNS requests we have to do, but the real kicker is whois databases, for
example AFNIC asks for 10K€ to get access to a copy of the database...

~~~
zamalek
Oh I missed that. You're doing good work there!

------
jlgaddis
Interesting, thanks for the pointer.

I've wondered about this previously as I run my own blacklists for $work's
mail servers, thinking about how I could slightly "penalize" brand new domain
names and such, correlating "spammy" domains with certain nameservers and
such.

~~~
dpifke
There's a DNSBL for that:

[http://support-intelligence.com/dob/](http://support-intelligence.com/dob/)

------
ben_utzer
This list would be useful for my attempt for a list of parked/squatted
domains..

------
canow
would it be easier to download a list of available domains?

~~~
talideon
No. The list of available domains is the list of all possible domains less the
list of registered domains. The list of registered domains is vastly smaller
than the list of possible domains. The list of available domains would mostly
consist of junk nobody would be interested in.

------
mike-cardwell
What about CCTLDs?

~~~
sneg55
They available from a third party source such as [http://domains-
index.com/](http://domains-index.com/)

~~~
mike-cardwell
Yeah. My point was, he wrote an article titled "How to Download a List of All
Registered Domain Names", and then didn't even mention the existence of
cctlds. Which is like writing an article titled "How to learn how to speak
every language", and then pretending there are only 10 languages in existence.

------
ps4fanboy
I think its sad how closed this data is.

~~~
talideon
There are very good reasons for this data being closed, not least of which is
that allowing zone transfer by arbitrary individuals is an excellent way of
allowing your DNS server to be DOS'd.

~~~
ps4fanboy
How is that achieved with read only access to the list of registered domains?

~~~
talideon
I'm not sure what you mean. Do you know what a zone transfer is? If you wanted
to get a list of the domains and records published in a nameserver, you would
perform a zone transfer. Because that can amount to quite a bit of information
being transferred, if a nameserver allows unrestricted zone transfers, that's
a vector for a denial of service attack against that nameserver.

If you're a domain registry, your zone files are _huge_. Allowing arbitrary
zone transfers could potentially put massive sustained strain on their DNS
infrastructure. And thus because only a very small number of nameservers
_really_ need to be able to perform zone transfers against their nameservers,
they're better off locking down the ability.

If you're running your own nameservers, then it's still worth locking down
zone transfers for similar reasons. At the very least, it gives you a degree
of defence in depth as you're giving attackers less of an opportunity to
gather information on the structure of your network. If they could simply do a
zone transfer to find out all the names in a given zone, then they don't have
to do more costly brute force enumeration to guess at the hosts in the zone.

Take a read of this for why, if you run your own nameservers, you shouldn't
allow arbitrary zone transfers: [http://www.iodigitalsec.com/dns-zone-
transfer-axfr-vulnerabi...](http://www.iodigitalsec.com/dns-zone-transfer-
axfr-vulnerability/)

