Hacker News new | past | comments | ask | show | jobs | submit login
How to Download a List of All Registered Domain Names (jordan-wright.com)
170 points by jwcrux on Oct 11, 2015 | hide | past | favorite | 53 comments

A couple of months ago I processed all metadata from the Common Crawl project for all indexed domain names. This was about 10TB of metadata and resulted in 26 million domain names. EC2 costs were only about 10$ to process this. If anyone is interested, let me know.

edit: available as torrent here: https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...

This was a fun Saturday afternoon project to do actually. I spawned a single c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/- 30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-was' pipelines and got a solid 5Gbit transfer speed.

The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.

The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.

In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.

pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.

Both your story and the torrent are fascinating. Thank you for sharing.

Synchronicity :-) Just this week I scraped and analyzed only the homepages from the Alexa top 1 million [1] and majestic million [2]. I used rqworker[3] and and a fleet of mini-servers from scaleway[4] to do the scraping. Some results:

- 1.8MM successful scrapes - 25GB total size (stored in postgres) - 1 server to host redis and postgres - 9 physical servers for workers (4x core ARM servers) = 36 total cores - Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total) - I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server

All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.

Or maybe push it futher and try my hand at these 26MM domains?

[1] - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] - https://blog.majestic.com/development/majestic-million-csv-d...

[3] - http://python-rq.org/

[4] - http://scaleway.com

Please make a torrent.

Thank you very much.

thank you for this!

just out of curiosity, what are you (and the other commenters) planning to use this for?

I myself was trying to download all X509 server certificates on the Internet I could find to:

- gain some insight into the dealings of CAs

- see if private keys were reused across multiple sites

- check the average expiry date of certificates etc.

My intentions is that I want a complete list of .no (Norwegian) domains, and http://norid.no refuse to give out that list to anyone.

I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.

This might be useful for that: https://scans.io/study/sonar.ssl

Yes I found that some time afterwards, so I decided not to continue with it ;) I did gather about 1.4M certificates in the process though.

Any way to extract just country specific domains without processing all 10TB of it?

No, but my torrent is just the list of unique domain names and 147MB gzipped.

I am interested.

A warning about parsing zone files... the grammar is deceptively tricky.

While TLD registries will probably provide you with files in a sane subset[0] of that specified in RFC 1035, there are a number of things that will NOT work in general:

- Splitting the file in to lines (paren-blocks and quoted strings can span lines, strings can contain ';' etc).

- Splitting the file on whitespace (it's significant in column 1 and inside strings)

- Applying a regex (you'll need lookahead for conditional matching and it'll get ugly fast)

Don't go down the road of assuming it's a simple delimited file.

A few references:



[0] See page 9 of https://archive.icann.org/en/topics/new-gtlds/zfa-strategy-p...


I use the BIND tool named-compilezone to canonicalize zone files, which allows me to apply simple regex parsing, because I can assume one record per line, all fields present, and no abbreviated names. Main disadvantage is it is not very fast.

It's probably slow because by default it does a bunch of DNS queries to foreign zones to see if your NS records etc are good.

I had been downloading the zone file for .PK domains on daily bases until they blocked the zone transfers. Based on comparison of these daily zone files I managed to publish the statistics [1] and also broke the news about hacked .PK domains [2] which was picked up by all leading tech blogs and news agencies.

Currently, I cannot find a way to get the zone file even by officially requesting the registry manager.

[1]: https://www.i.com.pk/pknic-domain-registration-statistics/

[2]: https://www.i.com.pk/110-pk-domains-managed-by-markmonitor-g...

What if someone were to maintain an unofficial list with one domain per line, freely available as a daily torrent or served directly? Would there be a rights problem with mirroring and filtering ICANN data?

A lot of tlds don't provide zone files unless you are a registrar. They would probably not be happy if someone put those out to the pubic. For com and the likes they would probably not care as much.

The best public list of domains I have found is the Project Sonar DNS (ANY) scans. I don't know how they do it, but their scans are pretty complete, at least for.dk domains, which are the ones I use.


It's quite slow to download from that site, I guess there are a lot of HN readers consuming their bandwidth :)

Their download speeds are often bad sadly, they really should provide a torrent. But it's free and they have some really interesting datasets, so it's worth the wait.

I started to look at this one week ago, wish it provide some restful api for this useful database

Unfortunately, as part of the application you are compelled to sign forms promising that you won't make "significant" parts of the zone file publicly available in any way (at least this was my experience when applying to Verisign for .com and .net zone file access).

I see. Oh well.

Why not just upload it to Github? Then you only need to send the deltas for every update.

I was just thinking of creating a bot that will update the list(s) and push the changes back on GitHub every day.

But first, I definitely have to see if it's against GitHub's terms of service.

I think that GitHub would love the publicity personally. They strike me as being very savvy regards this sort of thing ;)

does GitHub need any publicity? honest question as I've found that anyone who would ever use the functionality GitHub provides is already very aware of git and GitHub.

people shift between services like github and bitbucket and alternatives all the time. Perhaps not often on an individual basis, but at any one time many people are deciding where to put their stuff.

Almost anything that gets the name of a particular service bumped up to the top of someone's consciousness for a little while will shift some of those decisions toward that service.

This is why even the world's most popular brands (Apple, Coke, etc) never stop spending money on marketing / PR :)

isn't it essentially public data though?

Same question here

It could be used to create a competing system. ICANN would never allow this. If someone tried to put this together, I think they would quickly find their access to the data revoked.

FWIW, a TLD zone file does not contain every registered domain name, just those with DNS records. There is typically a good amount of domain names registered but without records, for reasons such as reserved names, malicious content takedowns, etc.

Exactly. The title "How to Download a List of All Registered Domain Names" is not correct.

For example in .com .net if a registrar puts a domain "on hold" that pulls it from the zone file. As such it will not appear in any zone file download. This is in fact a way for a company to keep a domain under wraps. As such:

register domain name (in .com )

pull the name servers

The name won't be in any zone file. And will be off the radar until it has nameservers or until someone guesses to see if it's been registered.

(Note other registries work differently (or the same) the above is specific to .com and .net)

Now someone just needs to train an NN to recognize botnets and spam domains.

cough https://news.ycombinator.com/item?id=10352001 cough

we're in the process of getting all the zone files we can to reduce the amount of DNS requests we have to do, but the real kicker is whois databases, for example AFNIC asks for 10K€ to get access to a copy of the database...

Oh I missed that. You're doing good work there!

Interesting, thanks for the pointer.

I've wondered about this previously as I run my own blacklists for $work's mail servers, thinking about how I could slightly "penalize" brand new domain names and such, correlating "spammy" domains with certain nameservers and such.

There's a DNSBL for that:


This list would be useful for my attempt for a list of parked/squatted domains..

would it be easier to download a list of available domains?

No. The list of available domains is the list of all possible domains less the list of registered domains. The list of registered domains is vastly smaller than the list of possible domains. The list of available domains would mostly consist of junk nobody would be interested in.

What about CCTLDs?

They available from a third party source such as http://domains-index.com/

Yeah. My point was, he wrote an article titled "How to Download a List of All Registered Domain Names", and then didn't even mention the existence of cctlds. Which is like writing an article titled "How to learn how to speak every language", and then pretending there are only 10 languages in existence.

ccTLDs are the responsibility of "designated managers" for each sovereign country (see https://tools.ietf.org/html/rfc1591).

I think its sad how closed this data is.

There are very good reasons for this data being closed, not least of which is that allowing zone transfer by arbitrary individuals is an excellent way of allowing your DNS server to be DOS'd.

How is that achieved with read only access to the list of registered domains?

I'm not sure what you mean. Do you know what a zone transfer is? If you wanted to get a list of the domains and records published in a nameserver, you would perform a zone transfer. Because that can amount to quite a bit of information being transferred, if a nameserver allows unrestricted zone transfers, that's a vector for a denial of service attack against that nameserver.

If you're a domain registry, your zone files are huge. Allowing arbitrary zone transfers could potentially put massive sustained strain on their DNS infrastructure. And thus because only a very small number of nameservers really need to be able to perform zone transfers against their nameservers, they're better off locking down the ability.

If you're running your own nameservers, then it's still worth locking down zone transfers for similar reasons. At the very least, it gives you a degree of defence in depth as you're giving attackers less of an opportunity to gather information on the structure of your network. If they could simply do a zone transfer to find out all the names in a given zone, then they don't have to do more costly brute force enumeration to guess at the hosts in the zone.

Take a read of this for why, if you run your own nameservers, you shouldn't allow arbitrary zone transfers: http://www.iodigitalsec.com/dns-zone-transfer-axfr-vulnerabi...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact