edit: available as torrent here: https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...
The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.
The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.
pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.
- 1.8MM successful scrapes
- 25GB total size (stored in postgres)
- 1 server to host redis and postgres
- 9 physical servers for workers (4x core ARM servers) = 36 total cores
- Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total)
- I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server
All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.
Or maybe push it futher and try my hand at these 26MM domains?
 - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
 - https://blog.majestic.com/development/majestic-million-csv-d...
 - http://python-rq.org/
 - http://scaleway.com
I myself was trying to download all X509 server certificates on the Internet I could find to:
- gain some insight into the dealings of CAs
- see if private keys were reused across multiple sites
- check the average expiry date of certificates etc.
I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.
While TLD registries will probably provide you with files in a sane subset of that specified in RFC 1035, there are a number of things that will NOT work in general:
- Splitting the file in to lines (paren-blocks and quoted strings can span lines, strings can contain ';' etc).
- Splitting the file on whitespace (it's significant in column 1 and inside strings)
- Applying a regex (you'll need lookahead for conditional matching and it'll get ugly fast)
Don't go down the road of assuming it's a simple delimited file.
A few references:
 See page 9 of https://archive.icann.org/en/topics/new-gtlds/zfa-strategy-p...
I use the BIND tool named-compilezone to canonicalize zone files, which allows me to apply simple regex parsing, because I can assume one record per line, all fields present, and no abbreviated names. Main disadvantage is it is not very fast.
Currently, I cannot find a way to get the zone file even by officially requesting the registry manager.
The best public list of domains I have found is the Project Sonar DNS (ANY) scans. I don't know how they do it, but their scans are pretty complete, at least for.dk domains, which are the ones I use.
But first, I definitely have to see if it's against GitHub's terms of service.
Almost anything that gets the name of a particular service bumped up to the top of someone's consciousness for a little while will shift some of those decisions toward that service.
This is why even the world's most popular brands (Apple, Coke, etc) never stop spending money on marketing / PR :)
For example in .com .net if a registrar puts a domain "on hold" that pulls it from the zone file. As such it will not appear in any zone file download. This is in fact a way for a company to keep a domain under wraps. As such:
register domain name (in .com )
pull the name servers
The name won't be in any zone file. And will be off the radar until it has nameservers or until someone guesses to see if it's been registered.
(Note other registries work differently (or the same) the above is specific to .com and .net)
If you're a domain registry, your zone files are huge. Allowing arbitrary zone transfers could potentially put massive sustained strain on their DNS infrastructure. And thus because only a very small number of nameservers really need to be able to perform zone transfers against their nameservers, they're better off locking down the ability.
If you're running your own nameservers, then it's still worth locking down zone transfers for similar reasons. At the very least, it gives you a degree of defence in depth as you're giving attackers less of an opportunity to gather information on the structure of your network. If they could simply do a zone transfer to find out all the names in a given zone, then they don't have to do more costly brute force enumeration to guess at the hosts in the zone.
Take a read of this for why, if you run your own nameservers, you shouldn't allow arbitrary zone transfers: http://www.iodigitalsec.com/dns-zone-transfer-axfr-vulnerabi...
we're in the process of getting all the zone files we can to reduce the amount of DNS requests we have to do, but the real kicker is whois databases, for example AFNIC asks for 10K€ to get access to a copy of the database...
I've wondered about this previously as I run my own blacklists for $work's mail servers, thinking about how I could slightly "penalize" brand new domain names and such, correlating "spammy" domains with certain nameservers and such.