Hacker News new | past | comments | ask | show | jobs | submit login
Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS (waleson.com)
257 points by jtwaleson on Jan 17, 2016 | hide | past | web | favorite | 55 comments

I have about the same amount of data in a Postgres database as part of the TLS Observatory project [1].

    observatory=> select count(distinct(sha256_fingerprint)) from certificates;

    observatory=> select count(distinct(target)) from scans;
The scanner evaluates both certificate and ciphersuites and stores the results in DB, so we can run complex analysis [2,3]. There is also have a public client [4].

I don't have a good way to provide direct access to the database yet, but if you're a researcher, ping me directly and we can figure something out.

[1] https://github.com/mozilla/tls-observatory

[2] https://twitter.com/jvehent/status/684127067005390848

[3] https://twitter.com/jvehent/status/686938805413232640

[4] https://twitter.com/jvehent/status/687429007680376833

I have something similar, running on a Linux desktop, to analyze SSL certs. The front end is in Go, which picks certs of interest (I wanted all certs with more than one domain) and loads them into a MariaDB database. Here's the code.[1] It's amazing how much work you can do on a modern computer when you actually use it for computing.

[1] https://github.com/John-Nagle/certscan

I see you've reused the zscan DB schema. It's a good choice, we used it to inspire our schema too. We went a bit beyond because we also store ciphersuites (from cipherscan) and chains of trusts, so the DB schema had to be more relational than what zscan uses.

That's awesome. If you are not already doing so, you can download my set from the torrent and include it in your database.


For exporting, pg_dump -F c greatly compresses the data so cost-wise you might be able to put on S3 and publish as a torrent.

Exporting is one possibility, but eventually I'd like to provide a read-only sql access to the database we host. We have a few ideas on how to do this [1], but it's not implemented yet.

[1] https://github.com/mozilla/tls-observatory/issues/92

Perhaps something like a modified PostgREST could work?


The problem isn't so much exposing the data as a rest api, as it is allowing for complex queries that may contain various table joins, subqueries or recursive conditions. I only skimmed through the documentation of postgrest, but it doesn't make mention of joining tables, which is a deal breaker for our use case.

Idea from someone just starting to learn about databases (very green :P):

- People request access and get an API key associated with a given load threshold, or don't use an API key and default to some low threshold

- Anything that SQL EXPLAIN says is over the threshold returns an error

- Successful requests' load costs and execution time (and possibly CPU, if that can be determined) count toward a usage rate limit

- An SQL parser implements the subset of SQL you deem safe and acceptable and forms a last-resort firewall

Obviously this is a complex solution; I'm curious what people's opinions are on whether this would overall be simpler or more difficult in the long run.

> I have about the same amount of data in a Postgres database ...

I'm curious, how fast can one load data into Postgres? Is it possible to import data directly from CSV files?

> I'm curious, how fast can one load data into Postgres?

Hard to answer considering the number of variables impacting. pg_bulkload[0] quotes 18MB/s for parallel loading on DBT-2 (221s to load 4GB), and 12MB/s for the built-in COPY (with post-indexing, that is first import all the data then enable and build the indexes)

> Is it possible to import data directly from CSV files?

Yes, the COPY command[1] can probably be configured to support whatever your *SV format is. There's also pg_bulkload (which should be faster but works offline).

[0] http://ossc-db.github.io/pg_bulkload/index.html

[1] http://www.postgresql.org/docs/current/interactive/sql-copy....

18MB/s sounds rather low. It obviously rather depends on the source of data, format of data (e.g. lots of floating point columns is slower than large fields of text), and whether parallelism is used. But you can relatively easily get around 300MB/s into an unindexed table, provided you have a rather decent storage system.

>Is it possible to import data directly from CSV files?

Yup! http://www.postgresql.org/docs/current/static/sql-copy.html

Our dataset is not loaded from an external source, it is generated by scanners.

But to answer your question: yes, postgres can load data from csv files: http://stackoverflow.com/questions/2987433/how-to-import-csv...

On a side note, I recently discovered https://scans.io/ where you can find pretty much all of the data that I collected as well. Might be interesting.

Censys (https://www.censys.io/) is also from them and it's a search frontend for a quick lookup in their data. It can come in real handy.

You might find the processing tips on the Project Sonar wiki useful:


Project Sonar is one of the primary contributors to scans.io. The DAP utility is handy for parsing raw x509 certificates and generating JSON output.

Sort uses only a fixed amount of memory, you can sort files larger than memory, but for such situations where you have only a few tens of millions of distinct values you can just use a python dictionary and it works even faster. While sort would shuffle data around a lot, the memory dictionary would just hold a key and a count as it gobbles the logs. It works because it is a special case of sorting where there are relatively few different values relative to the count of the whole list.

'sort | uniq' is another special case of this, and it is much better to replace that with 'sort -u'

the 'sort' in 'sort | uniq' doesn't know you are going to be throwing away all the duplicate data.

If anyone is wondering, here is an implementation of the python approach i have lying around:

  #!/usr/bin/env python2
  import sys
  from collections import defaultdict
  c = defaultdict(int)
  for line in sys.stdin:
      c[line] += 1
  top = sorted(c.items(), key=lambda (k,v): v)
  for k, v in top:
      print v, k,

Just for fun, here's a version using `Counter` from the same `collections` module which makes that blissfully simple:

    #!/usr/bin/env python2
    import sys
    from collections import Counter

    for pair in Counter(sys.stdin).most_common():
        print pair

This is even easier to do with awk.

    awk -e '!a[$0]++'
This also preserves the original input order, which is a nice property.

Don't you mean a python set? But yes, for use cases containing many duplicates where the result easily fits in memory, that is probably the fastest.

Fun fact: they are nearly the same implementation. See: http://markmail.org/message/ktzomp4uwrmnzao6

As one would generally expect, the backing store of most hashsets is little more than a hashmap with zero-sized/no values.

In fact, that's exactly how Rust's standard library hashset is implemented since rust supports zero-sized types "in userland" (and unit `()` is a ZST):

    pub struct HashSet<T, S = RandomState> {
        map: HashMap<T, (), S>

a set replaces 'sort -u' or 'sort | uniq'. A dictionary replaces 'sort | uniq -c'

Or sort -u` instead of `sort | uniq`.

it will not help for data transfer pricing but for cpu/vm time spot instances can be amazing value for these short lived projects. typically 1/8 of the price of the on demand. always take care to not set your bid higher than on demand price as wild fluctuations can happen. also if you are afraid of losing your work, there is an api you can query from within the vm that tells you 2 minutes ahead that its going to get killed. also, price is per ZONE, so there are zones in the same region that people use less.

The key to the low cost seems to be that he needed to process 10TB. You get 10TB "data in" free, per month. Had it been 10TB more, or if he needed to run more than once a month, or if he needed to get that 10TB back out, the bill is around $920.

Edit: Inbound might be unlimited free. The calculator did show me an inbound total a few times, but I can't reproduce it now.

You may have used an out of date calculator or it's for specific cases. AWS inbound traffic has been free since 2011


Pretty sure it's unlimited.

Appears that way. Still $920 if you had wanted to extract the 10TB back out though.

Yeah... that is one of the really unfortunate lock-ins with AWS. Hopefully they will add data export to their "Snowball" product, but they don't really have a lot of incentive to.

I haven't used it, but I do see that they have both data in and data out on their Snowball pricing[1] page. Data in is free, but data out is $0.03/GB(plus $200 per job). So it would cost a minimum of $500 to use Snowball to transmit 10TB.

Still, it does appear to reduce the price of getting data out of Amazon compared to using the internet.

[1] https://aws.amazon.com/importexport/pricing

Great write-up; really interesting that the CPU ended up being the bottleneck in this experiment! Regarding the cost of sending this data out of AWS, did you run into any issues there using rsync? IIRC rsync copies the data over TCP, so wouldn't this end up being expensive as well? Generally, though, that was my favorite part of the experiment!

My use case converted 10TB in only a couple of GB after processing. Downloading that was very cheap.

How much did storing the data on S3 cost where you said, "However, the data is on S3" or was it there for such a transient time that it didn't cost much? Bandwidth costs in/out of S3 too?

Edit: Actually I read the S3 parts again, it sounds like the CommonCrawl project pays the S3 costs, I think, since it looks like you're using their domain data?

The results of the Common Crawl project are hosted on AWS Public Data Sets, so it's not in my account. https://aws.amazon.com/datasets/

I see, without CommonCrawl paying for S3 (or AWS maybe eats that cost to help the public); this would be an expensive project.

Actually, on the paged linked on your parent post, it says

> AWS is hosting the public data sets at no charge for the community

Can you comment on how many additional domains you mined - compared (for example) to the 1M domains from alexa top-1M

$ cat alexa myset myset | sort | uniq -u | wc -l


0.77M of Alexa top 1M were not in my list.

$ cat alexa alexa myset | sort | uniq -u | wc -l


I mined 25,842,205 additional domain names.

Did you consider using the gTLD zone files (from the respective registries) and the ccTLD zone files found @ http://viewdns.info/data/? A much bigger initial dataset than 25M domains right there?

No, getting access will probably take a couple of days (or in case of viewdns, more than 100$) and thereby all the fun out of the project. If you know of any other way to get the list I'd be happy to hear it though!


Feel free to grab a copy of our domain list. The "All domains with NS records" is the one you want. Has 191 million in it.

Wow! That's awesome!

Amazing, thanks!

A shortcut to getting com, net, info, org, us, sk, and biz is to give premiumdrops.com $24.95/mo. You can get these for free from the TLD operators, but it takes a few weeks of snail mail (last I checked). The gTLD access via CZDAP is free, but takes a few days for approvals to process.

https://czds.icann.org/en It has been largely automated now so you can request access to the files with one click vs having to sign and email hundreds of forms. Approval seems to be automatic for most of them.

man.. the internet really is full of crappy domains...

(and yes.. now i see that you mentioned it in the article.. took me time to get there)

The Sonar FDNS set contains about 1.4 billion host names (50m+ domains). The FDNS set is seeded from TLD zones, CZDAP, PTR lookups (RDNS), SSL/TLS scans, and HTTP link extraction. It updates every two weeks: https://github.com/rapid7/sonar/wiki/Forward-DNS

Sorry, but is this golang concurrent networking pattern correct:

    func main() {
        ch := make(chan string)
        for i := 0; i < MAX; i++ {
            go fetchCert(ch)
        scanner := bufio.NewScanner(os.Stdin)
        for Scanner.Scan() {
            line := scanner.Text()
            ch <- line

All goroutines receive on the same channel! Instead a new goroutine should be launched for each net conn. One should be able to spawn 1000s (or 1Ms) of conns and avoid ulimits using buffered chans, waitgroups, timeouts, or counters...

This pattern is correct (but has a flaw). It is a simple worker pool. The first available worker will grab the first piece of work from the channel and process it.

If you set MAX to 1000, you will have 1000 workers — and simultaneous connections.

The flaw is that when the last piece of work gets taken from the channel, the program will end, thus the last pieces of work that at the time are being processed, will get canceled. You could mitigate this by using a second channel, that the workers will access at the end of their work, thus ensuring that it will close only when the last worker finishes its work.

The in-article version has a time.Sleep(2 * time.Second) after the scan loop. Not exactly reliable (waitgroups or channel signaling would be better) but better than nothing.

As you can see from almost all commands / snippets in the article, I took the pragmatic approach for this project ;)

Well, that whole file can be improved on (for instance: analyzeDomains and analyzeDomain can be combined into one; the range/close operators can be used on a channel), but the pattern itself (spawning a certain amount of workers instead of one for each job) is decent for certain cases. For this case, you may be right (I haven't tested it), since a) not much data is transferred, and b) He was running this on an AWS instance instead of a low-end machine.

However, at a certain point, you may encounter bandwidth issues, timeouts, and the like due to local network congestion; that pattern has its uses there. I've tried writing a downloader that downloads every file it's given at once, and it went about as well as one would expect.

Thanks for the replies! Indeed, Multiple goroutines can receive on a single global channel to create a simple worker pool in a fan-out configuration. Analogously a second fan-in channel can be used to merge the parallel computation. With the caveat that the channel should be closed properly to make sure all tasks are complete.

Inspired now to start a "go-saturate" library for measuring max net capacity...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact