
Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS - jtwaleson
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
======
jvehent
I have about the same amount of data in a Postgres database as part of the TLS
Observatory project [1].

    
    
        observatory=> select count(distinct(sha256_fingerprint)) from certificates;
          count  
        ---------
         1239943
    
        observatory=> select count(distinct(target)) from scans;
          count  
        ---------
         6483386
    

The scanner evaluates both certificate and ciphersuites and stores the results
in DB, so we can run complex analysis [2,3]. There is also have a public
client [4].

I don't have a good way to provide direct access to the database yet, but if
you're a researcher, ping me directly and we can figure something out.

[1] [https://github.com/mozilla/tls-
observatory](https://github.com/mozilla/tls-observatory)

[2]
[https://twitter.com/jvehent/status/684127067005390848](https://twitter.com/jvehent/status/684127067005390848)

[3]
[https://twitter.com/jvehent/status/686938805413232640](https://twitter.com/jvehent/status/686938805413232640)

[4]
[https://twitter.com/jvehent/status/687429007680376833](https://twitter.com/jvehent/status/687429007680376833)

~~~
mtrn
> I have about the same amount of data in a Postgres database ...

I'm curious, how fast can one load data into Postgres? Is it possible to
import data directly from CSV files?

~~~
masklinn
> I'm curious, how fast can one load data into Postgres?

Hard to answer considering the number of variables impacting. pg_bulkload[0]
quotes 18MB/s for parallel loading on DBT-2 (221s to load 4GB), and 12MB/s for
the built-in COPY (with post-indexing, that is first import all the data then
enable and build the indexes)

> Is it possible to import data directly from CSV files?

Yes, the COPY command[1] can probably be configured to support whatever your
*SV format is. There's also pg_bulkload (which should be faster but works
offline).

[0] [http://ossc-db.github.io/pg_bulkload/index.html](http://ossc-
db.github.io/pg_bulkload/index.html)

[1] [http://www.postgresql.org/docs/current/interactive/sql-
copy....](http://www.postgresql.org/docs/current/interactive/sql-copy.html)

~~~
anarazel
18MB/s sounds rather low. It obviously rather depends on the source of data,
format of data (e.g. lots of floating point columns is slower than large
fields of text), and whether parallelism is used. But you can relatively
easily get around 300MB/s into an unindexed table, provided you have a rather
decent storage system.

------
jtwaleson
On a side note, I recently discovered [https://scans.io/](https://scans.io/)
where you can find pretty much all of the data that I collected as well. Might
be interesting.

~~~
metafex
Censys ([https://www.censys.io/](https://www.censys.io/)) is also from them
and it's a search frontend for a quick lookup in their data. It can come in
real handy.

------
visarga
Sort uses only a fixed amount of memory, you can sort files larger than
memory, but for such situations where you have only a few tens of millions of
distinct values you can just use a python dictionary and it works even faster.
While sort would shuffle data around a lot, the memory dictionary would just
hold a key and a count as it gobbles the logs. It works because it is a
special case of sorting where there are relatively few different values
relative to the count of the whole list.

~~~
jtwaleson
Don't you mean a python set? But yes, for use cases containing many duplicates
where the result easily fits in memory, that is probably the fastest.

~~~
bacr
Fun fact: they are nearly the same implementation. See:
[http://markmail.org/message/ktzomp4uwrmnzao6](http://markmail.org/message/ktzomp4uwrmnzao6)

~~~
masklinn
As one would generally expect, the backing store of most hashsets is little
more than a hashmap with zero-sized/no values.

In fact, that's exactly how Rust's standard library hashset is implemented
since rust supports zero-sized types "in userland" (and unit `()` is a ZST):

    
    
        pub struct HashSet<T, S = RandomState> {
            map: HashMap<T, (), S>
        }
    

[http://doc.rust-
lang.org/src/std/collections/hash/set.rs.htm...](http://doc.rust-
lang.org/src/std/collections/hash/set.rs.html#112-114)

------
jnsaff2
it will not help for data transfer pricing but for cpu/vm time spot instances
can be amazing value for these short lived projects. typically 1/8 of the
price of the on demand. always take care to not set your bid higher than on
demand price as wild fluctuations can happen. also if you are afraid of losing
your work, there is an api you can query from within the vm that tells you 2
minutes ahead that its going to get killed. also, price is per ZONE, so there
are zones in the same region that people use less.

------
tyingq
The key to the low cost seems to be that he needed to process 10TB. You get
10TB "data in" free, per month. Had it been 10TB more, or if he needed to run
more than once a month, or if he needed to get that 10TB back out, the bill is
around $920.

Edit: Inbound might be unlimited free. The calculator did show me an inbound
total a few times, but I can't reproduce it now.

~~~
jtwaleson
Pretty sure it's unlimited.

~~~
tyingq
Appears that way. Still $920 if you had wanted to extract the 10TB back out
though.

~~~
mej10
Yeah... that is one of the really unfortunate lock-ins with AWS. Hopefully
they will add data export to their "Snowball" product, but they don't really
have a lot of incentive to.

~~~
MichaelBurge
I haven't used it, but I do see that they have both data in and data out on
their Snowball pricing[1] page. Data in is free, but data out is $0.03/GB(plus
$200 per job). So it would cost a minimum of $500 to use Snowball to transmit
10TB.

Still, it does appear to reduce the price of getting data out of Amazon
compared to using the internet.

[1]
[https://aws.amazon.com/importexport/pricing](https://aws.amazon.com/importexport/pricing)

------
magicmu
Great write-up; really interesting that the CPU ended up being the bottleneck
in this experiment! Regarding the cost of sending this data _out_ of AWS, did
you run into any issues there using rsync? IIRC rsync copies the data over
TCP, so wouldn't this end up being expensive as well? Generally, though, that
was my favorite part of the experiment!

~~~
jtwaleson
My use case converted 10TB in only a couple of GB after processing.
Downloading that was very cheap.

------
workitout
How much did storing the data on S3 cost where you said, "However, the data is
on S3" or was it there for such a transient time that it didn't cost much?
Bandwidth costs in/out of S3 too?

Edit: Actually I read the S3 parts again, it sounds like the CommonCrawl
project pays the S3 costs, I think, since it looks like you're using their
domain data?

~~~
jtwaleson
The results of the Common Crawl project are hosted on AWS Public Data Sets, so
it's not in my account.
[https://aws.amazon.com/datasets/](https://aws.amazon.com/datasets/)

~~~
workitout
I see, without CommonCrawl paying for S3 (or AWS maybe eats that cost to help
the public); this would be an expensive project.

~~~
xrstf
Actually, on the paged linked on your parent post, it says

> AWS is hosting the public data sets at no charge for the community

------
yazriel
Can you comment on how many additional domains you mined - compared (for
example) to the 1M domains from alexa top-1M

~~~
jtwaleson
$ cat alexa myset myset | sort | uniq -u | wc -l

773733

0.77M of Alexa top 1M were not in my list.

$ cat alexa alexa myset | sort | uniq -u | wc -l

25842205

I mined 25,842,205 additional domain names.

~~~
howaboutit
Did you consider using the gTLD zone files (from the respective registries)
and the ccTLD zone files found @
[http://viewdns.info/data/](http://viewdns.info/data/)? A much bigger initial
dataset than 25M domains right there?

~~~
jtwaleson
No, getting access will probably take a couple of days (or in case of viewdns,
more than 100$) and thereby all the fun out of the project. If you know of any
other way to get the list I'd be happy to hear it though!

~~~
adamseabrook
[http://meanpath.com/freedirectory.html](http://meanpath.com/freedirectory.html)

Feel free to grab a copy of our domain list. The "All domains with NS records"
is the one you want. Has 191 million in it.

~~~
jtwaleson
Wow! That's awesome!

------
fitzwatermellow
Sorry, but is this golang concurrent networking pattern correct:

    
    
        func main() {
            ch := make(chan string)
            for i := 0; i < MAX; i++ {
                go fetchCert(ch)
            }
            scanner := bufio.NewScanner(os.Stdin)
            for Scanner.Scan() {
                line := scanner.Text()
                ch <- line
            }
        }
    
    

All goroutines receive on the same channel! Instead a new goroutine should be
launched for each net conn. One should be able to spawn 1000s (or 1Ms) of
conns and avoid ulimits using buffered chans, waitgroups, timeouts, or
counters...

~~~
andmarios
This pattern is correct (but has a flaw). It is a simple worker pool. The
first available worker will grab the first piece of work from the channel and
process it.

If you set MAX to 1000, you will have 1000 workers — and simultaneous
connections.

The flaw is that when the last piece of work gets taken from the channel, the
program will end, thus the last pieces of work that at the time are being
processed, will get canceled. You could mitigate this by using a second
channel, that the workers will access at the end of their work, thus ensuring
that it will close only when the last worker finishes its work.

~~~
Liru
The in-article version has a time.Sleep(2 * time.Second) after the scan loop.
Not exactly reliable (waitgroups or channel signaling would be better) but
better than nothing.

~~~
jtwaleson
As you can see from almost all commands / snippets in the article, I took the
pragmatic approach for this project ;)

