

Navigating the WARC file format - zbowling
http://commoncrawl.org/navigating-the-warc-file-format/

======
pronoiac
Wait, Common Crawl is part of Automattic? I had no idea!

~~~
teraflop
It's not. The example shows them crawling a page from
102jamzorlando.cbslocal.com, which is hosted by Wordpress.com. Apparently
Automattic inserts that recruitment ad into the headers of every site they
host. (At least, all the ones I've checked so far.)

