
Triv.io donates URL index to Common Crawl - LisaG
http://commoncrawl.org/common-crawl-url-index/
======
soult
Is it just me or is the data file not available for free despite being in the
Amazon Public Dataset S3 bucket?

*Edit: The problem seems to be fixed now.

~~~
srobertson
Let me double check that for you. Were you using a valid aws-id and secret?

~~~
soult
No, usually you can download them without sending any aws-id if they are on
the Public Datasets S3 bucket, e.g.

    
    
        wget https://s3.amazonaws.com/aws-publicdatasets/common-crawl/crawl-001/2008/06/19/0/1213886083018_0.arc.gz

------
rb2k_
What I'd love to see: A simple list of domains No information about content,
no full URLs, just the domainname.

~~~
soult
Extracing such a list from the generated index only takes a small script and a
few hours to download the 200+ GB index file. Which is a lot less than the
slightly bigger script and the months/years to download and process 80+ TB of
arc files that it previously would have taken to extract all domains.

Anyways, if you want a copy of domains from the index file just send me a mail
to the address in my profile.

~~~
thefreeman
Where is the index file located in the s3 bucket?

~~~
srobertson
s3://aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792

~~~
thefreeman
thank you kindly!

------
brianr
Nice work triv.io!

