
102TB of New Crawl Data Available - LisaG
http://commoncrawl.org/new-crawl-data-available/
======
rwg
I really wanted to love the Common Crawl corpus. I needed an excuse to play
with EC2, I had a project idea that would benefit an open source project
(Mozilla's pdf.js), and I had an AWS gift card with $100 of value on it. But
when I actually got to work, I found the choice of Hadoop sequence files
containing JSON documents for the crawl metadata absolutely maddening and
slammed headfirst into an undocumented gotcha that ultimately killed the
project: the documents in the corpus are truncated at ~512 kilobytes.

It looks like they've fixed the first problem by switching to gzipped WARC
files, but I can't find any information about whether or not they're still
truncating documents in the archive. I guess I'll have to give it another look
and see...

~~~
Aloisius
I'd have to check the last crawl settings, but I believe I set the last crawl
was set to truncate at 1 MB (response body size, so that could be 1 MB
uncompressed or 1 MB compressed depending on what the source web server sent
out).

At one point I was tried out a 10 MB limit, but the thing is we try to limit
crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-
speed connections that would slow down the whole thing.

For the next crawl, we'll mark which pages are truncated and which aren't (an
oversight in the last crawl) so at least you can skip over them.

Also, hopefully you'll find the new metadata files to be a little clearer. We
switched over the same format Internet Archive uses and it contains quite a
bit more data (xpath truncated paths for each link for instance).

------
boyter
I love common crawl, but as I commented before I still want to see a subset
available for download, something like the top million sites or something like
that. Certainly a few steps of data, say 50GB 100GB and 200GB.

I really think a subset like this will increase the value as it would allow
people writing search engines (for fun or profit) to suck a copy down locally
and work away. Its something I would like to do for sure.

~~~
LisaG
There will be news about a subset sometime next month!

~~~
hkmurakami
would love to have even smaller subsets (like 5gb) that students can casually
play around with too to practice and learn tools and algos :) (if it's not too
much trouble!)

~~~
Aloisius
You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-
MAIN-2013-20/segments/1368704392896/warc/CC-
MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files,
so it is pretty easy to get the corresponding sets of files. For the above it
would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-
MAIN-2013-20/segments/1368704392896/wat/CC-
MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-
MAIN-2013-20/segments/1368704392896/wet/CC-
MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz

~~~
ccleve
Is there any way to get incrementals? It would be extremely valuable is to get
the pages that were added/changed/deleted each day. Some kind of a daily feed
of a more limited size.

~~~
froo

      s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/
    

That should get you about 90% on your way.

------
kohanz
I'm curious to hear how people are using Common Crawl data.

------
danso
Very cool...though I have to say, CC is a constant reminder that whatever you
put on the Internet will basically remain in the public eye for the perpetuity
of electronic communication. There exists ways to remove your (owned) content
from archive.org and Google...but once some other independent scraper catches
it, you can't really do much about it

~~~
bollacker
I think about this from George Santayana's perspective: "Those who cannot
remember the past are condemned to repeat it." I feel like we need our past
recorded (good, bad, AND ugly). It keeps us civil and humble.

------
rb2k_
Is there an easy way to grab JUST a list of uniq domains?

That would be a great starter for all sorts of fun little weekend experiments.

------
ma2rten
I would be great if common crawl (or anyone else) would also release a
document-term index for it's data. If you had an index, you could do a lot
more things with this data.

------
ecaron
Anyone have a good understanding of the difference between this and
[http://www.dotnetdotcom.org/](http://www.dotnetdotcom.org/)? I've seen Dotbot
in my access logs more than CommonCrawl, so I'm more inclined to believe they
have a wider - but not deeper - spread.

------
recuter
Anybody want to take a guess at what percentage these 2B pages represent out
of the total surface web at least? I can't find reliable figures, numbers all
over the place. 5 percent?

------
GigabyteCoin
Can anyone give me a quick rundown on how exactly one gains access to all of
this data?

I have heard about this project numerous times, and am always dissuaded by the
lack of download links/torrents/information on their homepage.

Perhaps I just don't know what I'm looking at?

~~~
wpietri
Did you try this?

[http://commoncrawl.org/get-started/](http://commoncrawl.org/get-started/)

I haven't tried that one, but I've poked at other of the Amazon Common
Datasets collection:

[http://aws.amazon.com/datasets](http://aws.amazon.com/datasets)

If you're already familiar with using Amazon's virtual servers, it's pretty
straightforward.

I also note that the Common Crawl project publishes code here:

[https://github.com/commoncrawl/commoncrawl](https://github.com/commoncrawl/commoncrawl)

------
DigitalSea
I've yet to find an excuse to download some of this data to play with. I have
a feeling my ISP will personally send around a bunch of suits to collect the
bill payment in person if I were to ever go over my 500gb monthly limit by
downloading 102tb of data, haha. I would still like to download a subset of
the data, from what I've read apparently that kind of idea is already in the
works. I just can't possibly think of what I would do, perhaps a machine
learning based project.

~~~
msoad
I'm on Comcast and download around 3TB/month with no problem. But seriously
why you should download big data to work with? It's cheaper and faster to do
it in 'cloud'!

------
sirsar
_We have switched the metadata files from JSON to WAT files. The JSON format
did not allow specifying the multiple offsets to files necessary for the WARC
upgrade and WAT files provide more detail._

Where can I read more about this?

~~~
ldng
Section "Resources" of the post you haven't read ?

~~~
sirsar
No, I mean the difference between the filetypes.

------
iamtechaddict
Is there a way we can access the data(small subet say 30-40GB's) without
having an AWS account(as it requires a credit card, I'm a student i don't have
any) ?

~~~
wodow
Some of the older data (2009) is available on archive.org:
[https://archive.org/details/commoncrawl](https://archive.org/details/commoncrawl)

~~~
iamtechaddict
Thanks a lot. It'll be very helpful i'm sure.

------
kordless
Ah, distributed crawling. What a great idea. :)

------
csmuk
Well that would take 3.5 years to download on my Internet connection!

------
manismku
That's great and cool stuff.

