It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...
At one point I was tried out a 10 MB limit, but the thing is we try to limit crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-speed connections that would slow down the whole thing.
For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them.
Also, hopefully you'll find the new metadata files to be a little clearer. We switched over the same format Internet Archive uses and it contains quite a bit more data (xpath truncated paths for each link for instance).
I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.
While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:
1) Everything linux, unix and both
2) Everything open-source
3) Only news & current events
4) Popular culture globally and by country
5) Politics globally and by country
6) Everything software engineering
7) Everything hardware engineering
8) Everything maker community
9) Everything financial markets
10) Everything medicine / health (sans obvious quackery)
Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.
The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.
Seriously - they give you an easy way to create these subsets yourself. That is a much better solution than them trying to anticipate the exact needs of every potential client.
There is definitely a benefit in using the community to identify valuable subsets and then individually putting your energy towards building discovery/search products around that subset.
They are around 850 MB each.
The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:
s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/
That would be a great starter for all sorts of fun little weekend experiments.
I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.
Perhaps I just don't know what I'm looking at?
I haven't tried that one, but I've poked at other of the Amazon Common Datasets collection:
If you're already familiar with using Amazon's virtual servers, it's pretty straightforward.
I also note that the Common Crawl project publishes code here:
Where can I read more about this?