

A Look Inside Our 210TB 2012 Web Corpus - LisaG
http://commoncrawl.org/a-look-inside-common-crawls-210tb-2012-web-corpus/

======
mark_l_watson
Check out the Common Crawl contest winning projects from the linked page -
some very good work, and a good source of ideas and techniques:
[http://commoncrawl.org/the-winners-of-the-norvig-web-data-
sc...](http://commoncrawl.org/the-winners-of-the-norvig-web-data-science-
award/)

Some good stuff!

~~~
wicknicks
I loved the inter-lingual web page linkage visualization project. Any idea why
Traitor won the contest? It seems very similar to regular "create inverted
index with map reduce" problem, or am I missing something?

~~~
mark_l_watson
Perhaps Traitor won because it is such a good example of using Map Reduce over
the Common Crawl data? I agree that inter-lingual was a cool project.

------
Aloisius
Link to the PDF mentioned:
[https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-
iZN...](https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-
iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit)

------
sylvinus
Common Crawl is awesome. I wonder how complex it would be to run a Google-like
frontend on top of it, and how good the results would be after a couple days
of hacking...

~~~
boyter
Very and probably not very good (Compare Gigablast to Google as an example of
why its hard). Not to take anything away from Common Crawl but crawling is
often one of the easier things to build when creating a search engine. A
crawler can be as simple as

for(listofurls) { geturl; add urls to listofurls; }

Doing it on a large scale over and over is a harder problem (which common
crawl does for you) but its not too difficult until you hit scale or want
realtime crawling.

Building an index on 210 TB of data however... Assuming you use
Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with
this amount of data with any sort of redundancy. That's just to hold a basic
index which is not including "pagerank" or anything (Gigablast is a web engine
so it might have that in there not sure). You aren't factoring in adding
rankers to make it a webs search engine, spam/porn detection and all of the
other stuff that goes with it. Then you get into serving results. Unless your
indexes are in RAM you are going to have a pretty slow search engine. So add a
lot more machines to hold the index for common terms in memory.

If someone is keen to do this however here are a list of articles/blogs which
should get you started (wrote this originally as a HN comment which got a lot
of attention so made it into a blog post) [http://www.boyter.org/2013/01/want-
to-write-a-search-engine-...](http://www.boyter.org/2013/01/want-to-write-a-
search-engine-have-some-links/)

~~~
asgard1024
Actually, not so simple. Sure, you can do simple crawling easily; but the hard
part is to extract meaningful data from it. It's very easy to loop on many
sites for instance. Protocol violations abound - some sites serve binaries as
text/html, for instance.

What I heard about a smaller search engine was that web crawling is usually
augmented with some manually added rules for various sites to prevent spoiling
the database. Not a trivial task at all.

Doing queries is IMHO algorithmically much better understood, because it's a
constrained problem. But getting information extracted out from the real
world, with all the PHP and HTML "hackers", not so easy.

~~~
Sven7
Which is one of the main reasons Google has no serious competition in search
except possibly in China.

It is also why the rate of innovation in search isn't moving as fast as it can
be moving.

If Google opened up (unlimited) web API access to their search interface, to
say a large city for a year or two people would really get a taste of what
innovation in search looked like.

And of course it would be in Google's interest cause search as a platform or
marketplace is where the future of Google really lies. All the other
advertising empire defending distractions like Android, Chrome and YouTube are
really sideshows.

------
rgrieselhuber
Is there something, other than funding, preventing a more regular, open-
sourced crawl of the web?

~~~
LisaG
Limited resources are the only reason. We are working on a subset crawl of ~3
million pages that will be published weekly starting two weeks from now. But
doing the full crawl takes a lot of time, effort and money.

~~~
boyter
Is that really worth it though? I can crawl 3 million pages in less than 24
hours without any real effort on my part. Or are you going to provide 3
million of the most useful pages? Depth or breadth first crawl?

~~~
LisaG
We do think it is worth it to avoid duplicative efforts.

Suppose you crawl 3 million pages and you pay for the compute and storage
costs. Then the next person who wants crawl data goes through the same effort
and pays the same costs. Doesn't it make much more sense to have a common pool
of open data that everyone can use? Even if the effort and costs are low, they
are not zero.

For the smaller frequent crawl, we are working with Mozilla and we are will do
the top pages (top according to Alexa).

~~~
boyter
Fair point and makes sense. If you publish the rank along with the data itself
that would be very useful. Perhaps having a few sets of data? 3 million top
pages, 3 million deep pages etc...

Personally I would like to see around 20-100 million pages or whatever is
about 500-1000GB. That's enough data to work with on a local machine and serve
up some meaningful results assuming you want to build a search engine or just
do some deep analysis of the web.

------
spimmy
What do you mean by "open"? Can the data be used for startups and other
commercial purposes?

~~~
Aloisius
Yes! Startups/commercial companies/etc can all use the data for free. The
terms of use basically say, don't do anything illegal with it and a few other
things, but it shouldn't affect the vast majority of uses.

Actually, tomorrow a video on a startup that uses Common Crawl data is getting
posted.

------
natch
How does one get set up to access the s3:// links their blog posts reference?
I do realize these point to Amazon S3 buckets, but how to get at them?

~~~
WestCoastJustin
Just replace 's3://' with
'[https://s3.amazonaws.com/'](https://s3.amazonaws.com/'). You can use this
link [1], but it looks like most of them are returning "Access Denied", so you
would likely need to login with your AWS username/password to access them.

[1] [https://s3.amazonaws.com/aws-
publicdatasets/](https://s3.amazonaws.com/aws-publicdatasets/)

------
danso
The tables of TLD frequency on page 4 of the stats report are interesting,
though it causes some confusion to me about how the crawler actually crawls
and when it stops: [https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-
iZN...](https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-
iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/view?sle=true)

Table 2a purports to show the frequency of SLDs:

1 youtube.com 95,866,041 0.0250

2 blogspot.com 45,738,134 0.0119

3 tumblr.com 30,135,714 0.0079

4 flickr.com 9,942,237 0.0026

5 amazon.com 6,470,283 0.0017

6 google.com 2,782,762 0.0007

7 thefreedictionary.com 2,183,753 0.0006

8 tripod.com 1,874,452 0.0005

9 hotels.com 1,733,778 0.0005

10 flightaware.com 1,280,875 0.0003

If I'm reading this correctly, it seems that the crawler managed to hit up a
huge number of youtube video pages...but only a fraction of them. I couldn't
find a total number of Youtube video count, but Youtube's own stats page says
200 million videos alone have been tagged with Content-ID (identified as
belonging to movie/tv studios).

In any case, it's surprising to not see Wikipedia on there. English wikipedia
has 4+ million articles, so it should be ahead of thefreedictionary.com

~~~
wicknicks
Good crawlers should typically avoid wikipedia links, to avoid the number of
HTTP requests on wiki servers (and keep their costs down), esp. because they
make available whole data dumps for download through a separate cheaper
channel:
[http://en.wikipedia.org/wiki/Wikipedia:Database_download](http://en.wikipedia.org/wiki/Wikipedia:Database_download)

~~~
gojomo
Yes and no.

Some crawlers are most interested in freshest versions of the most inlinked
articles, or in the exact HTML presentation at Wikipedia.

The monthly full raw wikitext dumps don't provide that.

And, Wikipedia's serving plant is pretty efficient, with bandwidth only being
a small portion of their costs. They can afford some crawling... and
correspondingly, their /robots.txt is pretty open.

Good crawlers seeking just the bulk text shouldn't try to grab the whole thing
as fast as possible via the standard web URLs... but other good crawlers may
want or need to visit discovered Wikipedia links, and doing so at a measured
pace should be OK.

~~~
greglindahl
blekko attempted to implement crawling a local copy, and it was a PITA. We'd
rather crawl the real thing with a crawl-delay of 1. Best would be if the
Wikimedia Foundation made a .html dump available.

