
Using Common Crawl to play Family Feud - fulmicoton
https://fulmicoton.com/posts/commoncrawl/
======
rspeer
In my work with the Common Crawl, I agree with one thing the author states
uncertainly: just download it.

Everyone tells you that you should compute statistics on the Common Crawl by
running a distributed task across a lot of AWS machines, leaving the data out
there on S3 and moving your code to it. But this only seems to be cost-
effective if your development time is worth nothing and if you get it right
the first time (hahaha of course you won't).

As long as you pay a reasonable rate for bandwidth, you should just buy a 5TB
hard drive and download it. Now you have a lot less system-wrangling to do and
you don't have to pay Amazon for your mistakes.

------
known
147,640,618 domain names can be downloaded from
[https://www.verisign.com/en_US/channel-resources/domain-
regi...](https://www.verisign.com/en_US/channel-resources/domain-registry-
products/zone-file/index.xhtml)

------
auvi
> The web contains hundreds of trillions of webpages, and most of it is
> unindexed.

any reference to this number?

~~~
fulmicoton
Google's "howsearchworks" :
[https://www.google.com/intl/gl/insidesearch/howsearchworks/t...](https://www.google.com/intl/gl/insidesearch/howsearchworks/thestory/index.html)

