

Ask HN: Google web corpus + big data analysis cluster? - pasteurquadrant

Thinking about starting a developer friendly service that frequently crawls the web at massive scale + a big data analysis cluster (such as Storm) with support for a variety of languages.<p>Obviously walled gardens like Facebook and Twitter would be off limits at the beginning, but if the service gains traction, then it&#x27;s possible companies would want to be crawled.<p>I was trying to analyze the common crawl recently, and the process of getting set up is non trivial. This service would allow more people to more easily analyze the web.<p>Would welcome any feedback.
======
cblock811
Some quick questions:

1) How frequently would you crawl?

2) What are you defining as a 'massive scale'? It'll probably grow i'm sure
but what is a reasonable range you're working with?

3) Which languages will you be supporting?

