pina's comments

pina · on June 29, 2008

I would use this for my people search application. This would definitely bring costs down significantly. The economics actually work the same if you want to do a really good job (be comprehensive) with your search application, you need to crawl the entire web to find the documents (part of the vertical) your search application is interested in. You could use a white list of sites, but then that would not be comprehensive.

So, building a good vertical search engine is really hard, you have to crawl the entire web first to find documents that look like the ones you are interested in. At 100 billion documents, 20k bytes each:

Bandwidth = 100 Billion documents * 20k bytes each = 1.8PB To download the documents at wire speed (gigabyte speed) it would take over 230 days. And of course processing, analysis etc. would take more compute time.

It would cost over $1M a month to process all web data on AWS. This is assuming we go at wire speed and ignore any kind of politeness.

If Yahoo were to open it's crawl, we would be able to just write our specific applications on top of Yahoo's crawl without needing to download the entire web on our servers.

Anon84 · on June 29, 2008

You might want to check out "Nutch"/"Wikia Search" ( http://re.search.wikia.com/about/get_involved.html )... They let you download their copy of the web ( http://search.isc.org/download/ ) Not exactly Yahoo quality, but it might be enough to get started on some smaller scale projects.

gojomo · on June 29, 2008

Alexa's "Web Information Service" and "Web Search" might also be of use:

http://www.amazon.com/Alexa-Web-Information-Service-home/b/r...

http://www.amazon.com/Alexa-Web-Search/b/ref=sc_fe_l_5?ie=UT...