

Ask HN: Good service provider for running CPU intensive algorithms - digamber_kamat

We have an algorithm which we need to run for couple of months. This is a very CPU intensive task. You might ask what kind of algorithm it is. Let us assume that we want to index the whole internet. It's something like that.<p>Suggest me some service provider who can provide us the infrastructure to run this. IS cloud based infrastructure a good option ?
======
messel
Dewitt Clinton mentioned some techniques for creating something on the order
of an effective web crawler (internet indexing), but I'm not sure what it's
limiting factors were (cpu, memory, number of threads/bots, asynchronous
bots).

here's the link: [http://www.google.com/buzz/dclinton/KuXDg9P8Q8z/Jesse-
Stay-A...](http://www.google.com/buzz/dclinton/KuXDg9P8Q8z/Jesse-Stay-A-few-
points-of-clarification-to-your)

check Dewitt's first comment:

DeWitt Clinton - I'm not even sure how to respond, mostly because I can't even
guess how you came up with the number 3, or how you can say anyone "owns" the
web.

There are currently hundreds, no, thousands, of companies, universities, and
individuals have indexed large portions of the web _. Getting the basics isn't
even that hard. No exaggeration, you could build a perfectly suitable crawl
for a sgapi of your own in a few weeks.

The hardest bit is probably the node mapping, but fortunately, that's open
source:

<http://code.google.com/p/google-sgnodemapper/>

Even high quality crawlers are available under open source licences. Nutch,
for example:

<http://lucene.apache.org/nutch/>

And I highly recommend this book on the fundamentals of web crawl and
indexing:

<http://www.amazon.com/gp/product/0136072240/>

Storing the node graph can be done on any number of open source or commercial
backends. If I wanted to do it myself, I'd probably host it all on EC2 at
Amazon. The cost wouldn't even be prohibitive, depending on how deep and broad
you want the crawl to go, and for sgapi stuff, you don't need to go that big
to be useful.

_No one ever has, or ever will, indexed the whole web

~~~
digamber_kamat
That was really a very useful reply messel. However in our case the crawler
code etc is ready. The only thing we need is some powerful infrastructure to
run it on.

We have certainly considered all the factors you have mentioned while
implementing our crawler however unlike typical crawlers the objectives of our
crawler is purely a research goal and completely different than what most have
done.

------
hapless
BlueLock vCloud Express offers 4- and 8-CPU VMs in a self-service "cloud,"
much like EC2. The big VMs are expensive, but cheaper than buying 8x "high
CPU" amazon instances. Also, memory is not tied to CPU -- you can have an
8-way VM w/ 256M of RAM, if you really want it.

[http://www.bluelock.com/bluelock-cloud-hosting/bluelock-
vclo...](http://www.bluelock.com/bluelock-cloud-hosting/bluelock-vcloud-
express/)

