

Collaborative Map-Reduce in the browser - igrigorik
http://www.igvita.com/2009/03/03/collaborative-map-reduce-in-the-browser/

======
timf
" _how hard would it be to assemble a million people to contribute a fraction
of their compute time?_ "

The BOINC project's done it, they've seen 1million+ computers. And they even
have an installation barrier which is different than what you are suggesting
(their software is robust and easy to install but you still have to do it).

One thing BOINC and BOINC projects do well is establish non-monetary
incentives, whether it be competitions, fancy graphs, etc. That's something to
solve, not sure enlisting just your social network (manually, with an URL) is
going to be enough to cut it if you want thousands of participants (unless you
are particularly "influential" I guess).

Or maybe this is something a legion of mechanical turkers would be interested
in?

~~~
hendler
In 2001 when I was traveling in Asia, I installed Seti@home (now a Boinc
project) on every computer I could in internet cafés.

Google Gears is also pretty good way to do this your self since there's a DB
and a background process - although full apps in the client changes the
security model - as with google native client -
<http://code.google.com/contests/nativeclient-security/>

Nice post. Oh, and I like the rainbow butts iconography. :)

------
ryanwaggoner
I knew I'd seen something like this before...

<http://www.pluraprocessing.com>

Launched on HN (where else) a few months ago:

<http://news.ycombinator.com/item?id=347359>

------
lecha
Here are some more "business ideas" for your enjoyment:

\- Buy tons of those fancy interactive visual advertisements, embed the worker
into them and perform mapreduce jobs in the browsers of unsuspecting users

\- Run some of the analytic/batch processing related to a popular social
network on CPUs of your customers.

\- Have a popular site? Sell CPUs of its audience just like one sells
impressions via AdSense.

------
raghus
_Google's server farm is rumored to be over six digits (and growing fast),
which is an astounding number of machines, but how hard would it be to
assemble a million people to contribute a fraction of their compute time?_ \-
maybe Google can put an optional thingy into Chrome so that users' computers
can be part of their server farm?

~~~
hendler
Gears and Native Client are likely just for this.

<http://code.google.com/contests/nativeclient-security/>

------
henryl
I realize the author wasn't proposing that something like this could be a
business, but humor me:

Had this idea a few years back with a business model that paid publishers for
cpu cycles gathered from a javascript or flash widget. We hoped to then sell
this service to data-intensive industries. Decided it wasn't feasible.

We need to consider cpu cycles gained from this regime vs. bandwidth and cpu
cycles lost from the hundreds of web, queue, data servers needed to run this
model. IMO it is unlikely that this model will pay off once you consider
things like network latency, and trade-offs like job size (higher job size is
better) vs. job completion probability (lower job size is better).

Even if the potential for viability were there, it isn't clear that there is a
market for something like this. Large scale computing challenges obviously
exist, and a lot of people are making money with solutions like cloud
computing, but these problems typically involve proprietary data sets, using
proprietary or industry standard (good 'ol apps like MySQL) software. Chopping
up your sensitive data and sending it en masse to the public to be processed
on javascript instead of C++ doesn't exactly fit client needs.

~~~
mchadwick
I had actually tried to do this exact thing. Problem A is that you can't
execute very much in a clients browser at any one time. Okay, so you make the
jobs smaller and fetch more. Problem B is that it turns out that by the time
you've pulled the data from disk, shipped it to the browser and back, put it
back to disk, and done the same cycle for the reduce, it ends up being cheaper
to stream 64M HDFS blocks around EC2.

That doesn't even get in to verifying results from an untrusted client.

Here's my half-implemented proof of concept from a while back that runs on
AppEngine: <http://github.com/markchadwick/emarer/tree/master>

Slightly different implementation, but the same idea (which I think is a very
cool idea!).

~~~
henryl
You can extend the amount of processing time available on the client by
storing intermediate results in window.name (which gives you up to 2MB of semi
persistent storage)

------
sam_in_nyc
Have fun making sure clients don't send you invalid data. You'll have to have
some sort of voting system where several clients compute the same piece, and
make sure they all match up. Even then, you can't be 100% sure of the results.

------
piramida
how is this related to map-reduce besides method names? it has a single point
of failure (server), nodes do not have logic to split the job further, and on
top of that painfully slow javascript engine...

------
jacktang
my current work might related to the field. we created firefox add-ons and let
the browsers work for us.

------
moonpolysoft
Sorry, but this is basically grid computing with a slightly different client.
As pointed out many times before, most interesting problems right now are IO
bound. It turns out that data locality is the most important thing in
processing extremely large datasets. That is the key insight in the map-reduce
paper and the linchpin to the success or failure of all the distributed map-
reduce frameworks that have sprung from it.

Most startups and small scale companies that would see the value in leveraging
a system like this simply don't have the right processing profile which would
make something like this worth their while. I'm sure if you graphed CPU time
per byte of data you'd find a sweet spot where a service like this would speed
up jobs rather than slowing them down.

As it happens, most companies that have a high CPU time per byte ratio are
either financial firms or pharma. Most of whom not only have their own
infrastructure, but would rather close up shop than see their proprietary code
out in the wild for competitors to analyze.

And there are already plenty of clients out there for running fourier
transforms on possible seti signals.

