

MapRejuice - Distributed Client Side Computing (Node Knockout Entry) - Klonoar
http://nodeknockout.com/teams/anansi

======
jhuckestein
Thanks for your feedback guys. This was our first shot. Next we need to figure
out

\- What kind of problems lend themselves well to our framework? Most MapReduce
algorithms crunch huge loads of data. Our framework is more useful for long-
running computations on freely available data

\- How does the technical side look? I'm thinking node as stateless master
servers, hypertable for storage, and RabbitMQ for flexible persistent job
queues. The client runner has some bigger problems, like circumventing the
same-origin policy (for pulling data e.g. from graph.facebook.com or twitter)
and streaming in large data.

\- What to do about confidentiality. All computations will by virtue of the
framework be visible to the workers. That may not be ideal in some cases.

\- How to handle worker failure and incorrect results. In our system, workers
are unreliable, not trustworthy and slow. That is very different from other
implementations ;)

That said, if anyone here runs a lot of MapReduce tasks or is generally
interested in the subject, drop us a line at team@maprejuice.com or leave us
your email address at <http://maprejuice.com>. Next we would like to take a
number of real-world tasks and try to calculate the correct results on our
"cluster" :)

~~~
endtime
Here's something I tried running (index generation):

    
    
        map = function (docID, text) {
          var results = [];
          var words = text.split(' ');
          for(var index in words) {
            results.push({key: words[index], value: docID});
          }
        
          this.done = true;
          this.results = results;
        }
        
        reduce = function (word, docIDs) {
          self.postMessage("starting reduce for distributed relevance-sorted index generation");
          var counts = {};
          var results = [];
          for(var index in docIDs) {
            var docID = docIDs[index];
            if(docID in counts) {
              counts[docID]++;
            } else {
              counts[docID] = 1;
              results.push(docID);
            }
          }
        
          results.sort(function(a, b) {return counts[b] - counts[a];})
          
          this.done = true;
          this.results = results;
        }
        
        data = {1: "the quick brown fox jumps over the lazy dog while the slow yellow fox stumbles around the yard", 2: "on the ning nang nong where the cows go bong and the monkeys all say boo", 3: "lions and tigers and bears oh my , i suppose next we'll see monkeys and maybe a fox"}
    

Didn't seem to work, but maybe I did something wrong.

~~~
Klonoar
Assuming you entered it in the format we take it in right now (separated out,
etc), it's just very possible that we're hanging on the backend. We aren't
able to touch the thing for a week while it's being judged, so we've had to
periodically kill off things to keep it alive.

It's not our desire either, but give us time to deal with it and we'll show
you some rockin' stuff. ;)

~~~
endtime
The first time it timed out. After I posted that I tried again and it let me
create the job and then start it...but then the results were the empty object.
This made me think I screwed up my JS, but it's pretty simple and I did test
each function individually, so I think it's correct.

Anyway, looking forward to you guys getting the kinks ironed out. Very cool
idea.

~~~
Klonoar
Yeah, your problem just hasn't run yet, it would seem. We didn't have time to
implement running multiple problems at once - currently they enter a queue and
wait in line for their turn to get distributed. :(

Glad to know you at least got it in there, though!

------
il
This is great tech, but I'm worried it might set a precedent for rogue sites
stealing my computing power/CPU cycles anytime I visit them.

~~~
Klonoar
Yeah, we're still musing on this, it's certainly a known point. Besides the
obvious method of offering an "opt out" extension (like Google does with their
ad/tracking stuff), there's the more ethical pieces to consider - e.g, loading
on Mobile, where people may be paying for data consumption... we shouldn't be
taking their data rates high (or potentially using up their battery in the
background) for arbitrary reasons.

As of right now it doesn't even fire on mobile devices for this reason.

tl;dr Good point, we're aware, and there'll be ways to deal with it. Must
sleep now.

------
milkshakes
why pay in eyeballs when you can pay in clock cycles? this could be a way
cooler way to monetize a site than ads

~~~
shiftb
Exactly! this was one of the thoughts we had. Can you imagine someone like
farmville implementing it? Lots of potential revenue.

Would also be great to donate cycles to educational or non-profit teams.

~~~
teaspoon
Plura Processing does something like that -- offering their distributed
computing client for integration into Flash games, and kicking $2.60 back to
the developer per month of compute time:
<http://www.pluraprocessing.com/games/index.html>

I don't know them personally, so I'm curious how that model is working out for
them.

------
speek
Here's another distributed client-side computing framework ->
<http://github.com/revis/Really-Cloudy/>

It hasn't been touched since december '09, but it looks neat too.

------
rarestblog
The graph on main page of MapRejuice shows ~300 jobs per minute. What is "a
job"? Is it one iteration of "map" or "reduce"? So, if we are to assume you're
doing word count - can you count up to 300 words per minute? On 6.5M visitors
site?

~~~
jhuckestein
A job is one run of map or reduce on one key.

In the case of word count, during the map phase, that means that 300 input
files are analyzed per minute. During the reduce phase that indeed means that
only 300 unique word frequencies per minute are computed.

This is a purely academic consideration, though. Currently we don't have any
running jobs; the clients are idling. When idling, the clients request a new
job every 10 seconds, which is longer than a typical reduce run would take.
We're also throttling the load on the server from the workers (workers only
spawn 50% of the time on the include script) because we don't want to ruin our
hackety-hack server. Latency between the workers and servers and the server
and our free couchone instance make up large portions of the running time of
smaller jobs.

This was just an experiment. Once (and if) we figure out the problems I
mentioned earlier we can start building to scale :)

~~~
rarestblog
Thank you for the answer.

Did you ever do or would it be possible for you to do a benchmark on
relatively large corpus of how fast can you count, say 10 million words? Or
100 million?

I'm really curious about how well does this method work relatively to usual
map-reduce on clusters. For comparison: single home PC does about 2.5 million
word counts per minute using local files and mergesort instead of network.

This is why I asked - 300 words per minute using using 6.5 million randomly
available nodes vs 2.5 million words per minute using one home node - that
looks like a huge waste. Would it be possible to do more fair comparison?

~~~
jhuckestein
Yes, we could benchmark that. Most likely our framework would be orders of
magnitude slower. There's a couple of reasons for this:

\- Data transfer is slow. Streaming the data in is not possible on all
browsers.

\- There is a very high sunk cost for each single computation, that is
multiple times bigger than the time it takes for the average map/reduce
function to run. This constant overhead per step is hard to minimize.

\- A single map or reduce computation is further chunked down into multiple
iterations of the same function. This is to avoid the "slow script" warning
that browser show for long-running tasks. This technique is necessary and
further adds overhead to the computation.

As a result our framework might not be well suited for many traditional
MapReduce scenarios like word count. More suited tasks are ones that have a
lot of overhead even if I ran them on my own cluster. This could be any tasks
that involves pulling in data from a third party that is not local to the
computation. For example counting word frequencies in all wikipedia pages will
be vastly faster on our cluster than it could ever be on a single machine,
even if we have only few workers.

We still need to figure many things out. The opportunities here are the
availability of idle processing power and the possibility to sandbox
computations in browsers. We need to find out exactly what the best way is to
exploit those opportunities.

------
yesbabyyes
This has the potential to reach many more computers than *@home.

Great idea, great execution!

~~~
Klonoar
Thanks! We're pretty happy with what we've got, but there's still a lot more
planned.

You're right about the potential to reach more; this is actually where this
idea spawned from. Always thought that *@home had too many barriers (having to
download/install/etc). This is fairly automatic, and takes almost no
integration effort on anyone's part other than one line of HTML. ;D

