

On the Computational Complexity of MapReduce - panarky
http://jeremykun.com/2014/10/05/on-the-computational-complexity-of-mapreduce/

======
gracenotes
Nice article. I got too far into it before realizing I was essentially tricked
into reading an academic paper :) There seems to be a tradeoff between clear
presentation of an idea and the formalization required to usefully build off
of it/publish it. Abstractions which do both (e.g. Turing machines) get more
reuse, I think.

The sublogarithmic result is constructive, though, and my understanding of it
is like this: We can split up a decision problem (take in a string, give a
yes/no answer) over many workers by splitting up the input string into sqrt(n)
chunks of size sqrt(n) each and using a shared working space. The first worker
handles the first chunk with a blank working space, the second worker handles
the second chunk with finished working space from the first one, etc. until
the last worker handles the last chunk, with the second-to-last's working
space, and gives a yes/no answer. This can be parallelized into a single
MapReduce round by instead having every worker process their chunk using every
possible configuration of the shared working tape. This sounds like a lot of
work, but if shared working area is sublogarithmic in size, calculating such a
table fits within polynomial time and sqrt(n) space. This means that one final
"reduce" step can take the working space from the first input chunk, pass it
through the second input chunk in constant time, and so on until it can make a
decision.

I can't think of any practical sublogarithmic space problems off the top of my
head but evidently there are some good ones in graph theory. If you tried to
implement the above in practice, you could imagine that the final reduce step
could be intelligent about combining the processing stages so workers don't
need to calculate _every_ single configuration.

What is unclear to me is whether all Turing machine deciding languages in
sublogarithmic space can be chunked in this way. The usual definition of an
input read head is that you can go left or right whenever you want, but having
access to the entire input means that the workers can't be cleanly chunked.
You could say the input is one-way like a DFA/PDA, but is there a result that
these are equivalent with sublogarithmic scratch space?

------
chubot
So they are talking about how many MapReduce rounds it takes to express
certain graph algorithms. (A single MapReduce round is of course O(n log n)
with respect to the number of records.)

I wonder if this context is practical/realistic. One paper I thought was
interesting is GraphChi, a graph processing framework that runs on a single
machine:

[http://graphlab.org/projects/graphchi.html](http://graphlab.org/projects/graphchi.html)

Graphs can be encoded very efficiently; you need a pair of integer IDs for a
pair of nodes, and maybe a weight or some other small payload data. A graph of
say 7 billion nodes for everyone in the world will often fit in memory on a
single machine these days, depending on the number of edges. GraphChi uses
disk, so you have even more space, if you are willing to constrain the model
(MapReduce is of course already heavily constrained, especially for graphs).

The raw source data is probably larger than a machine, but you could just run
a MapReduce to copy it out into a bunch of (ID, ID) pairs, do your hairy graph
algorithms on a single machine, then export the result back your data store
with a MapReduce join.

You could use either something like GraphChi or just an arbitrary C/Java
program working on a big graph in memory.

My hunch is that this is a lot more tractable that sequential MapReduce... I'd
be interested in hearing if people have big graph problems where you couldn't
use such an approach.

~~~
chillacy
MR is a pretty inflexible paradigm, and doesn't really suit graph problems
well at all. Most people I know prefer to use Spark (or in the case of graph
problems, giraph). Of course this is for huge graphs that won't fit on a
machine (petabytes). If it fits on a machine it's far easier to not muck
around with distributed computing.

~~~
ihodes
Do you know of a group using Spark on petabytes (or even a terabyte(s)) of
data?

~~~
TallGuyShort
The scalability of Spark is certainly not as established as that of Hadoop.
It's entirely anecdotal as I'm not in a position to share verifiable details,
but I know of several installations that run Spark on several-hundred-node
clusters. They're certainly well into terabytes, if not in the neighborhood of
petabytes.

