
Progress in graph processing - mrry
https://github.com/frankmcsherry/blog/blob/master/posts/2015-12-24.md
======
heavenlyhash
I hugely appreciate the whole section that's a "rant" on synchronicity.

Algorithms should always produce the same results on the same inputs. There. I
said it.

Relaxing constraints on large graph processing or AI systems because A)
threads are hard and B) we're tossing in random values anyway is a mistake
that's made research in these areas far too hard to reproduce and reliably
quantify, and that in turn makes progress very difficult.

We can engineer better than this. Build random systems with deterministic
seeds. Pass the seeds down in initialization trees if necessary; never refer
to a global random source from another thread, and providing a whole system
with enough "random"ness for stochastic behaviors AND keeping it deterministic
and reproducible is 100% possible. When going concurrent, build flows and when
there's a merge, sort results before passing on: determinism AND
parallelization _are_ possible.

As a personal anecdote, my first forays into ML are covered with rand() calls,
and in retrospect, it's head-smashingly obvious that this was a Bad Idea. I
spent huge amounts of time re-running the same code, then making a one-
character change to a constant or sign somewhere, then re-running it dozens of
times again to be sure it had any predictable impact at all. If I had made it
deterministic, it would be obvious immediately, and I could have iterated much
faster. It would also be far easier to farm out a CI/lab cluster with
different (but consistent) rand seeds to gather meaningful results in bulk. I
could have saved hundreds of hours of my attention (and sanity), even on small
projects.

Once these things become a bar for minimum viable quality in research, we can
start catching, debugging, and improving -- stuff like the "kingmaker
scheduler" problem the author describes can be made to vanish.

If only we had a video of Ballmer chanting "Determinism! Determinism!
Determinism!"...

~~~
GFK_of_xmaspast
> As a personal anecdote, my first forays into ML are covered with rand()
> calls, and in retrospect, it's head-smashingly obvious that this was a Bad
> Idea.

It's a perfectly fine idea, except that like you said you needed to keep
better control over your seeds, and probably also could have benefitted from
some kind of test harness. (Not to mention it sounds like your metrics were a
bit underpowered)

------
yzh
We are building a graph processing library on GPUs:
[http://gunrock.github.io/](http://gunrock.github.io/) Our publication
explains why our library runs faster, and in one word: load-balancing, since
fundamentally, graph problems are irregular problems. If you look at the road
map section, we are moving towards dynamic graphs and also planning to try out
asynchronous method by using MIS/coloring. Please try it out, any kind of
contribution is welcome.

------
anonymousDan
In terms of system support for graph algorithms that don't fit the "think like
a vertex model", check out the arabesque paper from sosp 2015 (" arabesque: a
system for distributed graph mining").

~~~
glxc
a comprehensive survey on "think like a vertex" frameworks has been recently
published in CSUR:

[http://dl.acm.org/citation.cfm?id=2818185](http://dl.acm.org/citation.cfm?id=2818185)

[http://arxiv.org/abs/1507.04405](http://arxiv.org/abs/1507.04405)

------
sdenton4
Awesome article.

My take on this is that 'interesting' graph problems tend to be NP-(.*), or
reduce to a handful of well-known standards. Some of these standards, like
breadth-first traversal, work quickly and are therefore standard interview
questions. Other standard solutions, like page-rank, are effectively
eigenvalue problems which take N^3 time to find the true solution. N^3 is
generally too slow for large-scale production systems, though, so we end up
introducing some randomness and battering the algorithm into an almost-linear
time approximation algorithm that can run on a thousand machines no problem;
this is where Pregel comes from.

As someone who loves playing with graph algorithms, I would absolutely love to
be wrong in this world-view, and learn that mine eyes have been carefully
shielded from the beautiful variety of graph problems that aren't in one of
these three classes... So I'll make a resolution to try to keep my eyes open
for surprises in 2016.

------
taliesinb
I thought this article great fun to read. If only this was the modern
scientific voice, instead of bloodless passive-voice humble-bragging.

I also liked the sequence of posts explaining timely data flow and its
implementation in Rust:
[https://github.com/frankmcsherry/blog/blob/master/posts/2015...](https://github.com/frankmcsherry/blog/blob/master/posts/2015-09-29.md)

