

The impact of fast networks on graph analytics - ICGog
http://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2015-07-08-timely-pagerank-part1.html

======
xtacy
I wonder: When the NSDI authors (or rxin) says a computation is communication
bound, what do they mean?

Is it network bandwidth or latency?

I suspect it's latency: If you're bottlenecked on latency, the barrier-
synchronised nature of many jobs (due to shuffles) lowers network utilisation
to the extent that many of the smart network scheduling algorithms the NSDI
paper refers to don't work at all.

If it's latency, it also makes sense that a framework that's closer to bare-
metal (a highly tuned implementation) can get squeeze more utilisation on a
cluster, lowering end to end job times. I wonder if the JVM intrinsically
prevents some hardware-specific optimisations due to its memory model.

~~~
ms705
We will go into this a bit in part 2 of the blog post, but the bottom line is
that it doesn't look like GraphX is bottlenecked on barrier-sync latency in
this computation. In fact, the iterations in GraphX are quite long and hardly
use the network at all, so we're not sure if there's much fine-grained
synchronization going on.

That said, leaving the implementation details aside, latency is definitely a
big deal, but 10G does help there, too: the latency for sending a fixed-size
message can be a lot lower on an idle 10G network than on an idle 1G network.
If we're talking about very small synchronization messages, then maybe there
isn't much of a gain (network stack overhead dominates), but techniques like
our destination-oriented edge processing help reduce the need for very fine-
grained synchronization (for this computation at least). The only barrier-
synchronization necessary in our fast 10G implementation is at the point at
which no more updates are to be sent by _any_ worker (this only happens once
per iteration).

You're quite right, however, that a lot of the work on network scheduling for
big data computations mentioned in the NSDI paper operates at the coarse-
grained level of some kind of 'flow' notion. This would indeed be very hard to
disambiguate in our implementation (part 2 will show this in more detail); I'm
not convinced that these algorithms would help timely dataflow at all.

~~~
xtacy
Thanks, yes, I realised it wasn't really barrier-sync latency after I wrote
the comment. :)

I remember that the NSDI paper actually made an Amdahl's-law-like argument
(they give it a new name) and did something to the tune of "let's just
eliminate time waiting on the network from the total runtime, which makes the
network infinitely fast."

Coming back to the post: If it's CPU overhead, shouldn't Java be pretty
competitive with C/C++/Rust for common computations? There might be a lot of
other things going on that lower might affect how much one can squeeze from
the CPU (GC/object sizes, time spent in reflection/serialisation, maybe?).

It would be great to look at (a) the number of instructions that Java and the
Rust implementation execute, and (b) the instructions-per-cycle issued (or its
inverse, the CPI) in both cases. If it's memory sync that's slowing down Java,
then Java's CPI must be (edit) _higher_ than Rust's.

~~~
ms705
Yep; the NSDI paper is correct in that there's hardly any time spent waiting
on the network (as our traces in part 2 will show). However, that is not to
say that the network being faster _cannot_ help: if computation and
communication are perfectly overlapped, then "blocked time analysis" (term
from the NSDI paper) would not show any potential improvement, but faster
communication can still improve the overall runtime (e.g., by reducing busy
polling, or crucial updates arriving sooner).

The CPI number investigation is quite a good idea -- we in fact already have
these numbers for the Rust-based timely dataflow, but I'll have a look to see
how hard it'd be to get them for GraphX/Spark.

~~~
xtacy
Yep, that's right. Looking forward to the CPI numbers!

------
yzh
You guys should check out our high performance GPU graph processing library,
Gunrock:
[http://gunrock.github.io/gunrock/](http://gunrock.github.io/gunrock/) We are
working on multi-GPU distributed version now.

~~~
ms705
Interesting -- though a quick scan of the evaluation data sets suggests that
none of them are as large as the Twitter one (1.1B edges) or the uk-2007-05
one (3B edges) that we (and other distributed graph processing systems) use.
Presumably this is due to memory limitations on the GPU?

------
anonymousDan
So what exactly is it about the Naiad implementation that makes it so much
better than Spark/GraphX? Is it just Rust vs Java, or is there something more
fundamental about the model?

~~~
frankmcsherry
It's a good question. Part two goes in to this a bit more (the post was too
long). But, the performance seems to be due to a few things:

1\. Rust/C#/etc. vs Java. Stuff just goes faster with less effort.

2\. The model is more "programmable". Rather than select from pluggable pre-
fab algorithms, with certain trade-offs already made for you, you are able to
write your own code where you need it. The system runs your code and really
not that much else.

Stay tuned for when we actually push that content out.

~~~
xjia
It is interesting to see C# outperforms Java. What exact behaviors of the
languages are contributing to the speed-up?

