

Massively Parallel Graph processing on GPUs - mpweiher
http://mapgraph.io

======
thompsonbry
Chiming in here. This is Bryan Thompson, the PI for the MapGraph project. We
have new results for multiple GPUs with up to 29 billion traversed edges per
second on 64 K20 GPUs and 4.3 billion directed edges. The multi-GPU code base
is not yet the full MapGraph / GAS API, but we will be fixing that soon. We
will be extending the capabilities of the platform in a number of directions,
including allowing multiple partitions per GPU so you can operate on graphs
larger than the GPU RAM without having to scale-out, and so you can operate on
huge graphs on GPU clusters.

GraphLab (a purely memory based system) is much faster than GraphChi (a disk-
based system using the parallel sliding window algorithm). Both GraphLab and
GraphChi use "vertex cuts" which amount to 2D partitioning, which is what we
use in the multi-GPU version MapGraph. This is also what is used by the world
record holder for BFS traversal, which is the Blue Gene/Q.

We used the synchronous engine for GraphLab and the specified version of
graphlab because that is what existed and worked (without segfaults) when we
collected the data for the SIGMOD paper in the summer of 2013. We did cross-
check the performance of GraphLab with the GraphLab developers.

One of the main reasons that CPU parallel graph algorithms slow down is
because they hit the memory bandwidth of the bus. GPUs have 10x or more memory
bandwidth. The K20 has 204 GB/s. The K40 has 288 GB/s. The Pascal GPU (2016)
will have over 1 TB/s of memory bandwidth and 24 GB of RAM.

I've listed the MapGraph Publications below. See
[http://mapgraph.io](http://mapgraph.io) for more details.

\- MapGraph: A High Level API for Fast Development of High Performance Graph
Analytics on GPUs - Zhisong Fu, Bryan Thompson, Michael Personick (SYSTAP,
LLC). GRADES 2014, June 22, 2014, Snowbird, Utah, USA. Copyright 2014 ACM
978-1-4503-2982-8.
DOI:[http://dx.doi.org/10.1145/2621934.2621936](http://dx.doi.org/10.1145/2621934.2621936)

\- Parallel Breadth First Search on GPU Clusters - Z. Fu, H.K. Dasari, M.
Berzins, B. Thompson. SCI Technical Report, SCI Institute, University of Utah.
July 29, 2014.
URL:[http://www.sci.utah.edu/publications/Fu2014a/UUSCI-2014-002....](http://www.sci.utah.edu/publications/Fu2014a/UUSCI-2014-002.pdf)

------
lmeyerov
I was looking at this today & yesterday. Boding well, these results look right
for GPUs because they're in parity w/ Duane Merril's GPU BFS work.

Less clear is whether MapGraph beats existing single node HPC systems. In
particular, instead of comparing to GraphLab, I'd love to see comparisons to
GraphChi or TurboGraph!

The importance here is that huge graphs are now possible on single-node, and
once these techniques get combined into distributed ones, most real-world
graphs can be solved in real-time!

(If you like building this kind of bleeding edge stuff in performance, ML,
and/or infoviz, we're hiring at graphistry.com :)

------
m_mueller
I've glanced over your paper just quickly, so I'm not sure whether I missed
something: Did you analyze why your 24 threaded CPU version did so much worse
in a few cases than the single threaded version? Did you use OpenMP without a
defined thread affinity on a multi socket system?

~~~
erichocean
It's using GraphLab 2.0, not 2.3 (no idea why), and they also forced
synchronous mode on GraphLab, hence the less than stellar scaling.

On at least the problems I use GraphLab for, synchronous mode is not only a
lot slower, it converges WAY more slowly as well.

It's important when doing comparisons to not rig the game, and it's not clear
to me that the authors took pains to keep things fair.

~~~
supersonick
We are big fans of GraphLab. We used the version that was available when we
collected the performance data on GraphLab - last summer. The asynchronous
engine was seg-faulting which is why we used the synchronous engine. Lots of
details about this in the SIGMOD paper:

[http://mapgraph.io/papers/MapGraph-
SIGMOD-2014.pdf](http://mapgraph.io/papers/MapGraph-SIGMOD-2014.pdf)

And of course MapGraph is open-source, so anyone can make their own
comparisons. :-)

------
ranty
Not being that au fait with big data stuff, can someone tell me what graph
processing is _for_?

~~~
floatrock
One current gold-rush topic is ASIC-resistant bitcoin mining.

Bitcoin works by generating a particular type of SHA256 hash -- you take some
fixed inputs (transactions), a random input you can change (a nonce), and
together they need to SHA256-hash into a special hash, eg one with 5 leading
zero's.

This is known as a "proof of work" system -- it has the interesting property
that it is __quick to verify but difficult to produce __(you can 't predict a
SHA256 hash). Mining bitcoin is basically brute-forcing tons of random inputs
until you find an input that produces a special SHA256 hash, which you then
advertise to the network and the network rewards you with libertarian disney
dollars.

Now, you can make a custom chip (ASIC) that is optimized to churn out as many
SHA256 hashes as possible -- far more and far faster than a CPU can. In fact,
ASIC mining is the only cost-effective way to mine bitcoin these days. Thus,
bitcoin mining is quickly becoming about who can buy the most chips and power
them with the cheapest electricity, encouraging massive centralized data farms
-- a Bad Thing if you believe in the Decentralize All The Things! promise of
bitcoin.

A better proof-of-work system is one that shares the quick-to-verify-
difficult-to-solve property of specially-formed SHA256 hashes, but one that
can't be specialized for with an ASIC. Then more people could participate on
commodity hardware and you would (theoretically) have less centralization (or
at least encourage better commodity hardware instead of better specialized
SHA256 chips).

Anyways, back to graph processing... one class of problem that is difficult to
create specialized ASICs for yet is still 'quick-to-verify-but-difficult-to-
solve' is finding cycles of a specific length (say 42) in a large graph.
Basically solving such a problem is memory-hard, so an ASIC can only get you
so far (although people have proposed ASIC farm architectures that try, eg
with shared central memory accessed by many ASIC cores, which is starting to
sound a bit like a GPU...)

~~~
lmeyerov
We're interested in day-to-day business needs. Databases and compute engines
here handle real-time transaction processing or data analysis for graph data.
E.g., logistics, recommendation systems, social graphs (users, employees,
companies), payments (fraud or segmentation), IT systems/traffic (performance
& security), knowledge management (manufacturing, legal), etc.

~~~
floatrock
...wait, so you mean you use this stuff for like real work instead of
intellectual exercises in creative incentives for burning electricity? ;)

~~~
lmeyerov
Data art mostly:
[https://twitter.com/Graphistry/status/433689227906142208/pho...](https://twitter.com/Graphistry/status/433689227906142208/photo/1)

------
tromp
"MapGraph is up to two orders of magnitude faster than parallel CPU
implementations on up 24 CPU cores and has performance comparable to a state-
of-the-art manually optimized GPU implementation."

That doesn't bode well for my $1000 GPU Speed Parity Bounty at
[https://github.com/tromp/cuckoo](https://github.com/tromp/cuckoo) based on my
belief that the Cuckoo Cycle graph-theoretic proof-of-work algorithm is more
suited to CPUs.

~~~
Bakkot
I don't know. This is admittedly not something I know much about, but from
poking around a bit, it looks like the Gather-Apply-Scatter model - or any
other vertex-centric model - is not well suited to finding cycles of specific
length. Compare eg GraphChi's triangle-counting code [1], which specifically
notes the difficulties arising from having to maintain and work with large
adjacency lists. I don't see a way around that for vertex-centric models.

That said, I haven't actually read your miner, so I don't know what it's doing
or if something similar can be done for GAS-based frameworks. If someone who
knows more wants to chime in, I'd be delighted to read more about this.

[1] [https://github.com/GraphChi/graphchi-
cpp/blob/master/example...](https://github.com/GraphChi/graphchi-
cpp/blob/master/example_apps/trianglecounting.cpp)

~~~
tromp
Most time in the miner is spent identifying which edges have an endpoint of
degree 1, and thus cannot be part of any cycle. This degree counting sounds
like it should be right up the Gather-Apply-Scatter alley...

