

One trillion edges: graph processing at Facebook scale [pdf] - mrry
http://www.vldb.org/pvldb/vol8/p1804-ching.pdf

======
Smerity
If anyone is interested in graph processing at scale, the largest Web Data
Commons Hyperlink Graph[1] (created from Common Crawl[2] data) is 3.5 billion
web pages and 128 billion hyperlinks. To my knowledge, it's the largest freely
available real world based graph dataset.

Numerous graph systems have been made that can scale to this data, frequently
on just a single powerful node, including FlashGraph[3] (using SSDs for high),
Frank McSherry's single laptop Rust implementation[4], and Dato's GraphLab
Create[5].

As Frank McSherry points out at [6], there's a lot of room to improve state of
the art in graph computation systems, especially as the current academic
evaluations are broken.

(where "broken" means that a single threaded graph system written in Rust on a
laptop can beat the benchmarks created by a cluster of high performance
machines)

[1]:
[http://webdatacommons.org/hyperlinkgraph/](http://webdatacommons.org/hyperlinkgraph/)

[2]: [http://commoncrawl.org/](http://commoncrawl.org/)

[3]:
[https://github.com/icoming/FlashGraph](https://github.com/icoming/FlashGraph)

[4]:
[http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html)

[5]:
[https://twitter.com/CommonCrawl/status/623615774909857792](https://twitter.com/CommonCrawl/status/623615774909857792)

[6]: [http://blog.commoncrawl.org/2015/04/evaluating-graph-
computa...](http://blog.commoncrawl.org/2015/04/evaluating-graph-computation-
systems-performance-and-scale/)

~~~
shoo
That post from Frank McSherry is excellent. His earlier post [1] is also worth
reading.

I was curious to read more about the Hilbert ordering of edges, and the delta-
encoding. There is a related article from Yzelman & Bisseling titled "A cache-
oblivious sparse matrix–vector multiplication scheme based on the Hilbert
curve"[2] which discusses Hilbert ordering combined with a delta-encoded
version of the usual CSR encoding for sparse matrices [3].

> A Hilbert curve thus can be projected on any [2^m * 2^m] matrix, which in
> turn can embed the sparse matrix A, thus imposing a 1D ordering on its
> nonzeroes.

The largest sparse matrix considered for the experimental results from this
article was "wikipedia-20070206" with 45,030,389 nonzeros (aka weighted
edges). Ignoring the weights, this is about 1/3000th of the size of the 128
billion edge graph that McSherry mentions in his post.

Yzelman & Bisseling also measure the additional preprocessing time required to
convert their sparse matrices using the Hilbert-ordered delta-encoded CSR
scheme -- this is roughly 30x the time for a single matrix-vector product
evaluation [4]. Clearly this approach is only going to pay off if one plans to
do a bunch of analysis using the single matrix / graph, without changing the
matrix / graph. McSherry mentions a similar issue but points out that he could
preprocess the graph using his approach faster than he could download the
data, so the preprocessing speed wasn't really an issue.

[1]
[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

[2]
[https://lirias.kuleuven.be/bitstream/123456789/400709/1/yzel...](https://lirias.kuleuven.be/bitstream/123456789/400709/1/yzelman11c-pp.pdf)

[3] sparse matrices are a fairly obvious representation of directed graphs
with edge weights, I searched for an article about sparse matrices as I am
more familiar with linear algebra than graph processing.

[4] edit: perhaps an crude intuitive way to think about "30x matvec products"
in the context of graph algorithms would be "about one pagerank", as each
pagerank iteration is doing a matvec product and a bit of other stuff, and
McSherry talks about 10 or 20 iterations per pagerank evaluation.

~~~
Smerity
Agreed re: the quality of Frank McSherry's posts - they're all great fun and
push the technical envelope!

I'm glad you brought up Hilbert ordering. It's fascinating but I rarely see
it. I usually see Z-order curves in their place - simpler and more intuitive
but also less effective. Does anyone know of large scale databases with
Hilbert ordering out of the box?

For anyone who hasn't seen them before, or why they help in ordering, I highly
recommend "Portrait of the Hilbert curve"[1] and "Spatial indexing with
Quadtrees and Hilbert Curves"[2]. The tldr is that they enable efficient
multi-dimensional indexing. Given a request like "X < 40 and Y > 80", most
databases need to run "X < 40" and then filter over the results for "Y > 80"
\- Hilbert curves allow you to find the query's intersection and only then
perform minor filtering.

As far as large scale use of Hilbert ordering, Google use it internally for
much of their geographical work. They even open sourced a Java library
specialised in multi-dimensional indexing based on Hilbert curves called
Uzaygezen[3].

[1]:
[http://corte.si/%2Fposts/code/hilbert/portrait/index.html](http://corte.si/%2Fposts/code/hilbert/portrait/index.html)

[2]: [http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-
Spatial-...](http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-
indexing-with-Quadtrees-and-Hilbert-Curves)

[3]: [http://google-opensource.blogspot.com/2008/08/uzaygezen-
mult...](http://google-opensource.blogspot.com/2008/08/uzaygezen-multi-
dimensional-indexing.html)

~~~
jandrewrogers
The primary advantage of Hilbert curves relative to Morton (Z) curves is that
they are marginally more efficient at linearizing multidimensional point
indexes e.g. storing on spinning disk.

However, the reason that Hilbert curves are relatively rare in real systems
these days is that (1) the optimality of Hilbert curves is largely based on
the assumption of indexing on sequential storage media, and (2) for the many
other applications of structures describable by space-filling curves, a
Hilbert description is often inferior in practice. For example, it is a
suboptimal curve to use if you are implementing spatial analysis operators in
a distributed system.

If you were designing a spatial indexing system from scratch today for modern
applications, you would use neither Hilbert nor Morton/Z curves but they are
adequate shortcuts for sufficiently narrow applications. Substantially more
optimal curves also have a much more complex description.

~~~
cbsmith
What would you go with for spatial indexing today? R*-tree? k-d tree? BVH?

------
Xyik
I thought facebook was located at 1 Hacker Way not Hacker Lane

------
anotherangrydev
>we were able to execute PageRank on over a trillion social connections in
less than 3 minutes per iteration with only 200 machines

Colour me skeptical, that's around 28M edges per second per machine.

~~~
cbsmith
Let's reframe that as ~1700 instructions per edge.

Seem more plausible?

These are 16 core machines with 10GbE links. Each core is processing maybe 2
million edges per second with highly data parallel instructions. Ignoring
hyperthreading, if these are 3GHz cores (pretty conservative), that means you
are burning ~1500 cycles per edge to process an edge. You've got 2 FP ALU's
and 2 integer ALU's in each core, not to mention the AGU. You've got
prefetchers and completion units that are designed to handle 4-6 instructions
per cycle, and that's not for no reason. You can get a _lot_ of work done in
one core with just 1000 cycles, let alone 1500.

So I'm not so sure why that would seem incredible.

~~~
anotherangrydev
Their test machines have 16-cores. They're running 2.9Gops/core. Plausible,
yeah. I'm not used to see real things achieve almost 100% efficiency, sounds
really good.

~~~
cbsmith
To provide perhaps more perspective, here's a paper using 2004 technology:
[http://www.leonidzhukov.net/papers/ParallelPageRank-2004.pdf](http://www.leonidzhukov.net/papers/ParallelPageRank-2004.pdf)

With 70 machines & 140 processors, they did pagerank over 6.6 billion edges in
35 minutes. Using 285% the machines, you'd expect them to be able to do it in
12.25 minutes. Moore's Law should put each machine at 160x more powerful, and
with 300x the edges they are finishing in 19 minutes. So we've improved by 2x
over an 11 year-old software stack that we've long since abandoned for better
approaches, using a hardware stack that is much more optimized for SIMD
computations than the 11 year-old hardware.

~~~
anotherangrydev
I'm reading that one, thanks.

I really dont know the PageRank algorithm, but I was really impressed by their
results, a trillion of "something" in 3 minutes. I thought that just reading
things should take you a lot more than that but apparently not.

Really nice stuff.

------
logicrime
I kinda hate these papers that just humble-brag clusterized setups without
providing any abstract insights. This doesn't bring me any closer to
understanding graph data any better, but I'm now ready to begin installation
of a multi-million-dollar cluster of machines and storage.

The bit about k-means was interesting, but the rest was an irrelevant bore.

~~~
Hytosys
Imagine being one of the five authors of this paper, browsing this comments
section. It's natural to distance yourself from people when you're behind a
keyboard, but for fuck's sake — these are your peers. If you're going to be
critical, do it with attention and care.

~~~
lcswi
The people were not critized, the paper was.

