Numerous graph systems have been made that can scale to this data, frequently on just a single powerful node, including FlashGraph (using SSDs for high), Frank McSherry's single laptop Rust implementation, and Dato's GraphLab Create.
As Frank McSherry points out at , there's a lot of room to improve state of the art in graph computation systems, especially as the current academic evaluations are broken.
(where "broken" means that a single threaded graph system written in Rust on a laptop can beat the benchmarks created by a cluster of high performance machines)
I was curious to read more about the Hilbert ordering of edges, and the delta-encoding. There is a related article from Yzelman & Bisseling titled "A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve" which discusses Hilbert ordering combined with a delta-encoded version of the usual CSR encoding for sparse matrices .
> A Hilbert curve thus can be projected on any [2^m * 2^m] matrix, which in turn can embed the sparse matrix A, thus imposing a 1D ordering
on its nonzeroes.
The largest sparse matrix considered for the experimental results from this article was "wikipedia-20070206" with 45,030,389 nonzeros (aka weighted edges). Ignoring the weights, this is about 1/3000th of the size of the 128 billion edge graph that McSherry mentions in his post.
Yzelman & Bisseling also measure the additional preprocessing time required to convert their sparse matrices using the Hilbert-ordered delta-encoded CSR scheme -- this is roughly 30x the time for a single matrix-vector product evaluation . Clearly this approach is only going to pay off if one plans to do a bunch of analysis using the single matrix / graph, without changing the matrix / graph. McSherry mentions a similar issue but points out that he could preprocess the graph using his approach faster than he could download the data, so the preprocessing speed wasn't really an issue.
 sparse matrices are a fairly obvious representation of directed graphs with edge weights, I searched for an article about sparse matrices as I am more familiar with linear algebra than graph processing.
 edit: perhaps an crude intuitive way to think about "30x matvec products" in the context of graph algorithms would be "about one pagerank", as each pagerank iteration is doing a matvec product and a bit of other stuff, and McSherry talks about 10 or 20 iterations per pagerank evaluation.
I'm glad you brought up Hilbert ordering. It's fascinating but I rarely see it. I usually see Z-order curves in their place - simpler and more intuitive but also less effective. Does anyone know of large scale databases with Hilbert ordering out of the box?
For anyone who hasn't seen them before, or why they help in ordering, I highly recommend "Portrait of the Hilbert curve" and "Spatial indexing with Quadtrees and Hilbert Curves". The tldr is that they enable efficient multi-dimensional indexing. Given a request like "X < 40 and Y > 80", most databases need to run "X < 40" and then filter over the results for "Y > 80" - Hilbert curves allow you to find the query's intersection and only then perform minor filtering.
As far as large scale use of Hilbert ordering, Google use it internally for much of their geographical work. They even open sourced a Java library specialised in multi-dimensional indexing based on Hilbert curves called Uzaygezen.
However, the reason that Hilbert curves are relatively rare in real systems these days is that (1) the optimality of Hilbert curves is largely based on the assumption of indexing on sequential storage media, and (2) for the many other applications of structures describable by space-filling curves, a Hilbert description is often inferior in practice. For example, it is a suboptimal curve to use if you are implementing spatial analysis operators in a distributed system.
If you were designing a spatial indexing system from scratch today for modern applications, you would use neither Hilbert nor Morton/Z curves but they are adequate shortcuts for sufficiently narrow applications. Substantially more optimal curves also have a much more complex description.
Colour me skeptical, that's around 28M edges per second per machine.
PageRank isn't doing much other than a load, a few += operations, and a store. The main reason it is slow is because memory is far away, but if you lay out the edges intelligently your rank data is usually in the L3 and then computers go fast.
The PageRank implementation on Dato's GraphLab Create when run on the Web Data Commons Hyperlink Graph (128 billion edges) does 3 billion edges a second on 16 nodes, which is 187 million edges per second per machine.
Given that communication overhead quickly becomes an issue for most of these systems and their graph is dealing with more edges, 28 million edges per second per machine seems quite reasonable.
Seem more plausible?
These are 16 core machines with 10GbE links. Each core is processing maybe 2 million edges per second with highly data parallel instructions. Ignoring hyperthreading, if these are 3GHz cores (pretty conservative), that means you are burning ~1500 cycles per edge to process an edge. You've got 2 FP ALU's and 2 integer ALU's in each core, not to mention the AGU. You've got prefetchers and completion units that are designed to handle 4-6 instructions per cycle, and that's not for no reason. You can get a lot of work done in one core with just 1000 cycles, let alone 1500.
So I'm not so sure why that would seem incredible.
With 70 machines & 140 processors, they did pagerank over 6.6 billion edges in 35 minutes. Using 285% the machines, you'd expect them to be able to do it in 12.25 minutes. Moore's Law should put each machine at 160x more powerful, and with 300x the edges they are finishing in 19 minutes. So we've improved by 2x over an 11 year-old software stack that we've long since abandoned for better approaches, using a hardware stack that is much more optimized for SIMD computations than the 11 year-old hardware.
I really dont know the PageRank algorithm, but I was really impressed by their results, a trillion of "something" in 3 minutes. I thought that just reading things should take you a lot more than that but apparently not.
Really nice stuff.
The bit about k-means was interesting, but the rest was an irrelevant bore.
It is to help you to understand frameworks for working on graph data better.
If you see it as a humble brag that is an irrelevant bore, you aren't the intended audience.
Also, your idea that it is a humble-brag to talk about a computer cluster at VLDB seems more indicative of a lack of knowledge on your part rather than a lack of "abstract insights".
Irrelevant bore indeed.
I disagree with the tone of the original comment. But I do not disagree with the sentiment. Just having a large installation does not make it interesting. However, Google's large systems almost always push the boundaries of science -- MapReduce, GFS, Spanner, Distbelief and have been a joy to read.