Google Pregel: a system for large-scale graph processing

helwr · on June 22, 2010

here is a brief summary

Pregel defines a sequence of supersteps (execution cycles) which are executed in parallel on the entire (partitioned) graph.

The supersteps are executed by worker threads, synchronized by global synchronization points and coordinated by a master. Pregel keeps vertices and edges on the machine that performs computation and uses network only for message passing.

Within a superstep each vertex executes the same user-defined function. There is no message passing between vertices during the superstep but the vertices will recieve their messages in the next cycle only.

There are no remote reads, or shared memory concepts like locks, mutexes and semaphores - it is a "pure message passing model".

Vertex can modify its state, or that of its outgoing edges, receive messages sent to it in the previous superstep, send messages to other vertices (to be received in the next superstep) or even change the topology of the graph.

Vertex state machine has two states Active/Inactive. a vertex is activated by a message received from a previous superstep. A vertex deactivates itself by "voting to halt". When all vertices are deactivated and there are no messages in transit the algorithms terminates.

The output of Pregel program is a set of outputs emitted by vertices, could be a directed graph or a set of disconnected vertices or aggregate stats.

The authors mention Valiant's "Bulk Syncronous Parallel model" as the main inspiration behind this model (I've found only ACM link: http://portal.acm.org/citation.cfm?id=79181)

----------------------

PageRank in Pregel:

class PageRankVertex

: public Vertex<double, void, double> {

public:

virtual void Compute(MessageIterator* msgs) {

if (superstep() >= 1) {

double sum = 0;

for (; !msgs->Done(); msgs->Next())

sum += msgs->Value();

MutableValue() =0.15 / NumVertices() + 0.85 sum; }

if (superstep() < 30) {

const int64 n = GetOutEdgeIterator().size();

SendMessageToAllNeighbors(GetValue() / n); } else {

VoteToHalt(); } } };

kiwidrew · on June 22, 2010

I'd love to find a PDF of the full paper; we hear so much about MapReduce, and this is essentially the map-reduce approach as applied to graphs.

The best reference about Pregel I've been able to find so far is a one-page announcement, published last year: http://jloxim.mimuw.edu.pl/redmine/attachments/11/p6-malewic...

nerme · on June 22, 2010

MapReduce is already being applied to graphs.

Spectral graph theory, ie, the construction and manipulation of graphs that are described by matrices, has been around for quite some time. Two types of matrix representations are typically used. Adjacency matrices and Laplacian matrices.

Page Rank is an example of this. It is equivalent to finding the centrality of a graph, by computing the primary eigenvector of the adjacency matrix of the graph.

MapReduce is used for accomplishing the difficult task of computing the needed eigenvector of a multi-billion dimensional matrix. Instead of attempting an exact computation of the primary eigenvector, the power method is used. The power method is an iterative approach that quickly approaches an approximation of the eigenvector. It comes down to looping over a matrix multiplied by a vector until you get the precision you're looking for.

The cool thing about the power method is that you don't REALLY need to get your data in to some sort of linear algebraic representation of a matrix, you can just use more efficient data structures to mimic matrix multiplication.

Since matrix multiplication is done a row and a column at a time, you can split it up in a parallel manner. The map function handles the computation of each row of the resulting vector, and the reduce function puts it back together again, ready for the next iteration of the power method.

As for PreGel, I didn't really understand the description of it.

I have always had a lot of gripes with publications related to algorithms. There are very rarely any code snippets. I feel at least that the authors should go over their algorithms in some sort of pseudo-code.

I realize that academic papers are written for a specific audience in mind, an audience that is used to seeing a bunch of algorithms described in linear algebraic notation, but I feel it does a huge disservice to spreading knowledge. I mean, I'm somewhat versed in mathematics, albeit at an undergrad level, but I rarely know what these articles are talking about in any specific detail, at least to the point where I could reproduce their approach in functioning computer code.

I've even been published in ACM, and I can't follow the majority of papers in the journal!

This isn't just applicable to computer science, rather all mathematics in general. The notations used are so arcane that even the simplest of ideas can seem daunting. It's probably why so many people are turned off by mathematics early on in their academic careers.

It took me a good week of brushing up on linear algebra and reading through various descriptions and white papers to figure out what was going on with Page Rank. It really isn't that complicated of an algorithm in the end... you're just constructing an MxM matrix where rows and columns are considered web pages, getting those rows to add up to 1 while making sure there are no elements equal to 0, computing an eigenvector, and then scaling the eigenvector down to make each element add up to 1. However, you read the Wikipedia entry on Page Rank and it is borderline indecipherable!

Why isn't there a "somewhat complex math explained in normal language with pretty pictures and graphs" textbook or website out there? Are all mathematicians incapable of communicating their art form to the world at large?

ajb · on June 22, 2010

Well, there's "Proofs without words" (http://books.google.co.uk/books?id=Kx2cjyzTIYkC) and its sequel (http://books.google.co.uk/books?id=wMpwQgAACAAJ). I'm not sure that's exactly what you meant, though.

ableal · on June 22, 2010

Digging around a bit, without much luck - there's the research blog post of a year ago: http://googleresearch.blogspot.com/2009/06/large-scale-graph... (some comments there point to similar work.)

Snippets:

so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code.

spoiler: The seven bridges of Königsberg — inspiration for Leonhard Euler's famous theorem that established the basics of graph theory — spanned the Pregel river.

Maro · on June 22, 2010

Is the PDF available for download somewhere?

gregbayer · on July 29, 2010

Should be available here: http://portal.acm.org/citation.cfm?id=1807184