Hacker News new | past | comments | ask | show | jobs | submit login
How to Partition a Billion-Node Graph [pdf] (microsoft.com)
125 points by espeed on Jan 5, 2017 | hide | past | favorite | 14 comments



That looks like a pre-print or review copy, and is missing the authors' names and other citation information.

The paper is: L. Wang, Y. Xiao, B. Shao and H. Wang, "How to partition a billion-node graph," 2014 IEEE 30th International Conference on Data Engineering, Chicago, IL, 2014, pp. 568-579. doi: 10.1109/ICDE.2014.6816682

Here's the (potentially paywalled) link to IEEExplore: http://ieeexplore.ieee.org/document/6816682/


Here's a direct link to the ICDE Conference version: https://www.graphengine.io/downloads/papers/Trinity.Partitio...


Even better!

Mods, maybe swap the links?


This paper by Fabio Checconi and Fabrizio Petrini of IBM Watson is a rare look into design details:

Traversing Trillions of Edges in Real-time: Graph Exploration on Large-scale Parallel Machines

http://www.odbms.org/wp-content/uploads/2014/05/g500-ipdps14...


OK now I'd like to see this run on a VLSI ASIC place-and-route system.

Originally these graph partitioning algorithms were largely designed to solve that problem.


Can't download it.

What's the problem? Machines with 32G RAM are commonplace, you can probably just fit it all into memory. The important bits at least.


I think its fair to say there is still value in extending the state of the art in graph partitioning algorithms.

There are two assumptions implicit in your statement. First, that you have hardware available with large amounts of RAM (32GB would not be enough for many of the graphs I work with, though in fairness 128GB+ almost always would be). Second, you assume that RAM is the graph application's only bound (rather than CPU, network, etc...).

I'm not personally convinced that either assumption is sound (but then, as a PhD student nearing the end of his thesis investigating graph partitioning ... check my bias :-) ).


The examples mentioned are the web (50B nodes, 1000B links) and the Facebook social graph (0.8B nodes, 100B links).

You might fit quite a bit into a beefy 3TB RAM server, but it might not be the best ram-to-cpu ratio for graph algorithms.


You can compute non-trivial statistics about the entire web-graph (roughly |V|=3B, |E|=100B)on a single high-end server these days. Speaking from personal experience, a machine with > 64 cores and 1TB of RAM can compute connectivity, MSTs, and partition this graph in minutes. Are there graphs out there (with the exception of Facebook and Google's private graphs) that are so large as to be intractable without resorting to external-memory/distributed computing?


LinkedIn + Outlook + Dynamics?


Sorry, I should clarify: publicly accessible graphs. The hyperlink graphs are the largest I've found.

http://webdatacommons.org/hyperlinkgraph/index.html#toc1



Extrapolating from a few data sets at that link, the road network for the entire US is probably on the order of 100M nodes and around 250-500M edges.

I would imagine real biological data isn't quite there yet, but it's certainly not for want of trying. Simulations might get pretty close. Blue Brain was aiming for 100M neurons and (presumably) a few orders of magnitude more connections and I think there's another simulation floating around with 10^11 neurons (but presumably less biophysical accuracy).


Obviously, graph partitioning is a useful and practical problem whether or not there are publicly accessible graphs that require more ram than an EC2 X1 instance gives you.

But yes, assuming your data is "big data" when it can easily be processed on a dev workstation or beefy server is very much in vogue right now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: