

Processing Billion-Node Graphs on an Array of Commodity SSDs - gk1
http://highscalability.com/blog/2015/5/19/paper-flashgraph-processing-billion-node-graphs-on-an-array.html

======
Smerity
I cannot recommend FlashGraph strongly enough. FlashGraph was one of the first
graph computation engines to enable near trivial analysis of the Web Data
Commons Hyperlink Graph - 3.5 billion web pages and 128 billion links.

For smaller graphs, analysis using FlashGraph is hilariously quick.

If you're interested in how this is achieved, refer to [1]. From memory, Da
Zheng said he created FlashGraph primarily as he wanted to prove how efficient
the storage system was. Full details on the FlashGraph (though FlashGraph runs
far faster now!) in the paper at [2].

Note: I'm a data scientist at Common Crawl, the dataset that the Web Data
Commons Hyperlink Graph is based upon and the main developer of Flash Graph,
Da Zheng, wrote a guest post for us on this very topic[3], so I'm rightfully
biased as to thinking this is an amazing project!

[1]:
[http://www.cs.jhu.edu/~zhengda/sc13.pdf](http://www.cs.jhu.edu/~zhengda/sc13.pdf)

[2]:
[https://www.usenix.org/system/files/conference/fast15/fast15...](https://www.usenix.org/system/files/conference/fast15/fast15-paper-
zheng.pdf)

[3]: [http://blog.commoncrawl.org/2015/02/analyzing-a-web-graph-
wi...](http://blog.commoncrawl.org/2015/02/analyzing-a-web-graph-
with-129-billion-edges-using-flashgraph/)

------
mrry
An interesting comparison point: a single core on a late-2014 MacBook Pro can
achieve runtimes for the same graph that are within a factor of 4 for WCC (461
seconds for FlashGraph versus 1700 seconds for the laptop).

[http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html)
(previously on HN:
[https://news.ycombinator.com/item?id=9001618](https://news.ycombinator.com/item?id=9001618))

There are also results for PageRank on that graph, which make the difference
more pronounced. FlashGraph runs PageRank in 2041 seconds (I'm assuming for 30
iterations, per Section 4 of the paper), whereas the laptop takes 46000
seconds for 20 iterations.

~~~
Smerity
Absolutely spot on - between FlashGraph and Frank McSherry's COST work, the
two have really pushed the envelope on efficient large scale graph analysis.

Frank McSherry wrote a "call to arms" for the broader graph community at [1].
The main point of interest is that academia generally compared their work with
existing distributed graph processing systems, celebrating when any
achievements were made, yet not aware of the significant overheads brought on
by the distributed approach. Both Frank's work (run on a single laptop) and
FlashGraph (run on a single powerful machine) run far faster than the
distributed approach and have very few disadvantages.

Note: I'm a data scientist at Common Crawl and Frank's graph computation
discussion article was a guest post at our blog.

[1]: [http://blog.commoncrawl.org/2015/04/evaluating-graph-
computa...](http://blog.commoncrawl.org/2015/04/evaluating-graph-computation-
systems-performance-and-scale/)

------
nojvek
I wish there was a deeper technical explanation on how they made such a
difference with a layer of file system. Would this also improve relational db
perf?

~~~
corysama
Searching for "set-associative file system" brings up their publications

[http://www.ncbi.nlm.nih.gov/pubmed/24402052](http://www.ncbi.nlm.nih.gov/pubmed/24402052)

[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881961/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881961/)

[https://github.com/icoming/FlashGraph](https://github.com/icoming/FlashGraph)

------
jheriko
i can't help but wonder if this has been tested vs. just letting the os
virtualise memory for you... i'm assuming yes.

