Hacker News new | past | comments | ask | show | jobs | submit login
GraphScope: A One-Stop Large-Scale Graph Computing System (github.com/alibaba)
66 points by sighingnow on Feb 3, 2021 | hide | past | favorite | 25 comments



GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simple by combining several important pieces of Alibaba technology for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the vineyard store that offers efficient in-memory data transfers.

We just released the version 0.2.0. And along with the release, we launched a public JupyterLab service where you can have a try in your browser: https://try.graphscope.app

Github: https://github.com/alibaba/graphscope. (stars are welcome :)

Website: https://graphscope.io

Documentation: https://graphscope.io/docs

Any comments and contributions from the community are welcomed!


if you have a small graph would it unreasonable to use graphscope to analyze it - what if it was a small graph that you expected to grow big later? What if you have a graph inside a graph db - I guess GraphScope gives you extra performance by allowing you to distribute processing for tasks that would make sense - any other benefits like that?


> if you have a small graph would it unreasonable to use graphscope to analyze it - what if it was a small graph that you expected to grow big later?

You are right. GraphScope is built for processing very large graphs, but for small graphs, it should work just as well. If you expect your graphs to grow too big to be handled on a single machine in the future, I think it is a good idea to start with GraphScope even when it is small.

Stay tuned, we got plans to make GraphScope nicer to use for smaller graphs without a k8s cluster.

> What if you have a graph inside a graph db - I guess GraphScope gives you extra performance by allowing you to distribute processing for tasks that would make sense - any other benefits like that?

Besides extra performance, there could a few other benefits depending on your scenario.

1. Many graph analytical / GNN tasks takes a lot of computing resources. A single task can take hours to run even with 1000 CPU cores. It makes sense to run such tasks in other machines/systems without adding too much burden to a graph db to avoid affect its quality of service.

2. Fully integration with Python makes it more flexible to do data analytics. For example, you can leverage the ability provided by numpy, pandas and mars (https://github.com/mars-project/mars) along GraphScope with zero-copy thanks to our storage engine vineyard (https://github.com/alibaba/libvineyard)

3. Besides distributed processing, extra performance can also come from the efficient graph layout in memory, and other optimizations on the compiler and runtime-level. GraphScope is ~100x faster on Gremlin, and even more on graph analytical algorithms like PageRank, compared with graph dbs like JanusGraph.


Superinteresting and timely for us. Curious to plugin to our stack and see how it plays with our GPU visuals! I'm especially optimistic b/c the pydata bindings.

Can you share a bit more about k8s-less design plan + timeline? (apt/conda/docker/...?)

And any intuition for something like efficient Arrow/Parquet dataframe ingest/export? We're especially focused on Dask interop.


Thanks for you interests on GraphScope!

We do have a concrete plan for k8s-less deployment and we already have an issue [1] to track that. That will be available before the end of March 2021.

To simplify the environment setup process we will release a docker image for end-users, but without docker will be ok as well (requires building from sources).

GraphScope use vineyard [2] as the storage layer for im-memory graph data structures. And current the graph type (aka. ArrowPropertyFragment in GraphScope) uses a set of arrow tables and arrays under the hood.

GraphScope supports a `to_vineyard_dataframe` method on the computation context [3]. We also has a plan for integration between vineyard and dask (may could be delivered in March as well). At that time the interop between dask would be straightforward.

[1]: https://github.com/alibaba/GraphScope/discussions/113

[2]: https://github.com/alibaba/libvineyard

[3]: https://graphscope.io/docs/reference/context.html#graphscope...


This is great! How does it compare to blazegraph and how does it compare to redisgraph?

As far as I can tell, it's not for doing parallel graph analysis (i.e. GAS) on the GPU, is that right? (It's hard to tell, because I'm also seeing that this is designed to be used for GNNs.) But if I wanted to spawn lots of workers on a single machine, I could do that?


> How does it compare to blazegraph and how does it compare to redisgraph?

GraphScope is not a graph database, and it cannot handle updates to graphs. The graphs are loaded to memory and are "immutable", but queries on GraphScope are typically faster than graph databases, and GraphScope scales very well - you can launch a larger session from your python notebook to handle a bigger graph or run a complex algorithm.

> As far as I can tell, it's not for doing parallel graph analysis (i.e. GAS) on the GPU, is that right? (It's hard to tell, because I'm also seeing that this is designed to be used for GNNs.) But if I wanted to spawn lots of workers on a single machine, I could do that?

Internally, we have an early version running with GPU support. However,it does not really seem cost-effective to running algorithms with lots of random memory accesses on sparse data structures like large graphs on GPU. There could be ~10x speed-up, but because memory on GPUs are typically smaller than main memory, that means the graphs would not be too slow to be processed on CPUs either.

For GNN part, we focused on the sampling, it is currently done on CPUs too. After the samples get feeding to TensorFlow, TensorFlow could utilize GPU depend on the setting.


Thank you, this is really helpful. I think clarifying that those limitations somewhere early on in the website could save people a lot of time in figuring out whether this is a good choice for them. It sounds like this might not be best for realtime graph analytics, but would be great for very large static graphs.


Stay tuned. We have a plan to build a graph storage engine that can work side by side with GraphScope.


It'd be interesting to see how this compares with other graph systems. For example, Neo4j, JanusGraph?


Good question! I am from the team built GraphScope.

Neo4j and JanusGraph are graph databases where GraphScope is not.

For end users, using GraphScope feels like using a python library like networkx (https://networkx.org/), and you can do graph analytics, gremlin queries and GNN sampling etc in a single place. But unlike NetworkX which does not support parallelization, the actual processing of (big) graphs is handled in a distributed k8s cluster in parallel on GraphScope.


How does GraphScope graph backend compares to the previous one built by Alibaba - Euler? https://github.com/alibaba/euler/wiki


The learning engine in GraphScope (aka. GIE) has a simiary programming interface and functionlity with euler. They both support graph neural network training, e.g., GraphSAGE, GCN, GAT, etc. However there are many differences under the hood.

1. The programming model is different. Euler provides a message-passing style API to define new graph models, while GIE provides a sampling API first, and abstracts a batch of seed nodes or edges(named ‘ego’) and their receptive fields (multi-hops neighbors) as a "EgoGraph", which can be turned into a "EgoTensor" as features.

2. Euler implements operation on graphs as tensorflow ops, while GIE takes a more flexible design and doesn't couple with a specific machine learning framework. Upon the sampling interface, developer could build their GNN models using either tensorflow, pytorch or any other machine learning framework, and even for non-GNN tasks.

3. Most importantly, euler is just for graph learning. In graphscope, the learning engine could co-work with the analytical engine and with the interactive query. In real world cases usually so complicated and the problem is not just a GNN training task, and different dedicated systems may involve for different kinds of workload, that means users need to move and transform data back and forth between systems. In GraphScope, engines for different proposals share the same graph, and, live in the same jupyter notebook to delivery the ability of one-stop large-scale graph computation. That is more user-friendly for data scientists.


alibaba/euler is a lightweight library built specifically for GNN sampling only.

While GraphScope is built for many kinds of graph tasks such as gremlin, graph analytics and GNN sampling.

You can check this example to get an idea what GraphScope can do. https://nbviewer.jupyter.org/github/alibaba/GraphScope/blob/...

The graphs on GraphScope is backed by vineyard (https://github.com/alibaba/libvineyard). And that enables GraphScope to have multiple specifically optimized runtimes (written in C++, rust and Python) for different tasks shares the distributed graph data in memory efficiently.


Does this offer similar functionality to GraphLab? It's sad how badly that project has deteriorated. Yet in some algorithms the difference is still between night and day compared to NetworkX to compute eg PageRank.


Yes, we intend to cover the functionality provided by GraphLab, but with better performance (see https://github.com/alibaba/libgrape-lite/blob/master/Perform..., We are actually 10x~50x faster...).

Also we also provide the ability to do Gremlin queries on graphs as well as GNN with TensorFlow, neither is provided with GraphLab


I have always wondered what common usecases such systems address. How large are the graphs typically? Are graphs really that big that they dont fit in a single machine?


When you are dealing with payments data x paying to y and txn volumes of few months then they are very large graphs, companies such as npci or alipay deals with these kinds of data.

Some of the usecase to build such graph is to get node embeddeding for fraud prevention or link prediction etc.


Indeed.. link prediction and fraud detections are two of many things we support internally in Alibaba.

Transaction graphs can be huge. Web graphs are huge. Also there are also many other huge graphs and use cases, like spot irregularities / intrusions from network traffic graphs in Alibaba Cloud. Some bioinformation algorithms also requires the ability to process big graphs.


GraphScope addresses on computations on extermely large graphs, e.g., the friendship networks of all facebook users, the hyperlink relations between all webpages around the world, and so on.

Typically the graph may have billions of nodes and 10x billions of edges. Obviously the graph data cannot be fit into a single machine to run alogrithms like SSSP or pagerank. And a single machine usually doesn't have enough cores for the computation, e.g., an interactive query couldn't return within milliseconds. That why we need distributed graph processing system for such big graphs.


Way back when I was working on parallel graph engines, trillions of edges was pretty typical. The kind of data models that produce graphs that large are pretty diverse.


It's a great project, so why do you think you need a "Sign in and star GraphScope" dark pattern on the sandbox page?


Sorry that it made you feel that way. It is just the colleague who built the landing page thought it was a good idea to ask for stars there..


Did you benchmark GraphScope with GraphX or Giraph? How does it compare to these two?


We don't have a benchmark between the analytical engine in GraphScope (aka. GAE) with GraphX/Giraph. But we do have evaluated the performance of the underlying engine of GAE (libgrape-lite) with LDBC Graph Analytics Benchmark and it achieves higher performance comparably to the state-of-the-art systems [2].

[1]: https://github.com/alibaba/libgrape-lite

[2]: https://github.com/alibaba/libgrape-lite/blob/master/Perform...




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: