GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simple by combining several important pieces of Alibaba technology for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the vineyard store that offers efficient in-memory data transfers.
We just released the version 0.2.0. And along with the release, we launched a public JupyterLab service where you can have a try in your browser: https://try.graphscope.app
if you have a small graph would it unreasonable to use graphscope to analyze it - what if it was a small graph that you expected to grow big later? What if you have a graph inside a graph db - I guess GraphScope gives you extra performance by allowing you to distribute processing for tasks that would make sense - any other benefits like that?
> if you have a small graph would it unreasonable to use graphscope to analyze it - what if it was a small graph that you expected to grow big later?
You are right. GraphScope is built for processing very large graphs, but for small graphs, it should work just as well. If you expect your graphs to grow too big to be handled on a single machine in the future, I think it is a good idea to start with GraphScope even when it is small.
Stay tuned, we got plans to make GraphScope nicer to use for smaller graphs without a k8s cluster.
> What if you have a graph inside a graph db - I guess GraphScope gives you extra performance by allowing you to distribute processing for tasks that would make sense - any other benefits like that?
Besides extra performance, there could a few other benefits depending on your scenario.
1. Many graph analytical / GNN tasks takes a lot of computing resources. A single task can take hours to run even with 1000 CPU cores.
It makes sense to run such tasks in other machines/systems without adding too much burden to a graph db to avoid affect its quality of service.
2. Fully integration with Python makes it more flexible to do data analytics. For example, you can leverage the ability provided by numpy, pandas and mars (https://github.com/mars-project/mars) along GraphScope with zero-copy thanks to our storage engine vineyard (https://github.com/alibaba/libvineyard)
3. Besides distributed processing, extra performance can also come from the efficient graph layout in memory, and other optimizations on the compiler and runtime-level. GraphScope is ~100x faster on Gremlin, and even more on graph analytical algorithms like PageRank, compared with graph dbs like JanusGraph.
Superinteresting and timely for us. Curious to plugin to our stack and see how it plays with our GPU visuals! I'm especially optimistic b/c the pydata bindings.
Can you share a bit more about k8s-less design plan + timeline? (apt/conda/docker/...?)
And any intuition for something like efficient Arrow/Parquet dataframe ingest/export? We're especially focused on Dask interop.
We do have a concrete plan for k8s-less deployment and we already have an issue [1] to track that. That will be available before the end of March 2021.
To simplify the environment setup process we will release a docker image for end-users, but without docker will be ok as well (requires building from sources).
GraphScope use vineyard [2] as the storage layer for im-memory graph data structures. And current the graph type (aka. ArrowPropertyFragment in GraphScope) uses a set of arrow tables and arrays under the hood.
GraphScope supports a `to_vineyard_dataframe` method on the computation context [3]. We also has a plan for integration between vineyard and dask (may could be delivered in March as well). At that time the interop between dask would be straightforward.
This is great! How does it compare to blazegraph and how does it compare to redisgraph?
As far as I can tell, it's not for doing parallel graph analysis (i.e. GAS) on the GPU, is that right? (It's hard to tell, because I'm also seeing that this is designed to be used for GNNs.) But if I wanted to spawn lots of workers on a single machine, I could do that?
> How does it compare to blazegraph and how does it compare to redisgraph?
GraphScope is not a graph database, and it cannot handle updates to graphs. The graphs are loaded to memory and are "immutable", but queries on GraphScope are typically faster than graph databases, and GraphScope scales very well - you can launch a larger session from your python notebook to handle a bigger graph or run a complex algorithm.
> As far as I can tell, it's not for doing parallel graph analysis (i.e. GAS) on the GPU, is that right? (It's hard to tell, because I'm also seeing that this is designed to be used for GNNs.) But if I wanted to spawn lots of workers on a single machine, I could do that?
Internally, we have an early version running with GPU support. However,it does not really seem cost-effective to running algorithms with lots of random memory accesses on sparse data structures like large graphs on GPU. There could be ~10x speed-up, but because memory on GPUs are typically smaller than main memory, that means the graphs would not be too slow to be processed on CPUs either.
For GNN part, we focused on the sampling, it is currently done on CPUs too. After the samples get feeding to TensorFlow, TensorFlow could utilize GPU depend on the setting.
Thank you, this is really helpful. I think clarifying that those limitations somewhere early on in the website could save people a lot of time in figuring out whether this is a good choice for them. It sounds like this might not be best for realtime graph analytics, but would be great for very large static graphs.
Good question! I am from the team built GraphScope.
Neo4j and JanusGraph are graph databases where GraphScope is not.
For end users, using GraphScope feels like using a python library like networkx (https://networkx.org/), and you can do graph analytics, gremlin queries and GNN sampling etc in a single place. But unlike NetworkX which does not support parallelization, the actual processing of (big) graphs is handled in a distributed k8s cluster in parallel on GraphScope.
The learning engine in GraphScope (aka. GIE) has a simiary programming interface and functionlity with euler. They both support graph neural network training, e.g., GraphSAGE, GCN, GAT, etc. However there are many differences under the hood.
1. The programming model is different. Euler provides a message-passing style API to define new graph models, while GIE provides a sampling API first, and abstracts a batch of seed nodes or edges(named ‘ego’) and their receptive fields (multi-hops neighbors) as a "EgoGraph", which can be turned into a "EgoTensor" as features.
2. Euler implements operation on graphs as tensorflow ops, while GIE takes a more flexible design and doesn't couple with a specific machine learning framework. Upon the sampling interface, developer could build their GNN models using either tensorflow, pytorch or any other machine learning framework, and even for non-GNN tasks.
3. Most importantly, euler is just for graph learning. In graphscope, the learning engine could co-work with the analytical engine and with the interactive query. In real world cases usually so complicated and the problem is not just a GNN training task, and different dedicated systems may involve for different kinds of workload, that means users need to move and transform data back and forth between systems. In GraphScope, engines for different proposals share the same graph, and, live in the same jupyter notebook to delivery the ability of one-stop large-scale graph computation. That is more user-friendly for data scientists.
The graphs on GraphScope is backed by vineyard (https://github.com/alibaba/libvineyard). And that enables GraphScope to have multiple specifically optimized runtimes (written in C++, rust and Python) for different tasks shares the distributed graph data in memory efficiently.
Does this offer similar functionality to GraphLab? It's sad how badly that project has deteriorated. Yet in some algorithms the difference is still between night and day compared to NetworkX to compute eg PageRank.
I have always wondered what common usecases such systems address. How large are the graphs typically? Are graphs really that big that they dont fit in a single machine?
When you are dealing with payments data x paying to y and txn volumes of few months then they are very large graphs, companies such as npci or alipay deals with these kinds of data.
Some of the usecase to build such graph is to get node embeddeding for fraud prevention or link prediction etc.
Indeed.. link prediction and fraud detections are two of many things we support internally in Alibaba.
Transaction graphs can be huge. Web graphs are huge. Also there are also many other huge graphs and use cases, like spot irregularities / intrusions from network traffic graphs in Alibaba Cloud. Some bioinformation algorithms also requires the ability to process big graphs.
GraphScope addresses on computations on extermely large graphs, e.g., the friendship networks of all facebook users, the hyperlink relations between all webpages around the world, and so on.
Typically the graph may have billions of nodes and 10x billions of edges. Obviously the graph data cannot be fit into a single machine to run alogrithms like SSSP or pagerank. And a single machine usually doesn't have enough cores for the computation, e.g., an interactive query couldn't return within milliseconds. That why we need distributed graph processing system for such big graphs.
Way back when I was working on parallel graph engines, trillions of edges was pretty typical. The kind of data models that produce graphs that large are pretty diverse.
We don't have a benchmark between the analytical engine in GraphScope (aka. GAE) with GraphX/Giraph. But we do have evaluated the performance of the underlying engine of GAE (libgrape-lite) with LDBC Graph Analytics Benchmark and it achieves higher performance comparably to the state-of-the-art systems [2].
We just released the version 0.2.0. And along with the release, we launched a public JupyterLab service where you can have a try in your browser: https://try.graphscope.app
Github: https://github.com/alibaba/graphscope. (stars are welcome :)
Website: https://graphscope.io
Documentation: https://graphscope.io/docs
Any comments and contributions from the community are welcomed!