The thing is this benchmark uses a small data set and no concurrent writes. This means that the data set can be trivially cached in memory on every single instance.
In this benchmark there is theoretically no need for different database nodes to even communicate with each other after the initial data sync. It is embarrassingly parallel.
In fact it is somewhat alarming that going from 2 to 5 nodes doesn't at least double the performance, given that there should be absolutely zero need for coordination between servers.
The writes for Dgraph actually happen in O(log N) time where N = number of subjects for a given predicate; and a background job then syncs the posting lists to disk later. So, that performance wouldn't be any worse than what we observe here.
However, to do writes like you would do them in production would require you to sync a mutation log to disk, before applying it in memory and returning an OK. This particular operation bounded by disk I/O would then become the bottleneck, and not help with doing benchmarking for the database. Which is why we didn't make writes part of this particular benchmark.
Dgraph does concurrent writes very well. To determine that, you can run Dgraph batch loader. It can load up the entire 21M triples from the blog post, in 20mins on an n1-standard GCE instance.
https://github.com/dgraph-io/dgraph#loader-performance
When jumping from 2 to 5 nodes, we did want to ensure that the nodes have to communicate to run the queries, as they would in a real-world scenario; which is why the performance wouldn't just double.
I think you're putting a lot of emphasis on the size of the data. While that is surely important, the size itself isn't the only thing which proves scalability. It's about the concurrency, throughput and latency -- and determining that they do improve when you add more hardware power to the cluster.
Thanks for the info. I guess my follow up question is why doesn't Dgraph end up caching all the data on every node when it has the memory to do so? Is this a planned feature that you just haven't implemented yet?
It depends upon the memory available. We have a background goroutine which checks how much memory we're using and evicts posting lists if we exceed a predefined usage. If we have a lot of memory then we just won't evict the posting lists, which means they'll be in RAM.
All this "caching" happens close to the disk level, just above RocksDB. We don't cache any results -- to me caching is cheating -- and we want to build something truly low latency. A user can easily add caching if they want to decrease the load on Dgraph, but it's not something I think we should be doing at the db level.
While some data is cached at an instance level by rocksDB, it wouldn't be cached across instances. So, there is a need for coordination/communication among the servers for every request.
They are orthogonal to a point :) Thats why people want to see a real world dataset and queries. If it's performance is 100 times less than an alternative I don't care if it can scale out. Also if it can scale out to 10 nodes it tell's me little about how it will perform with 100 nodes. If It takes 30 sec for complex query on a production dataset I really don't care if I can run 1000 30 sec. queries in parallel if I need that query to run in 50ms on avg and to never run more than 200ms
Thanks for your comments! I think people feel that the data set wasn't big enough to prove scalability. While I disagree, I think this is a great opportunity as well -- to hear from people. We can actually bring up the entire 1.9B triples from Freebase and run some queries on them -- get some more real world numbers. So, tell us what to run here:
You should run the whole of Freebase, or really the biggest thing you can get your hands on. Hope you do. I'd love something that can comfortably handle billion order graphs!
> Freebase is an online collection of structured data which includes contributions from many sources including individual and user-generated contributions. Currently, it has 1.9 Billion RDF N-Triples worth 250GB of uncompressed data. On top of that, this dataset is over 95% accurate with a complex and rich real world schema. It is an ideal data set to test the performance of Dgraph. We decided not to use the entire data set as it wasn’t necessary for our goal here.
I don't follow. Wasn't using as much data as possible exactly what scaling is about?
Not exactly. Given enough time and a lot of hard disk, you can put the entire 1.9B triples into any database -- that doesn't prove that it scales. Also, if the database uses a single global mutex lock for all reads and writes, again it's not a scalable architecture.
Scaling is about throughput and latency. This post is trying to determine whether adding more machine power actually lets the db perform better.
If you are in graph benchmarking you should really have a good look at http://ldbcouncil.org/. Especially the fact that this is a read only benchmark. Plus your dataset is tiny. So all you showed is that if there is no contention and all data fits on the local machine there is no communication overhead and near perfect scaling. That is not a benchmark that is hard to beat, at all.
OK if you load this dataset into Postgres it will be about 10-20X this speed for the queries performed in this test. What is the point of benchmarking a graph database on the tiny dataset and using queries that can be run easily on RDBMS? Isn't whole point of Graph Databases that they can perform well on queries that would not run well on RDBMS and handle datasets that RDBMS would choke on?
The last one, which involves Kevin Bacon, and returns 2.4 million entities is an interesting one, that might choke a lot of application layer powered graph datastores running on top of some RDBMS.
Not the entire point. Relational databases have schemas that are not well suited to graph structures, and SQL is also a barrier. A better question is whether you need to implement your own storage backend or whether something like Postgres would suffice.
OK then what type of graph would be hard to store in RDBMS?
There are obviously queries that one would want to run on graph data that RDBMS will not run efficiently (but the queries in this test are actually trivial for RDBMS). In what way is SQL a barrier compared to learning query language used by some graph database that is not applicable to anything else you do?
"A better question is whether you need to implement your own storage backend or whether something like Postgres would suffice". You could just use a more established Graph Database that actually performs well on complex graphs?
Here's a thing that most RDBMS don't do well but graph DBs do: find me nodes connected to at least X nodes of type T where those nodes have attribute A and are also connected to node N. Now duplicate that filter a potentially arbitrary number of times.
The issue here is that the number of joins explodes, and depending on your schema you may be doing lots of self joins.
An additional complication is if your dataset is too big for a single host. In Postures you shard, but that is manual and has significant cost.
In DGraph you lose some performance but (hopefully) if you know something about your queries you can optimize the distribution function to minimize cross node queries. This is a pretty hard problem to generalize, but even a partial solution is good.
Yours is a great example of what graph DBs should be good at, but many self-styled graph DBs out there at the moment are not. Graph DB means to me only two things: index-free edge traversal and scalable built-in graph operations. While these would seem to be necessary and sufficient criteria to distinguish a graph DB, some instead use only the criteria of the GGP and equivocate graph DBs with schema-free DBs, which should be orthogonal axes of database features.
OrientDB would indeed be an example of something that's only a "graph database" because marketing said it should be.
Last time I asked how to import an actual graph into OrientDB, a marketing person of theirs pointed me at a Java API for writing extensions to their code.
Dgraph is aimed at minimizing network calls. In fact, the network calls are directly proportional to the complexity of the query, not the number of results. Which means the queries would maintain their latency even as you add more machines to the cluster.
Right, but unless you know the queries in advance there is always the risk of pathological queries that thrash the network.
Naive example: in the movie dataset, if you partition by node type and have actors on one server and films on another a query like "find me all films with actors names starting with M who also starred in films with actors starting with N" will perform horribly, but if you partition by actor and film name it will be OK.
Titan (and I think most distributed Graph DBs) use pluggable distribution strategies and default to random to try to combat this problem.
I agree with you that there are queries where Graph DBs perform better than RDBMS (and you provide a good example of such query) so it would be really cool to see appropriate benchmarks. Also would be nice to see benchmarks vs more established graph dbs.
Yes. But I think that this is showing the scalability of the network and query stacks in DGraph, not other kinds of scalability.
Both are important. As someone who sometimes has very large graphs, I'm more interested in this benchmark than absolute performance: I'm happy to take a performance hit if it means I can scale out.
It doesn't. That paper proposes a data layer called Grail on top of an RDBMS, which is exactly what I described in my comment. An RDBMS may perform well at storing graph-like structures, but is not ergonomically suited to be used directly by humans for that purpose.
Most graph databases are not optimized for speed. Specific indexes for optimization are missing, while they have been researched by computer scientists. As you said, using a random RDBMS will probably be faster.
Having that said, currently the big benefit of graph databases is ease of use.
As for the authors of Dgraph, keep it up! Your tech looks promising and I'm excited to see more and more Open Source graph databases in the market. Would love to compare notes, shoot me a message.
In this benchmark there is theoretically no need for different database nodes to even communicate with each other after the initial data sync. It is embarrassingly parallel.
In fact it is somewhat alarming that going from 2 to 5 nodes doesn't at least double the performance, given that there should be absolutely zero need for coordination between servers.