In this benchmark there is theoretically no need for different database nodes to even communicate with each other after the initial data sync. It is embarrassingly parallel.
In fact it is somewhat alarming that going from 2 to 5 nodes doesn't at least double the performance, given that there should be absolutely zero need for coordination between servers.
However, to do writes like you would do them in production would require you to sync a mutation log to disk, before applying it in memory and returning an OK. This particular operation bounded by disk I/O would then become the bottleneck, and not help with doing benchmarking for the database. Which is why we didn't make writes part of this particular benchmark.
Dgraph does concurrent writes very well. To determine that, you can run Dgraph batch loader. It can load up the entire 21M triples from the blog post, in 20mins on an n1-standard GCE instance.
When jumping from 2 to 5 nodes, we did want to ensure that the nodes have to communicate to run the queries, as they would in a real-world scenario; which is why the performance wouldn't just double.
I think you're putting a lot of emphasis on the size of the data. While that is surely important, the size itself isn't the only thing which proves scalability. It's about the concurrency, throughput and latency -- and determining that they do improve when you add more hardware power to the cluster.
All this "caching" happens close to the disk level, just above RocksDB. We don't cache any results -- to me caching is cheating -- and we want to build something truly low latency. A user can easily add caching if they want to decrease the load on Dgraph, but it's not something I think we should be doing at the db level.
They are close to orthogonal concerns. For example I'll take lower absolute performance if I can scale out.
Performance itself can be optimized separately, and usually that optimisation also applies to the scaled out version.
Doing about 30M ops/sec on a Macbook Air.
As for the authors of Dgraph, keep it up! Your tech looks promising and I'm excited to see more and more Open Source graph databases in the market. Would love to compare notes, shoot me a message.
I don't follow. Wasn't using as much data as possible exactly what scaling is about?
Scaling is about throughput and latency. This post is trying to determine whether adding more machine power actually lets the db perform better.
under reasonable constraint. throughput and latency from a cacheable dataset is moot for big data.
If you're interested, there're a list of complex queries that we've run:
The last one, which involves Kevin Bacon, and returns 2.4 million entities is an interesting one, that might choke a lot of application layer powered graph datastores running on top of some RDBMS.
The issue here is that the number of joins explodes, and depending on your schema you may be doing lots of self joins.
An additional complication is if your dataset is too big for a single host. In Postures you shard, but that is manual and has significant cost.
In DGraph you lose some performance but (hopefully) if you know something about your queries you can optimize the distribution function to minimize cross node queries. This is a pretty hard problem to generalize, but even a partial solution is good.
I'm not aware of any schema free databases marketing themselves as Graph DBs. I'm sure there are though.
There is a distinction between graph and graph processing frameworks (GraphX etc) though, but I don't think that's what you mean.
Last time I asked how to import an actual graph into OrientDB, a marketing person of theirs pointed me at a Java API for writing extensions to their code.
Naive example: in the movie dataset, if you partition by node type and have actors on one server and films on another a query like "find me all films with actors names starting with M who also starred in films with actors starting with N" will perform horribly, but if you partition by actor and film name it will be OK.
Titan (and I think most distributed Graph DBs) use pluggable distribution strategies and default to random to try to combat this problem.
Both are important. As someone who sometimes has very large graphs, I'm more interested in this benchmark than absolute performance: I'm happy to take a performance hit if it means I can scale out.
The latest research disagrees with your statement:
SQLGraph: An Efficient Relational-Based Property Graph Store; SIGMOD 2015.
Previous hackernews discussion: https://news.ycombinator.com/item?id=11101013
Having that said, currently the big benefit of graph databases is ease of use.