Hacker News new | past | comments | ask | show | jobs | submit login
DGraph – Scalable, Distributed, Low-Latency, High-Throughput Graph Database (dgraph.io)
130 points by pythonist on Mar 20, 2016 | hide | past | favorite | 47 comments

Wikidata did a comprehensive analysis of Graph DBs [0], and settled on BlazeGraph with TitanDB coming a close second.

Notably, there are quite a few omissions. DGraph and Cayley [1] being two of those. Interestingly, both are developed by Googlers. Cayley is used by Kythe.io [2], a Google project that kind of competes with srclib [3] by SourceGraph.

Cayley has native JavaScript interface, which makes it an interesting choice for Node JS based apps.

At work, we settled on TitanDB, primarily because it supports DynamoDB/Cassandra for storage and ElasticSearch. Most of the graph DBs rely on some storage engine or the other underneath-- Cayley supports LevelDB, for instance; whereas TitanDB supports BerkeleyDB apart from aforementioned DyanmoDB and Cassandra.

[0] https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXik...

[1] https://github.com/google/cayley

[2] https://kythe.io

[3] https://srclib.org

Both Cayley and TitanDB aren't native graph databases. In fact, Cayley supports many storage engines, including MongoDB etc. This is because both are graph layers, and the data maintenance is done via the real database underneath. This has benefits in the sense that it's easier to build, but also doesn't perform as well when it comes to query latency.

DGraph, OTOH, is a native graph database. We do use RocksDB, but the data distribution and maintenance is done by us. It's optimized for decreasing the number of network calls, to keep them linear to the complexity of the query, not the number of results. This is of incredible value when you're running real time queries, serving results directly to end user. The query latency hence, isn't too affected by high fan out of intermediate results and should remain low; while providing high throughput.

In fact, the entire HN traffic is being served by one GCE n1-standard-4 instance right now, using all 4 cores really well :-).


I have questions, I'd be glad if you could answer them:

1. DGraph v0.2 isn't production ready?

2. DGraph API doc is missing?

3a. Does DGraph support bulk loading of RDFs only? No support for graphSON?

3b. Does DGraph support incremental loading of a Graph? Or just bulk loads?

4. Is 'distribution' achieved by maintaining a copy of entire Graph data across all instances? Or is data distributed too?

5. What's with the UID generation? Is to establish a partioning scheme?

1. I wouldn't term anything up until 1.0 as production ready. 2. API doc? Basically, there's only one endpoint, called /query. All the queries just go through that. There's a wiki page with some test queries to get you started.

3a. Yes, with the 2 phase loader. Only RDFs are supported right now, nothing else. https://github.com/dgraph-io/dgraph#distributed-bulk-data-lo...

3b. Yes, with mutations. https://github.com/dgraph-io/dgraph#queries-and-mutations

4. It's truly distributed. The data is actually sharded, with each shard containing part of the data and served by a separate instance. The bulk loader instructions generate 3 instances.

5. To keep the queries, data storage and data transfer efficient, we assign a uint64 ID to all entities. UID assignment is that operation.

I take that it's a self-funded project? Good luck, and I hope you hit production sooner. You've got any roadmaps for us to keep track of?

A few Qs abt the storage layer:

DGraph supports replication too, in case a node fails...?

Given your description, I take that you've implemented a custom data distribution protocol on top of rocksdb? Do you have plans to extract this 'distributed rocksdb' out to its own implementation? How would something like this compare to actordb.com and/or rqlite?

Thx again.

We have funding now, would be made public soon. So, we have enough to keep us going for a bit, and focus solely on the engineering challenges.

DGraph would support high availability, which means all our shards would be replicated 3x across servers, so in case one server fails, the shards would still be available for querying and mutations. In addition, shard movements to other servers would happen so the replication factor remains the same. We aim to achieve this using (Etcd's) RAFT protocol, by version 0.4.

RocksDB is just a medium for us to have something between the database and disk. All the data arrangement, handling, movement etc. happens above RocksDB. So, no there's no "distributed rocksdb" here.

For me the demo doesn't work, CORS violation while trying to access http://dgraph.xyz/query. (EDIT: manually accessing it sends me in a cloudflare(?) captcha, that might mess with the query?)

Cloudflare does that when you are hitting it from a suspicious VPN IP address. There may be other reasons why it doesn't like your IP.

hmm.. we have the CORS allowed, so this shouldn't be the issue. It's possibly the cloudflare captcha issue.

Re Wikidata, it's worth noting that they discarded TitanDB because it was assumed dead after the Datastax acquisition of Aurelius, which turned out not to be the case after all.

Also, the choice of BlazeGraph whiffed of politics. Wikidata (or at least one of the primary developers) seems to have been courted by the BlazeGraph people, to the point where Wikidata prematurely abandoned their research spreadsheet. This was at a time when there was hardly any public info/documentation about BlazeGraph, and its pedigree seemed completely unknown/untested.

The roadmap for Titan is unclear to say the least. No communication, then a release drops, then nothing again. Hmm.

I agree with your point about the Wikidata decision process. Some links about this if anyone is interested: https://news.ycombinator.com/item?id=11201943

BlazeGraph seems like a reasonable product (now). But I don't like seeing this "Wikidata evaluated products and chose Blazegraph" thing - they started an evaluation process.

AFAICT, Titan is effectively dead, being replaced by Datastax Graph. Likewise, Blazegraph is about graph computation (think Spark's GraphX, OLAP) rather than simple reads/writes (OLTP). Systems can try to straddle both, but the benchmarks should be different.

Quickly looking at this project suggests it's on the OLTP side, not OLAP, so apples/oranges. Claiming more performance than a GPU compute engine would need some real benchmarking ;-)

Blazegraph is about graph computation (think Spark's GraphX, OLAP)

This is definitely incorrect. Wikidata uses it exclusively for read/write queries.

The (GPU accelerated) graph processing in BlazeGraph is new, and it's pretty unclear to me if it actually does the (GraphX) style processing. The examples[1] are all query-based.

It does look interesting though!

[1] https://www.blazegraph.com/product/gpu-accelerated/

Blazegraph has multiple backends. Their GPU compute engine vs. their classic DB engine are different ones. And yes, I doubt they use the same approach as GraphX, considering they've published how they don't (MapGraph) and GPUs require way more work. Lumping all these systems together doesn't make sense: reads aren't writes, and there's a whole world of graph computations. Benchmarks matter.

That said, I'd be skeptical of a GPU system for being great at writes, and I'd be skeptical of a non-GPU one computing better than one on GPUs.

There's quite a lot of activity on http://github.com/thinkaurelius/titan for a dead project. Admittedly, the issues are piling up, but the code seems to be actively worked on.

Hi, I'm one of the creators of Sourcegraph and srclib. Just wanted to clarify that srclib doesn't compete with Kythe. They tackle different issues and we're even considering using Kythe as a low-level language library.

This is really exciting. I've been hoping for a robust, distributed open source Graph database ever since I first played with Freebase (which clearly had some amazing secret sauce, long-since purchased by Google). The engineer behind DGraph has worked on Google Knowledge Graph, the spiritual successor to Freebase, and obviously understands the space incredibly well: https://twitter.com/manishrjain

This looks excellent!

Some questions because I need something like this:

What does "distributed" mean in this context? Can the graph size be larger than the storage on a single node? If so, how is it partitioned (I think Titan was randomly partitioned)?

Has any thought been given to in-graph processing (PageRank etc)?

The data gets sharded, and can be served by different servers; all talking to each other to respond to queries. This is how v0.2 works, which is what the demo is running right now.

A typical RDF data is (subject, predicate, object). We shard the data based on predicates. So, all the RDFs corresponding to one predicate are on one server. This allows us to find, say lists of friends of X really quickly; with a single lookup.

I think PageRank should be relatively straightforward for DGraph, provided we have edges in the right direction. We don't automatically generate a reverse edge, it has to be provided, if needed for queries.

"We don't automatically generate a reverse edge"

Wait, then you don't actually have relationships as first class citizens? Does every "thing" know what is connected to it? I mean, where do you draw the line between graph databases and an object database with one way links?

The problem with automatically generating reverse edges is that it causes data explosions and duplications -- for e.g., if you're adding facts like X -- IS_A --> Human; then automatically generating the reverse would cause a Human -- Reverse(IS_A) --> X; which would list all the humans on the planet. Whichever machine serves this list would immediately run into memory issues, not to mention, any query using such a relationship is going to be very slow.

DGraph uses type schema to understand the relationships an entity can have. So, you can have an entity of type A, where A has relationships R1, R2, and R3. So, you can then deduce that A has relationships R1, R2, and R3. Of course, each entity can be of multiple types (for e.g. Tom Hanks is an actor and a director).

This approach is very scalable, because it avoids unnecessary scans over the distributed database to find all the relationships an entity can have. Rather, utilizing a schema to deduce such information; and then hitting the right servers to get the data.

Wait, I don't understand. If I create a relationship "User 1" -[:LIKES]->"Chocolate Ice Cream". Does the "Chocolate Ice Cream" know which users liked it or not?

My takeaway is that "Chocolate Ice Cream" would not know about "User 1" unless an explicit relationship "Chocolate Ice Cream" -[:liked_by]->"User 1" is created

I'm guessing not.

This isn't unusual - it's somewhat analogous to the derived (inferred) relationship thing in RDF-style graph databases (eg, Sydney is-in Australia, Australia is-in Oceania, therefor Sydney is-in Oceania).

This sounds like a great idea, but in practice doesn't always work so well. You end up with an explosion of relations, some of which are completely useless.

I'm not opposed to this decision being a choice.

Since you mentioned RDF - are you planning SPARQL?

Not a huge fan of SPQRQL, but OTOH I don't know GraphQL at all so I can't comment sensibly on a comparison.

Up until v1.0, we're only looking at GraphQL. After that, we'll consider adding other languages depending upon the demand.

All the graph database traversals I've seen are fairly simple (Friend of a friend, Movies starring X).

Are they a good choice for turn-by-turn navigation, and answering questions (given a traffic dataset) like: "What has been the quickest route between A and P, departing at 8am on a Monday morning?"

We've examples of some pretty deep Graph queries (8 levels), for e.g. the last query here: https://github.com/dgraph-io/dgraph/wiki/Test-Queries

Finding the shortest route between A and B is on the roadmap. This would mean adding a weight to the edge. Navigation adds yet another interesting factor, which means we need to store multiple weights per edge, depending upon time of the day, day of the week, or day of the year etc. This would be an interesting challenge to solve; probably by v1.0.

Your landing page is missing any kind of "evidence" that it is scaleable, low-latency or high throughput.

Also if you are sharing on predicate you will end up in big trouble. Predicates in most RDF datasets are not at all evenly distributed, tending more towards extreme value distributions. e.g. in UniProt the most common predicate has 2,419,000,171 occurrences, the least 1!

Also if you are going to benchmark can I suggest the rather good LDBC ones[1]. Even if for marketing reasons you don't want them public they are good to show where you can improve.


The way data is sharded across machines, is done via predicate. So, in this case, the 2.4 billion occurrences only leads to a data of 2.4G * 8 bytes (uint64 id) ~ 20GB of data for us, on one machine; which is pretty manageable. Furthermore, each predicate could be further sharded to fit on multiple machines; such would be the case with say the friend predicate in Facebook.

It's hard to "prove" on a landing page, without going into design details, that you truly are those things. We do have a demo, with 21 million RDFs from real world data from Freebase, so you could play with the database, and get a feel for it.

We'll look into LDBC. Thanks for the pointer.

Is a Graph DB suitable for use cases like products/homes/cars etc where users mostly do "and" queries to narrow down the results set? If so, is it faster than traditional SQL DB?

Graph DBs are great for intersection queries (AND queries). In fact, DGraph is designed to do those really fast; and supporting that via GraphQL is in our roadmap.


In general Graph DBs are great when you have many "kinds" of things, which would require many many tables in traditional databases, and lots of interlinking. Those scenarios are ideal for Graphs, because many different kinds of things can be interrelated to each other easily, and be queried seamlessly. In other words, the schema for graphs is very fluid.

Maybe this exists already and I just haven't found it but I would love to find a tool that visually/graphically helps me visualize and test build data structures for a graph database. For example modeling the relationship from a host to VM to OS to app to network etc.

Check out the Assimilation Project [1]. It facilitates automatic infrastructure discovery and uses the Neo4j graph database for visualization / querying. More info about the project here [2].

[1]: http://assimilationsystems.com/

[2]: http://linux-ha.org/source-doc/assimilation/html/index.html

Our users (graphistry.com) do this with our visualizer. You don't really need a heavy-duty graph database for that part, stuff like SQL or even lighter weight things are more normal.

It gets fun for us when we help visualize a full enterprise (hundreds of thousands of users, devices, apps..), and even more so when event data enters the picture. We do the former with our GPU tech, and push the latter to generic big data systems like Spark or Splunk that should already be in place before this becomes worthwhile.

My company offers a tool to search, edit and visualize graph data: http://linkurio.us/

You should try it out :)

I actually have you guys open on a tab somewhere. I'll check it out at some point.

This looks very promising!

Are you planning to add filtering, supported by indexes? Seems a bit useless for production use if you can't filter a query by predicate, or even sort/limit. You could layer something like Elasticsearch on top of it, but then you lose all the graph support.

Any thoughts on enforcing schemas?

Yes, here's the roadmap: https://github.com/dgraph-io/dgraph/issues/1

We do filter by predicates, and have plans to do sort and limit, including count. DGraph is designed to accommodate all these use cases really well.

The idea is that you shouldn't need much on top of DGraph, because the query latency should be low; maybe Memcached. But, in the graph world, a single edge change can affect many queries, and so we've designed DGraph to provide low latency for arbitrarily complex queries.

The schemas would be enforced via GraphQL, i.e. the query language, not via the storage mechanism. This allows you to change the definition of types of entities, without changing the corresponding data. This is incredible for both backwards compatibility, and quick iterations.

Thanks. I didn't see any examples of predicates in the GraphQL, other than basic graph edge traversal. Do you have any?

Does DGraph use indexes to back the predicates? If so, do indexes have to be pre-declared? To explain, we layer ElasticSearch on top of our current document store because we don't want to manage an "index schema" manually; we have tons of different apps on top of the same document store, and they have all sorts of ad-hoc queries where it would simply be easier to (like ES does) index everything rather than require the app to declare a schema.

Not sure I understand your schema explanation. We're talking about schemas for validating document data, yes?

We don't yet have intersection, which is what you're looking for. But, it's in the roadmap, and as I mentioned, the design is built around efficient intersections (all posting lists are sorted lists of uint64s).

So, DGraph supports arbitrary complexity, so we try to not optimize for any particular query. Hence, you don't need to specify indices. We automatically generate posting lists, which store all the objects for a given (subject, predicate). For e.g., all the friends of X is stored in a single value, which is a sorted list of UIDs (uint64).

When we do sorting, we'll provide a way to specify which predicates to sort by. That'd be something that the user would have to provide; that we can't automatically deduce. For e.g., movie sorting by year of release etc.

Re: schema, have a look at GraphQL type system: https://facebook.github.io/graphql/#sec-Type-System This explains how types can be specified, and enforced while inputting data; and retrieving data.

I think I see what you mean — sorry, I don't really think in triplets and to me a predicate is usually an algebraic term, not a data model term.

Without value indexes, how would your data model optimize for queries like (year >= 1975) or (title like "%thing%)?

If I understand you correctly, the client would pass a schema along with each mutation? It's something we considered and quickly discarded for our own data layer, because it's not a good data model. Every client now has to know how to express its current schema, which puts too much of a burden on the client.

For example, we have a lot of microservices that all share the same data and schema, and to avoid boilerplate we'd have to build client glue (in 3 different languages) to let them share a schema.

So, for filters like year >= 1975, that'd be part of our sorting feature push, which would also help with this sort of filter. Also, string matching / name search, is in our roadmap as well.

The clients don't need to pass the schema on every mutation. DGraph could be provided the "schema" part of GraphQL when run, so all the servers have a common knowledge of what's the schema -- and the clients can assume that knowledge, without having to be repetitive.

So, no your clients in different languages wouldn't have to share any schema.

Sounds good, thanks!

It has been a while since I've seen so many buzzwords in one HN topic.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact