
DGraph – Scalable, Distributed, Low-Latency, High-Throughput Graph Database - pythonist
http://dgraph.io/
======
ignoramous
Wikidata did a comprehensive analysis of Graph DBs [0], and settled on
BlazeGraph with TitanDB coming a close second.

Notably, there are quite a few omissions. DGraph and Cayley [1] being two of
those. Interestingly, both are developed by Googlers. Cayley is used by
Kythe.io [2], a Google project that kind of competes with srclib [3] by
SourceGraph.

Cayley has native JavaScript interface, which makes it an interesting choice
for Node JS based apps.

At work, we settled on TitanDB, primarily because it supports
DynamoDB/Cassandra for storage and ElasticSearch. Most of the graph DBs rely
on some storage engine or the other underneath-- Cayley supports LevelDB, for
instance; whereas TitanDB supports BerkeleyDB apart from aforementioned
DyanmoDB and Cassandra.

[0]
[https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXik...](https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-
ZkMqT8Y5b2NYVKbU/edit#gid=0)

[1] [https://github.com/google/cayley](https://github.com/google/cayley)

[2] [https://kythe.io](https://kythe.io)

[3] [https://srclib.org](https://srclib.org)

~~~
mrjn
Both Cayley and TitanDB aren't native graph databases. In fact, Cayley
supports many storage engines, including MongoDB etc. This is because both are
graph layers, and the data maintenance is done via the real database
underneath. This has benefits in the sense that it's easier to build, but also
doesn't perform as well when it comes to query latency.

DGraph, OTOH, is a native graph database. We do use RocksDB, but the data
distribution and maintenance is done by us. It's optimized for decreasing the
number of network calls, to keep them linear to the complexity of the query,
not the number of results. This is of incredible value when you're running
real time queries, serving results directly to end user. The query latency
hence, isn't too affected by high fan out of intermediate results and should
remain low; while providing high throughput.

In fact, the entire HN traffic is being served by one GCE n1-standard-4
instance right now, using all 4 cores really well :-).

~~~
detaro
For me the demo doesn't work, CORS violation while trying to access
[http://dgraph.xyz/query](http://dgraph.xyz/query). (EDIT: manually accessing
it sends me in a cloudflare(?) captcha, that might mess with the query?)

~~~
paulftw
Cloudflare does that when you are hitting it from a suspicious VPN IP address.
There may be other reasons why it doesn't like your IP.

------
simonw
This is really exciting. I've been hoping for a robust, distributed open
source Graph database ever since I first played with Freebase (which clearly
had some amazing secret sauce, long-since purchased by Google). The engineer
behind DGraph has worked on Google Knowledge Graph, the spiritual successor to
Freebase, and obviously understands the space incredibly well:
[https://twitter.com/manishrjain](https://twitter.com/manishrjain)

------
nl
This looks excellent!

Some questions because I need something like this:

What does "distributed" mean in this context? Can the graph size be larger
than the storage on a single node? If so, how is it partitioned (I think Titan
was randomly partitioned)?

Has any thought been given to in-graph processing (PageRank etc)?

~~~
mrjn
The data gets sharded, and can be served by different servers; all talking to
each other to respond to queries. This is how v0.2 works, which is what the
demo is running right now.

A typical RDF data is (subject, predicate, object). We shard the data based on
predicates. So, all the RDFs corresponding to one predicate are on one server.
This allows us to find, say lists of friends of X really quickly; with a
single lookup.

I think PageRank should be relatively straightforward for DGraph, provided we
have edges in the right direction. We don't automatically generate a reverse
edge, it has to be provided, if needed for queries.

~~~
maxdemarzi
"We don't automatically generate a reverse edge"

Wait, then you don't actually have relationships as first class citizens? Does
every "thing" know what is connected to it? I mean, where do you draw the line
between graph databases and an object database with one way links?

~~~
mrjn
The problem with automatically generating reverse edges is that it causes data
explosions and duplications -- for e.g., if you're adding facts like X -- IS_A
--> Human; then automatically generating the reverse would cause a Human --
Reverse(IS_A) --> X; which would list all the humans on the planet. Whichever
machine serves this list would immediately run into memory issues, not to
mention, any query using such a relationship is going to be very slow.

DGraph uses type schema to understand the relationships an entity can have.
So, you can have an entity of type A, where A has relationships R1, R2, and
R3. So, you can then deduce that A has relationships R1, R2, and R3. Of
course, each entity can be of multiple types (for e.g. Tom Hanks is an actor
and a director).

This approach is very scalable, because it avoids unnecessary scans over the
distributed database to find all the relationships an entity can have. Rather,
utilizing a schema to deduce such information; and then hitting the right
servers to get the data.

~~~
maxdemarzi
Wait, I don't understand. If I create a relationship "User 1"
-[:LIKES]->"Chocolate Ice Cream". Does the "Chocolate Ice Cream" know which
users liked it or not?

~~~
yazaddaruvala
My takeaway is that "Chocolate Ice Cream" would not know about "User 1" unless
an explicit relationship "Chocolate Ice Cream" -[:liked_by]->"User 1" is
created

------
pbowyer
All the graph database traversals I've seen are fairly simple (Friend of a
friend, Movies starring X).

Are they a good choice for turn-by-turn navigation, and answering questions
(given a traffic dataset) like: "What has been the quickest route between A
and P, departing at 8am on a Monday morning?"

~~~
mrjn
We've examples of some pretty deep Graph queries (8 levels), for e.g. the last
query here: [https://github.com/dgraph-io/dgraph/wiki/Test-
Queries](https://github.com/dgraph-io/dgraph/wiki/Test-Queries)

Finding the shortest route between A and B is on the roadmap. This would mean
adding a weight to the edge. Navigation adds yet another interesting factor,
which means we need to store multiple weights per edge, depending upon time of
the day, day of the week, or day of the year etc. This would be an interesting
challenge to solve; probably by v1.0.

------
jerven
Your landing page is missing any kind of "evidence" that it is scaleable, low-
latency or high throughput.

Also if you are sharing on predicate you will end up in big trouble.
Predicates in most RDF datasets are not at all evenly distributed, tending
more towards extreme value distributions. e.g. in UniProt the most common
predicate has 2,419,000,171 occurrences, the least 1!

Also if you are going to benchmark can I suggest the rather good LDBC ones[1].
Even if for marketing reasons you don't want them public they are good to show
where you can improve.

[1][http://www.ldbcouncil.org/](http://www.ldbcouncil.org/)

~~~
mrjn
The way data is sharded across machines, is done via predicate. So, in this
case, the 2.4 billion occurrences only leads to a data of 2.4G * 8 bytes
(uint64 id) ~ 20GB of data for us, on one machine; which is pretty manageable.
Furthermore, each predicate could be further sharded to fit on multiple
machines; such would be the case with say the friend predicate in Facebook.

It's hard to "prove" on a landing page, without going into design details,
that you truly are those things. We do have a demo, with 21 million RDFs from
real world data from Freebase, so you could play with the database, and get a
feel for it.

We'll look into LDBC. Thanks for the pointer.

------
bikamonki
Is a Graph DB suitable for use cases like products/homes/cars etc where users
mostly do "and" queries to narrow down the results set? If so, is it faster
than traditional SQL DB?

~~~
mrjn
Graph DBs are great for intersection queries (AND queries). In fact, DGraph is
designed to do those really fast; and supporting that via GraphQL is in our
roadmap.

[https://github.com/dgraph-io/dgraph/issues/1](https://github.com/dgraph-
io/dgraph/issues/1)

In general Graph DBs are great when you have many "kinds" of things, which
would require many many tables in traditional databases, and lots of
interlinking. Those scenarios are ideal for Graphs, because many different
kinds of things can be interrelated to each other easily, and be queried
seamlessly. In other words, the schema for graphs is very fluid.

~~~
newman314
Maybe this exists already and I just haven't found it but I would love to find
a tool that visually/graphically helps me visualize and test build data
structures for a graph database. For example modeling the relationship from a
host to VM to OS to app to network etc.

~~~
jvilledieu
My company offers a tool to search, edit and visualize graph data:
[http://linkurio.us/](http://linkurio.us/)

You should try it out :)

~~~
newman314
I actually have you guys open on a tab somewhere. I'll check it out at some
point.

------
lobster_johnson
This looks very promising!

Are you planning to add filtering, supported by indexes? Seems a bit useless
for production use if you can't filter a query by predicate, or even
sort/limit. You could layer something like Elasticsearch on top of it, but
then you lose all the graph support.

Any thoughts on enforcing schemas?

~~~
mrjn
Yes, here's the roadmap: [https://github.com/dgraph-
io/dgraph/issues/1](https://github.com/dgraph-io/dgraph/issues/1)

We do filter by predicates, and have plans to do sort and limit, including
count. DGraph is designed to accommodate all these use cases really well.

The idea is that you shouldn't need much on top of DGraph, because the query
latency should be low; maybe Memcached. But, in the graph world, a single edge
change can affect many queries, and so we've designed DGraph to provide low
latency for arbitrarily complex queries.

The schemas would be enforced via GraphQL, i.e. the query language, not via
the storage mechanism. This allows you to change the definition of types of
entities, without changing the corresponding data. This is incredible for both
backwards compatibility, and quick iterations.

~~~
lobster_johnson
Thanks. I didn't see any examples of predicates in the GraphQL, other than
basic graph edge traversal. Do you have any?

Does DGraph use indexes to back the predicates? If so, do indexes have to be
pre-declared? To explain, we layer ElasticSearch on top of our current
document store because we don't want to manage an "index schema" manually; we
have tons of different apps on top of the same document store, and they have
all sorts of ad-hoc queries where it would simply be easier to (like ES does)
index everything rather than require the app to declare a schema.

Not sure I understand your schema explanation. We're talking about schemas for
validating document data, yes?

~~~
mrjn
We don't yet have intersection, which is what you're looking for. But, it's in
the roadmap, and as I mentioned, the design is built around efficient
intersections (all posting lists are sorted lists of uint64s).

So, DGraph supports arbitrary complexity, so we try to not optimize for any
particular query. Hence, you don't need to specify indices. We automatically
generate posting lists, which store all the objects for a given (subject,
predicate). For e.g., all the friends of X is stored in a single value, which
is a sorted list of UIDs (uint64).

When we do sorting, we'll provide a way to specify which predicates to sort
by. That'd be something that the user would have to provide; that we can't
automatically deduce. For e.g., movie sorting by year of release etc.

Re: schema, have a look at GraphQL type system:
[https://facebook.github.io/graphql/#sec-Type-
System](https://facebook.github.io/graphql/#sec-Type-System) This explains how
types can be specified, and enforced while inputting data; and retrieving
data.

~~~
lobster_johnson
I think I see what you mean — sorry, I don't really think in triplets and to
me a predicate is usually an algebraic term, not a data model term.

Without value indexes, how would your data model optimize for queries like
(year >= 1975) or (title like "%thing%)?

If I understand you correctly, the client would pass a schema along with each
mutation? It's something we considered and quickly discarded for our own data
layer, because it's not a good data model. Every client now has to know how to
express its current schema, which puts too much of a burden on the client.

For example, we have a lot of microservices that all share the same data and
schema, and to avoid boilerplate we'd have to build client glue (in 3
different languages) to let them share a schema.

~~~
mrjn
So, for filters like year >= 1975, that'd be part of our sorting feature push,
which would also help with this sort of filter. Also, string matching / name
search, is in our roadmap as well.

The clients don't need to pass the schema on every mutation. DGraph could be
provided the "schema" part of GraphQL when run, so all the servers have a
common knowledge of what's the schema -- and the clients can assume that
knowledge, without having to be repetitive.

So, no your clients in different languages wouldn't have to share any schema.

~~~
lobster_johnson
Sounds good, thanks!

------
eddd
It has been a while since I've seen so many buzzwords in one HN topic.

