

Data Modeling in Graph Databases: Interview with Jim Webber and Ian Robinson - ancatrusca
http://www.infoq.com/articles/data-modeling-graph-databases?utm_source=hackernews&utm_medium=link&utm_campaign=datamodel
InfoQ spoke with Jim Webber and Ian Robinson in Neo Technologies team (also co-authors of Graph Databases book) about the data modeling efforts and best practices when using Graph databases for data management and analytics.
======
joe_the_user
Well, I'm grad to get text with InfoQ rather than a video.

Still, " _Relational databases are fine things, even for large data sets, up
to the point where you have to join. And in every relational database use case
that we’ve seen, there’s always a join — and in extreme cases, when an ORM has
written and hidden particularly poor SQL, many indiscriminate joins._ "

It seems like the overall argument is for (what I see as) a step backward from
the declarative model to a lower level imperative model. "You never know what
memory your implicit declarations will allocate, better do everything in
explicit c-like loops as your data expands."

It's almost like an argument for a return to the world of "hardware is
expensive, people are cheap" and for all I know that's what's happening with
really big data. But it seems a bit sad to present it as a step forward.

~~~
maxdemarzi
You must have missed something because that is the opposite of what it being
said here. The graph model is more declarative than the so called relational
model.

In the graph, this node here represents Bob, this one Alice, Alice is Bob's
manager. (alice)-[:MANAGES]->(bob) The query costs you 1 traversal from one
node to another, a minuscule cost regardless of size. O(1) your cost stays the
same no matter how many employees you have.

In a relational database, you have an employees table with a FK_ID for Manager
that is indexed and is a O(log(n)) operation. As your employees table gets
bigger your cost increases.

Take a look at slides 5-8 from this presentation =>
[http://bit.ly/1iN6Y60](http://bit.ly/1iN6Y60)

and the "what we put in cache" image =>
[http://maxdemarzi.com/2012/08/13/neo4j-internals/](http://maxdemarzi.com/2012/08/13/neo4j-internals/)

Then it will all make sense.

~~~
batbomb
It's a O(log(n)) operation, but we aren't talking log base 2(100,000), since
it's usually a B+tree, not just a B-tree. In practice, you don't usually end
up with a B+tree more than 4 levels deep.

So, we are usually talking 3, at most 4 buffer reads for the index, and
possibly an extra read to an additional buffer or disk if you need non-indexed
data, assuming the amount of managers is fairly smaller than employees. Follow
the 5 minute rule (you should have enough memory for anything you might touch
every 5 minutes) and in practice you get will probably get similar performance
to a graph database.

You'd be surprised what you can do with recursive common table expressions in
SQLite, Postgres, and Oracle too.

------
glesica
Speaking from personal experience, so YMMV, the trickiest thing about moving
from a relational or document DB mindset to a graph DB mindset is remembering
that you can store information implicitly in the structure of the graph.

So, as a very simple example, you don't have a Comment node with attributes
for the person who wrote the comment and the article the comment is associated
with. You just have edges pointing back to those things. Nowhere in the
comment, or even in the edges, is there anything that looks like an ID or
foreign key.

Unlike a document DB, however, you don't have weirdness once you have
something like co-authorship. Just point to both authors, no need to duplicate
the data or set up some kind of pseudo foreign key. Once you get the hang of
it, it's a really elegant way to store data.

~~~
3pt14159
Elegant to store, hell to really query. Continuing with your example of
comments, it makes every query you have to do into a map reduce job, and
simple fast things that used to be easy end up being a pain.

Certainly useful when you really need a graph, but I don't find that it is a
cure-all.

~~~
hugofirth
I would generally agree that you should only go the graph dbms route when you
actually need to. However I would attach the caveat that a lot more people
need to than you might think.

Almost any domain where you want to do some kind of 'deep' traversal or non
trivial pattern matching is going to benefit hugely from a native graph data
model.

------
salmonellaeater
" _The problem with a join is that you never know what intermediate set will
be produced, meaning you never quite know the memory use or latency of a query
with a join. Multiplying that out with several joins means you have enormous
potential for queries to run slowly while consuming lots of (scarce)
resources._ "

Anyone with even a meagre understanding of databases will put indexes on join
columns. If the data model is complete, then the joined columns will be
modelled as foreign keys (conceptually the same as a relationship in a graph
DB) which force indexes. I think they are talking more about problems with
ORMs, where the ORM might construct unexpected queries that don't hit indexes.
This is an ORM problem, not a relational DB problem.

One of the major promises of relational DBs was that you could write code
describing what you wanted, and the DBMS would figure out how to efficiently
find it for you. This promise was derailed by the push to merge relational
models with object-oriented models (i.e. ORMs), but it's not dead. What we
need is a more powerful SQL, one that doesn't require boilerplate and can do
things like recursion (making every database a graph database). We need a SQL
that makes the application code seem like boilerplate. We need the equivalent
of type inference for joins; let me say A join D and the DBMS infers I mean
A->B->C->D and figures out the cardinality of the result. We need result sets
that are graphs instead of one-list-fits-all. These are all things that can be
modelled in an RDBMS without losing its expressiveness.

~~~
ak39
Agreed.

You said: "What we need is a more powerful SQL, one that doesn't require
boilerplate and can do things like recursion (making every database a graph
database)."

What about SQL's common table expressions (CTEs)? Not powerful enough?

For those interested in SQL's recursive capabilities:
[http://en.wikipedia.org/wiki/Hierarchical_and_recursive_quer...](http://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL)

------
arafalov
People do interesting things on top of graph databases. I find Structr (
[http://structr.org/](http://structr.org/) ) a very interesting approach to a
rich CMS/WCM (based on Neo4j).

------
k__
Last thing I heard, was, graph-dbs don't scale well.

I there any information about this?

I wanted to build a system with tagged content and thought about using a
graph-db. (Soft-)realtime querys etc.

~~~
hugofirth
It depends what you mean by 'well'.

Neo4j offers Master-slave replication for efficient scaling of reads.
Horizontal scaling of graph databases often involved partitioning, which is a
hard problem and an active area of research.

I would say this however:

\- If your data and query workload is a natural fit for the graph model then
the speedup you get offsets a huge amount of the advantages offered by
horizontal write scalability in other DBMS.

\- A single Neo4j instance can store and query a very great deal of data
indeed (in personal testing I have imported low 100s of millions of nodes, and
I am given to understand it can go much further still). For many use cases
this is sufficient.

~~~
k__
Well, this sounds nice. This amount of nodes is more than sufficient for my
needs. The problem would probably be the reads. "give my everything that is
tagged X, Y and Z", "give me everything that is tagged A, B, X and G" etc.

But I will look in neo4j, thanks :)

~~~
hugofirth
Obviously I don't know the specifics of the data you are going to be modelling
but I would suggest thinking of many tags 'properties' as part of the topology
of the graph.

For instance ( _warning_ contrived example ahead) if you wanted to say "Give
me all people that live in Germany" then Germany would be a node (and Lives_In
a relationship) rather than a property on each individual person node.

Graph databases are optimised for _thinking_ about data in this way. So you
might start your query at the node with the label Country and the name
property Germany, then return all connected lives in relationships. This
obviously considers far fewer nodes than if you loop through all nodes with
the label Person.

~~~
k__
Yes, I was thinking about doing it like this.

This will probably the most flexible way.

