
SQLGraph: An Efficient Relational-Based Property Graph Store (2015) [pdf] - espeed
http://research.google.com/pubs/archive/43287.pdf
======
ivan_ah
Could someone summarize the results? How did they represent the graph
structure and attributes? As normalized SQL or as json blobs?

Also which RDB did they use? I saw mention of Berkeley DB, but can we infer
their schema would perform well with stock Postgres or MySQL?

Ideally, there should be code so we can reproduce independently.

~~~
espeed
The paper details its model for the graph structure and attributes, and it
shows results for the different approaches they evaluated -- in the end, they
propose a novel approach of using rows/columns for adjacency lists and JSON
blobs for properties/attributes.

The paper references 3 relational DBs: DB2, Oracle, and PostgreSQL...

    
    
      Most modern relational databases such as DB2, Oracle or
      Postgresql have features to support both relational and 
      nonrelational storage within the same database engine,  
      making it possible to perform an empirical comparison of 
      the utility of relational versus non-relational storage 
      structures for property graphs.
    

Presumably they used PostgreSQL since it's the only open-source RDBMS
referenced, and it supports CET translations, which they used.

The Berkeley DB reference is regarding the Titan backend datastore they used.
Presumably they chose BDB over one of Titan's distributed backeds like
Cassandra or HBase since BDB is single server so it's a more apt comparison to
the other DBs.

------
arthursilva
This is fairly interesting. Postgres has a lot of momentum going on. I guess
there's not stopping someone from building a better neo4j on top of it.

~~~
espeed
Interesting paper indeed.

On the last page of the paper, Table 8 includes Gremlin->SQL CET translations,
i.e. Common Table Expressions:
[http://www.postgresql.org/docs/current/static/queries-
with.h...](http://www.postgresql.org/docs/current/static/queries-with.html) ,
[https://en.wikipedia.org/wiki/Hierarchical_and_recursive_que...](https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL)
.

Note the paper uses TinkerPop/Gremlin 2 step names, and since then
TinkerPop/Gremlin 3 has been released
([http://tinkerpop.apache.org](http://tinkerpop.apache.org)), which includes
many enhancements so the current Gremlin 3 names may differ a little from the
ones they reference in column one of the table.

P.S. If you want to see some of the new crazy shit you can do in Gremlin 3,
check out Marko's talk and paper from GraphDay a few weeks ago:

Quantum Processes in Graph Computing
([https://www.youtube.com/watch?v=qRoAInXxgtc](https://www.youtube.com/watch?v=qRoAInXxgtc))

"Quantum Walks with Gremlin"
[http://arxiv.org/pdf/1511.06278v1.pdf](http://arxiv.org/pdf/1511.06278v1.pdf)

~~~
dspillett

        Common Table Expressions
    

Be careful to watch how your chosen DB processes CTEs and what it means for
performance in your queries.

In postgres they are "optimisation fences", something which there is
significant resistance to changing[1][2], where-as elsewhere (at least in SQL
Server) the query planner and engine can optimise across CTE boundaries
(pushing search predicates back into them where possible). This can make a
massive difference to the performance of some queries, potentially turning a
full index scan (or worse, multiple scans) into a small number of seeks.

 _As always: test with realistic data sizes /patterns to make sure things work
as you are expecting them to._

[1] see [http://blog.2ndquadrant.com/postgresql-ctes-are-
optimization...](http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-
fences/)

[2] also
[https://news.ycombinator.com/item?id=7023907](https://news.ycombinator.com/item?id=7023907)
for further discussion

------
zkhalique
I always thought that the main advantage of graph databases is the O(1) lookup
instead of the O(log n) lookup in a traditional foreign-key relationship.
There is no need to search an index, since the pointer already records the
location of the adjacent document. On disk, the seeking time would be roughly
equivalent if the database is heavily used, since reads/writes would be
batched as the disk head moves over the disk, or for other types of drives
some similar technique would be used.

~~~
lobster_johnson
The problem with a "pointer" like this is that you then can't move stuff
around as easily. SQL databases frequently reorganize the physical
representation of rows to reclaim space used by deleted data, for example.
Surely the same would apply to graph/object databases.

SQL databases _could_ conceivably optimize foreign key constraints behind the
scenes by replacing/augmenting the referencing column with the physical
location of the foreign row, but what do you do when the target row moves
around in the physical store? You'd have to hunt down all the references and
replace them. If anything can point to anything, that could have a cascading
effect.

You could have conceivably have an indirection here: Instead of pointing to
the physical location of another row, you pointed to the physical location of
a special intermediate file that itself could contain the physical location. I
don't know of any database that uses any tricks like these.

~~~
phpnode
> The problem with a "pointer" like this is that you then can't move stuff
> around as easily

You can, it just requires one further level of pointer indirection. Rather
than pointing to the physical location of the data, you store a pointer to an
offset in an array of pointers.

Alternatively you can point directly if you're prepared to update those
pointers when you move things around. Luckily a graph database is ideally
suited for this kind of operation because it makes discovering those pointers
exceptionally simple and efficient.

~~~
lobster_johnson
Yes, that was the solution I sketched out in my last paragraph. Do any
databases actually do this?

~~~
phpnode
The array of pointers thing is standard/common but I'm not aware of any which
use the features of the graph itself to avoid pointer indirection. Typically
the edges are stored in some kind of tree structure anyway so I'd guess that
the savings aren't worth the added complexity when the thing does a _lot_ of
pointer lookups besides the initial one. But I'm probably going to experiment
with it in my own db project.

------
snaky
It would be nice to compare it with AllegroGraph

~~~
espeed
AllegroGraph is an RDF store, not a property graph. This paper compares
property graph implementations.

~~~
jerven
The important bit is that it is a graph database. Property or rdf is a small
difference not that important compared to all the other details that go into
designing a graph database.

------
maxdemarzi
This is over a year old, and using ancient versions of software. It would be
nice if they showed their testing code not just some horrific and unreadable
gremlin queries. Taking 1.6 seconds to create a new node in neo4j and 2.5 to
update it? Yeah, I don't think that's anywhere close to legit.

Never met a vendor benchmark that wasn't faulty in some way...
[http://maxdemarzi.com/2015/10/16/benchmarks-and-
supercharger...](http://maxdemarzi.com/2015/10/16/benchmarks-and-
superchargers/)

~~~
espeed
To be fair, the SIGMOD'15 conference was in June 2015 so the paper is only ~6
mos old. While it is based on Gremlin 2 rather than Gremlin 3, most all the
code in the paper is SQL CET statements -- there's very little Gremlin. In
Table 8 on the last page they show their code for the Gremlin->SQL CET
translations, and only a few tweaks are needed to update the step names in
column one to Gremlin 3.

~~~
timwilliate
If the paper is only 6 months old, why would the authors use an ancient
version of Neo4j?

~~~
espeed
Titan 0.4 [1] and Neo4j 1.9.4 [2] were the versions released in the Fall of
2013 so maybe that's when they did the work.

[1]
[https://github.com/thinkaurelius/titan/releases/tag/0.4.0](https://github.com/thinkaurelius/titan/releases/tag/0.4.0)

[2]
[https://github.com/neo4j/neo4j/releases/tag/1.9.4](https://github.com/neo4j/neo4j/releases/tag/1.9.4)

