

No More Joins - SilverStripe and OrientDB - reactor
http://www.silverstripe.org/no-more-joins-silverstripe-and-orientdb/

======
maxdemarzi
The title of the post is throwing people off, because it's actually the
opposite. Graph databases like OrientDB and Neo4j "Pre-Join" everything. Using
nodes connected by relationships, every single "record" in your database knows
what is connected to it without having to do a table or index scan/lookup.
It's all pre-joined, so you avoid the pain of having to JOIN dynamically when
you run your queries.

After years of MS SQL Server, switching to a graph database (I use Neo4j but
it's the same concept) requires a mental mode change, but once you get past
the "graph epiphany" you'll never want to go back.

~~~
bilbo0s
This.

Graph DBs don't do away with JOINs...

they just do the JOIN on INSERT.

~~~
cwyers
Two questions about that.

1) Wouldn't that have roughly the same performance impact on
insert/update/etc. as building indexes? Possibly even more, as building an
index (without foreign key constraints) only affects one table, whereas this
would imply updating all tables with which this table has relationships?

2) Doesn't this require you to know all of the relationships your data has at
insert/update, rather than at select?

~~~
RyanZAG
1) No. It stores the id/address of the 'row' you are linking too, so there is
no performance impact on insert/update/etc.

2) Yes, very much so.

EDIT: For a bit more clarity on how/why you'd use it.

Graph DBs are very good for something like a social network. You would store
each person as an object/document and then you could have a simple array of
ids for all that person's friends.

Then when you retrieve the person from the database, the database can
automatically retrieve all of the friends very quickly as there is no need to
do further searching - it already knows where all the friends are stored as
their id/address is already in the document.

In a relational DB, you would have a table of friend->friend and would need to
do a one-many join across that table.

~~~
cwyers
Okay. I'm not sure I understand why that form of linking would be
significantly faster than JOINs when you are properly indexing everything (or
significantly faster on INSERT/UPDATE, for that matter).

To the second, it sounds like if you're willing to give up a substantial
amount of the flexibility that SQL gives you, you can achieve performance
gains. Isn't that true of any RDBMS, though? If you go through and denormalize
heavily, you can make similar tradeoffs (maybe not the exact same ones, so
there could be cases where a graph DB would be better, but it's hardly the
blanket "joins are bad" that seems to be implied by the original post).

~~~
RyanZAG
It's faster, but it's not remarkably faster, no. The main place it becomes
faster is deep joins - if you want the friend of the friend of the friend. If
using joins that looks like LEFT OUTER JOIN bla a1 LEFT OUTER JOIN bla a2 LEFT
OUTER JOIN bla a3 etc. If you've got much experience with SQL you should
realize the performance impact of that. So graph dbs are best suited to highly
networked problems.

Denormalize is something different, you're misunderstanding something I've
said. With a graph DB, you still only have 1 copy of each person - you are
just storing the 'joins' or ids/addresses of the 'rows' you want to join with
ahead of time.

I certainly did not say that joins are bad, I was just answering your
questions as to how a graph db differs and why it would perform better in
certain specific circumstances. If you're in those circumstances then a graph
DB is very useful.

Another nice part of a graph db is if you have a large collection of different
objects in your programming language that all interact with each other. Those
interactions can be stored directly in-line and allow for an ORM layer to be
more efficient when retrieving the whole collection.

~~~
dragonwriter
> The main place it becomes faster is deep joins - if you want the friend of
> the friend of the friend. If using joins that looks like LEFT OUTER JOIN bla
> a1 LEFT OUTER JOIN bla a2 LEFT OUTER JOIN bla a3 etc.

If you want all friends out to some specified distance _n_ where _n_ > 1,
recursive CTEs are a much more sane solution that anything involving _n_
levels of OUTER JOINs.

And if you just want specifically friends-of-friends-of-friends by joins, you
want three levels of _INNER_ JOINS.

> With a graph DB, you still only have 1 copy of each person - you are just
> storing the 'joins' or ids/addresses of the 'rows' you want to join with
> ahead of time.

More accurately, graph DBs store addresses (not ids/addresses) while RDBMS
store ids (Foreign Keys) of the records you want to join with ahead of time.
This means that to get from _one_ record to a related record takes an
additional level of indirection in an RDBMS compared to a graph database, but
that doing so from many similar structured records to their equivalent linked
records is more efficient in an RDBMS.

> Another nice part of a graph db is if you have a large collection of
> different objects in your programming language that all interact with each
> other. Those interactions can be stored directly in-line and allow for an
> ORM layer to be more efficient when retrieving the whole collection.

With a Graph DB instead of an RDBMS, you obviously wouldn't haven a ORM
(Object- _Relational_ Mapping) layer in any case.

At more than one level out

~~~
RyanZAG
Recursive CTEs are not what one would traditionally call 'relational' and many
relational databases certainly don't have them.

Some distributed graph databases do store ids and not addresses - there is no
actual requirement that they be addresses, only that they be efficient to
retrieve. So not actually more accurate.

People still call object-db mappers 'ORMs' even when dealing with document
databases such as Google's bigtable or MongoDB. It's just an acronym at this
point. [1][2]

[1] - [http://mongomapper.com/](http://mongomapper.com/)

[2] - [https://code.google.com/p/mongo-java-
orm/](https://code.google.com/p/mongo-java-orm/)

EDIT: By 'id' I'm referring to some unique hash or key that can be used to
locate a document. By 'address' I'm referring to an actual on-disk location
that can be directly loaded. Maybe my definitions clash with yours, I don't
think this stuff is formally defined.

~~~
dragonwriter
> Recursive CTEs are not what one would traditionally call 'relational'

I'm not a big fan of tradition for its own sake.

> and many relational databases certainly don't have them.

The only major SQL-based RDBMSs I can think of off the top my head that don't
support recursive CTEs in their current version are SQLite and MySQL --
Postgres, Firebird, DB2, Oracle, and MS SQL Server all do.

------
buckbova
I've spent over a decade living in relational databases . . . currently senior
db architect. I like the joins. I work and think in sets of data and not
objects.

But, I do spend a good deal of time transforming this data for application
consumption, so I suppose this kind of database would cut down that time.

~~~
mtdewcmu
Sets are the most elemental way to abstract data. Not all data are graphs, but
all data are sets. A graph is a set of vertices and a set of edges. So there's
no loss of generality in using sets.

------
lectrick
"OrientDB does away with the concept of a join table, one of the most
significant bottle necks in relational databases."

I'm having a hard time not reading this as "joins are too hard for me to
understand, so I love the idea of a no-join database"

~~~
jhh
what's a "join table" even? This speaks to a lack of understanding of how
relational databases should best be used.

~~~
chromaton
I believe they're referring to tables that have just two columns, both of
which are foreign keys to other tables. They're used to create many-to-many
relationships.

~~~
Pfiffer
I usually call them staple tables.

------
rosenjon
Aside from whether or not this method of linking data increases performance,
there is a huge cost in relational databases to the complexity created by
many-to-many relationships. Even if the performance were the same, the
dispersion of data makes it a nightmare to pull together a hierarchical object
out of many different database tables, and then to deduplicate the data that
is cross-joined (ie in the case of a user with multiple shipping
addresses...join user on address => you get the same user record for every
instance of address, which then has to be consolidated in code). Then do that
times 5-10 other tables and you have yourself a very messy query that is very
slow.

These problems can be solved with KV stores, but at the cost of human readable
databases. Likewise, with ORMs, you sacrifice performance and direct
understanding of how the database is being queried.

There are many advantages to a document-graph model like orientdb for these
reasons.

~~~
mtdewcmu
_> in the case of a user with multiple shipping addresses...join user on
address => you get the same user record for every instance of address, which
then has to be consolidated in code_

One solution is to use two queries: one to get the user with the user's key,
then a second to get the addresses that match that user key.

If you want to get all the user fields and all the addresses in one query, you
can do a join with a group by on user and aggregate the addresses. In
Postgres, you can aggregate into an array with array_agg(). More recent
versions have json_agg() to aggregate into json.

Under no circumstances should you have to deduplicate data in code. That's
what the database is for.

These queries can get tricky to write, but I view them as enjoyable puzzles,
like writing regular expressions. The payoff for writing the right query is
that it's faster than fetching too many records and deduplicating in client
code, and the SQL is short and declarative.

------
bni
"OrientDB does away with the concept of a join table, one of the most
significant bottle necks in relational databases."

join table? Does he mean that you are able to join tables in queries? This is
awesome! It might be a bottleneck but its also on of RDBMS best features. I
also would revisit the bottleneck claim with todays SSDs and fast CPUs.

"...support for inheritance in OrientDB is useful ... to avoid joining tables
to mimic class inheritance."

I have never used joins to "mimic class inheritance" and dont understand why
you would ever want to do such a thing. Can someone enlighten me?

~~~
nsxwolf
I think they are talking about a junction table.

I also think they're talking about a class table inheritance pattern, where
you have a base table that stores the data for the base/abstract class, then
more tables to store the data for classes that extend the base class. You
could use junction tables to model that relationship.

[http://en.wikipedia.org/wiki/Junction_table](http://en.wikipedia.org/wiki/Junction_table)

------
gregwebs
OrientDB has some great explanations on their github wiki. Here is their
thoughts on why references are better than joins:
[https://github.com/orientechnologies/orientdb/wiki/Tutorial:...](https://github.com/orientechnologies/orientdb/wiki/Tutorial:-Relationships#the-
problem-with-joins)

And this page gives an overview of relationships in general with OrientDB:
[https://github.com/orientechnologies/orientdb/wiki/Concepts#...](https://github.com/orientechnologies/orientdb/wiki/Concepts#wiki-
Referenced_relationships)

Personally I have found the relational model limiting for performance in
certain use cases and limiting in terms of mental overhead in others. I have
used MongoDB a lot, however only having embedding to model relationships is
also very limiting.

The promise of OrientDB is a documunt store with support for embedding (and
class inheritance), but also being able to use references rather than joins.

What embedding and references have in common is that you have to do more up
front work to define your relationships. You can think of it as freezing the
possible queries available in a relational model to the subset that you
actually use. I think this is a great default for most applications.

However, it is not what an analyst wants for doing ad-hoc queries, which SQL
databases are great at. And even application creators often later decide that
they want to do some ad-hoc analysis of data sitting in the database. This is
where polyglot persistence should shine and you can have multiple databases
that index the data in different ways.

------
djur
"As many relational databases such as MySQL do not have native support for
inheritance this concept is approximated by joining normalised tables which
together have all the attributes necessary to build a subclass object."

This suggests to me that their problem is not JOIN but a dogged insistence on
a particular style of object-relational mapping. If you're having to do
complex joins to retrieve a single persisted object you're probably coupling
your objects and the database too closely.

That said, a customizable CMS is a natural fit for both document databases and
graph databases. All of the ones I've seen built on relational DBs have
extremely generic schemas that allow users to build their own quasi-schemas on
the fly, at which point you've sacrificed pretty much every benefit the
relational model brings.

------
mmgutz
Good timing. I was waiting until a big player bought into OrientDB.
SilverStripe may not be as big as Drupal, Joomla or WP but they have a large
user base.

------
m_mueller
Since the storage model is plain Javascript objects - what's the situation
concerning offline sync on mobiles (as in, run a lightweight db server on the
device and let it replicate with the db in the cloud)?

------
corresation
I think we've read this story before. It turns out it is a tragedy.

