
Comparing Database Types - ironcutter
https://www.prisma.io/blog/comparison-of-database-models-1iz9u29nwn37
======
taffer
> Relational databases get their name from the fact that relationships can be
> defined between tables.

This is a widespread misconception. Relational databases get their name from
relations in the mathematical sense[1], i.e. sets of tuples containing facts.
The basic idea of the relational model is that logical predicates can be used
to query data flexibly without having to change the underlying data
structures.

The basic paper by Codd[2] is really worth reading and describes, among other
things, the problems of hierarchical and network databases that the relational
model is meant to solve.

[1]
[https://en.wikipedia.org/wiki/Finitary_relation](https://en.wikipedia.org/wiki/Finitary_relation)

[2]
[https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)

~~~
tlarkworthy
What is the relation? The table? (I.e. the tuples describing the rows). Or is
it the joins? (In which case the article is correct). The columns?

~~~
woolcap
One way to think about it is with a mathematical relation, like 'X > Y'. A
relational database relation representing this relation would consist of a
header tuple, <X, Y>, and a set of tuples whose values satisfy the relation,
such as <10, 2>, <8, 3>, <9, 4>. In more common terms, the rows of this table
would contain pairs of numbers in which the value of the X attribute is
greater than the value of the Y attribute. This table describes the
relation(ship) of certain pairs of numbers.

"Each tuple in a relation represents an _n_ -ary relationship...among a set of
_n_ values..., and the full set of tuples in a given relation represents the
full set of such relationships that happen to exist at some given time--and,
mathematically speaking, that's a relation."[1]

[1] Chris Date, _Database in Depth_ , page 46

~~~
talaketu
This points to the dual view of a relation - intensional vs extensional. The
beauty of the relational model of data is the morphism: by evaluating
relational operations on the extensional model (data) we gain can answer
questions corresponding to intensional model (concept).

For example given the extensional data ALBUM(TITLE, ARTIST) corresponding to
the intention "the albums, each with a title and artist", we can compute "the
eponymous albums, each with a title" via EPONYMOUS_ALBUM = ALBUM where (TITLE
= ARTIST)

We started with some data for a relation corresponding with a concept, and
were able to operate on the data to produce a new relation - data
corresponding to a new concept.

------
woolcap
> Relational databases get their name from the fact that relationships can be
> defined between tables.

Relational databases get their name from the mathematical concept of a
relation, used by the Relational Model, "an approach to managing data using a
structure and language consistent with first-order predicate logic, first
described in 1969 by English computer scientist Edgar F. Codd, where all data
is represented in terms of tuples, grouped into _relations_." [1][2] (emphasis
added)

Recommended Reading: Database In Depth, by Chris Date.

[1]
[https://en.wikipedia.org/wiki/Relational_model](https://en.wikipedia.org/wiki/Relational_model)

[2]
[https://en.wikipedia.org/wiki/Relation_(database)](https://en.wikipedia.org/wiki/Relation_\(database\))

~~~
reilly3000
I did a MOOC on relational algebra that made me much more productive in SQL
and better appreciate the gravity of what RDBMS really offer. Understanding
relational algebra helps demystify the magic or query planners and grok why
they both add and reduce latency based on use cases.

~~~
edmundsauto
Mind sharing the course? Sounds useful.

~~~
reilly3000
[https://www.coursera.org/learn/data-
manipulation](https://www.coursera.org/learn/data-manipulation)

------
pjungwir
Codd's 1979 paper "Extending the Relational Model" [1] is really interesting,
especially the second half. The first half is about nulls and outer joins, and
I think that steals everyone's attention. But the second half basically gives
a way to turn your RDBMS into a graph database by (among other things) letting
you query the system catalog and dynamically construct your query based on the
results. This would never work with today's parse-plan-optimize-execute
pipelines, but it's a really cool idea, and I've certainly often wished for
something like it. I'd love to know if anyone has followed up on these ideas,
either in scholarship or in built tools.

[1]
[https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...](https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqlserver.programming/codd1979.pdf)

------
einpoklum
The document completely overlooks Columnar databases, which are focused on
analytics and are much faster than (most, not all) general-purpose DBMSes.
See:

[https://en.wikipedia.org/wiki/Column-
oriented_DBMS](https://en.wikipedia.org/wiki/Column-oriented_DBMS)

and

[https://www.slideshare.net/arangodb/introduction-to-
column-o...](https://www.slideshare.net/arangodb/introduction-to-column-
oriented-databases)

or get:

[http://www.nowpublishers.com/article/Details/DBS-024](http://www.nowpublishers.com/article/Details/DBS-024)

Examples:

* MonetDB

* SAP Hana

* Actian Vector (formerly Vectorwise)

* Oracle In-Memory

~~~
kthejoker2
Also Druid, HBase, Vertipaq (engine behind PowerBI), Redshift, Azure SQL DW,
etc

Columnar compression is a really interesting engineering problem

~~~
shrumm
ClickHouse is another favourite

~~~
einpoklum
Clickhouse is a columnar system, yes, but is not a full-fledged DBMS.
Specifically, I don't think it can join tables.

------
muydeemer
The article reminds of the work of Stonebraker and Hellerstein - What Goes
Around Comes Around, which gives a description of how the database world goes
in cycles (can be found here:
[https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape...](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf))

------
gibsonf1
The discussion of graph dbs completely misses the semantic rdf graph approach
and how that differs greatly from the property graph (which is discussed). So
important is not having to have a custom schema for each application that does
not communicate with any other app as opposed to using standard ontologies
with relationships and classes that are known and allow interoperability
between systems (Linked Data Platform - Solid)

~~~
planck01
Do you know of any successfully semantic RDF graph databases, I guess with OWL
support? Because I personally don't. If not, it probably is rightfully too
much an academic niche to be discussed in the article.

~~~
kthejoker2
Stardog, MarkLogic, Virtuoso, AllegroGraph, and RDF4J all have commercial
applications, but yeah in general semantic RDF is dying on the vine.

------
AlphaWeaver
This article comes from the team at Prisma, who are doing some really cool
work building "database schema management and design with code" tools. They're
working on a new version of their library right now (Prisma 2) and are
regularly giving updates to the community and providing test versions.

Most everything they make is open source and really well designed. Would
recommend checking it out!

~~~
616c
I am curious about Prisma2 because I tried to build a server side API with v1
as a novice to graph systems and it became an unwieldy nightmare. Partially my
fault for wanting to do it without the SaaS they provide but trying to build
with something complicated and Apollo on the frontend with a skilled FE dev
got me so confused I put it off.

~~~
nikolasburk
Prisma 1 indeed has a couple of quirks that we're currently ironing out with
Prisma 2 (or the "Prisma Framework" as we now call it). Would love to hear
from your whether the new version actually solves your pain points!

Feel free to reach out to me: burk@prisma.io or @nikolasburk on the Prisma
Slack [https://slack.prisma.io](https://slack.prisma.io)

------
muydeemer
Just a quick remark on graph dbs. Titan which is mentioned in the article as
an example of a graph db is dead. Its successor is the Janus graph
([https://github.com/JanusGraph/janusgraph](https://github.com/JanusGraph/janusgraph)).

~~~
planck01
I am surprised Dgraph isn't mentioned as an example. It is the most starred
graph db on Github, and I think it is the best one in terms of performance and
scalability.

~~~
mdaniel
Strange that they have to have such a non-standard license, when they go out
of their way to mention Apache 2 several times: [https://github.com/dgraph-
io/dgraph/blob/master/LICENSE.md](https://github.com/dgraph-
io/dgraph/blob/master/LICENSE.md)

Contrast that with Orient, who also have an Enterprise version, and they just
straight-up say "Apache 2, no drama"
[https://github.com/orientechnologies/orientdb/blob/develop/l...](https://github.com/orientechnologies/orientdb/blob/develop/license.txt)

We had an absolutely miserable experience trying to get Janus to behave
rationally, and _thus far_ have had zero drama with Orient; we skipped dgraph
because it does not appear to work with Gremlin, meaning one must use vendor-
specific APIs to use dgraph.

Their client reminds me of the days before ORM: write a big string literal and
send it to the server: [https://github.com/dgraph-io/dgraph4j#running-a-
query](https://github.com/dgraph-io/dgraph4j#running-a-query)

~~~
mrjn
> write a big string literal and send it to the server

To be honest, that describes almost all query languages, including SQL and
Cypher. Dgraph's QL is based on GraphQL[1]. When starting Dgraph, we had a
choice to go with Gremlin or Cypher. But, we didn't like either choices --
Gremlin tends to be slower to execute, because of how they treat the queries,
i.e. graph walks, with iteration over results to generate the next set of
queries; something we explicitly avoid [2]. And both of them returned lists of
things.

GraphQL, on the other hand, allows a subgraph to be returned. One can go from
a subgraph to a list of things (lossy), but can't go the other way around,
because you lose relationships (knowledge about how they are connected).

That's what led us to the path of GraphQL. Over time we modified it to what we
have now, GraphQL+-. Meanwhile, over these last 3 years, GraphQL has become so
much more popular than either Gremlin or Cypher, that now Neo4j is a founding
part of the GraphQL Foundation, just like Dgraph is.

Now, Gremlin is still relevant in the graph space. So, we're planning to build
Gremlin support. But, before that, we're working on building a native support
for official GraphQL spec. That should let one integrate into the fast growing
GraphQL ecosystem.

[1]: [https://graphql.org/](https://graphql.org/) [2]:
[https://blog.dgraph.io/post/why-google-needed-graph-
serving-...](https://blog.dgraph.io/post/why-google-needed-graph-serving-
system/)

------
tabtab
I'd like to see "dynamic relational" implemented. It's conceptually very
similar to existing RDBMS and can use SQL (with some minor variations for
comparing more explicitly). You don't have to throw away your RDBMS experience
and start over.

And you can incrementally "lock it down" so that you get RDBMS-like
protections when projects mature. For example, you may add required-field
constraints (non-blank) and type constraints (must be parsable as a number,
for instance). Thus, it's good for prototyping and gradually migrating to
production. It may not be as fast as an RDBMS for large datasets, though. But
that's often the price for dynamicness. (A fancy version could allow migrating
or integrating tables to/with a static system, but let's walk before we run.)

[https://stackoverflow.com/questions/66385/dynamic-
database-s...](https://stackoverflow.com/questions/66385/dynamic-database-
schema#46202802)

Some smaller university out there can make a name for themselves by
implementing it. I've been kicking around doing it myself, but I'd have to
retire first.

~~~
asah
I've done this (commercially!) with PostgreSQL - just start with a single
table, with one JSON field, and as you want performance, integrity, etc, add
expression indexes, break out frequently used expressions into columns etc. On
large tables, obviously there's a cost for this reorganization but you can
partition the data first, and only reorg the most recent data (e.g. range
partitioning by time).

[https://www.google.com/search?q=expression+index+postgres](https://www.google.com/search?q=expression+index+postgres)

[https://www.postgresql.org/docs/10/ddl-
partitioning.html](https://www.postgresql.org/docs/10/ddl-partitioning.html)

~~~
zawerf
I am trying this out and I am still on the edge of whether I like it or not.

Create a table with a json column:

    
    
      CREATE TABLE Doc (
        id UUID PRIMARY KEY,
        val JSONB NOT NULL
      );
    

Then later it turns out all documents have user_ids so you add a check
constraint and an index:

    
    
      ALTER TABLE Doc ADD CONSTRAINT check_doc_val CHECK (
        jsonb_typeof(val)='object' AND
        val ? 'user_id' AND
        jsonb_typeof(val->'user_id')='string'
      );
      CREATE INDEX doc_user_id ON Doc ((val->>'user_id'));
    

I think the postgres syntax for this is pretty ugly. And if you also want
foreign key constraints you still have to move that part of the json out as a
real column (or duplicate it as a column on Doc). I am not sure it's even
worth it to have postgres check these constraints (vs just checking them in
code).

I am also a little worried about performance (maybe prematurely). If that
document is large, you will be rewriting the entire json blob each time you
modify anything in it. A properly normalized schema can get away with a lot
less rewriting?

~~~
takeda
Just a note about using uuid as a primary key. Typically you will use a b-tree
index, which likes to keep things sorted. So something like a serial number
works best, because it is already sorted and will be appended at the end.
Otherwise inserting a new column will cause traversal the b-tree all over the
place which will hurt performance it you do a lot of inserts.

If you really want to use uuid and care about performance you might prefix it
with something that's increasing like a date, or perhaps (did not try it) use
hash index (need to be PG 10+).

~~~
zawerf
(We're getting way off topic) but I think the problem with auto increment is
that it can't be sharded easily since multiple shards can increment to the
same value. If you then try to go back to random ids you're now stuck with 8
bytes which will conflict once every billion items or so. I guess it's pretty
extreme premature optimization but I think UUID is nicer for future-proofing
at the cost of some performance. (I would love to see benchmarks to know
exactly how much performance I am giving up though)

By the way uuidv1 is already prefixed by a timestamp! But unfortunately it
doesn't use a sortable version of the time so it doesn't work for clustering
the ids into the same page. I think it was really designed for distributed
systems where you would want evenly distributed ids anyway.

~~~
pas
In MySQL/MariaDB/Percona InnoDB Galera every writeable replica has an auto
increment offset.

~~~
takeda
Same in postgres[1], and I'm willing to guess every relational database has a
way to do it.

[1] [https://www.postgresql.org/docs/current/sql-
createsequence.h...](https://www.postgresql.org/docs/current/sql-
createsequence.html)

------
bryanlarsen
The description of flat-file database seems too restrictive. In my experience,
flat files with fixed record lengths and no delimiters were far more common
than variable-length delimited formats like CSV. File sizes were often much
larger than computer memory size, so random read & write was necessary.

~~~
kps
Yes. The origin of the flat-file database is fixed-format unit record
equipment¹, predating computers. COBOL is essentially a language designed for
processing fixed-format files.

¹
[https://en.wikipedia.org/wiki/Unit_record_equipment](https://en.wikipedia.org/wiki/Unit_record_equipment)

------
rainyMammoth
What about time series databases that are fairly common nowadays ?

~~~
manigandham
Time-series is more about a specific use-case about data that has a primary
time component (like sensor metrics). You can store it in any database,
although the common ones are usually some sort of key/value or relational with
specific features for time-based queries.

Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value.
Timescale is an extension to Postgres.

~~~
jnordwick
The big time TS databases (Sybase, KDB, Informix Datawarehouse) are column-
based, not key value or traditional relational row-oriented. The ones you list
are all lower-tier trying to shoehorn a time field on another model.

~~~
manigandham
Those are still relational databases, just with column-oriented/column-store
tables. I don't see how the storage layer changes the database type. For
example, MemSQL has both rowstores and columnstores. Postgres 12 has pluggable
storage with column-store (zedstore).

~~~
shakkhar
> I don't see how the storage layer changes the database type.

It does, because it leads to other types of optimization. LittleTable [0], for
example, keeps adjacent data in time domain adjacent in disk. So querying
large amount of data that are close to each other is efficient even on slow
(spinning) disk. Vertica [1] does column compression which allows it to work
with denormalized data (common in analytics workload) efficiently.

In an ideal world, you could have a storage layer sitting below a perfect
abstraction; orthogonal to higher levels. In the real world, column-based and
row-based are two completely different categories serving very different use-
cases.

[0] [https://meraki.cisco.com/lib/pdf/trust/lt-
paper.pdf](https://meraki.cisco.com/lib/pdf/trust/lt-paper.pdf) [1]
[http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf](http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf)

~~~
manigandham
The underlying database _type_ hasn't changed. LittleTable is a relational
database (it's the first sentence in the paper). Vertica is also a relational
database.

Stored is an implementation detail. Optimizations are improvements to
performance. Neither affects the fundamental data model, which in relational
databases is relational algebra over tuple-sets.

~~~
jnordwick
Timeseries is the data model and that is, for the upper end, synonymous with
column-oriented. In my top comment, I mean timeseries/column-oriented (there
are other series besiudes time, byt they fit the same data model).

The top TS databases are more than just storage too. You need a query language
that can exploit the ordering column-oriented gives you that the row-oriented
relational doesn't.

On the lower end (eg, Timescale db) trying to fit a timeseries model on a row-
oriented architecture which is a poor fit.

~~~
manigandham
Time-series is definitely not synonymous column-oriented. The data model is
separate from the storage layer which is separate from the use-case.

You're talking about _relational_ databases (which is the formal type)
designed for large-scale analytics using column-oriented storage and
processing and supporting a time-series use-case.

Storage and querying just for time-series specifically is more about product
features than the underlying type. For example, here's Pinterest doing the
same on HBase: [https://medium.com/pinterest-engineering/pinalyticsdb-a-
time...](https://medium.com/pinterest-engineering/pinalyticsdb-a-time-series-
database-on-top-of-hbase-946f236bb29a)

------
nudpiedo
All of them are in fact graph databases, they just didn't realize about it and
got lost giving the implementation the category of design for many reasons
specific to the context in which they were created. I think we should think
more often as mathematicians and a little bit less as "hackers"

~~~
TheMiller
I think this is a mischaracterization. The relational model which motivated
relational DMBSs is based on predicate logic. Mappings to graphs are obvious,
but are not the organizing principle. This was one of the strengths of the
relational model, encouraging a more flexible view of the data than graph
databases had previously offered. In a complex relational schema, you can
discover and work with all kinds of implicit graphs that were not originally
intended by the schema design.

------
_Understated_
I'm curious... what did the author mean by this:

> Legacy database types represent milestones on the path to modern databases.
> These may still find a foothold in certain specialized environments, but
> have mostly been replaced by more robust alternatives for production
> environments.

I didn't notice anything that went into any detail about legacy database
types.

Any idea what the author means by a "Legacy Database"?

~~~
tempguy9999
He actually tells you in the article, straight after (flat file,
hierarchical...)

~~~
_Understated_
Aww man. I am a dumbass... I never equated that section of things like Network
databases and such as legacy.

Dunno how I missed it :(

*Must read slower...

~~~
tempguy9999
We've all done it :) No worries!

------
honkycat
There is a great chapter in "Designing Data Intensive Applications" about this
very subject

------
bryanrasmussen
Again a renaming that makes what the article is actually about less clear.

------
victor106
Can anyone here point to a resource that gives a comprehensive treatment (use
cases, pros and cons etc) of all the types (Nosql, NewSQL, relational,
timeseries) of databases being used today?

------
thekhatribharat
_[Shameless Plug]_ A summary of the evergrowing NoSQL and NewSQL market:
[https://medium.com/open-factory/nosql-newsql-a-
smorgasboard-...](https://medium.com/open-factory/nosql-newsql-a-smorgasboard-
market-e8cbce4ae8a9)

------
minitoar
How would you categorize something like ClickHouse or Interana or Druid?
Columnar I guess, but then the description of Column-family in the article
doesn't match up with my experience of how those work.

~~~
imchairmanm
Hello, author here. That's a good question and something I had a hard time
sorting out as I worked on this.

I think those fall into a different category confusingly sometimes called
column-oriented databases. They're primarily used for analytic-focused tasks
and get their name from storing data by column instead of by row (all data in
a single column is stored to disk together).

I didn't include those as a separate category here because they're basically
relational databases with a different underlying storage strategy to allow for
easier column-based aggregation and so forth.

My colleague shared this article [1] with me, which definitely helped inform
how I distinguished between the two in my head.

[1] [http://dbmsmusings.blogspot.com/2010/03/distinguishing-
two-m...](http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-
types-of_29.html)

~~~
minitoar
That makes sense, they really are just relational databases optimized for
certain tasks, with corresponding limitations e.g. they don't support
arbitrary joins.

~~~
barrkel
There's nothing intrinsic about not supporting joins, in a columnar store;
it's just that you lose a huge amount of the linear scanning performance if
you have to do joins for each value. Most columnar stores I've used (primarily
Impala, SparkSQL and Clickhouse) all support joins, but they materialize one
side of the join as an in-memory hash table, which limits the allowable size
of the join, and is a cost multiplier for a distributed query. I believe per
the docs that MemSQL can mix and match row-based with columnar more easily,
but joins are always going to be really slow compared to the speed you can get
from scanning the minimum number of columns to answer your question.

~~~
hodgesrm
The ClickHouse team is working on merge joins which will supplement the
currently supported in-memory hash join mechanism. It's not a panacea as you
point out, especially on distributed tables. That said it will help with a
number of important use cases such as those that require joining a fact table
against very large dimension tables.

------
intellix
Skipped through it looking for an answer but didn't see it: where are unions
in Prisma?! Was looking for some big reveal about an underlying choice that
enables what everyone is begging for

~~~
matthewmueller
I've been mapping out union types at Prisma.

Are you in our Slack? I'm @mattmueller at
[https://prisma.slack.com](https://prisma.slack.com).

I'd love to chat with you to better understand your use cases, so we can make
sure we're designing it for you.

------
dehrmann
The column-family databases mentioned (Cassandra, HBase) are both just fancy
key-value stores that add semantics for separate tables and cell-level data so
you're not rolling it yourself.

------
neop1x
No Elastic among the examples while highly popular and nice :( Great article
overall, though!

------
CMCDragonkai
There's also column-oriented or array databases like MonetDB and Rasdaman.

~~~
zbentley
And the elephant in the room: Cassandra.

------
sourcepath
What happened with Prisma being all about graphql?

~~~
nikolasburk
GraphQL is a really important _use case_ for Prisma. That is using Prisma as
the "data layer" on top of your database when implementing a GraphQL server
(e.g. using Apollo Server). However, it's _not the only use case_ since you
can effectively use it with any application that needs to access a database
(e.g. REST or gRPC APIs). We actually wrote a blog post exactly on this topic:
[https://www.prisma.io/blog/prisma-and-graphql-
mfl5y2r7t49c/](https://www.prisma.io/blog/prisma-and-graphql-mfl5y2r7t49c/)

You can also find examples for the various use cases here:
[https://github.com/prisma/prisma-
examples/tree/prisma2](https://github.com/prisma/prisma-examples/tree/prisma2)

Please let me know if that clarifies it or if you have more questions! :)

------
galaxyLogic
What happened to Object-Oriented Databases?

~~~
AtlasBarfed
Document databases kind of killed them I would specualte. Since JSON
serializes with objects so much better than (ugh) XML, the Relational
impedence is gone (well, a lot of it).

------
chrisweekly
I didn't RTFA, but based on titles alone, isn't "Object Database" missing from
the list?

~~~
marcosdumay
Aren't those a special case of hierarchical databases?

------
marknadal
What a lovely article!

It should be emphasized that graph databases can do all other types of
databases (relational, document, key/value, etc.) as you can see demonstrated
in this article ([https://gun.eco/docs/Graph-
Guide](https://gun.eco/docs/Graph-Guide)).

This makes graphs a superior data structure.

If you think about the math, any document is a trie, and tables are a matrix.
Both trees and matrices can be represented as graphs. But not all graphs can
be represented as a tree or graph.

This gets even more fun when you get into hypergraphs and bigraphs, which are
totally possible with property graph databases where nodes have type!

~~~
danenania
“It should be emphasized that graph databases can do all other types of
databases (relational, document, key/value, etc.)”

Not to knock graph dbs, but isn’t the reverse also true?

~~~
mumblemumble
Yes. And it may even be the best way to do it. For example, here's a paper
where the authors come up with a schema and transpiler for doing a Gremlin-
queryable graph DB in PostgreSQL, and find that it outperforms Neo4j and
Titan:

[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43287.pdf)

~~~
gbear0
That's interesting, but kinda makes sense since it would be optimizing
specific access patterns by translating it to relational models rather than
using the normal graph walking algs to find relations.

As an anecdote for one project, while trying to speed up some neo4j queries
myself, I decided to model a binary tree structure in the nodes (child/parent
relations) and then compared the query times for using the simple cypher
queries vs cypher queries with some embedded lib functions that would walk the
tree exactly the way I wanted. The times were much faster for something that I
hadn't even optimized much code wise.

The test got me thinking that if I could have a way of declaring more info
about how the relationships are related, then maybe we could automatically
have the db use more appropriate algs for a more appropriate data structure
for certain node types. I think that's similar to what's happening here, it's
automatically mapping out the simple graph relations to structured relational
db tables. I hope in the future we'll be able to provide more input to that as
well (or at least I haven't seen something like this yet). Let me annotate my
schema to specify my parent/child relationship as a tree, or my word map nodes
form a trie but the leaves should be some other type. Why can't we think of a
db as having multiple datatypes beyond just a kv-store, or table, or graph?

~~~
mumblemumble
It's been a while since I read the paper in detail, but, IIRC, it _is_ using
normal graph walking algorithms. They're just implemented in SQL.

That it's implemented on top of a relational database seems like a red herring
to me. The relational model just defines operations on sets of tuples. A graph
is just a particular kind of thing you can construct with sets and tuples.

From there, the query planner and execution engine take over, and an incumbent
RDBMS's query planner and execution engine are supported by decades and
decades worth of accumulated dark knowledge on how to optimize execution plans
and efficiently traverse large datasets in the presence of a hierarchical
memory model.

By contrast, Neo4j (to take an example) has a steeper hill to climb. Both in
terms of not having had to spend decades trying to compete with Oracle, and in
terms of being implemented in a less-than-ideal language for chasing raw
performance.

------
kristoff_it
> To store data, you provide a key and the blob of data you wish to save, for
> example a JSON object, an image, or plain text. To retrieve data, you
> provide the key and will then be given the blob of data back. The database
> does not evaluate the data it is storing and allows limited ways of
> interacting with it.

Definitely not a good description of Redis, even though they cite it as the
first example of a Key-Value DB.

------
nailer
> Relational vs. Document

Tabular vs Document. Having relations is orthogonal to the shape of your data.
There are document databases with relations - RethinkDB was pretty popular.
Mongo sadly doesn't have them but will probably eventually get them too.

~~~
takeda
The adjective relational in a relational database comes from mathematical
relations, tuples i.e. data in tables.

It's common misconception that it is from foreign keys.

~~~
nailer
That's interesting - it seems to both be backed by and conflict with a lot of
[https://en.wikipedia.org/wiki/Relational_model](https://en.wikipedia.org/wiki/Relational_model)
but maybe that's wrong.

I'd still avoid the word 'relational' though - obvious many people will assume
'relational' is related to DB relations rather than tuples (assuming you're
right about 'relations' meaning tuples, a lot of the wikipedia contributors
are included).

------
Nican
I am really tired of articles that talk about the different types of
databases. People can make a graph databases act like relational databases,
and vice-versa. Computers, in the end, are just a Turing machine. Just pay
attention that the query that you are executing is actually doing the optimal
solution.

I wish more time would be spent talking about the underlying algorithms that
the different query languages use to accomplish the tasks. It is important for
developers to understand the execution complexity of queries, and how data is
distributed across a cluster.

For example, I am usually surprised when people talk about "web-scale", but
they do not understand the difference between a "merge-join" and a "hash-
join". Or when people do not realize that a sort requires the whole result set
to be materialized and sorted.

~~~
threeseed
This is non-sensical.

Take a highly nested JSON document and try and implement it in a relational
database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL.
Or good luck traversing a graph structure in MongoDB when you should've used
Neo4J. So in order to have a performant database you do need to "pay
attention" and ensure that your access patterns suit the database type.

Also Web Scale is about any approach used at the bigger web companies. And the
type of people who use the term are not the same people who would be sitting
there optimising SQL queries. So I wouldn't conflate the two.

~~~
judofyr
> This is non-sensical.

Your examples are kinda proving yourself wrong though.

> Take a highly nested JSON document and try and implement it in a relational
> database. You would have an O(1) lookup in MongoDB and a O(n) lookup in
> MySQL.

In PostgreSQL (which is not MySQL, but it's a relational database) you can
create an index for a column which contains JSON and the lookup for a nested
field becomes O(1).

> Or good luck traversing a graph structure in MongoDB when you should've used
> Neo4J.

Facebook implemented the social graph on top of MySQL:
[https://www.facebook.com/notes/facebook-engineering/tao-
the-...](https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-
the-graph/10151525983993920/).

> So in order to have a performant database you do need to "pay attention" and
> ensure that your access patterns suit the database type.

I think the point of the parent poster was that you don't need to ensure that
your access pattern suite the database _type_ , but that it fits well with the
_features_ the database provides. If you're looking at the database landscape
today talking about a "database type" isn't really saying much at all.

~~~
threeseed
That's because you took my examples and distorted them.

I specifically mentioned MongoDB and MySQL. Having a JSON type is basically a
mini document database since it uses a different/enhanced query language and
is often not relational with the rest of the schema.

I also specifically mentioned MongoDB and Neo4J. Yes you could implement a
graph in MySQL but as your paper lists you will need to front it with a
Memcache in order to cache the graph traversal lookups. With a graph database
you generally don't need such a cache.

