Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why are relational DBs are the standard instead of graph-based DBs?
233 points by kirillrogovoy 18 days ago | hide | past | favorite | 175 comments

I've been recently exposed to some informational systems where all the domain data was modeled as one graph instead of a set of inter-related tables.

I worked with RDBs (primarily Postgres) for 5+ years and I cannot say that it ever felt wrong, but the more I think about modeling data as graphs, the more it makes me confused why it's not the default way.

Graphs seemed to be: (1) Easier to scale (both storage-wise and complexity-wise). (2) Closer to how we model the world in our minds, hence easier to reason about. (3) Easier to query (felt more like GraphQL than SQL if it makes any sense).

The way I see it, there are two major ways to connect singular entities in a data model: 1. Lists (aka tables) that allow you to sort, filter, and aggregate within a set of entities of the same kind. 2. Relations (aka graph edges or foreign keys) to connect singular entities of different kinds.

... And I can imagine relational DBs being List-first Relation-second, and graph DBs being the opposite. But maybe that's too much of a simplification.

Anyway, looking back at different domains I worked with, it felt like I had spent much more time working with relations than with lists.

Another signal: I have an intern developer, and it took him 1 minute to understand the basics of how graphs work, but then I spent two hours explaining why we needed extra tables for many-to-many relations and how they worked.

Any thoughts? What am I missing? Are RDBs the default way mostly due to historical reasons?

Discussion on this topic that I could find: https://news.ycombinator.com/item?id=27541453

This exact debate took place in the early 70s. There were three major database models: relational, network, and hierarchical. Network and hierarchical had quite a bit of success, technically and as businesses. IMS was (is?) an IBM product based on the hierarchical model.

Network databases, which seem quite similar to graph databases, were standardized (https://en.wikipedia.org/wiki/CODASYL).

Both the hierarchical and network models had low-level query languages, in which you were navigating through the hierarchical or network structures.

Then the relational model was proposed in 1970, in Codd's famous paper. The genius of it was in proposing a mathematical model that was conceptually simple, conceivably practical, and it supported a high-level querying approach. (Actually two of them, relational algebra and relational calculus.) He left the little matter of implementation as an exercise to the reader, and so began many years of research into data structures and algorithms, query processing, query optimization, and transaction processing, to make the whole thing practical. And when these systems started showing practical promise (early 80s?), the network model withered away quickly.

Ignoring the fact that relational databases and SQL are permanently entrenched, an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming.

But another factor is that all of the hard database research and implementation problems have been solved in the context of relational database systems. You want to spring your new database technology on the world, because of its unique secret sauce? It isn't going anywhere until it has a high-level query language (including SQL support), query optimization, internationalization, ACID transactions, blob types, backup and recovery, replication, integration with all the major programming languages, scales with memory and CPUs, ...

(Source: cofounder of two startups creating databases with secret sauces.)

This kind of history is so important, and even this fairly recent history seems kind of hard to access. I am curious how you learned it? (Books? Articles? Finding people involved to interview? usenet posts?)

I learned almost none of it in my formal computer science education in the laste 90s -- maybe because at that time it was recent enough that "middle-aged prime of their career" people still remembered it, it didn't seem necessary to teach it to newcomers as "history".

I wonder how much post-1950 history is taught in current undergrad CS programs, like if a "databases" course will include a summary of any of the material you summarize.

Lived it. I remember taking my first course on databases, in 1977, and feeling like I was hopelessly and permanently behind because it had been seven years since Codd's paper.

I taught an intro database course recently, and only touched on some of that history. It wasn't the main point, and it was an undergrad course.

But you point out a real problem. There is so much reinvention by people who simply don't know what was done previously.

* I have seen research I did in the 70s/80s get redone in the last 20 years, with no awareness that the problem had already been solved.

* I saw a horrifying takedown at an academic conference, (late 80s?) Someone on the original System R research team was pretty furious at a young researcher for being completely oblivious to the fact that he had repeated some System R work. (He blamed the conference reviewers as much as the author.)

My late professor used to lament quite a lot about work being redone. Not a week went by without one or two papers/submissions/talks/works being mentioned with an immediate followup like “this is the same as this paper from the 70/80/90” or “this is the same as this paper by that author”. I found it quite sad, both because of the apparent double work and because this whole catalogue of computer science material went unused because of the tirades instead of their being a constructive way to share it.

Although I very well can imagine the frustration, as I have seen the same things with libraries being reinvented, I think there is value in revisiting some problems in a different context. Maybe the problems are just not that of a big deal anymore in newer languages/eco systems even though they are fundamentally the same problem as in the 70s.

You would think the CS discipline, of all disciplines, would be able to expose its history in an easily searchable format.

Cobblers children tend to go without shoes maybe?

I suspect some of it is just the joy of solving (or trying to solve the problem) means the unsexy work of digging through hundreds or thousands of old papers gets forgotten.

And looking at things with a fresh view (which isn’t going to happen after looking at all those old papers) does have value.

I find the freshest view is typically in the foundational papers (not necessarily in computer science, but generally). Always go back and look.

Please allow me a snarky comment: Coddler's children.

Such a format would not matter, because most programmers are anti-research, anti-math, and anti anything other than shipping Agile software.

Programmers are not computer scientists any more than civil engineers are material researchers.

Software "engineering" isn't. It's all about processes and very little about standardization of process and verification of outcomes.

Found one in the wild.

Ironically, if there were a database that would allow querying for this type of things... What's the State of the Art on X ? Where are we on Y ? Has the problem in paper Z been solved ?

I think in a lot of cases, it's partly a question of where the material was (on paper, not necessarily in libraries, possibly at obscure conferences). That's the case for a various things I'd like to refer to, and the system I used for physics research was never written up anyway. In this context and example is information on Logica's RAPPORT, allegedly the first commercial, portable RDB, which had some significance in UK research support.


Databases are interesting because they were the forefront of computing when it was invented, and they still are today. The computer science sorting algorithms and trees (and trie) go directly to how good (for all the different kinds of definitions of good) a database is.

MongoDB threw out a ton of history in order to invent a different wheel, and it's faster for interesting reasons, but it's also for a different era than single monolith computing (aka Oracle DB). That doesn't mean MongoDB is appropriate for all situations, but that a fundamental precept of computing has changed. Used to be, there was room for maybe five computers in the whole world. Today, my computer has multiple virtual computers inside of it and they are treated as cattle, not pets. The work being redone is because (sometimes) the work has been invalidated by newer experiments.

We're going to have to "redo" a lot work once dark matter is found as well.

MongoDB is a classic example of the innovator's dilemma. In its first incarnation it had no ACID transactions, no joins, no Geo-Spatial indexes, no inverted indexes, no SQL query language etc.

What it did have (which no relational database had at the time) was JSON as a coin of the realm and a distributed database model. Run the clock forward and MongoDB now competes on equal footing with relational databases for all those features but it still has the key competitive advantages of JSON as a native format and a distributed architecture.

Of course there is value in revisiting a problem. But you should know that you're doing it, and be able to say how your work relates to the earlier work.

(Not denying that reimplementing things can be fun ...)

I think some of it is pure exposure. Developing a new interest in a topic can be inspirational to build things and without adequate access to prior art or knowing the right search terms can leave you lost and ignorant.

I wonder if this is where technological evolution will stagnate, when most people research an already solved and proven topic and nothing new happens, I saw some science YouTuber say development is still accelerating, wonder for how long until the next breakthrough!

I think this problem comes down to two core issues: discoverability and terminology.

You're going to be lucky if a paper from the 70s or 80s is available in a searchable database at all. That means someone bothered to scan it in, and OCR it since then. Even for the few papers that are searchable, they are old enough that they probably won't catch anyone's eye unless they are desperate.

Of course then there's also the problem of knowing what to search for. Programmers love to invent, reinvent, and re-reinvent terminology. It's only gotten worse with every other developer running a blog trying to explain complex ideas in simple terms.

The entire field of ML is a perfect example of this. I remember talking to my father about all sorts of new developments in ML back in the early 2010s, and I was quite surprised when he told me that he learned a lot of the things I was talking about back in the 80s just named a bit differently.

In most cases it ends up being a question of how much time you can put into any given problem. If I spend two weeks to find a paper that would have taken me a week to reinvent, then am I really ahead? If the knowledge wasn't important to enough make it into textbooks/classes/common knowledge then attempting to find it is akin to searching for a particular needle in a pile of needles.

I have never come across a popular CS paper that was not available on the web, for what it’s worth. Maybe some of the lesser known papers are lost, but all of the important ones, such as Codd’s writing, are very easily accessible with simple search engine searches.

The important and popular ones are absolutely available, but those are usually important because they have entered the realm of "common knowledge," at least in a particular sub-field. These are going to be at the top of the list when it comes to digitizing useful historic records. It's fairly easy to OCR a PDF, so as long as someone with some time decided "hey, this might be useful" then you'll probably be able to find it.

If you're doing databases then you've almost certainly been exposed to Codd's work, if not through his papers and books, then at least through textbooks and lectures. There are countless blogs, lecture series, and presentations that will happily direct you there.

The challenge is that there's also a mountain of work that never really got much popularity for whatever reason. Say a paper was ahead of it's time, or was released with bad timing, or simply kept the most interesting parts until the end where few people might have noticed. It's these sort of gems that are hard to find. It's hard to even know how many of these there are, because they are by definition not popular enough for most people to know about them.

>But you point out a real problem. There is so much reinvention by people who simply don't know what was done previously.

I've often wondered if and where the boundaries of human knowledge will run into this very problem. Mostly we rely on the concept of reductionism and hope that will generalize and scale but it rarely does. We often have different theories at different scales for different things we to want model and explain with science in general. If it's found that these ideas don't connect and we have to forever add more and more generalized nuggets of knowledge, models, etc for specific case, even if they are reductionist in nature, we may get to a point where there's just so much information the act of discovery of prior art may be more work than simply redoing the experiment against the ultimate judge (reality) and see if it flies.

I used to work with a handful of bioinformaticists in the era of the affordable sequencing boom and there was a lot of debate about what to do with all the sequence data. Do they archive it and pay the cost to manage, index, and make it searchable or does it make more sense to simply pay the couple thousand to rerun the experiment and assure yourself that no process flaws were introduced in a prior paper, that you eventually agree with the prior work, and so on. There's almost a little bit of built in reproducibility in this unintended act of duplicating prior work.

Software is especially bad because it's such a relatively new domain and is rarely done rigorously so people just run through cycles over and over again, assuming something people did before was wrong or that the environment changed and that approach was just inappropriate for it's time period. Plus, people in this industry seem to be driven to do innovative new work so they have no interest in looking backwards even if they should.

I wrote a little on this subject, of research that is ignored or forgotten. You might find it of interest.


You can read this article by Michael Stonebraker about the evolution of database systems: https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape.... It's the first article in the Red Book[1].

[1] http://www.redbook.io/

We left the exact same comment :)

History of science, as a scholarly discipline, is unfortunately rather decoupled from science (or engineering) itself. Practitioners of a field are rarely as interested in the field's history as they probably should. History of CS in particular has not attracted much academical interest even among science historians, possibly due to its relative youth and technical nature.

It's a good thing that the two are separate. You really want good scientific advances to be put in textbooks, and the rest should be left for history books. You really don't want to have the scientific debate get muddled in pointless historical fights like in heterodox economics where people are still debating what mArX rEaLlY mEaNt centuries later. Scientists should read textbooks that reflect the current state of the art, not 40 years old papers that will necessarily be flawed.

That only works if you assume a linear model of scientific progress. Those who don't know history are doomed to repeat it, and this is extremely evident in the field of software engineering (granted, SE is not science and not even a good example of an engineering discipline) where old wisdoms are forgotten and then reinvented every five years.

There is quite a lot about DBMS history in the Red Book and references therein: http://www.redbook.io/index.html

For the history of data model proposals in particular, there is this paper: https://scholar.google.com/scholar?cluster=73661829057771494...

So the info is definitely out there if you are interested in that sort of thing.

The Association for Computing Machinery[1] (usually "ACM") seems to be nearly invisible now. It was the forum for advancing the state of the art in those days.

Its archives are primary sources for this history. The database debate was carried out mostly in ACM SIGMOD, the "Special Interest Group for Management of Data".

1. https://www.acm.org/

I was taught exactly this history in my college DB course in Slovenia. I even had to know the years when each model was proposed and by whom.

Some references to the state of the data world in the 70s is found Codd’s initial paper itself: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf.

I strongly recommend everyone pick up basic set theory notation so that you can read seminal papers like these. There is so much amazing information in them, yet they’re treated as if they’re incomprehensible. You can understand the relational model with undergraduate-level discrete math and set theory knowledge.

These are the papers that formed our industry. Let’s not forget them.

> This kind of history is so important, and even this fairly recent history seems kind of hard to access. I am curious how you learned it?

This sort of touches upon the general inadequacy of CS education in general.

The way I learned all this history is by digging it up myself, that was in early 2000s - old journals, papers, listening to people who witnessed all this themselves.

That's the cornerstone skill that should be taught - research and the understanding that no single course can give you much in actual knowledge, only skill to acquire it.

> I wonder how much post-1950 history is taught in current undergrad CS programs, like if a "databases" course will include a summary of any of the material you summarize.

My databases course (circa 2014 or so) covered the gist of this; we didn't spend much time talking about database systems before it, but we read the Codd paper and discussed how various companies began working on implementations over the next couple of decades.

One downside of both hierarchical and network (graph) models that is not mentioned either in the post or parent comment is, "access path dependence". Network model allows multiple paths but basically just forestalls the issue.

As to sibling comment question about how to keep up on this stuff, the preceding info was in the "tar pit" paper that has been discussed a lot here on HN. So I'd say, your already doing it. Continuing education is part of any profession and this is probably a better way than some.

> One downside of both hierarchical and network (graph) models that is not mentioned either in the post or parent comment is, "access path dependence". Network model allows multiple paths but basically just forestalls the issue.

Yes. I think that this access path independence is an essential part of the relational query approach, (of which SQL is one example), being high-level. And once you have such a language, query optimization becomes essential.

> the "tar pit" paper that has been discussed a lot here on HN

Did you mean Mosley and Marks' "Out of the Tar Pit" http://curtclifton.net/papers/MoseleyMarks06a.pdf?

Yes, thanks for providing the link.


> Ignoring the fact that relational databases and SQL are permanently entrenched, an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming.

You seem to be ignoring the success of highly scalable managed NoSQL databases that has been all the rage when building high TPS services. What is your opinion on those?

They work for a very specific use case. And if you need to expand that use case, or address others, good luck. Relational database systems are complicated for good reasons. Schemas, which do cause problems, are there for good reasons, and the alternative problems obtained by going without schemas are just not worth it, nearly all the time.

I think a lot of the attraction of NoSQL databases is that they are seductive for people who don't know about the problems inherent in working with shared, persistent, long-lived data, (i.e. years). A NoSQL system lets such people get started easily. They start down the path, and then they run into dragons, and muggers, and hostile aliens. This is a really good description of what I mean: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never....

The NoSQL crowd often switches jobs before they run into these problems. 90% of NoSQL use is just resume driven development. Clearly some valuable cases exist but most startups are not among those.

You'd think that the recent increase in popularity of type systems would prompt some: "aren't type systems schemas?" moments for some of these cases.

Haha! I love reading comments like these - check out TypeDB (vaticle.com/typedb)

disclaimer: work there

Not OP but my guess is NoSQL wins were data isn't very relational or traditional RDBMS and ACID don't fit. One could even argue that file systems and key value stores are NoSQL solutions that predate RDBMS. Yet a lot of businesses just need something reliable, roughly relational, and with a community of talent.

All that said, the world of successful solutions is not mutually exclusive or limited to only one winner.

In my experience (having worked with both for a decade or so), NoSQL is valuable when the desired performance characteristics and consistency needs diverge wildly between ‘relations’/entities. You run across similar with file systems and database boundaries.

Have a lot of data you need stored where there is limited relation between them directly, and attempting to be transactionally consistent across the boundaries there and the rest of the system is going to very expensive? Then doing some part of the data in a nosql db and the rest in a ACID compliant database is probably a good idea.

Similarly, putting 50GB vm images into a nosql db is probably pretty silly when it should probably be on a proper filesystem, or at most copied out and written back periodically.

And putting a bunch of data that is tightly coupled together and needs to be transactionally updated to ensure your dataset doesn’t turn to gibberish in a NoSQL backend is going to be a nightmare.

> It isn't going anywhere until it has a high-level query language (including SQL support), query optimization, internationalization, ACID transactions, blob types, backup and recovery, replication, integration with all the major programming languages, scales with memory and CPUs

I would seriously argue that all of these are necessary for a database to be useful. The sad fact is that no graph databases in existence truly act as graph database, without some weird caveat like "you must have schemas".

At some point, the graph is simplified to a tree.

The tree becomes lumber.

The lumber becomes a table.

At that point, the data on the table become is suitable for human consumption.

For more on this topic, a good read is “what hoes around comes around”, by Michael Stonebraker (creator of Postgres, among other achievements): https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape...

To add to this, the debate was won in practice because the data persists, but the set of things you might want to do with it is mutable.

Hierarchical and Network databases assume you can anticipate future uses of your data.

> an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming

Neo4j's Cypher query language is beautiful to work with!

The databases that dominated the industrial scene prior to the emergence of relational designs—IBM’s IML, GE’s IDS, the CODASYL systems—were described as following a “network” model, which didn’t have much conceptual distance from what today we’d call a “graph” model.

They were largely replaced with relational systems for more or less exactly the reason Codd laid out in his classic paper[1]: If efficient processing of logical operations depends on a predetermined physical structure of data, the range of practical applications for your database is severely constrained.

That suggests that the proper niche for graph databases contains those applications with a predefined set of highly recursive operations, which is more or less where we find them today.

[1] http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_rel...

Yet a relational database in 5th normal form is equivalent to a RDF-Like first order Triple graph.

A lot of network databases store graphs as pointer-bogs which are hard to process and query, but that does not diminish the virtues of the graph model itself or preclude a representation based on joins.

I'd argue that a 5th normal form graph representation contains a lot less implementation detail and is a lot more flexible than a regular old buch of tables.

> Yet a relational database in 5th normal form is equivalent to a RDF-Like first order Triple graph.

Not quite. The definition of 5NF depends on real-world constraints on valid combinations of attributes in the database; it has nothing to do with any pre-defined data model such as RDF. There can be "wide", horizontally-modeled tables (quite unlike the RDF model) which are in 5NF simply because no real-world constraints apply to the data beyond those that are implied by the candidate keys.

Wow. First time I’ve heard of a normal form above 3rd. Guess I need to read up.

FWIW, there's very little practical reason to normalize your DB beyond the third normal form (or technically the Boyce-Codd normal form, which your DB will probably already meet if it's in 3NF)

Could you recommend some resources for reading more/learning about Nth normal forms?

I assume you mean IBM's IMS.

I don't think it was replaced for lack of flexibility.

AFAIKT it was replaced (by those who could replace it. There are many that still use it. Like banks) because IBM didn't let it progress, basically froze it in the cobol era, tied it with their hardware, and continued to cash-in as long as they could.

Unlike some sister comment, I think it was conceptually great... Only the tooling and ecosystem were/are terrible.

I worked with a hierarchical database system in the late 80s. Nightmare. <g>

As others have said, graph databases push foreign keys, joins, referential integrity, and migrations to the application layer. Right now the combinations of databases and software just can’t handle that with nearly the sophistication of Postgres + an open source ORM.

Facebook famously uses a graph database called TAO [1] with an ORM called Ent sitting on top of it. Everything lives on that - internal and external applications. Every dev at FB writing backend code uses it. Can’t talk too much and how these work since they’re internal and proprietary, but looks like we open sourced a Go ORM that’s Ent-inspired that shows the basics [2]. It’s definitely a weird mental model to get used to, but the combination of deep business logic in schema definitions, complex joins, and insane query performance make it super powerful. We’re definitely one of the bigger non-SQL shops, with thousands of developers using this every day as their primary data store.

[1] https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

[2] https://entgo.io/docs/getting-started/

> graph databases push foreign keys, joins, referential integrity, and migrations to the application layer

All graph databases? With just a bit of googling I can find ones that support migrations through ETL with foreign keys translated to edges at that point, and claim to have transactions with immediate consistency—even some claiming ACID.

Joins are not even a thing in graph databases; being able to directly query across edges obviates the entire concept.

> Joins are not even a thing in graph databases; being able to directly query across edges obviates the entire concept.

Joins are the central thing in graph databases, even if you don't think of it in those terms or materialize some of the join operations. Being able to directly query across edges is a logical operation, it doesn't imply the underlying implementation. While the simple "node/edge" model in memory seems easy, it also has pathologically poor locality so most graph databases use less direct representations (traversed with join operations under the hood) with much higher average performance.

You could make the same argument for any columnar store. Getting fields from one is fast, though, independent of the table they belong to. Joining together row-based tables has more moving parts and much more runtime cost.

> Joins are not even a thing in graph databases; being able to directly query across edges obviates the entire concept.

Nah, you do need joins if you don't only want to walk the graph from some well known roots. E.g. queries about properties of connected subgraphs

O(1) traversal, and filtering edges is more intuitive than joining.

And also not O(1) nor intuitive for any non trivial query.

You don't need to have an explicit `JOIN` operator in your language, but any query that goes beyond unconditional tree walking will benefit from using joins under the hood.

Depends on what’s under the hood, though. A lot of these seem to be document or columnar stores.

Well, still. Columnar and document stores still profit from performing joins under their hood.

Whenever you are combining two partial query results into a coherent whole answer you need to perform some form of join.

For example you could view a MongoDB map-reduce step with a subquery as a naive indexless scan join.

But there is no other query, except if you count each and every traversal originating from each record. The top-level query can traverse edges to reach other vertices, so first it gets a result set, then for each result, fields across relations, etc.

In traditional join sense, a RDBMS first gets two or more result sets, then filters and matches one, both or neither to appropriate rows in the other one. It is a different concept entirely.

Suffice to say, one of these is conceptually easier to parallelize and distribute.

You seem to be stuck on the query semantic level. I can highly recomend actually building a graph database, and using it in anger in a project.

You'll realize that all these paradigms are more related than they appear on the surface, that the obvious and naive thing is also often prohibitively slow and limited, and that NoSQL doesn't mean NoJOIN ;)

I don’t care about implementation details. I care about performance and team velocity; the former is already implemented and the latter comes from being able to just define a graph data model and get a free GraphQL API for frontend developers in return.

> the former is already implemented

Yeah, with joins, which this entire discussion is about.

> the latter

Is completely tangential.

You don’t think there is a slight difference in how those two very different ”joins” behave when the underlying storage is entirely non-local (columnar graph) or row-local (RDBMS)?

Nah, you do need joins if you don't only want to walk the graph from some well known roots. E.g. queries about properties of connected subgraphs

In an graphdb the engine would deduct that by the way of the query language (no explicit joins).

Yes, and then do a join ;)

I'm not arguing for explicit joins, which are an artifact of SQL and not even the relational model e.g. Datalog as a explicit join free relational language.

> graph databases push foreign keys, joins, referential integrity, and migrations to the application layer.

This isn't a fundamental limitation right?

no. not at all. this whole discussion is really about marketing terms. there is absolutely no reason why one can't apply relational operators to graphs.

its just that sql doesn't normally support cyclic evaluation. that's really only the only fundamental difference in a 'graph database'

that said, recursion is really handy and SQL is a really crap environment for doing anything but the most basic of evaluations.

I'd contest your points 2 and 3 from a business application perspective.

Relational DB's resemble ledgers from a business perspective. Most business apps and nearly all financial apps think in terms of ledgers of transactions. Go one to two degrees of separation from a financial transaction in any web app these days and the users of that system will want to line up other activity/transactions in the app with the financial ledger. Then they want to ask questions and have reports on things like: how many users for that customer? how many purchases this month for that customer? what did they purchase? etc.

I think its historical and historical also implies the vast majority of systems aren't likely to change if there is some-other-tech that may be easier to work with. Its a safer decision to pick well known tech because you'll know the trade-offs, support issues, hiring parameters, etc. from the history of all that came before you. To reverse that point - picking some esoteric language/db solution increases the risk of a product/project failing, not because of the technology but all the other factors that surround that technology.

Yep and just look at all the people using Excel. They're not saying 'oh if only this was a diagramming tool where I could draw lots of lines between boxes instead'.

Lots of people struggle to use Excel at all beyond simple non-relational lists. And even then they do tons of raw copying and manual work to generate the data they need. I would not say that Excel’s prevalence is somehow a sign that tables are natural to people.

I think Excel's prevalence is a sign that tables are natural to people (to the degree we can say a data structure of any kind is natural, but, well) -- the struggles come in, I think, from

- trying to use it with data where tables are a poor fit, or

- trying to use it with data that can't be reasonably represented in more than two or three tables

I still use spreadsheets for simple "databases," but I can't think offhand of any time I've a spreadsheet that requires more than two interlinked tables/sheets.

1. "Easier to scale" Scaling is such a weired topic, its like that joke about teenage sex, everybody always talks and is concerned about it but how many people really have to scale? And generally if you have to scale you have money and money can make any system work

2. "Closer to how we model the world in our minds, hence easier to reason about" I don't think i agree. I would argue, most peoples mental model of data is strongly inspired by tables. Thus, the big success of excel.

3."Easier to query" ok maybe I am just too used to SQL databases but i would very strongly disagree there. SQL or using any ORM its super easy to query. Of course some joins might get complicated. But if you always have to run a lot of complicated joins you can probably work on the data model, or use caching.

> And generally if you have to scale you have money and money can make any system work

That's the thing, this is not always true. It's true that generally you can throw money and hardware at a problem, but at some point you just reach the fundamental limit of the technology, at which point throwing money at the problem doesn't work.

In fact, being in a position where you can throw money at the problem is _brilliant_, as spending money is easy. So the point of _scalable_ technologies is to enable you to scale just by throwing money at the problem.

Case in point, at my old company we had this _massive_ Pg database running in RDS. It did not scale. Performance was terrible, and made the user experience shit. We were basically using the biggest instance size we could, with a few replicas powering read only endpoints. So at some point we migrated parts of it to Dynamo, and poof all of our problems went away, including operational ones.

One of the primary issues with graph databases is that they don't scale no matter how much money you throw at them. Many organizations quickly hit a very hard scaling wall with graph databases, and no amount of effort will resolve it. Money doesn't make hard computer science problems disappear.

The US military has been willing to write a blank check to anyone that builds a graph database that actually scales for decades. No one has collected on that.

It is complicated but there are good reasons for it.

The historical advantage relational databases have over graph databases is that the former forces a more restrictive representational structure onto the data model, which doesn't sound like an advantage at first. Restricting the ways in which a data model can be organized and traversed makes it straightforward to implement highly effective performance optimizations inside the database engine. Optimization is a tradeoff, by making one type of operation on a data model faster you often make another type of operation slower -- there is no free lunch. A database where all possible relationships in the data must be optimized is a database where little optimization is possible. The restrictions tacitly placed on relational database representations allow SELECT operations to be heavily optimized in a way that isn't possible if you want graph-like data model traversals to be fast and efficient.

From the perspective of database internals, relational databases are optimized around SELECT performance and graph databases are optimized around JOIN performance. The former is intrinsically much easier to optimize. It turns out that shoe-horning data models into a relational database almost always has qualitatively better performance than using a more flexible graph database, and performance traditionally matters a great deal in databases.

At a technical level, scaling ad hoc join operations -- the core operation of a graph database -- is famously extremely difficult. Ironically, most graph databases use data structures and algorithms that are tacitly optimized for the assumptions of a relational database implementation, that aren't trying to be good at graph-like things. You generally don't see graph databases that were designed from first principles to be a graph database; their internals are typically that of a relational database that supports join recursion.

We figured out very early on how to optimize the hell out of relational databases. To this day we have been unable to build graph databases that are similarly optimized, partly because the computer science is much more difficult.

Graph DB's have the relationships and the data in one lump. With Table based databases you can have relationships between tables stored in tables and obtain your graph structure that way. You can also change your graph structure by changing the data in the relationships table. This was a big problem in ISAM databases, if your schema changed then you had to spend a lot of time rebuilding the whole graph (as I was told anyway - I came in at the beginning of SQL databases), I imagine this is a problem with graph databases to?

The advantage of relational was also you could store reference data in one table - not all through the graph, another problem with ISAM. Also often when reporting you don't really know the relationships you want to report on, if this is in a hard wired graph then extracting that data and putting it in the form you want is hard, this is the power of joins - joins allow you to make the relationship you want at query time.

Graphs assume you know the relationships and they're fixed in time, this is seldom the case when building a relational DB, thats why joins come in handy - you can restructure your tables without deleting the data, or not much deleting. You can add views in to, to present the data in the old way and so on.

There's lots more, but that is the big ticket items.

> joins allow you to make the relationship you want at query time

And the price you pay for that is speed. Direct arc traversal is always O[1]. Querying is O[all_over_the_map] and all traversal in an RDBMS is essentially querying.

But your point is well taken: Hardcoded arcs move the speed problem to update time. However modern graph databases use a number of techniques to mitigate this problem like indirect pointers and reified relation objects. Yes, this slows things down to something like O[2] (which is of course the same thing as O[1] but I'm pointing out that it's about half as fast) but it preserves traversal speed predictability (minimizing traversal variance) while also allowing update flexibility.

At the same time modern RDBMSs optimize candidate key traversal queries to the point that they're essentially O[1] too so it's kind of a potato/potahto situation.

'Nodes' are represented by the ID column [1] (aka single-column PK) in the respective table. 'Hyperedges' are tables that tie together 2 (or more) IDs, for example the classic ParentChild table. Table attributes can be seen as mini ID-Attr 'edges'. Adding a FK column to a 'node' table is a shortcut for relations that are binary and 1:N. In practice, the relational model is encoding graphs quite directly, perhaps with a funny terminology.

At the query language level, a graph query language has to somehow represent the notion of a list + sort/filter/aggregate too. Haven't seen convincing improvements over the relational model, and arguably there are none: the concept of list + sort/filter/aggregate is fundamental.

The use-case where graph dbs might have an edge is when querying recursive relations, but recursive query constructs are conceivable in a table-first approach as well. Possibly also creating / querying relations (tables) on-the-fly without the 'CREATE TABLE ... / JOIN ON ...' ceremony is handy

I wouldn't look at SQL in particular for a modern instantiation of a relational algebra query language. Modern relational algebra libraries like dplyr in R ecosysyem are better.

I have not found yet uses for graph dbs. But I am curious to see why people find graph dbs "easier", preferably with concrete examples.

[1] Arguably 'primary key' terminology is confusing, as it may denote either an ID column (node) or a multi-dimensional FK tuple (hyperedge), so I prefer to avoid using it.

This might be my limited knowledge but I've always thought of relational systems as a kind of graph. The foreign keys denote edges connecting the table/nodes. At least that's how I think of them, this might be complete non-sense though.

I have not the exact but a similar sense especially when we go higher normal forms.

I’d happy if someone could elaborate more one that, but to me it seems like if we do the higher normal form things eventually we get to a place very similar to tables being graph node/edge types and the table rows being the insances of nodes/edges. Even the join table in lower normal forms is clearly like an edge type, but ofc with tables I think it sucks more, because when you join tables it costs a lot, but in case of graphs the information is local to the graph node (table row) and there is no complexity boom issue.

Also, if we’d take it really far we could say every SQL Table shall only have 1 row meaning one node/object. I don’t think it’s partical, just a though lt experiment, wondering about its meaning.

It would be slightly more straightforward to interpret a relational system as a hypergraph as a relation can typically be between more than 2 things. It does seem like you can more or less interpret most of these as a directed graph, though compound keys might make things somewhat tricky.

A relation can be a graph, but a graph need not be relational from my understanding. It has something to do with the higher mathematical what'sit's I haven't quite read up enough on but it's on my to-do list.

[not a theorist]

I think the relational _schema_ is a graph, but not the incremental data stored within. The nodes are entity types, not instances, and thus recursion is irrelevant.

So, kinda? But a lot of categorical differences.

If you want to hear my take on it after intensively working with dozens of different databases in real-world projects over the past ten years:

It is because the additional flexibility of graph databases creates the responsibility to manage the explosion of edge cases, while constraints add safety, documentation of intent and reliability.

And while special functions for graph traversal can seem elegant and nifty, I have not come across more than maybe a handful of cases ever that could not have been solved by a recursive CTE or equivalent, slighter complex query in relational databases.

Reason #1: Relational DBs got a simple query language, which boosted their popularity in 1980s. MS Access made it a funny game, and you could get it going in under an hour.

I had experience with Wikidata, and it was way harder. The logic of the language is impossible to figure out by reading the examples. (My request was simple like "get region X, find all nested entities and their population".) I'm not familiar with other query languages to Graph DBs, but if this is state of the art, I'd avoid it.

Reason #2: Graphs are not easier to reason about. When you have all properties as optional, and every connection (edge) gets a set of properties too, things become much harder.

"John is Frank's father" in that database would be a very complex structure: entity A which is John with properties such as "belonging to a class of" humanity (which is another entity), human own first name, last name, and finally a "child" property, linking to Frank, who's got same set of attributes.

That's hard to put in your head. With relational DB for this example, you have to imagine just a 4*2 table (3 if we count the header).

I think these 2 factors make development with graph DBs much harder, and developers resort to them only when absolutely necessary, and keeping only a single type of relations in graph (like road graph, or parent-child relationship).

You don't use properties to point to other entities/nodes in a property graph. That is what edges are for. So you could have an edge of type father, or maybe a "parent" type edge and the properties on that edge denote further details like whether it is biological or legal, et cetera.

You're right, I don't remember the terminology well. And yes, I remember edges having these properties as well.

This made me think that graphs are essentially more complex.

With a simple relational DB what you have essentially is a table/mapping/dictionary-like structure. (Even when storage is columnar, the API is still behaving like a table/map/dict.)

With a graph, if we only keep the info about edges, it also becomes a key=>value map, where key is node A & B list/set, and values are edge properties.

But as soon as we try storing info about nodes, this requires a second table.

Wikidata has to make yet another table, with types of edges and nodes.

I think they're the default because the relational model is either the most, or perhaps even the only logical data model with a solid theoretical foundation that makes it suitable to solve a wide range of database problems.

While there are some genuine problems that require extending or adjusting the model (e.g., temporal databases), all the problems people normally complain about relational databases are implementation, not model problems.

A common misunderstanding I find is that people think the relational model does not scale, but that makes no sense. The relational model is a logical model meant to, among other things, provide independence of the logical representation of data from its physical storage. A loose but hopefully useful analogy would be that the relational model is like arithmetic, while any given database product is like a calculator. At some point, you'll hit the limits of a calculator and will get an error as a result to an operation. That won't cause you to say "arithmetic does not scale!". Instead, you'll probably try a different calculator, or perhaps even end up implementing a new piece of software to let you handle that particular computation.

It is similar with the relational model. You may hit the limits of a specific implementation (can't scale beyond X cores, can't transparently partition data across multiple nodes, etc.), but that's an implementation limitation. The relational model has nothing to say about hardware, so those are not limitations in the model.

I think it's good pragmatic engineering to use the best tool for the job, and that includes using non-relational databases (nitpick: I don't think any of the mainstream databases considered relational are a good representation of the model. SQL, in particular, is very bad at the job. But we're probably stuck with calling them "relational databases") when they're more suitable for the task at hand. However, the important consideration is that the relational model itself can be used as a foundation to represent whatever data you need to represent, including graphs. It's just that there may be no suitable implementation for your needs right now.

One last thing, when you say Relations are like graph edges or foreign keys, I wonder if perhaps you're misunderstanding the relational model? The 'relational' in it is not about linking from one entity to another. It's about a relation from a set of attributes and types, to a set of values for those. In SQL-parlance: you can have a relational database with a single table, no need for foreign keys (even to itself) for it to be relational.

> However, the important consideration is that the relational model itself can be used as a foundation to represent whatever data you need to represent, including graphs

Yes and no. There are plenty of data-modelling scenarios that, while theoretically possible, are utterly impractical, when using Codd's model[1]: For example, neither of Codd's then-descriptions of relational-algebra and relational-calculus can be used to express a recursive query through a graph relation[2] - and it took until 1999 for ISO SQL to add support for CTEs to facilitate recursive-queries, before that the only way to recursively traverse a graph in an off-the-shelf table-based RDMBS was to either use a CURSOR or dump the entire table into memory.

[1] Only 5th Normal Form of course: I'm in the school-of-thought that 6th Normal Form is a regressive step - or a poor man's substitute for sparse-table support.

[2]Using Codd's definition of "relation" to mean "table" in SQL. I wish he had chosen a better name than that because 90% of the people working with databases without the benefit of a formal database-theory education all think "relation" refers specifically to a SQL foreign-key constraint - which is a very different thing.

Yes, I agree with all you say, especially the choice of ‘relation’ as the name.

The fact is[0] that even though Codd’s contributions effectively created the field, they were also vague in some key aspects.

[0]: that may come across as arrogant. It should be expanded to “in the humble opinion of this informed practitioner, it seems the fact is…”

There are also other types of data that don't lend themselves well to Codd's model either, such as structured hierarchical documents, especially those containing irregular data[1]. Whilst you can model a hierarchical document tree (like those) in a relational database, it isn't pretty and is impractical to query due to the insane amount of joins required to reconstruct the document (i.e. you'd need to join on every table that represents every type of element - and if you don't know exactly what tables/element-types are used in a document then I don't think there's any workable solution for that?).

Then consider non-hierarchical markup documents ( https://en.wikipedia.org/wiki/Overlapping_markup ) which has elements that span two or more arbitrary points in a document. Fortunately I've never encountered this problem yet so I've never had to put much thought into it.

[1]e.g. think of XML, JSON, and YAML files used for web-application configuration or a SOAP request message (but not those XML/JSON files being abused as containers for tabular data).

I think this is a reason NoSQL became so widespread. If we stored every piece of data relationally, databases would have an insane number of tables, resulting in huge numbers of joins, and coupled with the difficulty in running migrations, would become unmaintainable.

Every data model is shoehorned into the relational model via normalization which involves a ton of judgement calls about how the data will be used at the time the schema is defined. And then this becomes extremely difficult to change.

It’s a big trade-off between schema flexibility and performance.

I’ve always felt we would be better off querying at a higher level (say the logical db schema) and then the physical db schema is managed for us behind the scenes. But then there is also a further level up at arguably the purest representation of data as triplets, which is why Datalog is called out so much.

> I’ve always felt we would be better off querying at a higher level (say the logical db schema) and then the physical db schema is managed for us behind the scenes

But that's *exactly* what an RDBMS does: it handles the low-level record data-structures on-disk and presents to you a relatively cushy set of tables that you can design how you like them.

...otherwise please clarify what you mean by "logical db schema" vs. "physical db schema"?

Your talk of there being a "physical" design harkens back to those myriad "4GL" database platforms but those are all completely obsolete (despite what Progress's sales people will tell you). Y'see, database application developers haven't needed to concern themselves with physical records on-disk since dBase stopped being cool around 1990 (and the poor sods still using 4GL systems through the 1990s and even into the 2000s failed to see the industry shifts towards open databases (hence the "Open" in ODBC) so I don't feel too sympathetic for their cause.


> I think this is a reason NoSQL became so widespread.

I attribute it to too many CS/SE grads who didn't take database-theory electives and so never got to properly grok SQL (never mind how SQL is horribly, horribly mis-designed!) but they got hired by cool startups but for 90% of their use-cases the notion of a database behaving like memcachd is very, very enticing.

You may have noticed that the hype around "traditional" NoSQL engines (like Redis, MongoDB, CouchDB, etc) started receding probably about 3-4 years ago (and PostgreSQL is cool again!) after all those companies that built their entire business around a NoSQL database suddenly find themselves being mature businesses that now need to hire normal business-people who need to get that business data into Excel, or an OLAP warehouse, or otherwise get real-time views of data in a tabular form - those are all things that will make upper-management decide to hire people to transition the company off NoSQL and back to RDBMS.

This isn't anything new: a couple of decades ago there was (and still is) a popular in-process key-value (i.e. NoSQL) database called Berkeley DB[1] that had plenty of use-cases and applications as a more flexible alternative to some stock dBase library - or as the years went by: Microsoft JET (aka Access MDB). Though in this case, the funny thing is that dBase is also "NoSQL" because application programmers have to query data by manually iterating over fixed-size records at known offsets in a one-file-per-table system - it's really primitive.

[1]Now also owned by Oracle, natch: https://en.wikipedia.org/wiki/Berkeley_DB

> ...otherwise please clarify what you mean by "logical db schema" vs. "physical db schema"?

Logical schema is an ER diagram, physical schema is how that is translated to a DB-specific DDL, not to be confused with the physical data storage layer.

Think about all the different ways you can model a list in SQL. E.g. jsonb, array, hstore, 1-M FKs, M-M join table, and all the different levels of normalization.

> I attribute it to too many CS/SE grads who didn't take database-theory electives

Completely agree! But it also makes you realize how the relational model is a poor fit for a lot of data models too - which again many are unaware of. It's not perfect, but its a nice sweet spot for now.

I think we are at the other end of another pendulum swing of databases though. But it's hard to innovate when the prerequisite for success of any new database remains as SQL compatibility.

I’ve got an idea: “BQL+BDL” (“Better Query Language” + “Better Data Definition Language”) which would basically be a superset of SQL, but better. Like TypeScript is to JavaScript. And implement it as a schema-aware translation layer between the user and RDBMS-specific SQL dialects.

(Yes, this reeks of https://xkcd.com/927/ - but the aim isn’t to abstract-away different RDBMS’s dialect’s features, but instead to massively improve QoL)

Ping me if you’re interested! My email address is “[my-first-name]@[my-first-name].me” (and my first name is “Dai”)

EdgeDB and EdgeQL are doing something in this direction. SQL with GraphQL for joining solves some problems. Built on top of PSQL.

I think PSQL can only take you so far though, I think we will move to something new eventually.

For most apps you want to watch a query for changes. SQL is a pain for that. If you can get inside the query tree though, you can do some caching and materialization to help. And I want to run my db in my browser too.

The big thing that graph dbs provide is transitive traversals of join relationships.

The problem with graph dbs is trying to return something that is not a graph. Like a count. Or derived information. And which graph model do you use? There’s more than one. Lots of information is very poorly modeled in graph dbs. Temporal organization, for example.

Ultimately, graphs are a way to use relations. But relations allow you much more flexibility to associate information (subject to the issue of transitive relationship traversal).

Mixed graph-relational is perfectly reasonable. Reasonable start here: [https://age.apache.org/]

Out of curiosity, why do you say that no mainstream examples are a good representation of the relational model?

Mostly because of SQL. Among other problems, it’s not set-oriented: there are duplicates in SQL, but neither in relational calculus, nor in algebra. Column order has meaning in SQL but not in the relational model (another consequence of SQL not being based on sets). Another deviation is that SQL does not support column-less tables. On a rich-enough database (e.g. PostgreSQL) you can implement them yourself[0], but then that’s yet another incompatible across databases (one may think support for the empty table, which can only be one, is only of theoretical interest, but considering the model is based on set theory, it’s like not having a standard representation for the empty set).

[0]: https://blog.jooq.org/creating-tables-dum-and-dee-in-postgre...

I think the language the developer writes their queries in is most important. It is very easy and logical to express what you want in SQL. Say we want to show a list of cities with more than a million people. Easy:

    SELECT name FROM cities WHERE population>1000000
GraphQL queries on the other hand always look like gibberish to me.

Good thing that GraphQL traditionally (despite it's name) isn't really the query language of graph databases.

Same query in Cypher (the language used by e.g. Neo4J, RedisGraph):

    MATCH (n:Country) WHERE n.population > 1000000 RETURN n.name

How a graph DB index the data to support this query?

Usually there are indices on the properties of specific labels (e.g. for the .population of :Country), in graph databases with labeled nodes (most of them), so in general pretty much the same as in a relational database.

Is that the shortest way to write it in Cypher?

SQL needs 48 chars:

    SELECT name FROM cities WHERE population>1000000
Cypher needs 57 chars:

    MATCH (n:cities) WHERE n.population>1000000 RETURN n.name
I would be very hesitant to use a language that needs 19% more code to express the same thing.

This example doesn't demonstrate any of the power of Cypher, the whole point is matching a pattern than corresponds to a traversal, which as it happens is much much more elegant than recursive SQL

Would be nice though to have an example that does demonstrate the power of Cypher.

SELECT p.ProductName FROM Product AS p JOIN ProductCategory pc ON (p.CategoryID = pc.CategoryID AND pc.CategoryName = "Dairy Products")

JOIN ProductCategory pc1 ON (p.CategoryID = pc1.CategoryID) JOIN ProductCategory pc2 ON (pc1.ParentID = pc2.CategoryID AND pc2.CategoryName = "Dairy Products")

JOIN ProductCategory pc3 ON (p.CategoryID = pc3.CategoryID) JOIN ProductCategory pc4 ON (pc3.ParentID = pc4.CategoryID) JOIN ProductCategory pc5 ON (pc4.ParentID = pc5.CategoryID AND pc5.CategoryName = "Dairy Products");


MATCH (p:Product)-[:CATEGORY]->(l:ProductCategory)-[:PARENT*0..]->(:ProductCategory {name:"Dairy Products"}) RETURN p.name

So basically allow 'named joins' in SQL and we are done.

GraphQL has absolutely nothing to do with graph databases nor SQL databases. It’s a poorly named schema for making RPC calls and designating the response structure.

Exactly. GraphQL is merely REST 2.0. That's all it is. Whether it's even an improvement over REST 1.0 is debatable.

Tape. SQL databases emerged when data was stored on tape.

join table1, table2 where table1.id = table2.customer_id

type operations would have a tape for table1 in one drive, and a tape for table2 in the other drive. Things like fixed length records emerged to make it possible to fast forward the tape a specific number of inches to the point where the next record would begin, facilitating non-linear access.

Once that model was completely baked into the tooling, it didn't go away when the data moved to HDs then SSDs. The paradigms have outlived the hardware.

It's a bit like the save icon still being a floppy disk.

That's wrong. Since its inception, on of the main points of the relational mode is physical data independence.

No assumptions on how or where the data is physically stored is made by the model. Besides, it seems more plausible to me that Codd would be working on machines with disks, not tapes, by the time he proposed the model. See this, for example: https://www.theregister.com/2013/11/20/ibm_system_r_making_r...

So my understanding is relational databases were born on disks, not on tapes, but that distinction doesn't really matter to it. Additionally, it is the physical independence of the model that let all implementations based on it adapt when SSDs (and newer) storage systems arrived.

Finally, saying "the paradigms have outlived the hardware" doesn't make sense in this context. To repeat: the relational model was proposed precisely to isolate logical database design from hardware details. As a paradigm, it has been independent of hardware from conception, so of course it would outlive any specific hardware incarnation.

I’m no historian, but indexed disk-based structures - upon which relational databases were built – have been around a very long time. For example, IBM brought out ISAM in the mid 1960’s, roughly contemporary with the publication of the relational model. I could be wrong, but it doesn’t seem likely to me somehow that that those guys had tape foremost in mind.

Data locality still mattered in mechanical disks as it avoided seeking, and data locality still matters in SSD as a read brings in a whole block instead regardless of how much you need.

Tape was before ISAM iirc - things like merge-sort were big then, and the spinning tapes you'd see in those old black and white movies from the 60's. ISAM needed disks, and so did RDB's

do you have a reference for this?

There is huge value in graph theory that RDBs are almost completely incapable of putting to work. I think it's partly a historical accident that RDBs arrived when they did and I don't think graph theory was well enough understood by the computer scientists of the day. Relational theory and SQL were a better fit for the languages, operating systems, and big money problems of the day, which were mostly about business operations – accounting, inventory management, manufacturing – and to a lesser degree science and mathematics.

Today we have very different languages (while COBOL and FORTRAN still exist they haven't influence modern languages much), operating systems built on different ideas, and, perhaps most importantly, networks.

I've been waiting most of my career for something to take over from RDBs and SQL, something that supports the ideas of graph theory as well as strong typing, composability, and so forth. And no, GraphQL is most definitely not it.

Think about the the practical value of queries. Queries are the main point of DBs. Now think about most applications. Are graph queries useful for CRUD? .... no. So that is why.

relational is good for lookups and adjacency queries, and thats the main query semantic in CRUD


Hmm, why aren't graph queries useful for CRUD? I don't get it.

C — create a new node (possibly with some edges).

R — get one or a list of nodes fitting some criteria (including being connected to other nodes).

U — update a given node and possible its edges, and also possible some adjacent nodes.

D — drop the node and its edges.

Which of these is conceptually more difficult with graph DBs?

All of them? It's a lot more to think about than in a relational DB where you only have to reckon with a 'node', not any of its connections. (or, if you have some tables related by key, just processing everything under that key.)

I assume "CRUD" here means "boring business use" such as "customers and their order history" or "customers who brought a car in for servicing" or "customers and their cable TV subscriptions". The customers aren't connected to each other in any social/graph way, the database just stores them as a list while employees CRUD one customer's details at a time. What benefit from a graph DB there?

CRUD = create, read, update, delete

Yes, that's what Kirilrogovoy said that I was replying to. Yo can can do those things in a graph database, so I'm speculating that the original commentor didn't mean "CRUD" as literally "create read update delete" but in the sense of "most uses of databases are unexciting". Crud doubles as a lightly rude word, covering low-end drudge work. "another cruddy web database today, I'm sick of this".

When someone needs to store today's office visitors, they don't need to know who shared the visitor log on Facebook and how many people liked it, or how many visitors removed from Kevin Bacon they all are, it just needs a list with a bit more structure than CSV, and that kind of use dominates what people do with databases.

Let me ask my first architect's favorite question:

What does <a graph database> add that I can't get out <the relational one> I already use, and does <that thing> matter in the context I'm building in enough to justify the cost of migration/maintenance?

In a surprising number of contexts, the answer is nothing, in which case you stop, or something, and no, which is also a stop.

The Architect's point of view should take into account the relative availability of talent and cost of materials to get the job done. This is part of what makes them the Aspirant Developer's mortal enemy.

This way of formulating a question though, despite it's frustrating effectiveness at short-circuiting new-shiny for new-shiny sake, also excells for getting juniors on the road to becoming technical experts.

Sorry, just thought I was being helpful.

the delta between a graph DB and a relational DB is graph can formulate queries such as X is connected to Y, and stuff that is some kind of walk over the graph. None of that is relevant for CRUD apps.

Releational can model a graph but can only answer queries like X is an immediate neighbour of Y. It lacks the walks (SQL recursive is kinda there but not really).

The primary purpose of a DB is to serve queries. But graph DB's queries are not so useful for most "lookup" type workloads that dominate business. So a graph DB is overkill, and when you get into the nitty gritty, cannot do many performance optimizations as a result. So a graph DB is a worse relational DB. But immediate relations are what is interesting to business.

The relational model includes the relational algebra.

The relational algebra means that the order of operations do not change the results.

That's a huge advantage for query optimization. Relational database systems maintain metadata that allows pruning poor orders of operations.

The other advantage of the algebra's lack of implicit ordering is that arbitrary orderings can be added on top of it, e.g. multi-version concurrency control without changing the underlying algebraic logic.

This is a great topic. I have been working on TAO for many years. I am not very familiar with other graph databases; I assume fundamentally they are more or less the same. Here are some differences between RDB and a graph db IMHO, 1. sharding and transaction boundary 2. Secondary index support 3. How “join” works (eg. give me a list of my friends who follows Justin Bieber)

You’re right that graph db is very easy to use a lot of the times.

TAO uses MySQL doesn't it? Would that make it more of a service built on top of MySQL, rather than a unique graph database?

It does, which one can argue it’s an implementation detail. The differences I mentioned above eg sharding, transaction boundary, secondary indices, etc. should be generally applicable to a non-relational db and not unique to TAO. I could be wrong though; as I know little about other graph dbs.

It would probably be more productive to compare RDB and Graph with certain workload examples (OLTP, OLAP, joins, scans, etc.).

> Are RDBs the default way mostly due to historical reasons?

It sure helps, because you got literally generations of developers trained on how to best utilise them.

Additionally, you can use ORMs, GraphQL adaptors or similar abstractions to build graphs on top of relational storage, and keep your existing infrastructure, which makes hybrid setups a lot more attractive than graph-first ones for non-startup environments.

Relational databases are very significantly graph-like in structure, after all, they're filled with foreign keys when decomposed, and foreign keys build graphs, hence why ER diagrams aren't tables. The main issue is that the language used isn't intended for selecting pieces of your data layout in their original structure, it's intended to use that original structure to return table-like projection results that can be expressed through math a la abstracted matrix operations essentially (relational calculus, etc.).

The reason "graphs" seems easier is because people tend to ignore the data models for intuiting relationships, like "An Arm has a Hand," which SQL can handle pretty well, but can't return in a nested way. In some respects, this suggests SQL is missing a new syntax layer more than anything. That said, SQL, the back-end model, does fall short when modeling non-directed graphs IMO, though I'm not sure it's easy to express non-directed graph queries in GraphQL either. (And I don't have the experience with Graph DBs to know too much about what they offer.)

That said, I don't think it should take two hours to explain many to many tables. In graph terms, they're an expression of the requirement that you need a record to indicate every pairwise combination (edge) that exists. So to me that implies the mental version of "Graph" M2M was probably incomplete, as M2M is a very graph-centric concept.

Generally: because complexity never goes away, it just finds a different spot to be solved in. The RDBMS is mostly 'just there' from the perspective of the average programmer, and solving a problem related to databases is more of an ORM library issue than a programmer's issue. Let's not forget that most software programming tends to be various CRUD incarnations and UI fluff, not complex problem solving and scientific engineering. Due to the massive scale and volume of this type of work, the available databases and mindshare is just not there for most other types to succeed.

A difference can be found in mostly-frontend data storage and retrieval applications were storing information is actually abstracted into a SaaS RESTful API or GraphQL API and you never get to talk to the underlying database. Another one is where a library or framework requires something like ElasticSearch or MongoDB to work. This isn't because they are inherently 'better', but because that's just what the README.md in the repo happened to say when a developer came along to make use of the framework or library that fulfilled some generic functionality.

If you think about it, most semi-complex implementation details in general software you encounter turn out to be low-quality re-implementations of state machines, graphs, hashmaps and printf. A lot of the world runs on this type of stuff and makes a lot of money from it. (doesn't mean it's high quality software or that sound engineering choices were made - sometimes it's just standard components, availability of people and technology and a time-money tradeoff)

I think that you're correct in your assessment of relational vs graph-like structures: it's closer to our data domains we model and think of, more flexible, etc. We may be seeing something similar in the ML world where things are moving from tabular-dominant data to being able to process graphs more natively. A table is just a very structured graph after all!

SQL is the standard because, as others have pointed out, it's so entrenched and also builds upon a solid theoretical foundation. And given its dominance, it has been optimised and performed extremely well until recently, where data complexity is catching up again.

Recent noSQL databases won't take over SQL because of the lack of schema/typing. They do scale nicely, but aren't as constrainable as SQL, which is a feature (compare building a large software in Python vs Rust or Java) that enforces safety and good abstractions. There are some newer DBs which are combining strict schemas with NoSQL, which is promising!

Disclaimer: I work on TypeDB (vaticle.com/typedb) which is a native ERA (entity-relation-attribute) model with strict typing via the schema.

I don't mean to get off topic, but I'm surprised the intern - presuming they're are a CS major or similar - didn't already know both. These tools are simply a reflection of data structures. Are such things no longer part of 100 or 200 level CS?

I was wondering the same, I did a couple of 100 database courses and some 200 and 300 courses. Just had a look at my old alma mater and seems you can get an IT degree now without doing any DB course - very sad.

I really believe this has to do with reliability and performance. Intuitively speaking I'd agree with you that graphs are more higher-level concepts.

Also performance.

You may be looking at highly scaled distributed databases with thousands of concurrent users.

What about the base case of one large dataset being scanned linearly or by complex joins. In its basic form an RDBMS table is like an array of structs, other types of DBs are like lists.

You wont see much performance difference in normal applications, but when there is lots of work to be done RDBMS will shine.

Side question: is there a good rule of thumb for when an application can start benefitting from using a database? I'm working on a project whose internal data structures are starting to resemble a database, and I've been considering moving from just using language-native data structures to an in-memory sqlite instance, but I'm not sure what the tradeoffs are.

On the most basic level databases solve the ACID problem, namely that you typically want data operations to be:



Isolated transactions


A database may also provided better options for querying, though you'd have to use a pretty odd data model if that's a true bottleneck.

Reasons to use a database:

* to save data * to query data with SQL

Also note that hierarchical databases saw a revival in the 90s as web was taking over and web data was mostly hierarchical. A few groups started works on building hierarchical database systems.

That didn't go anywhere. Review a few years later show all those buggy, crashed frequently and worst -- slower than just storing the hierarchical data into an RDBMS.

Relational databases proposed unique combination of an easily optimizable abstract model, and memory efficient implementation that won the day.

Since the optimizable abstraction becomes more and more popular, it will dominate until we find another model that enjoy both high-level abstraction and ease of optimization.

RDB's and the theory behind them were simply elucidated earlier in Codd's paper laying out the fundamentals of relational algebra. I still haven't found "the Ur-Paper" for graph based DBMS's.

They are good enough for general purpose databases. That's very valuable when you're in early stages of development and didn't yet consolidate your business logic. Good enough is good enough.

Tangential question: does anyone have experience using a triplestore database in production? Or is using RDF in any way not related to marking up web pages for SEO?

Because until relatively recently, storage was the most expensive architectural component and relational databases optimized for storage.

How do NoSQL / graph databases support foreign keys or joins? Do they have the concept of referential integrity?

The point of a DBMS is to do this for you, via the SQL table declarations. AIUI, NoSQL / graph databases cannot do this at the system level and needs to be done by an application-level framework.

Some (many?) SQL shops turn off some of the referential integrity checks on their production transactional workloads anyway, making that an application responsibility.

I have not heard of any place doing this for many years.

Thank you, I agree!

I don't think you can enforce a constraint on which nodes are supposed to have which edges, etc.

I'm curious: is there a reason why you're putting NoSQL and graph databases into the same bucket, other than them both having the aforementioned limitation?

There are many different kinds of NoSQL db's. Graph db's are just one type of NoSQL db.

(disclaimer: I am VP of Engineering at RelationalAI where we are building a graph database that uses the relational model)

Thanks, this is a great question with many technical, social, and commercial aspects to it.

TLDR: the relational model has a super power for data management systems: it decouples the logical from the physical representation and will eventually always win. There are technical reasons why it was hard until recently to build a graph database based on the relational model.

Database were not always relational: In the 1960s databases actually had a navigational paradigm and used a hierarchical or network data model (not unlike some current graph databases).

The 1970s saw the rise of relational database management systems with early proofpoints of Ingres and System R. The important improvement here was that the physical organization of data is separated from its logical organization in relations. This is the super power of the relational model. This innovation led to an explosion of commercial activity with Oracle, DB2, Sybase (licensed to become Microsoft SQL Server) and some more. Many of these are now still industry giants.

The 90s was a big hype of objected-oriented programming and some got the idea that database management systems should be following the object-oriented model as well. This was mostly a catastrophic failure and instead systems based on the relational model kept improving and won in the end.

In the 2000s there was a large emphasis on scalability to large numbers of users and data, and the development of NoSQL systems started. Most of these did not follow the relational model. These are the key value stores, document databases etc. Key value stores addressed the problem of poor scalability but compromised on the data model and transactions. Document databases have better locality of data and made schema changes easier. They all had something in common: compromise on the relational model to gain an advantage over existing relational systems. However, systems based on the relational model kept improving the meantime and have gradually started to gain market (or mind) share again (eg Aurora, Snowflake, Spanner, CockroachDB).

In my opinion, graph databases are next. Graph databases identified a weakness, which in this case is modelling and the inability of current relational systems to handle graph structured data well. Graph data involves many joins and often recursive computations, which current commercial SQL relational systems do not do well. However, the new graph databases are not superior universally. For example, take TPC-H (OLAP - analytical) or TPC-C (OLTP - transactional), put that on a graph database and you'll typically see pretty terrible performance even though the data can easily be modeled as a graph. Several popular graph database systems do not even scale well beyond a single node.

I think you are absolutely right that graph data models are easier to work with. Starting from an ER diagram (or similar) it's not straightforward to go to tables. Assuming that your ER model is a good model, you're grouping stuff into tables based on functional dependencies. Tables here are a collection of relations, eg for an _order_, the SQL table might include relations to the customer, order date, etc. It does not include relations to products included in the order, because these have a different primary key. This is difficult for users to understand.

Predictably, the relational model is catching up though with graph database systems. A major research innovation in databases from recent years are better join algorithms, specifically for joins involving many relations, self-joins and skewed data. They're called worst-case optimal join algorithms (WCOJ) and several early prototype systems have shown promising results with these.

Based on these ideas, RelationalAI ( https://docs.relational.ai/ , https://twitter.com/RelationalAI ) is building a graph database management system based on the relational model. Presumably, this is the relational model utilizing its super power again and demonstrating that relational models will always win and innovate to incorporate legit limitations of previous systems.

Hey Martin,

I'm a huge fan of LogicBlox and RelationalAI, and the research done at both had a huge impact on me as an undergrad!

While you're here, could you provide some insight on why you folks went with arbitrary width tuples for your datamodel? Limiting youself to triples seems to drastically ease the implementation of WCOJ algorithms without much loss of expressivity, and provides compatibility with existing RDF research.

E.g. in our system we materialize the 3!(6) indices required, which allows us to pick variable/intersection orderings on the fly without a planner.

Thanks, nice to hear!

With triples you need to reify hyper edges into nodes (a relation is just a hyper edge). This can lead to pretty awkward models, for example for temporal data. Most graph databases (eg Neo4J) have the same limitation and as a result struggle with temporal data or other higher-dimensional data.

We prefer to support general relations (hyper edges) and face the complications of query planning. We really want a general relational system that decouples the complications in query processing from the way you need to model your data and write your queries

This gets particularly important for what we do besides graphs: we also aim to support machine learning workloads (we have some references to papers on our website).

We still do automatic index selection, but indeed it is important that you do not create too many of these unnecessarily if you relations with arity bigger than triples.

Thanks for the insight! Having the best and brightest of the west coast database departments at your disposal allows for some different tradeoffs I guess ^^'!

I'm actually surprised that RelationalAI and Materialize Inc haven't merged yet, for ultimate world domination.

It's a shame though that both LogicBlox and RelationalAI historically only target large enterprise customers.

It forces us commoners that want to get out of the tarpit too to bolt relational operators to our existing languages. :D

Stay tuned for that :). RelationalAI is developed as a multi-tenant service that has a pay-as-you-go model. This should make it possible for everybody to use the product. I expect we will have some free tier options too, but we have not decided on the exact model for that yet. We are starting to onboard individual preview users already.

We're good friends with Materialize.io and use similar techniques for general incremental view maintenance, but their product is quite a bit different: their purpose is to design for SQL. RAI offers a different programming model to capture complex application and business logic (the language for that is called Rel, and docs are public).

Can't wait!

Looking at RelationalAI documentation, it's fairly clear that you designed a more powerful and "cleaner" query language than SQL, good for graph-oriented tasks and without many of the traditional limitations and compromises found in typical databases.

But this is the front end. In the back end that writes files, elaborates query execution plans and executes concurrent transactions, what price are you paying on the performance side? What are the unavoidable costs of graph queries?

This is really hard to answer in a short comment, but I'll try to highlight a few things:

- We use compute/storage separation as in Snowflake and Aurora. This means that data is stored in object storage (for example S3) and computing resources are transient. This is the de facto standard now for new database management systems. Systems that are not designed for object storage are gradually going to be less relevant. The Snowflake papers are a great resource on this ( http://pages.cs.wisc.edu/~yxy/cs839-s20/papers/snowflake.pdf ) . It's also a recurring topic in the CMU database talks. Our memory management is based on LeanStore and Umbra.

- The database is entirely versioned with an immutable data structure, so read-only scaling does not require locking: you can provision new compute instances that work on a snapshot of the database that is 100% guaranteed to be consistent. This means that you can run an analytical job on your production database with a 100% guarantee that the performance of your primary system is not impacted. For concurrent writes we aim to use incremental maintenance techniques.

- All our data is indexed so that the WCOJ algorithms can do their job. Indexes maintenance can be expensive, so we have write-optimized data structures for that (B-epsilon trees - http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf ). The performance of those is really good, and they can be compacted in the background. The combination of our graph normal form (narrow relations) and WCOJ addresses the index selection problem in a novel way.

- We use new compiler architecture ideas to make everything live and demand-driven. Our framework for this is Salsa, which is open source ( https://www.youtube.com/watch?v=0uzrH2Ee494 and https://github.com/RelationalAI-oss/Salsa.jl ).

- WCOJ are our work horse join algorithm. We have different query evaluation techniques, including JIT compilation and vectorization. For queries that do not require worst-case optimal joins the plans are similar to more conventional systems due to optimizations applied to the plan (talk: https://www.youtube.com/watch?v=C_mBEq_o4HE )

Graph workloads have a few interesting properties: 1) Analytical queries are often recursive. Systems like Neo4J do not really address this in general because the query language does not support general recursion. Instead, it offers bindings to algorithms implemented in Java. 2) Queries often do joins across many relations. 3) There is lots of skew.

(2) and (3) are handled by WCOJ. For (1) we use algorithms similar differential dataflow. Differential Dataflow already has great proof points for graph analytical queries (mainly created by Frank McSherry)

In the end, of course all that matters are the actual benchmarks. We are developed those for standard relational workloads (like TPC-H) and graph workloads (for example LDBC SNB, graph analytical workloads). It's a little too early for us to claim results at scale, but we have isolated proof points that the design works.

Graphistry gets used for all sorts of graphdb + non-graphdb data projects, which has led me to think of it sociotechnically on 2 dimensions:

- Approachable semantics. Graph-y queries like untyped entity 360 views are way easier to write in graph langs. This matters for programmer adoption. But tabular systems can be simpler to implement for doing great on all sorts of other things, so other tricky tasks on SQL have gotten industry weight over decades, while the market rewarded graph query langs more for doing great on graph tasks... and just passable on non-graph.

- Diverging performance for sweet spots. Graph DBs get optimized for graph queries that SQL DBs struggle at, which matters for big enterprise contracts => most revenue. There are competing graph workloads to optimize for -- handling heavy concurrent small read/write queries, big whole graph compute, fancy iterated symbolic reasoning, horizontal scaling, GPU acceleration -- and while in theory you can combine most, reality is even with VC money someone like TigerGraph gets great scaling but not say GPU speed/efficiency of cuGraph, and neo4j wins at usability. SQL engines will have better specializations for HA, GIS, time series, etc. In theory those can be special indexes in graph DBs (I think neo4j uses lucene for text indexes?), and vice versa within SQL. But practically, it takes a lot of time and $$$, so compounding engineering years fueled by enterprise $ is playing out.

So in both cases, we see SQL taking compounding benefits for general purpose use, while the niche graph space goes deeper in its sweet spot. Neo4j and maybe TigerGraph might be 5-10 years away from being great general databases. With the rise of open source for specialized index types and ecosystem adapters, this is interesting to consider!

So does SQL eat graph before graph eats SQL, or more timely, will the graph DB niche grow or shrink? IMO it is a quite interesting time in this space.

In particular, the increase in data scales, rise of NLP knowledge graphs, and need to fuel AI/automation systems with things like 360-views of entities has grown the modern need for graph DBs. Still a small % of general DB use cases, but great wrt current enterprise/tech DB market size (1-10B?) and probably 2-3X that in 5 years. Separately, the new ubiquity of large & high-fidelity operational data (clicks, logs, transactions, sensor reads, ..) have led to a corresponding ubiquity of behavior and relationship questions. Graph intelligence includes working directly from log and SQL DB systems, which are 100X-1000X bigger than the graph db market. That has been fueling our corner of graph intelligence (viz, BI+automation, AI) space.

Anyways, this adds up to Graphistry and basically all of our graph partners are hiring (we do intl remote!), and I encourage folks looking for their next thing to check all of us out :)

Could it be similar to imperative vs functional languages? Imperative style being closer to the hardware instructions and having the first-mover advantage in gaining popularity, while functional style being more abstract, but trailing behind the imperative languages in adoption.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact