I designed and maintain several graph benchmarks in the Linked Data Benchmark Council, including workloads aimed for databases [1]. We make no restrictions on implementations, they can any query language like Cypher, SQL, etc.
In our last benchmark aimed at analytical systems [2], we found that SQL queries using WITH RECURSIVE can work for expressing reachability and even weighted shortest path queries. However, formulating an efficient algorithm yields very complex SQL queries [3] and their execution requires a system with a sophisticated optimizer such as Umbra developed at TU Munich [4]. Industry SQL systems are not yet at this level but they may attain that sometime in the future.
Another direction to include graph queries in SQL is the upcoming SQL/PGQ (Property Graph Queries) extension. I'm involved in a project at CWI Amsterdam to incorporate this language into DuckDB [5].
Take a look at RDFox, unlike postgres where you can only derive a column from another column once, you can't derive a column from a derived column, RDFox can derive unlimited columns or even objects through rules incrementally. Kind of like unlimited chaining of materialized views
Their claim to fame is progressive incremental computation - each node is an actor responding to events -- and I'm not sure how a relational db could do that and match the latencies. That usecase is pretty much pattern matching and forensics and stuff like that.
It's also worth to mention the idea of WITH ITERATIVE. This is a proposed SQL syntax extension that allows formulating iterative computations. For many problems (including graph traversals), this makes queries easier to write and execute.
> We derive new iterative CTE variants from the simple loop-based,
operational semantics of SQL:1999’s WITH RECURSIVE.
Quine seems to be an interesting project but I'm not too familiar with it. Its main feature, evaluating complex multi-way join queries on incoming streaming data, exists in the relational world as the Materialize database which leverages differential dataflow for computation.
Quine uses Cypher so expressing path queries can be done with the concise Kleene-star syntax, e.g. (p1:Person)-[:knows*]-(p2:Person).
Materalize is getting support for WITH RECURSIVE and WITH MUTUALLY RECURSIVE (their own SQL extension that fixes some issues of WITH RECURSIVE):
I implemented pretty much exactly this setup once.
For what it's worth, there's a lot of footguns with this approach. You need to be careful about infinite cycles and things like that. You also just do not get the ergonomics of using a query language that is meant for graph data. Once you get it all setup and experience the issues you will no doubt be able to build whatever you want but there's a learning curve.
In the end we switched to using Neo4j and were able to build new features a lot more quickly than we would've been able to on Postgres.
It's also worth mentioning that there are many ways to store graph data, and using an "adjacency list" like is being done here may or may not be the best way for your use case. Neo4j I think uses a different approach where nodes store info about edges they are connected to directly, so you don't even need to hit an index to follow an edge.
There are still ways to go wrong. Try writing a query which explores a graph from a starting node, building up a path from the start to each node as a string. Even using UNION you will get an infinite loop, because every visit to a node in the loop has a different path, so it makes a unique row.
Ah, yes, I flubbed -- this one requires aggregation as you go. It's tricky. Ideally you could have a decent aggregation data structure for the path that lets you select a distinct sequence (path) of middle points. One could write such an aggregation function, I think. Lacking that you have to limit the length of all paths between any pair of points to the longest possible path (which is the number of points minus one), which is annoying, and then you still have to fix up each resulting path to remove any repeating suffix.
sqlite> .schema
CREATE TABLE edge (src text, dst text);
sqlite> with recursive paths (src, dst, len, path) as (
...> select src, dst, 0, src || '->' || dst from edge
...> union
...> select p.src, e.dst, p.len + 1, p.path || '->' || e.src from paths p join edge e on p.dst = e.src where p.len < (select count(*) from (select src from edge union select dst from edge))) select * from paths;
Hawaii|San Francisco|0|Hawaii->San Francisco
San Francisco|New York|0|San Francisco->New York
San Francisco|Houston|0|San Francisco->Houston
Houston|New York|0|Houston->New York
Houston|Hawaii|0|Houston->Hawaii
Hawaii|Houston|1|Hawaii->San Francisco->San Francisco
Hawaii|New York|1|Hawaii->San Francisco->San Francisco
San Francisco|Hawaii|1|San Francisco->Houston->Houston
San Francisco|New York|1|San Francisco->Houston->Houston
Houston|San Francisco|1|Houston->Hawaii->Hawaii
Hawaii|Hawaii|2|Hawaii->San Francisco->San Francisco->Houston
Hawaii|New York|2|Hawaii->San Francisco->San Francisco->Houston
San Francisco|San Francisco|2|San Francisco->Houston->Houston->Hawaii
Houston|Houston|2|Houston->Hawaii->Hawaii->San Francisco
Houston|New York|2|Houston->Hawaii->Hawaii->San Francisco
Hawaii|San Francisco|3|Hawaii->San Francisco->San Francisco->Houston->Hawaii
San Francisco|Houston|3|San Francisco->Houston->Houston->Hawaii->San Francisco
San Francisco|New York|3|San Francisco->Houston->Houston->Hawaii->San Francisco
Houston|Hawaii|3|Houston->Hawaii->Hawaii->San Francisco->Houston
Houston|New York|3|Houston->Hawaii->Hawaii->San Francisco->Houston
Hawaii|Houston|4|Hawaii->San Francisco->San Francisco->Houston->Hawaii->San Francisco
Hawaii|New York|4|Hawaii->San Francisco->San Francisco->Houston->Hawaii->San Francisco
San Francisco|Hawaii|4|San Francisco->Houston->Houston->Hawaii->San Francisco->Houston
San Francisco|New York|4|San Francisco->Houston->Houston->Hawaii->San Francisco->Houston
Houston|San Francisco|4|Houston->Hawaii->Hawaii->San Francisco->Houston->Hawaii
sqlite>
Aha, nice, i didn't think of just limiting it that way! The need to postprocess this is annoying but only very mildly; i think it takes a fairly simple window function.
I had a go at changing my query to only produce one route to each node, by excluding rows in the recursive step which go to a node for which we already have a destination, but i couldn't find a way to do it which PostgreSQL would accept, because you can't use the recursive table in an outer join or a subquery.
The PostgreSQL has some advice about dealing with cycles, but i haven't worked up the mental energy to absorb it:
Well, it's a nice trick but it's O(N^2) in terms of output size because you have N items and paths up to N-1 long, so you have up to N ~N-length paths. If your data is guaranteed to have loops then this could be annoying.
The PG thing works trivially because you can use an array for the path and inspect the array in each iteration. That's harder to do in SQLite3, but... you could: just format a string like '%->' || src || '->%' then use it with `like()` to check if the path you're adding `src` to already has it.
You started the thread by disagreeing with “you need to be careful”, and the OP is pointing out that your thread conclusion is “you need to be careful”.
No, it's not Postgres-specific advice. Using UNION ALL in recursive queries is a mistake if you have cycles in your data. This follows from first principles and the standard.
There is some theoretical work on datalog you might be interested in. Recursive queries over sets are guaranteed to terminate if they don't contain negations / aggregations, and if they have a finite herbrand universe, which, very roughly speaking, means that no functions can return new constants or tuples of unbounded length.
eg: recursively generating the set of x := x + 1 won't terminate beacuse x goes to infinity, and you get infinite tuples. x := abs(x) is fine, since you get up to two tuples per input tuple (-1, and 1, for example).
Anyway, if you ask for the query to build up all paths (of unbounded length), it won't terminate. If you asked for shortest paths, (bounded by the number of nodes in the source database) it would.
Interesting. Most of that makes sense I think, though a bit outside my normal sphere.
I was struggling why aggregates were excluded so to speak, but I guess it's because they have to be repeatedly evaluated? Ie compute the sum, add it to the output, now you might have to recompute the sum? Perhaps I'm being dense here, left my thinking cap at home.
Piggybacking on your Neo4J reference....can anyone suggest any resources that would be good for someone from the SQL world who is very uninformed on graph databases to quickly get their head around the key ideas, but even more importantly, how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)...assume a large, "social media" scale system, to be safe.
Neo4j has a self-hosted community edition that is free, though you need to update to Enterprise to expand your DB beyond one node. It's worth noting though that scaling from a single node to a cluster is going to have some pretty huge performance issues whether you are using SQL or Neo4j to store your graph. The power of graph databases is that it should be very inexpensive to follow an edge in the graph but when that leads to a network hop you lose a lot of performance.
If you are comfortable fitting your data onto one box, then development speed is probably more important than other factors and I would just try a few databases out and especially pay attention to how good the libraries are in your programming language of choice. Neo4j for instance had high quality libraries in Java but the experience in Python was not as good.
If you have a lot of data or need next-level performance, I would start by doing some research on the various ways to store graph data and then pick a DB that supports the approach you want to take.
As someone who hated being in the SQL world and couldn’t figure out why until years later when I found out about graph databases, here’s one big shift:
Graph database store true relationships.
The R in RDBMS and the concept of relationships by primary keys are lies (imo). They lead people away from what true relationships can be. Databases like Neo4j are not about doing any sort of PK or join or merge-before-query. Looking up by connection is the fastest way to find information. If your RDBMS had one table for phone numbers, one for email addresses, one for first names, one for last names, one for street, and so-on it would take huge amounts of time to query to find all the detail for a single person. With a graph database it takes a small amount of time to find the first bit of info (let’s say phone number since it’s easily sorted), and then every other bit of linked data is essentially free.
Apologies in advance; the R for relational is with respect to the tuple, that is amongst the unordered items which are serialized within a row in a table and not the keys which link different tables, the lie you find does not originate in the definition.
As someone who has wanted to love the graph dbs, keep with narrow focus top down queries where the index free adjacency lets you avoid touching the bulk of your data, and stay away from bulk queries that may need to visit everywhere multiple times, which is unfortunately where things get interesting.
Effectively, moving from RDB to a graph database just shifts relationship resolution from query time to insert/update time. Sometimes this can be an advantage. I’d even wager usually.
I've been using EdgeDB in a greenfield project and loving it. Some rough edges (forgive the pun) but fantastic for prototyping / smaller projects and could be a good candidate for large projecs. I like it better than when I used neo4j a few years back for a client project. Much better license and I really like that it's built on top on postgres.
> how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)
Safest bet (knowing nothing else about your needs): Build at least three prototypes, preferably rapidly and in parallel. Only after doing so will you be able to ask the right questions that speak accurately to your specific needs and therefore be able to map actual requirements to the different graph databases.
I’d be curious what other experienced graph database people would say about their first bigger projects. Looking back, my first graph projects involved significant iteration and several restarts. And these were still necessary after having significant background knowledge about the algorithms, offerings, and trade offs.
There are many people in roles/orgs that don’t have the humility or organizational cover to admit that these early projects are largely experimental learning experiences.
> ...assume a large, "social media" scale system, to be safe.
This broad assumption is going to cost you.
Instead, I’d encourage you to reframe this assumption in more specific terms such as read, write patterns; need for consistency, availability; support expectations; your talent pool; and lots more. Estimates can be on a log10 scale: 1, 10, 100, 1000 and so on,
I do love PostgreSQL, and I often reach for this approach when working with hierarchical data, but a word of warning: Recursive queries will not scale to handle _very_ large edge tables. The recursive query will always consider _all_ edges, even when the outer query seeks only a single node's relationships. The solution is to build and index denormalized n-th order relationship tables.
Others have already pointed out the issues of cycles.
It's not even that it always considers all edges, but that you can't exit the search early when a condition is met. In other words, you have to first find all the paths and then, outside the CTE, filter to the shortest. We push the node filter into the CTE by wrapping it in a function.
> The solution is to build and index denormalized n-th order relationship tables.
This sounds much more performant but also more difficult to maintain.
"you can't exit the search early when a condition is met" - I have a DAG traversal written in a recursive CTE, and I can bail early out just fine when my traversal predicates no longer match. Not sure why I'd have to do that outside the (recursive part of the) CTE?
Obviously maintaining a flattened version is going to perform better in queries, but you're trading maintenance (write) time for query time. Or materialization time if you decide to do it lazily.
Postgres will walk both paths from A to D in the recursive CTE and then you can filter them afterwards to keep only the shortest.
You can use aggregate functions within the recursive CTE, so you can’t GROUP BY your starting node and stop once you find the first path. There isn’t a way to compare across multiple paths or iterations of the for-loop.
Ok but that's a little different than saying you can't cut the traversal short.
If I'm traversing ancestors (let's say) until I find one that doesn't satisfy a condition, it'll bail out then. I get that this doesn't serve all uses cases, but it isn't a small thing either.
Cycles are not a problem: just use `UNION`, not `UNION ALL`.
I myself build transitive closure tables that I update incrementally as needed (sometimes in a deferred way), so that many of the recursive queries I do can be very fast, but I only build transitive closures for the smallest portion I can (basically group nesting).
> Recursive queries will not scale to handle _very_ large edge tables
What do you consider large ? We have a 12-year old telco application with a few million objects in Neo4J, doing fault impact calculations... Would PostgreSQL handle that easily nowadays ?
My hunch is that a few million objects is pretty tiny these days - you could probably write a script to import all of them into PostgreSQL in a few minutes and try it out yourself to see how it behaves.
Not very realistic example, you need to be requesting some actual fields across nodes and doing some filtering on at least strings and dates, maybe geo areas as well.
I'm sorry could you spell it out? What exactly does "recursive query will always consider _all_ edges" mean? A table scan? I'd be very grateful if you could give some pesudocode or point to a doc page.
I think GP means that it has to completely expand the recursive part for every branch before any where condition on the edge nodes can be applied. Graph databases presumably can optimize this.
I've found recursive queries to be difficult to scale in real-world queries past a few million edge nodes. We've denormalized several tree relationships so that the edge is connected both to its parent and to the root of its tree in several cases.
Thanks! Sounds like you're saying that brute-force recursive algorithms without backtracking or early termination aren't a good match for path finding. That's not a surprise.
I'm using recursive algorithm on Postgres to find trees in a graph, but only where I'm specificaly interested in all of the nodes.
It's a good tool as long as your queries are simple, and your graph isn't too large (which is probably the case 99% of the time). And if that allows you to have one less tool to deal with, go for it!
But I've tried to push it to its limits (using PostGIS, my nodes being land plots that were connected to each others — the biggest query used joins, lateral, recCTE, windows… you name it) and ran into many issues: can't branch out early, hard to optimize, no support for complex queries (writing anything beyond BFS and DFS is a pita). Yet, in the end I accepted the struggle (and the long queries) because it was so nice to work with my data directly where it was kept :)
Using the built-in PostGIS topology tools? Or something more custom? I'm curious as I just started digging into and using PostGIS for a land parcel use case. I've wondered about the graph structure of parcels adjacent to others, rather than just a big unordered table of geom objects.
Just using the geography stuff and doing joins on ST_Intersects. I couldn't guarantee that my data formed a partition of the plane, which is necessary for topology iirc.
Fun was had and headaches too. The biggest speedup I got was computing the graph online (creating a node/vertices tables from geometries) and then doing the recursive CTE on that table.
Over time I've come to take this a step further. I feel people reach for a graph database way too early. Most use cases that most people have will work fine in a standard relational setup. And that brings with it the benefits of using technologies like Postgres that have stood the test of time. So I treat it as I do any performance consideration: first demonstrate that you actually need to go down the path less traveled, don't just assume you need to do so.
Of course relational db can act like a graph db. It's just not as efficient due to how things are stored and queried. Would be great to have a graph db plugin (and I found one https://github.com/apache/age)
Having first class pagerank and shortest path functions in tinkerpop gremlin vs having to roll it yourself with recursive CTEs feels like a graph DB win.
Johan wrote the original Neo4j implementation as a graph layer on top of Postgres, random trivia. Then Java released NIO, he got excited and tried replacing the PG engine with one built for graphs from scratch.
As a C# guy, I’ve figured out how to build in-memory graphs with generics and dsl/fluid queries. I’m working on a blog entry about it because it’s such a powerful skill to learn and leverage. No data store required. I can deserialize/serialize the data structure to/from binary or json.
I personally prefer native graph databases such as Memgraph[1]. Graph databases have it's advantages and disadvantages. Sometimes you, need RDBMS, sometimes you need NoSQL solution. So I pick my stack based on situation and issue that I need to solve.
“Native” is a misleading or misused term in the context of graph databases because a graph cannot be mapped in a maximally consistent way with the underlying hardware (the von Neumann architecture represents data in a sequential manner and it is much faster to access it this way rather than randomly).
There are generally two families in the graph database world: those which use underlying traditional tables of nodes and many-to-many edges; and index-free adjacency which just means each node in the graph knows the memory address of its connections (other side of the edges).
Distributed graphs necessarily end up using the former because it’s difficult if not impossible for a node to know the memory address of its connection when that crosses a physical boundary. So typically index-free adjacency graphs have a master-slave setup with multiple read replicas but a single one to write to.
So with a “native graph” you don’t rely on potentially expensive join operations to find neighbors of neighbors and can traverse complex paths easily.
This is spot on. Though, in Memgraph’s context, there is a combination of lock free techniques and raw pointer representation that can work exceptionally well for a large variety of use cases, and even better so compared to others on streaming / event driven / large volume of writes. Using raw pointers, there are also optimization opportunities when storing or rebalancing the graph to maximize the number of adjacent nodes in a cache line so benefiting from sequential representation too.
Small nitpick but I wish the author just made `data` a JSONB field rather than VARCHAR. That really shows the power of Postgres, being able to operate as a "NoSQL graph DB" with little to no hacking.
Hey, author here! Thanks for the feedback! I went back and forth with having `data` being a JSONB field or VARCHAR, but ended up with VARCHAR to show that `nodes` don't have to be anything crazy. Really `nodes` is just a standard table you'd see in any old SQL database, and what makes it a "graph" is the `edges`/relationship table.
Andy Pavlov, mentioned something similar. Where, he advocated Relational database to solve most of the graph database utilities. He also made a bet, if graph databases spawn a legit usecase by 2030, that he would change his CMU profile picture.
We materialize the transitive closure of our graph and use that to very quickly traverse the graph.
Our data is a directed graph (an organization tree) so the materialization code (accomplished via triggers) isn't too bad.
The result is that instead of WITH RECURSIVE we just join against our closure table which is in the form `(organization, ancestor)`. Example type of query: "Is Alice an administrator in any of Bob's organizations or in any ancestor organizations", and if that's true it would give them the ability to perform X administrative action. Actually queries are more nuanced but the core Hard Thing :tm: is traversing the graph and it's really easy with our closure table.
We have a foreign key constraint in place that prevent any org with children from being deleted. So first, all children must either be orphaned or moved to another parent. We have a trigger on updates that will rematerialize an org's closure rows and then do so for all their descendants. That puts all the former children into a correct state. Then when we delete the organization, we have foreign keys from the closure table to the organizations table and we allow cascading deletes.
Our closure table itself contains "evidence" that enables correctness constraints. So a row can only exist if a) it represents a direct parentage (ancestor is the direct parent) or b) ancestor is a parent of my ancestor. This is all enforced via foreign keys, so if the foreign keys are set up correctly an errant row isn't possible.
Behind all this is a proptest (property based testing is awesome for stuff like this). We generate arbitrary org forests, perform random mutations (changing parents, deleting orgs, gaining children, etc) and then assert that the closures table is accurate. We have very high confidence that the data is correct because of this test suite.
This is nice but it misses what's fundamentally special about "real" graph databases - the way they store data.
Native graph databases like Neo4J use index-free adjacency. This means query durations are proportional to the amount of the graph searched, -not- the size of the data stored.
Only remotely relevant, but I have recently come across Gremlin, a sort of “SQL for graphs”. Does anyone have some experience with it they can share? I only used it on an in-memory graph database, and I found it quite great so far.
Tinkerpop/Gremlin is not really SQL-like at all. My experience with it has not been fun: it's difficult to reason about, has obscenely bad documentation, poor tooling, idiosyncratic library support, and simple data access patterns are hard.
I mirror the sibling comment's experience: it was a tire fire; also, if one comes from an n-tier application mental model, it's further bad news because the actual query's compute runs _client side_ -- there is no such thing as "connect to Gremlin 'database,' submit query text, have it run queries" it's rather "boot up Gremlin in your appserver, run queries so long as your appserver can connect to every one of the data stores it references, credentials included"
Very interesting. Relatedly, I've been refining simple lightweight ways to use a regular SQL database to execute seminaive datalog queries. Path reachability in a graph is the canonical example. One could use stored functions to implement the outer loop also.
There are some other fun related applications like graph rewriting using SQL, etc.
I've done this on many occasions, but I actually only go with a single nodes table.
Schema is like this
id
type: string (or enum)
data: jsonb
indexed: jsonb (but this one gets a GIN index)
toId: nullable foreign key to this same table
fromId: nullable foreign key to this same table
createdAt: date
updatedAt: date
So if I had a post with a comment on it, the data would look something like this
Yes, but the downside of your schema is that the distinction between nodes and edges is somewhat muddled.
Presumably, all nodes would have nulls in both fromId and toId, but the schema doesn't enforce that. The schema also allows linking edges to other edges.
Don't you think this is a bit too much flexibility if the intention is to model a graph?
We used the "WITH RECURSIVE" basically the moment it shipped in Postgres for the original CoreOS Clair[0] data model which was a graph of software versions and vulnerabilities.
It scaled well compared to a naive graph abstraction implemented outside the database, but when performance wasn't great, it REALLY wasn't great. After running that version in production for a couple years, we ended up throwing it out and remodeling the domain to be less graph-y to achieve more consistent performance.
I've since worked on SpiceDB[1] which scales much better by taking the traditional design approach for graph databases and only treating Postgres as triple-store. IME, there's no short-cut: if you need a graph, you probably want to use a database optimized for graph access patterns. Most general-purpose graph databases are full of optimizations for common traversals that are uncommon operations in relation databases.
Few suggestions based on my prior experience on the same task in production system:-
1. Both the tables (Nodes and Edges) MUST HAVE one more column as "label" and then you must create partitions based on distinct values of "label" column. This greatly helps queries to run faster as search plan is narrowed when you include it in where clause and PG knows partitions to hit.
2. Instead of one big fat primary key column in each table use, consider having different uniques indexes on each partition.
3. The EDGES table can afford to have "data" column of data type TEXT or perhaps JSONB. PG saves it in different data disk layout and compression works great here. We used it to cache the result for previous/next nodes data or compute the result for many extra hops which are required on frequent basis.
4. Use PG procedures to hide the complexity of reusable DFS/BFS queries.
Postgres is powerful. I've implemented graph DB, mapreduce, huge sparse matrix math (which was cool), and ofc a KV store with it. With actual performance benefits over the alternatives. It's not very ergonomic for those, but it works.
That said, I've had so many jobs where the team decides to fit our whole data model into a fully recursive graph, and it's been a huge mistake every time regardless of whether they use Postgres or a dedicated graph DB. Just because you can do it (you always can!) doesn't mean you should. Start with just a traditional n-hierarchy schema and only look at modeling things as a graph if you're sure it's necessary. Usually it's not. Sometimes you'll have a mostly non-graph schema with a couple of graph tables for the things that really need that.
The reason graph DBs exist is mainly to provide easier syntax for recursive queries (which aren't really recursive because they're tail-recursive, which means iterative rather than what most people think of as recursive).
Any questions?
It's not clear how one might make SQL syntax for recursive queries simpler. Clearly we can have syntactic sugar for "dereferencing" relationships, so one could say `SELECT child-->name FROM parent_child WHERE ...;`, and we could have aggregation `SELECT array_agg(child-->name) FROM parent_child WHERE ...;`, and `SELECT child--->name FROM parent_child WHERE ...;` to say "recurse all the way through the child relation", but if you want more than one column then things get tricky.
It doesn't handle all cases or all queries, but one method I've used to deal with graph data in relational databases it to materialize paths and cache them on the node (this only really works well with DAGs).
So, the path key could look something like: bob.ted.lisa.bill (obviously, use ids instead of names here)
Then you can index this in various ways to make LIKE queries fast.
Want do know anyone that rolls up to ted?
select * from blah where path like 'bob.ted%'
Or, how many people report to ted, but aren't under lisa?
select count(*) from blah where path like 'bob.ted%' and path not like 'bob.ted.lisa%'
And something a bit more complex that would typically be considered the domain of a graph database, like "What are the average number of direct reports each manager in my organization has?"
select name, avg((select count(*) from blah as sub where path like '%' || name || '%' and (array_position(string_to_array(path,'.'), name) = ((array_position(string_to_array(path,'.'), sub.name) - 1))
So, even with that degree of complexity, you're still avoiding recursion, though you might be getting into O(n^2) territory if you don't have an index that will handle '%' || ... || '%' well. You can expand on this with various techniques depending on the application.
It's certainly no replacement for a graph database, but for basic stuff it's very fast, handles huge volumes, and doesn't rely on any sort of recursion at the database layer (unless you materialize in a trigger or something).
The actual materialization code can be tricky, one technique to deal with that is to use the author's technique or something similar to have a recursive CTE in a view. When a graph change is made, you can scope the change based on the path's hierarchy and update the paths from the recursive view, that way you only hit the performance penalty at write time, not read.
Ok, can the table and query design in this article actually scale to millions of rows ? With hundreds of concurrent queries ? I have generally become a skeptic of doing anything fancy on a RDBMS.
any suggestions for getting nice graph db-style functionality out of an postgres schema, where you've got the nodes and edges of a labelled property graph spread across a bunch of different tables? (AGE seems to want to create a separate graph object that lives alongside your other tables, and toy examples like this article just dump everything into special tables)
in my particular case i don't actually have a lot of data so i'm not looking for performance, just want to save developer time.
I can see how this possibly won't scale compared to dedicated graph dbs.
I'm interested in what the fundamental difference is in those graph dbs because rdbms seem so close I don't see why it isn't just provided as an extra layer on top.
As described in the article, any graph can already be built with an rbms, so why for example does neo4j need to be standalone and can't possibly be a layer on top of postgres to leverage existing relationships? Any good articles on that?
In native graph databases like Neo4J query times are proportional to the amount of the graph searched, rather than increasing with the overall size of the data. That's the clincher.
I don't know off the top of my head but I'd assume that one could adjust the storage strategy and other performance tweaking to optimize for the expected use case.
While the basic technique is sound, SQL recursive queries' no-side-effects character means a huge number of basic graph traversal strategies (e.g.: pruning, shortest path first/Dijkstra) are beyond reasonable engineering.
In addition to that, Postgres is difficult as a computational platform - the execution strategy for recursive queries is not necessarily intuitive, and their performance is hard to assess (for example: how soon will you run out of "stack"?)
We accomplish similar with SchemafreeSQL. Currently MySQL 8 back-end but working on other SQL back-ends. This demo uses PlanetScale as the DB back-end, Fly.io handles the SFSQL endpoint, Netlify function handles the query, and cytoscape.js graph network library for the UI
Thanks !! We had hoped to use PS for our serverless backend and also offer a dedicated PS option. But we ran into some issues with Vitess. I was just thinking today I wanted to get back into it and see if we can resolve these issues. We had to go with Aurora.
The issues were on deletes. This demo is read only so it works well with a PS backend. My experience with your CS has been amazing despite these issues!!
Postgres has commands to easily list conventional tables nicely. While I can see this would work, there are no native Postgres commands to list the graph.
In my experience Postgres works really with a Recursive CTE inside a Materialized View for Graphs. Latest version can detect cycles. I used to add a distance column as well, to show number of hops apart, any 2 nodes are.
Hoping Postgres adds Automatic Incremental Mat View Maintenance soon, it will be perfect then as a Graph Database.
All in all this seems to be "graph databases, only the simplest parts", and yeah sure SQL can be coerced to traverse an acyclic directed graph, doesn't make for a very efficient not interesting solution for the general case.
Also calling a graph database a serialized tree structure seems quite a stretch.
For most non-performance critical cases, Postgres is a good choice. However, when dealing with more complex queries such as pattern matching (e.g., triangle counting) that require immediate results, a graph database outperform postgres.
This is a neat data modeling exercise, but not sure why you'd prefer this approach over Neo4j.
It's much simpler to query in native graph database land (though it's not exactly SQL, it's pretty close to be honest). And there are performance / reliability implications of forcing postgres into graph land.
"However, recursive SQL queries can be expected to perform comparably for 'find immediate descendants' queries, and much faster for other depth search queries, and so are the faster option for databases which provide them, such as PostgreSQL,..."
Can someone explain or point towards an explanation, what the performance impact of such a query is? How does it scale with number of edges or depth of recursion? Is there a special smart way of indexing this to be efficient?
If you have indexes on next_node and previous_node, it'll preform okay.
The size of the table (# of edges) won't actually matter *that* much performance wise with btree indexes. The performance will mostly depend on how deep you want to go with your recursion. The more nodes you check per query, the slower it'll be. Luckily PG will keep the frequently accessed nodes and index pages in the buffer pool, so it'll perform close enough to an in-memory graph database for most use cases that access the same data often - while not requiring your whole dataset to fit into memory all the time.
MongoDB's current graph search mechanism works the same way.
I think actually if you are using incremental ids and you don't delete edges, you could try BRIN indexes for the next/previous node columns. This could still get you pretty good performance with much lower storage space requirements than a "traditional" index, so you could theoretically store a pretty large graph (thinking billions of edges) with minimal disk space.
If you store a list of edges like (startId, endId) with an index on startId, then every time you want to follow an edge will cost O(log(n)) where n is the number of edges in your graph.
You can get O(1) performance on similar queries using a different storage strategy (by storing edges as pointers to the other nodes in memory) but SQL is not the right tool for that, you would be better off with a true graph DB.
>> While Postgres is not generally thought of when working with graph data structures, it is perfectly capable to store and query graph data efficiently.
Some scalability and performance metrics would be helpful here.
And that’s why graph databases. Practically, you always need to limit the recursion depth anyway (can’t show infinite depth on an UI), so you might be able to make do with postgre.
whoa.... postgres is moving quickly and taking jobs away from other databases. Maybe we need to hold up for 6 months on any new uses for postgres until we all have some time to figure this out.
In our last benchmark aimed at analytical systems [2], we found that SQL queries using WITH RECURSIVE can work for expressing reachability and even weighted shortest path queries. However, formulating an efficient algorithm yields very complex SQL queries [3] and their execution requires a system with a sophisticated optimizer such as Umbra developed at TU Munich [4]. Industry SQL systems are not yet at this level but they may attain that sometime in the future.
Another direction to include graph queries in SQL is the upcoming SQL/PGQ (Property Graph Queries) extension. I'm involved in a project at CWI Amsterdam to incorporate this language into DuckDB [5].
[1] https://ldbcouncil.org/benchmarks/snb/
[2] https://www.vldb.org/pvldb/vol16/p877-szarnyas.pdf
[3] https://github.com/ldbc/ldbc_snb_bi/blob/main/umbra/queries/...
[4] https://umbra-db.com/
[5] https://www.cidrdb.org/cidr2023/slides/p66-wolde-slides.pdf