This is a great slide deck and directly follows what I learned about this over the years. In nearly every case I think you want a flattened table of the hierarchy. I hadn't heard the name closure table before so thanks.
Not sure the age on that slide deck but Sqlite has definitely supported recursive table expressions for years now.
In some ways it is more permissive with syntax allowed in the recursive portion than Postgres.
Recursive CTEs are almost never what you want. They're the database equivalent of pointer-chasing in code. If you design your schema such that you need a recursive CTE to query it, be prepared for bad performance unless your CTE only ever does a handful of iterations over a handful of rows.
I like paths for representing hierarchy, but closure tables can also be a good idea, depending on what you're modelling and how you query it.
> They're the database equivalent of pointer-chasing in code.
That's the general case, but more specific CTE queries can be optimized, e.g. by adding database indexes. Recent versions of Postgres have greatly improved wrt. not making CTE's overly inefficient.
I said recursive CTEs. CTEs being an optimization barrier is a different issue - and often desirable with Postgres with its lack of optimization hints. Hence with not / materialized etc.
I'm a big fan of the nested set representation. It hits the sweet spot of compactness and expressivity - queries for parents, children, ancestors and descendants are simple integer comparison and pretty fast with the right indices.
The only downside is that this requires updates of a large number of rows whenever the tree changes (not all rows though). If you have a sharded table that could be problematic.
I feel that a discussion of ltree is incomplete without discussion of manipulating the tree. e.g. splicing, cutting, moving, deleting branches [1] etc. Also, how ltree uses Gist indexes [2].
Careful with CTEs in older postgres. Until the most recent versions they were an optimisation fence (predicate push-down would not pass through, meaning the CTE would often result in an index or table scan, potentially multiple times in the recursive case. This can be a huge problem over large data, though obviously in many circumstances with small enough data it is still efficient enough.
That concern doesn't really apply to recursive CTEs though.
Or rather, recursive CTEs inherently require that you select only what you need.
A little difference that the non-recursive case of "I'll just use a CTE for same functionality but better readability" which can deceptively bite you prior to PostgreSQL 12.
tl;dr: I agree and healthy human organizations should never scale beyond ~5 degrees of hierarchy, which is totally manageable via basic recursive JOINs in a RDBMS without fancy stored procedures or graph theory.
I like to use Dunbar's Number (100-250) to approximate the levels of heirarchy in human organizations. The idea is that these organizations are most efficient when organizational layers don't exceed ~150 elements, due to the implementation details of the human brain.
Basically, you can do log_{150}(N) to get a very rough idea of how complex the organization of N people should be. This works for small startups and entire countries. Of course, startups should probably get comfortable with the idea of teams well before hitting >100 employees. Teams can then scale into departments (with new subteams), and once there are many departments, add regional layer, strategic/executive layer, and so on.
One interesting fact is that the USA population has roughly increased by a multiple of Dunbar's number since its organizational structure was codified in its Constitution. Perhaps time for another look?
I would argue that Dunbar's Number is the wrong number to use for this. At least not by just naively dropping it in.
Remember, it represents the total number of stable social relationships a person can maintain. If you're looking to allow your employees to have personal lives, you'll want to leave ample room for their family and friends.
Maybe an important question to ask is, how much of your employees' social-emotional carrying capacity is it appropriate to consume? If 10%, then 15-25 is your number. If 20%, then 30-50 is your number.
I would argue that there's a reason for the general size of military formations: a change of size of about half an order of magnitude per level. Much more than that for sub ordinate organizations masks it difficult to know what those orgs are actually doing. So your product team might be like 8 people. Next level up us 3-5 product teams. The 3-5 of those. And so forth. It actually scales with remarkably few levels of management. It also allows space for free form connections between people on other teams.
It's definitely a big ballpark, but I think Dunbar's Number is a good place to start. If you have managers spending 10% or less of their lives on management, I don't think the organization will be very healthy. Management should be a high commitment, high compensation role.
It's also definitely a upper limit rather than lower limit. Big bureaucracies with many layers of management and small teams can work well, but no one can really individually manage 1000 subordinates.
> One interesting fact is that the USA population has roughly increased by a multiple of Dunbar's number since its organizational structure was codified in its Constitution. Perhaps time for another look?
I have had that exact same idle thought.
In 1813, each of the 182 US Representatives represented on average ~40,000 Americans. Today, each of our 435 Reps stands for about ~760,000 people. That's over an order of magnitude growth. To keep the same rate of representation, we'd need to have over 8,000 Representatives, which is clearly too large a body to get anything done.
So we're probably well beyond the point where we could benefit from a large House of Subrepresentatives and then a smaller House of Superrepresentatives that aggregate them.
This is wise, but in the healthcare field, there are some pretty huge trees of things that you need to deal with sometimes. I've been involved with building out a structure a lot like MeSH[0] and some disease trees similar to ICD. Some of my implementations I would definitely do differently now because both the tools and my experience have improved. MeSH's "addresses" even match the ltree syntax, so it would probably make a lot of sense to use that.
This is cool - I hadn't seen the materialized view + CTE trick before, or the ltree extension.
I've implemented this in the past using the classic adjacency list, materialized path or nested sets mechanisms - all by using Django Treebeard, which offers a very robust implementation of all three of them: https://django-treebeard.readthedocs.io/en/latest/
I'm actually using materialized views and CTE with PhotoStructure (where hierarchies are arguably the cornerstone of the product).
It's pretty funny that I landed on this implementation, given that I spent a couple years building https://github.com/ClosureTree/closure_tree (one of the most popular acts-as-hierarchy ActiveRecord gems), as it (unsurprisingly) uses closure trees.
When CTE isn't available, closure trees are nice, but boy howdy, does that closure tree table get gigantic with deeper graphs.
If CTE is available, closure trees don't even come close in performance and simplicity.
Peaches 'n' Stink [1] uses ltree for nested comments. All the PostgreSQL is here [2]. The first query that uses ltree is on line 452 [3].
It was hard to decide on the ltree "path" format. There's a "path" value for each comment and it is what sets up the hierarchy. What I ended up using was this (root level comments):
1.aaa
1.aab
1.aac
1.aad
.
.
.
1.aaz
1.aba
1.abb
The "1" is the post ID. Sub-comments under the first comment above would be:
1.aaa.aaa
1.aaa.aab
1.aaa.aac
The reason why I used this "path" format is because if I sort alphabetically then the comments will be in the correct "vertical" order to display chronologically.
Most datasets have relationships forming a graph, but few need to be analyzed as a graph. Graph databases specialize for the latter usecase, with large datasets.
But you could always traverse graphs with recursive SQL — it’s just less pleasant, and perhaps less performant — but often that’s all you need (and often people confuse having a graph relationship with needing a graph-specialized database)
Postgres can handle a lot of generic 'graph' use cases. Unless you have very specialized needs, focusing on complex algorithms and very big network data, it's not at all clear that you'll need more than that.
And even then, it really depends on what you're trying to do.
My use case a few years ago was such that _every_ dedicated graph engine I tried would just OOM on the test set of ~10M nodes, which I attributed to insufficient pruning. Postgres was able to handle it, and still worked on the production set of ~1.5B nodes. The SQL took some tuning, and wasn't _pretty_, but it did _work_ where nothing else did...
This is use case dependent, but one of my favorite methods for implementing e.g. threaded comments is closure tables:
create table post_paths (
descendant_id bigint not null references posts (id) on delete cascade on update cascade,
ancestor_id bigint not null references posts (id) on delete cascade on update cascade,
depth int not null,
primary key (descendant_id, ancestor_id)
);
At the cost of O(n^2) rows per hierarchy, it will allow you to do some pretty nifty things, most notably being able to get a count of all descendants of a parent in O(n) time. Updating entries may be painful, though.
I like using Modified Preorder Tree Traversal. In Django apps there's a pretty good plugin [1]. Makes working with lots of hierarchical data a breeze. Good admin integration as well. Hook it up to a Quasar q-tree component on the front end and things get really fancy [2].
If you use Postgres + Redshift on AWS (via foreign data wrapper or dblink) it's worth noting that AWS has released recursive CTE support in Redshift. But with that said 9 levels of UNION ALL ought to be enough for anyone.
Dunno how close it is to being ready for production, but it lets you use Cypher in PostgreSQL, which is a pretty damn nice language for doing many of the things you'd use recursive CTEs for. I don't much care for Neo4j (the "home" database for the Cypher language) but Cypher is good.
I spent some time looking for this specifically to translate Datalog or Prolog to SQL. There’s theoretical work out there but I didn’t find anything plug and play.
I use ltree with UUIDs. What's really annoying is that I needed to replace the '-' characters of the UUIDs to '_' in order to store the hierarchy in the ltree column.
ltree looks interesting. I've used MSSQL's HierarchyID extensively for this sort of thing. Are they similar or is there something else like HierarchyID for PostgreSQL?