Hacker News new | past | comments | ask | show | jobs | submit login
Hierarchical Structures in PostgreSQL (2020) (hoverbear.org)
248 points by belter on June 25, 2021 | hide | past | favorite | 50 comments



A great overview of the pros and cons of different approaches is given in https://de.slideshare.net/billkarwin/models-for-hierarchical...


The author of the deck has a book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

It’s one of the best books for data model design. Highly recommend.


This is a great slide deck and directly follows what I learned about this over the years. In nearly every case I think you want a flattened table of the hierarchy. I hadn't heard the name closure table before so thanks.


Not sure the age on that slide deck but Sqlite has definitely supported recursive table expressions for years now. In some ways it is more permissive with syntax allowed in the recursive portion than Postgres.


Yes, many RDBMS's have gotten support for recursive CTE's as of recently. It's definitely not a niche feature anymore.


Recursive CTEs are almost never what you want. They're the database equivalent of pointer-chasing in code. If you design your schema such that you need a recursive CTE to query it, be prepared for bad performance unless your CTE only ever does a handful of iterations over a handful of rows.

I like paths for representing hierarchy, but closure tables can also be a good idea, depending on what you're modelling and how you query it.


> They're the database equivalent of pointer-chasing in code.

That's the general case, but more specific CTE queries can be optimized, e.g. by adding database indexes. Recent versions of Postgres have greatly improved wrt. not making CTE's overly inefficient.


I said recursive CTEs. CTEs being an optimization barrier is a different issue - and often desirable with Postgres with its lack of optimization hints. Hence with not / materialized etc.


Yeah, and it's a straightforward standardized way to solve this.

But I still think it's useful to know the flattening idea. I've seen it show up other places like map reduce. It's a useful general idea imo.


I'm a big fan of the nested set representation. It hits the sweet spot of compactness and expressivity - queries for parents, children, ancestors and descendants are simple integer comparison and pretty fast with the right indices.

The only downside is that this requires updates of a large number of rows whenever the tree changes (not all rows though). If you have a sharded table that could be problematic.



As there is no context, FYI for others: The video link is the presentation just above.


If this topic interests people, take a look at Joe Celko's book called Trees and Hierarchies: https://www.amazon.com/Hierarchies-Smarties-Kaufmann-Managem...


Available on what was formerly O'Reilly Safari: https://learning.oreilly.com/library/view/joe-celkos-trees/9...


I feel that a discussion of ltree is incomplete without discussion of manipulating the tree. e.g. splicing, cutting, moving, deleting branches [1] etc. Also, how ltree uses Gist indexes [2].

Reference: [1] http://patshaughnessy.net/2017/12/14/manipulating-trees-usin... [2] http://patshaughnessy.net/2017/12/15/looking-inside-postgres...


There's a good chapter on this in SQL Antipatterns, it covers four or five different approaches in depth.

I always just use a parent_id and recursive cte - it's plenty fast even without materialised views.

https://pragprog.com/titles/bksqla/sql-antipatterns/


Careful with CTEs in older postgres. Until the most recent versions they were an optimisation fence (predicate push-down would not pass through, meaning the CTE would often result in an index or table scan, potentially multiple times in the recursive case. This can be a huge problem over large data, though obviously in many circumstances with small enough data it is still efficient enough.


That concern doesn't really apply to recursive CTEs though.

Or rather, recursive CTEs inherently require that you select only what you need.

A little difference that the non-recursive case of "I'll just use a CTE for same functionality but better readability" which can deceptively bite you prior to PostgreSQL 12.


Most hierarchies are shallow in practice. Keep it simple.


tl;dr: I agree and healthy human organizations should never scale beyond ~5 degrees of hierarchy, which is totally manageable via basic recursive JOINs in a RDBMS without fancy stored procedures or graph theory.

I like to use Dunbar's Number (100-250) to approximate the levels of heirarchy in human organizations. The idea is that these organizations are most efficient when organizational layers don't exceed ~150 elements, due to the implementation details of the human brain.

Basically, you can do log_{150}(N) to get a very rough idea of how complex the organization of N people should be. This works for small startups and entire countries. Of course, startups should probably get comfortable with the idea of teams well before hitting >100 employees. Teams can then scale into departments (with new subteams), and once there are many departments, add regional layer, strategic/executive layer, and so on.

One interesting fact is that the USA population has roughly increased by a multiple of Dunbar's number since its organizational structure was codified in its Constitution. Perhaps time for another look?


I would argue that Dunbar's Number is the wrong number to use for this. At least not by just naively dropping it in.

Remember, it represents the total number of stable social relationships a person can maintain. If you're looking to allow your employees to have personal lives, you'll want to leave ample room for their family and friends.

Maybe an important question to ask is, how much of your employees' social-emotional carrying capacity is it appropriate to consume? If 10%, then 15-25 is your number. If 20%, then 30-50 is your number.


I would argue that there's a reason for the general size of military formations: a change of size of about half an order of magnitude per level. Much more than that for sub ordinate organizations masks it difficult to know what those orgs are actually doing. So your product team might be like 8 people. Next level up us 3-5 product teams. The 3-5 of those. And so forth. It actually scales with remarkably few levels of management. It also allows space for free form connections between people on other teams.


It's definitely a big ballpark, but I think Dunbar's Number is a good place to start. If you have managers spending 10% or less of their lives on management, I don't think the organization will be very healthy. Management should be a high commitment, high compensation role.

It's also definitely a upper limit rather than lower limit. Big bureaucracies with many layers of management and small teams can work well, but no one can really individually manage 1000 subordinates.


> One interesting fact is that the USA population has roughly increased by a multiple of Dunbar's number since its organizational structure was codified in its Constitution. Perhaps time for another look?

I have had that exact same idle thought.

In 1813, each of the 182 US Representatives represented on average ~40,000 Americans. Today, each of our 435 Reps stands for about ~760,000 people. That's over an order of magnitude growth. To keep the same rate of representation, we'd need to have over 8,000 Representatives, which is clearly too large a body to get anything done.

So we're probably well beyond the point where we could benefit from a large House of Subrepresentatives and then a smaller House of Superrepresentatives that aggregate them.


Well isn’t that why we have limited federal government and defer most decisions to the states?


Who should then defer to county/parish, who then defer to city, who then defer to neighborhood


Right, that scales better than 8000 representatives in the house.


This. Throughout my years it was very common to see juniors trying to model static website dropdown menus as recursive database tables.

In practice if a menu has more than 3 levels the user experience will suffer. I keep them at 2 submenus max, possibly 1 if I can get away with it.

Not to mention most menus are not dynamic, meaning they can be just a JSON file or simple HTML.


This is wise, but in the healthcare field, there are some pretty huge trees of things that you need to deal with sometimes. I've been involved with building out a structure a lot like MeSH[0] and some disease trees similar to ICD. Some of my implementations I would definitely do differently now because both the tools and my experience have improved. MeSH's "addresses" even match the ltree syntax, so it would probably make a lot of sense to use that.

[0] https://www.nlm.nih.gov/mesh/intro_trees.html


I feel your pain. I consulted in a hospital management software and they had trees spanning 6+ levels deep and it had to be dynamic too.


This is cool - I hadn't seen the materialized view + CTE trick before, or the ltree extension.

I've implemented this in the past using the classic adjacency list, materialized path or nested sets mechanisms - all by using Django Treebeard, which offers a very robust implementation of all three of them: https://django-treebeard.readthedocs.io/en/latest/


I'm actually using materialized views and CTE with PhotoStructure (where hierarchies are arguably the cornerstone of the product).

It's pretty funny that I landed on this implementation, given that I spent a couple years building https://github.com/ClosureTree/closure_tree (one of the most popular acts-as-hierarchy ActiveRecord gems), as it (unsurprisingly) uses closure trees.

When CTE isn't available, closure trees are nice, but boy howdy, does that closure tree table get gigantic with deeper graphs.

If CTE is available, closure trees don't even come close in performance and simplicity.

(Hint: materialized paths should use a unique separator: ASCII 0x1F is applicable: https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_...)


Peaches 'n' Stink [1] uses ltree for nested comments. All the PostgreSQL is here [2]. The first query that uses ltree is on line 452 [3].

It was hard to decide on the ltree "path" format. There's a "path" value for each comment and it is what sets up the hierarchy. What I ended up using was this (root level comments):

1.aaa

1.aab

1.aac

1.aad

.

.

.

1.aaz

1.aba

1.abb

The "1" is the post ID. Sub-comments under the first comment above would be:

1.aaa.aaa

1.aaa.aab

1.aaa.aac

The reason why I used this "path" format is because if I sort alphabetically then the comments will be in the correct "vertical" order to display chronologically.

[1] https://www.peachesnstink.com

[2] https://github.com/ferg1e/peaches-n-stink/blob/master/db/ind...

[3] https://github.com/ferg1e/peaches-n-stink/blob/master/db/ind...


This is the kind of thing I thought graph databases were good at. postgres extensions for this would be very interesting.


Most datasets have relationships forming a graph, but few need to be analyzed as a graph. Graph databases specialize for the latter usecase, with large datasets.

But you could always traverse graphs with recursive SQL — it’s just less pleasant, and perhaps less performant — but often that’s all you need (and often people confuse having a graph relationship with needing a graph-specialized database)


Postgres can handle a lot of generic 'graph' use cases. Unless you have very specialized needs, focusing on complex algorithms and very big network data, it's not at all clear that you'll need more than that.


And even then, it really depends on what you're trying to do.

My use case a few years ago was such that _every_ dedicated graph engine I tried would just OOM on the test set of ~10M nodes, which I attributed to insufficient pruning. Postgres was able to handle it, and still worked on the production set of ~1.5B nodes. The SQL took some tuning, and wasn't _pretty_, but it did _work_ where nothing else did...


This is use case dependent, but one of my favorite methods for implementing e.g. threaded comments is closure tables:

  create table post_paths (
      descendant_id bigint not null references posts (id) on delete cascade on update cascade,
      ancestor_id bigint not null references posts (id) on delete cascade on update cascade,
      depth int not null,
      primary key (descendant_id, ancestor_id)
  );
At the cost of O(n^2) rows per hierarchy, it will allow you to do some pretty nifty things, most notably being able to get a count of all descendants of a parent in O(n) time. Updating entries may be painful, though.


I like using Modified Preorder Tree Traversal. In Django apps there's a pretty good plugin [1]. Makes working with lots of hierarchical data a breeze. Good admin integration as well. Hook it up to a Quasar q-tree component on the front end and things get really fancy [2].

[1] https://django-mptt.readthedocs.io/en/latest/

[2] https://next.quasar.dev/vue-components/tree#basic


If you use Postgres + Redshift on AWS (via foreign data wrapper or dblink) it's worth noting that AWS has released recursive CTE support in Redshift. But with that said 9 levels of UNION ALL ought to be enough for anyone.


Is any one aware of a language (or dsl) which will compile to recursive CTEs?

It'd be helpful to see queries at the domain level to understand how they fit into a problem; rather than this weird embedding of some implicit idea.


There's the AGE plugin:

https://age.apache.org

Dunno how close it is to being ready for production, but it lets you use Cypher in PostgreSQL, which is a pretty damn nice language for doing many of the things you'd use recursive CTEs for. I don't much care for Neo4j (the "home" database for the Cypher language) but Cypher is good.


I spent some time looking for this specifically to translate Datalog or Prolog to SQL. There’s theoretical work out there but I didn’t find anything plug and play.


Not exactly plug and play, but for BigQuery there's now Logica https://opensource.googleblog.com/2021/04/logica-organizing-...


Thanks, this is great — and it claims experimental support for SQLite which is where I want it the most.


This seems a lot cleaner than the suggestions on Stack Overflow!

https://stackoverflow.com/questions/54907495/postgresql-recu...

Edit: Sorry for being unclear. By "this" I meant the ltree solution that's listed second in the blog post.


The first suggestion of the post is exactly the same as the one of Stack Overflow.


I use ltree with UUIDs. What's really annoying is that I needed to replace the '-' characters of the UUIDs to '_' in order to store the hierarchy in the ltree column.


ltree looks interesting. I've used MSSQL's HierarchyID extensively for this sort of thing. Are they similar or is there something else like HierarchyID for PostgreSQL?


Pro tip: please don't use comic sans (or derivative) for serious blog posts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: