
Tree structure query with PostgreSQL - dpeck
http://truongtx.me/2014/02/28/tree-structure-query-with-postgresql/
======
NhanH
I've been playing with some more advanced feature in postgres (lately, mostly
json and jsonb related), and one of the main aspect I'm concern about is the
potential pitfalls -- specifically performance pitfalls. Mostly because I feel
like there are too many magics around, and I have no idea how much overhead
for any action is.

The advanced features allow you to have more than one way to do things,
sometimes in a slightly more concise way (I've been tempting to use jsonb type
in place of simple many-to-many tables if one of the values are just always be
string). It's not that I'm expecting my app to be _web scale_ or anything, but
I'd rather not having to redesign everything one a query choke on a million
records. And most of the docs tend to be light on performance characteristics.

As performance is one of the most important aspect in database system in
general (which I'd say on par with consistency, and way a head of features),
it'd be nice to have docs focus more on not just how to use features, but when
it shouldn't be used.

~~~
sergiosgc
Your concern is valid. Whenever complexity is abstracted away you may suddenly
introduce an exponential order algorithm in an otherwise simple looking query.

Having said so, Postgresql has actually surprised me very little throughout
time. The planner's "reasoning" is quite comprehensible. Inevitably I trigger
a slow plan here and there but, once identified, correction is usually
straightforward. My overall opinion is that postgresql has quite solid
theoretical foundations and it shows.

What I mean is that, in advanced stuff, there'll always be some "magic", and
that postgresql has the least amount of black box magic I've ever seen on a
RDBMS (I'm looking at you, Oracle).

(Postgresql has been my non-clustered database of choice since about 2004. I'm
definitely biased.)

------
ThePhysicist
Interesting article! By the way, SQLite also has support for recursive queries
(and allows even slightly more advanced stuff than Postgres, e.g. ordering of
results), which you can use to do many amazing things.

I recently wrote a blog article about this, where I use recursive common table
expressions solve the "eight queens" problem in SQL (using both Postgres and
SQLite):

[http://www.andreas-dewes.de/en/2015/queens-of-the-data-
age-a...](http://www.andreas-dewes.de/en/2015/queens-of-the-data-age-abusing-
common-table-expressions-to-solve-the-eight-queens-problem-in-sql/)

------
chx
I am not sure of performance. What I do know is that Vadim Tropashko's SQL
Design Patterns has a few ideas on how to encode a materialized path see
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.643&rep=rep1&type=pdf)
. Also, bojanz and me have ideas here, see
[http://bojanz.wordpress.com/2014/04/25/storing-
hierarchical-...](http://bojanz.wordpress.com/2014/04/25/storing-hierarchical-
data-materialized-path/) and I know that these are as fast as they can be.

~~~
wolfhumble
This chapter seems to give more detail on "Nested Intervals Tree Encoding":
[https://vadimtropashko.files.wordpress.com/2011/07/ch5.pdf](https://vadimtropashko.files.wordpress.com/2011/07/ch5.pdf)

Via a comment from what seems to be from Vadim Tropashko, here:
[https://communities.bmc.com/docs/DOC-9902](https://communities.bmc.com/docs/DOC-9902)

The book itself seems to be found here (where "trees in sql" is found in
chapter 5): [http://rampant-
books.com/book_0601_sql_coding_styles.htm](http://rampant-
books.com/book_0601_sql_coding_styles.htm)

I have no connection to Vadim Tropashko, just interested in knowing more about
"Nested Intervals Tree Encoding".

~~~
chx
That ch5 pdf is from the book I mentioned, I didn't know Mr. Tropashko made it
available for free, great! (We have conversed briefly in 2006 when I was
designing the menu tree system for Drupal 6 and he sent me the draft of that
chapter.)

------
Gracana
That's the method that Disqus used as well.[1] Later on they used a
materialized path solution [2], but I remained impressed that the recursive
query worked so well at their scale, enough so that I used the same method in
a personal project that had a hierarchical comment system. Happily, it wasn't
much of a hack to do this in Django... the raw query support is quite good;
you don't have to sacrifice the ORM to do something like this.

[1] [http://cramer.io/2010/05/30/scaling-threaded-comments-on-
dja...](http://cramer.io/2010/05/30/scaling-threaded-comments-on-django-at-
disqus/)

[2] [http://cramer.io/2012/04/08/using-arrays-as-materialized-
pat...](http://cramer.io/2012/04/08/using-arrays-as-materialized-paths-in-
postgres/)

~~~
pjungwir
To toot my own horn, I wrote a blog post that takes their approach and extends
it to sort sibling comments by upvotes, like on HN or Reddit:

[http://illuminatedcomputing.com/posts/2014/09/postgres-
cte-f...](http://illuminatedcomputing.com/posts/2014/09/postgres-cte-for-
threaded-comments/)

It's trickier than you might expect!

~~~
Gracana
Oh awesome, my solution to that is pretty similar. I also added a couple
window functions in the final SELECT statement to get each comment's sibling
count and position, which my application code uses to do pagination. It's
definitely very puzzle-y stuff. Time consuming, but fun when you get a nice
solution!

------
hobs
I have never had great performance with recursive queries in SQL Server, does
postgre perform better in this respect?

I have seen a few posts on HN regarding recursive CTEs recently and I shy away
from them because it seems like an additional invocation for each different
recursive lookup and a lot of the queries can be rewritten and potentially
optimized using more traditional approaches.

~~~
fleetfox
Take a look at this: [http://cramer.io/2010/05/30/scaling-threaded-comments-
on-dja...](http://cramer.io/2010/05/30/scaling-threaded-comments-on-django-at-
disqus/)

~~~
hobs
To summarize for anyone reading this thread, and thank you fleetfox for
posting this:

"Yes, you heard that right, the SQL statement is 5x more performant at the
database level alone."

I will have to do more testing and see what I can accomplish.

------
acjohnson55
I haven't done nearly enough research to back this point of view, but it seems
to me that if you really need to make a recursive query over a data tree, you
should maybe reconsider how that tree is being modeled. The closure table
model [1] is simple to query non-recursively, and unless your tree is
pathological, it doesn't add that mcuh space overhead. I suppose though that a
built-in recursive query might not be so bad though, if it's just a whole
bunch of indexed row lookups within one query.

Perhaps there are situations where recursive queries really are the only
solution. I'd love to hear some examples.

[1]
[http://technobytz.com/closure_table_store_hierarchical_data....](http://technobytz.com/closure_table_store_hierarchical_data.html)

~~~
jsherer
How do you determine if your tree is pathological? I.e., what types of tree
structures would you find the closure table model to be pathological enough to
warrant other modeling options?

~~~
acjohnson55
Good question -- so the most pathological tree of _n_ nodes would be one with
depth of _O(n)_. If this was the case, the closure table would be of size
_O(n^2)_. In tree applications I can think of--like an org chart or site
comments--the depth is _O(log(n))_ , in which case the closure table is just
_O(n)_. Note that this same kind of pathological tree would be hell for a
recursive query too, just trading space for time.

------
jperras
For trees that have few inserts/deletes and a large number of reads (like most
threaded comments), I'd probably go with a modified preorder tree traversal
(MPTT) implementation.

With an MPTT structure you can query the entire tree (or subsets of said tree)
with a single non-recursive SQL query, at the expense of having to use
multiple statements to perform inserts and/or deletions. If you're really
concerned about insert performance (which can be an issue for large trees),
there are always hybrid approaches where you use adjacency lists for
maintaining newly inserted nodes and then run an out-of-band operation to
rewrite the relevant MPTT tree values.

*note: MPTTs used to be all the rage maybe 6-8 years ago, and it's like everyone forgot about them. What's up with that?

------
scg
PostgreSQL has a built-in ltree contrib datatype. Wouldn't that work just as
well?

~~~
gtaylor
ltree isn't going to be a one-size-fits-all if you need more than one or two
simple sorting conditions. Also, since ltree is really a glorified char field,
you end up sorting by character ordinals instead of full values.

Of course you can slice and dice the ltree string up, but at what point do you
just use a CTE?

------
bitL
How fast is this comparing to nested sets? Some time ago I did my own nested
sets implementation in Oracle and for 100k+ records the path from a leaf to
root as well as all sub-tree nodes retrieval took seconds with WITH and
milliseconds with nested sets (2-3 orders of magnitude faster).

------
databass
Recursive CTEs are a cool feature, but it's a shame the syntax is so awful.
You really have to work to understand what the query is trying to do.

Other query languages that make recursion much more bearable are SPARQL (on
RDF data), Cypher (Neo4j) or Datalog (Datomic, Cascalog, etc). Would love to
see more of those query languages being used.

------
bifel
I think it's a shame that many database courses don't even mention advanced
features like CTEs or Window functions.

It makes me cringe anytime I see a statement with complicated nested self
joins to solve a problem that should not exist because most current database
systems implemented the relevant standards years ago.

------
JohnHaugeland
Better off with Closure Trees (aka Closure Tables.)

Start at page 48.

[http://www.slideshare.net/billkarwin/sql-antipatterns-
strike...](http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back)

