

Managing Hierarchical Data in MySQL (or really any database) - mattculbreth
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

======
znbailey
The author fails to mention another very common approach that does not suffer
(as badly) from the nested set approach. This strategy is called "materialized
path".

In the materialized path strategy, each node/row has its full path in the tree
stored in a column and that path is "materialized" or "derived" from a depth-
first traversal to that node. In conjunction with the adjacency list approach
this could yield much better results in most situations than a nested set
model.

Taking the tree used in the article, the paths would look like (using '/' for
path delimiter but you can use whatever you like):

    
    
      electronics
      electronics/televisions
      electronics/televisions/tube
      electronics/televisions/lcd
      electronics/televisions/plasma
      electronics/portable electronics
      electronics/portable electronics/mp3 players
      electronics/portable electronics/mp3 players/flash
      electronics/portable electronics/cd players
      electronics/portable electronics/2 way radios
    

Alternatively it would make more sense to base the materialized path off of an
immutable field such as the PKey of the record (in the same order as the paths
above, from the article):

    
    
      1
      1/2
      1/2/4
      1/2/5
      1/2/6
      1/3
      1/3/7
      1/3/7/10
      1/3/8
      1/3/9
    

This has the following advantages:

    
    
      * Node insertion/deletion overhead is low / does not require updating other nodes unlike nested set model
      * Query for subnodes is easy: "select * from tree where path like 'electronics/televisions'". To get only direct subnodes add a parent_id specifier.
      * Getting the full path for a node is easy: it's built into the node itself
      * Node depth is easy: count the number of path "segments"
    

Downsides:

    
    
      * Moving sub-tree requires updating all sub-node paths
    

Hope this helps someone!

Cheers, -Zach

~~~
NyxWulf
Materialized path is a good tool for trees that won't be deeply nested, and
for tables that won't grow particularly large. However, because the path is
stored in a string, read operations are not particularly efficient. It's a
good tool to know, but it doesn't scale as well as Nested Set to very large
datasets. That being set, it's much less complicated to implement than Nested
Sets and is more efficient than adjacency list for many types of data.

~~~
snprbob86
How are reads not particularly efficient? If you use rooted paths in your
lookup, then even a "LIKE '/a/b/c%'" query for all decedents will effectively
utilize the index. I think that this would be good for deeply nested trees
also. As Zach implied, the down side of this approach is moving subtrees.
Unless you have a very volatile tree structure, this should be perfect for
most uses.

~~~
NyxWulf
Because B-Tree indexes perform orders of magnitude better on smaller lookup
values, like say an integer, than they do on large (and even worse variable
length) strings. There are a number of factors that contribute to this, but
two big ones are the raw computation time it takes to compare two strings is
much larger than the comparison of an integer that matches the register size
of the machine. Second, the depth of the B-Tree is dependent on the key size
used for the lookup.

As I said above, if you are using the materialized path for the type of
problem it's best at solving, the speed differences won't matter so much. But
that's primarily because the tree's aren't particularly deeply nested and/or
that tree table itself isn't overly large. So in essence you are trading
computational complexity for ease of use on smaller sets of data. In many
cases that's exactly what's needed. On the other hand, if you will be modeling
very large trees, or will have a huge number of them, nested sets are more
efficient in terms of encoding and storage, as well as lookups and retrievals.
The down side is that nested sets are more complex to setup and work with, and
make understanding the structure of your trees more difficult.

IMHO, it's important not to fall into the trap that one
technology/tool/solution/data structure will solve all of your problems. It's
good to know the pro's and con's of different solutions and which problems
they are most efficient at solving.

------
rjurney
Do yourself a HUGE favor, and just use L-Tree for PostgreSQL instead:
<http://www.postgresql.org/docs/current/static/ltree.html>

~~~
tom_b
Knowing more than one approach for dealing with hierarchies in relational dbs
is a good idea. I've used both nested sets and proprietary approaches (things
like CONNECT BY in Oracle).

Thanks for the link to the PostgreSQL ltree, I wasn't aware of it.

Also the article mentions Joe Celko's books, which I recommend to everyone who
wants to build up their SQL skills.

~~~
rjurney
I agree. I implemented a pre-calculated nested tree, as in the article, based
on an existing PHP implementation but in Perl. In benchmarking it, I found
that it was faster than I thought to update the entire, large tree - but that
the L-Tree was going to be a much better solution overall. L-Tree has that
low-level C, Russian hacker edge going for it :) But your mileage may vary.

The nice thing about the pre-calc stuff is that it is very approachable, and
you can build it yourself. I found doing so very educational and encourage
anyone who is curious to play around.

As someone else mentioned - we also looked at rolling our own materialized
path trees - and did an implementation of this as well, but in the end L-Tree
'just works' out of the box, and is wicked fast compared to a higher level
implementation... so we went with it. Its great. I'm not a Postgres fanatic or
anything - I like MySQL and Postgres, and lots of NoSQL databases too, but
Postgres is great at this in particular.

------
dabeeeenster
This is quite a neat solution, but it has a few detracting points, most
notably mass updates when you want to modify the tree.

It also seems slightly irrelevant; surely it's better just to read the
structure into memory when the application starts and just work on the data
structure in memory?

~~~
derobert
Reading the entire structure into memory brings all kinds of problems with it,
such as knowing when to throw out your in-memory copy because the database
changed. It is also problematic if the tree is large.

Reading the entire tree into memory can be effective if it never changes and
is reasonably sized. But if that's the case, why are you storing it in a
database? Why not just a data file?

~~~
rjurney
In practice... in many applications you can cache parts of the tree in a
session, because it won't change in a single login. If its not too big.

------
jessep
Here's a comparison of two django apps implementing various ways of doing
hierarchies: [http://www.qompr.com/charts/63%3Bdjango-hierarchical-tree-
da...](http://www.qompr.com/charts/63%3Bdjango-hierarchical-tree-data/)

------
nradov
What a mess. IBM DB2, Oracle, and MS SQL Server all support native storage of
hierachical data (as XML). MySQL needs to catch up.

------
leej
good article <http://www.sitepoint.com/article/hierarchical-data-database/>

------
smithjchris
I've written an implementation of nested sets for at least 3 DB engines over
the years (SQL Server, MySQL and PostgreSQL).

Some of that article is plainly dangerous and I wouldn't recommend following
it without some extensive research elsewhere. SQL for Smarties is the usual
reference.

Firstly they seem to insist on MyISAM which is just crap. I don't need to cite
any references on that.

Secondarily, they use LOCK TABLE whilst modifying the data. That basically
blocks other writes yet doesn't give any transactional capabilities. Due to
the denormalised nature of the nested set implementation, you NEED
transactions to ensure operations are ACID compliant or you risk tree
corruption.

Thirdly, they do not discuss in detail the performance constraints of nested
sets i.e. it is purely a read-optimised. When nested sets start growing in
node count, the modification operations do not scale as a significant chunk of
the tree needs to be renumbered inside a transaction.

~~~
NyxWulf
I've done the same, and I agree with your post entirely. There are a few
tricks to making nested sets perform reasonably well on large data sets.

I've tested this model on trees that contain up to 100 million items within
the scope of a single tree, and I have a pretty good feel for how well it
would perform beyond that.

\- The first and most important consideration is what defines the scope of a
tree. For example, if you are building a discussion board, and using nested
sets (I've done this) you would want each post/reply set to be a separately
scoped tree. That decision alone vastly reduces much of the write contention.

\- Next, if the tree is going to be heavily used, store the tree in a separate
table with each row fixed width and the primary key of the item it refers to.
Databases are much more efficient at reads and writes when rows are fixed
width and do not contain nullable or variable length data.

\- Additionally it's important to maintain the data for the adjacency list
within your structure, so that if you ever have errors or corruptions with
your tree, you can rebuild it correctly.

\- Next if you need to insert some type of sub-tree into a larger tree. You
can compute the sub-tree and then insert it into the larger tree in a single
operation, instead of doing multiple inserts that lock the entire tree. This
is a relatively difficult optimization and something I don't typically build
until it becomes necessary.

\- Spend some time working through the math of how Nested Sets work. It's
relatively simple, and really understanding it will help you implement
efficient bulk operations on your trees.

\- It's important to get a write lock on the tree before you get your left and
right values for the tree. If you get your left and write values before you
get exclusive access to the data you are working with, someone else could have
done an insert while you were building your tree, and when you do your insert,
you will have corrupted the entire tree.

\- Finally if you are interested in this topic, I cannot recommend Joe Celko's
books highly enough. SQL For Smarties has a thorough discussion of the
different models. Additionally he has a separate book that only deals with
Hierarchical data models, and he delves further into the various issues and
implementation tricks with these data structures.

~~~
smithjchris
Some excellent suggestions there. Many thanks :-)

