
Recursive Postgres Queries - tta
http://blog.timothyandrew.net/blog/2013/06/24/recursive-postgres-queries/
======
chrisfarms
I've implemented similar designs in the past, but found ltree[1] to be much
more useful/expressive when it comes to needing to fetch, order and manage
lists of hierarchical data.

[1]
[http://www.postgresql.org/docs/9.3/static/ltree.html](http://www.postgresql.org/docs/9.3/static/ltree.html)

~~~
joevandyk
How do you handle ordering of list elements? (say I have a "position" column
that subcategories need to be sorted by)

------
batbomb
I've dealt with recursive queries many many times. For legacy production
systems I stick to Oracle's CONNECT BY statement and half joins to keep
performance okay. Most of the tables I deal with typically have 80 million
rows with a fanout of 1:2.5 and a average depth of ~3.

A better solution is to create a temporary table to insert results in as you
go if you can't afford extra DB results and perform additional inserts to that
table in order to effectively take advantage of shared memory on the server
side. The second better result is to effectively flatten out the rows as a
closure table.

------
just_hobbyst
I don't know Ruby, but does Benchmark.ms clear buffers and flush cache? If no,
the second query has a large advantage, sometimes you can get similar
performance improvements without changing query...

~~~
tta
Good point! I've flipped the queries around.

The difference is lesser, but still fairly significant (30x).

Thanks for pointing this out.

------
psychometry
This is absurd. Is all this just to eliminate the n+1 queries problem? There
are already solutions for this in any ORM.

When loading a survey into an ActiveRecord object, why not just preload the
associations? You can then iterate through the children and subchildren of
survey in Ruby. Either Survey.includes(categories: { subcategories: questions
}).find(1) (one query) or Survey.preload(categories: { subcategories:
questions }).find(1) (four queries) would do the trick.

No need for convoluted Postgres-only queries.

~~~
nieve
This is a somewhat bad example of when to use recursive queries in PostgreSQL,
but they're useful a lot of places Activerecord's preload can't hack it and
trees. Examples:

* Recursive queries can update, insert, or delete rows. This is a huge win over doing it in Activerecord with large data sets. I've done updates and inserts in under an hour that testing with Activerecord indicated would quite literally take days. It's not really black magic, but you have to actually learn SQL instead of simple selects to use this properly.

* Pg's query optimizer with a recursive query & internal data handling is much higher performance than anything Activerecord can portably produce (or produce at all) even if you're lucky enough not to have to retrieve most of your results. Preloading is fine for small data sets, but by the time you hit a few hundred thousand rows you're going to start feeling it. This is one of those places where SQL as set handling and Activerecord as ORM diverge more than just conceptually.

* Pulling up rows more than a few step away from the parents (especially through other tables) tends to move a lot of unnecessary data over the network with preloading and can produce ridiculous joins.

* "WITH RECURSIVE" can do things Activerecord can't without multiple costly queries round-tripping their results to the client, getting processed there, and fed back into more queries. [http://www.postgresql.org/docs/current/static/queries-with.h...](http://www.postgresql.org/docs/current/static/queries-with.html)

------
sync
Super cool, but it seems like this is throwing technology at a problem that
could be solved in a much simpler way.

Since you will rarely (if ever?) be looking up a single question in a survey,
storing the whole set as a JSON array-of-arrays makes much more sense. Then
you can look up the set of questions with a dead-simple SELECT.

~~~
joevandyk
I'm assuming you'd want to link an answer to a particular question. And have
referential integrity.

JSON is great if you don't need foreign keys to catch integrity problems.

------
btilly
This is basically a tree like data structure.

Did they look at
[http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html](http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html)
and see if it can be adapted to their problem?

~~~
ibotty
nested sets are not generally faster. for anything which is not "get all
direct and indirect children", they are slower than common table expressions
(recursive queries). in databases supporting cte's, that is.

------
RyanZAG
Please, please, never do this.

Programming like this in SQL is a maintainability nightmare. Can you even
imagine how difficult it would be to modify this to add in new features? I've
worked on projects where people have done 'clever' tricks like this, and it
always ends in terrible awful terrible. The SQL itself is also non-portable
which makes it even worse.

Follow the advice of Justin in the comments there and rather denormalize -
which basically means caching the results of deep tree queries at the top
level where they can be easily and simply understood.

EDIT: If you don't like the idea of denormalizing, another simple alternative
is to create a separate lookup table/model which you can use to easily map
back the items to the correct parent.

~~~
nadaviv
I make heavy use of SQL in my daily work. A lot of our business logic lives in
the database as views, triggers and various functions.

I actually find that taking this logic away from the app code and into the
database is one of the best architectural decisions we took.

I think the primary issue some people have with doing that is that they don't
know working with databases well enough, and consider them a dumb place to
store data, and nothing more. This is what makes it difficult to maintain.

One thing that I do agree with is that the SQL code isn't portable - but its
the same as your Ruby/PHP/NodeJs/whatever code not being portable. I don't see
that as an issue.

~~~
btilly
From a software engineering perspective, this is entirely backwards.

Databases offer a great way of looking at your data.

However they have very poor stories around version control, rollout, rollback,
separating metadata from data, and creating multiple parallel environments for
production, qa, and every dev engineer. Yes, I know that database vendors
claim otherwise, but compared to what is standard for most programming
environments these days, it is a major limitation.

But it doesn't stop there. If you are lucky enough to get a busy site, you
will find that the database is the inevitable bottleneck that can't simply be
scaled by putting machines in parallel. The more work you push down to that,
the sooner you'll hit that bottleneck, and the harder it is to fix once you
do. When you've experienced, as I have, the joys of periodic outages because
the database choked at a million dynamic pages/day, you'll value moving logic
out of the database, the ability to put caching in front of it, etc.

Now to all of this the database vendors have a trump card, what happens if
multiple applications access the database? If your business logic is there,
then you're fine, otherwise you're reimplementing that logic multiple times!
The thing is that in the 15 years that I've seen people making that argument,
I've not encountered a lot of organizations opening up a business critical
transactional database for random applications. And in practice it is easier
and saner to put an RPC layer on top of of your existing app, in which case
you get the business logic that way.

I guess I should somewhere note that I spent several years of my life with my
primary job being to write fancy SQL. I was also the go to person on the
software development side for "get us to scale the database better". I do not
criticize databases out of a lack of knowledge of how to use them. I criticize
them based on a large amount of experience with how to make the most effective
use of their capabilities.

~~~
jacques_chester
In general, "ordinary" software has these advantages for the simple reason
that it's ephemeral. The software starts with a clean universe each time you
kill it and relaunch. That makes lots of things much, much easier.

Databases don't get to do that. The data is persistant between instances,
between sessions and between copies. It never goes away and the database must
always consider all other considerations secondary. If you want to change the
object model of a Ruby program, you change the source and relaunch. If you
want to change the schema of a database, you will need to provide the database
instructions on what you want to do with all the existing data that is now
inconsistent.

If you look at the hoops people jump through in always-on software, as opposed
to "restart with #e33cf4" software, it's similarly gnarly.

Relational databases are kissing cousins with heavy-duty type systems. They
make you do homework and eat your vegetables to forestall classes of error
lots of people have never encountered and so discount to zero. Then it's
decided that they aren't agile and fwoosh, out they go.

~~~
btilly
Absolutely.

Which is why I find it works well to have code flow from dev to production,
and databases flow the other way. With all changes starting as well-tested
patches that are then applied in production and propagated elsewhere.

