
Understanding Recursive Queries in PostgreSQL - fdr
https://www.cybertec-postgresql.com/en/recursive-queries-postgresql/
======
Dowwie
This post is only scratching the surface of what can be achieved with a..
fully armed, and operational battle station.. powered by recursive SQL in
postgres. Simply adding a parent_id column to a table (and a few constraints)
can facilitate all sorts of useful network topologies. I doubt that the
recursive query against a DAG is going to give better performance than a query
against a DAG within a graph database, but the relational query may be
sufficient for your work and not require the cost of graph db investment.

I think that people need to move beyond simple recursive sql example blog
posts and share the more challenging examples. Recursive queries can become
challenging quickly.

~~~
e1g
In Postgres, there is a data type ltree[0] which can materialize the full path
in a DAG and help do relatively complex searches. Tree coordinates are
typically stable, so it's easier and faster to recalculate the path on writes
vs reads. We use parent_id + ltree to do search hundreds of thousands of
relationships in milliseconds.

ltree is no match in power to CTEs, but it is perfect for a simple domain like
the one used in the article.

[0]
[https://www.postgresql.org/docs/current/ltree.html](https://www.postgresql.org/docs/current/ltree.html)

~~~
Tostino
I 2nd this recommendation. If nothing else, it helps to have fewer recursive
queries which are still treated as optimization barriers when you need to do
complex queries joining to your DAG. So using a nice indexable ltree is much
better for performance when your queries get complicated.

------
koeng
I use recursive queries in SQL to search persistent suffix arrays (mainly when
there is a LOT OF DATA). You can do string search hundreds of gigabytes of
data within ~0.004s with SQLite on cheap SSD and pretty much no RAM - my goal
is to be able to be able to search and align on all genetic information we
have so far (about ~9 trillion base pairs in total)

If anyone knows how to increase INSERT speeds on Postgres, I would appreciate.
I can get about 100K per second with COPY or putting the inserts in a
transaction. Have tried some more mirrored NVME drives and more RAM with ZFS,
as well as dropping some unimportant indexes, but I'm still a little limited.
Any thoughts?

~~~
mulmen
Why ZFS? I thought copy-on-write filesystems were discouraged for database
workloads. I know it is possible but are you getting other benefits from ZFS?

~~~
koeng
I have a single physical server for running all my virtual machines. ZFS is
nice for everything else, and it just so happens that my database virtual
machine takes up most of the SSD space on that computer. I guess I should have
a dedicated machine if I really need to improve insert speeds though.

~~~
mulmen
Ah ok, pragmatic. Makes sense.

Have you done things like disable constraints and indexing during loading?

~~~
koeng
I've disabled all constraints, but I haven't removed indexing during loading.
I'll look into doing that, with adding the indexing later

~~~
mulmen
I don’t often work with Postgres itself but I do work with a derivative
(Redshift) at $dayjob.

Indexes don’t exist. Constraints are present in the syntax and the planner
relies on them but they are never enforced. Our ETL layer deals with checking
constraints at load time. Load performance is good. I’ve never run a
rows/minute metric because it’s never been a problem.

~~~
mulmen
Coming back to this just to provide a datapoint. I just loaded 10m rows into a
table in less than one minute using Redshift. The data is encoded (compressed)
at insert time but is not sorted until a vacuum runs.

------
fabian2k
Without benchmarking it, my first instinct would still be to design the schema
in a way to avoid recursive queries, if possible. I'm thinking of stuff like a
simple tree, e.g. categories with subcategories arbitrarily nested. In that
case I'd probably just store all the parents as well, not just the immediate
category. That is more work in the insert/update case, but makes queries that
include all children trivial.

~~~
hansvm
> Without benchmarking it, my first instinct would still be to design the
> schema in a way to avoid recursive queries, if possible.

In my experience that's a good idea. The query planner has a really hard time
with recursive queries.

------
barrkel
Recursive queries are better mentally modelled as iterated queries building up
a result set. If you think of recursion as tracing a tree, or in trivial
cases, a linked list, recursive queries are a breadth-first traversal.

You better not traverse too deep (iterate too much), or you lose the data
locality benefits that relational databases are tuned for, and all you end up
saving is the cost of round trips to the database.

Many tree structures are better modelled as prefix pattern matching on string
paths. Extracting a whole subtree with "path LIKE 'foo/bar/baz/%'" on an
indexed path column is much faster than following a series of parent links
with a recursive query, more and more so the deeper your subtree gets. This
denormalizes the tree structure into the path column - you can no longer make
structural edits with small updates - but it's much faster for queries.

~~~
breatheoften
I have some tree data represented in jsonb fields -- is it possible to build a
materialized view on top of recursive traversal of a jsonb data structure --
to extract a view like this ...?

owner_id, element_path, element_id

where element_path would be something like an LTree -- to allow easily
querying over the set of traversals defined in the jsonb?

Is it a reasonable way to represent the tree in convenient way for queryomg
purposes while also allowing data updates to stay simple?

~~~
barrkel
I did approximately that, but with actual columns, not jsonb, and with a
normal view, not a materialized view. Performance sucked on the view because
of all the lookups.

So you could create the materialized view which builds the path, and then use
an index on the path for subtree selection. However you might find refreshing
the materialized view slower than you'd like.

The only real answer here is to build it and load it out with a representative
sample of data, or a set scaled down for a smaller test machine, and finding
out what the latency and throughput are like for your case. Find out what the
ground truth is for you, today, on today's hardware. My tests were from 4+
years ago.

------
yeswecatan
A well-written, simple introduction to something I've been curious about but
never took the time to dig into. Thank you!

------
remus
A few years ago the company I was working for ran an advanced SQL course for a
few people who were interested. I was skeptical going in, but the chap who ran
the course was the real deal.

One particularly memorable example he showed was a real world implementation
of Dijkstra's shortest path algorithm using recursive CTEs. While it was very
impressive I did spend most of the demo thinking "I'm so glad I don't have to
maintain that"!

------
elchief
How does a Recursive CTE run, line by line?

[https://stackoverflow.com/questions/3187850/how-does-a-
recur...](https://stackoverflow.com/questions/3187850/how-does-a-recursive-
cte-run-line-by-line)

------
doersino
Recursive SQL is fun:
[https://excessivelyadequate.com/posts/sierpinksy.html](https://excessivelyadequate.com/posts/sierpinksy.html)

------
snicker7
Datalog is a recursive query language. One of the very few alternatives to
SQL.

