Hacker News new | past | comments | ask | show | jobs | submit login
SQLite now allows multiple recursive SELECT statements in a single recursive CTE (fossil-scm.org)
398 points by thunderbong 39 days ago | hide | past | favorite | 75 comments



This means you can use SQLite as a graph database. The draft documentation for this feature talks about that in more detail here: https://sqlite.org/draft/lang_with.html#rcex3

Also relevant: the draft changelog for the next version (due out in December): https://sqlite.org/draft/changes.html


While you can use recursive common table expressions to perform graph queries, certain queries like breadth-first traversal are not very efficient for anything but trees (not even DAGs). See my post on the sqlite forum from the other day for more details

https://sqlite.org/forum/forumpost/cd5ff0a1d4?t=h


IMHO fixing this is more a job for the query planner/optimizer than for the SQL language itself. I wonder if there's a more general optimization that would help with this and other similar graph-related tasks...


It isn't a query planner/optimizer problem.

Generalized graph traversals require each query to have additional data structures that grow with table size. This becomes an expensive proposition as tables get large, especially if you have many concurrent queries. Some databases designed for graph processing have internals and storage models that make it more efficient to resource manage this extra state. A SQL database, which has other priorities, would not implement this.


Still cannot use window functions in recursive queries, e.g. `select row_number() over ()` to enumerate and distinguish paths along the edges of a graph. This works in PostgreSQL, but is an error in SQLite, just tested with preview 3.34.0 2020-10-20. This greatly diminishes the utility of graph querying because only (nearly) trivial queries don't drag window functions into them.


Have you suggested this feature (along with a use case) on the SQLite forum?


It's not my responsibility. Hipp knows he's out of compliance with the standard.


I don't know anything about graph databases, but ... why do we want to use SQLite as one? Doesn't the graph database community have something better to offer?

SQL has a knack for doing really easy things well, and moderately complicated things badly. I would assume by default that anything involving graphs should not involve SQL. Recursive CTEs to me scream "you're about to spend hours debugging something trivial".


> I would assume by default that anything involving graphs should not involve SQL.

That would be a questionable assumption. SQL databases are a widely tested tool, and SQL itself can allow you to augment your "graph" with constraints and semantics that many graph-focused systems have trouble with. CTE's, while not entirely trivial, are not overly complex; they're not something you'd spend "hours" debugging.


I don't feel that way at all. I think a lot of that feeling comes from the fact that everyone's so dependent on ORMs they just rarely write SQL. Once you start doing it more you realize how powerful and elegant it can be.


> Doesn't the graph database community have something better to offer?

Probably not in the same kind of resource-constrained way, no. Whilst it might be a suboptimal graph database, it's also not going to require 2GB of RAM to run...


An interesting upcoming graph database is oxigraph. It's written in Rust and might be able to cope well in 2GB.

https://github.com/oxigraph/oxigraph/


Ooh, ta, that looks handy, especially with supporting SPARQL. If it can load my 1.1GB TTL, I'll be beyond happy.


I’d love to learn more about your use case... I think SPARQL and RDF are pretty interesting technologies, but haven’t seen many real applications using them.


My current use case is that I'm working on something which is basically things in groups (which may also be in groups) and seeing if things like "items in group A which are linked to group C via linkages to items in group B" are easier with SPARQL + graph database than they are with Postgres.

(Previous use case was converting music industry data into RDF and providing query interfaces on top. That 1.1GB TTL is a CWR file converted to RDF as a stress tester.)


> "items in group A which are linked to group C via linkages to items in group B"

Relational databases are actually quite convenient for such cases because you can model each "group" of items via a database table, including potentially an associative table which provides "linkage" between items in the database. Graph-based models are generally more limited than that, e.g. RDF and SPARQL are limited to simple triples which link a "source" and a "target" entity (both of which are essentially non-typed) according to a fixed "predicate". You can sort of materialize triples and endow them with extra information, but it gets clunky.


> Relational databases are actually quite convenient for such cases

I believe that someone may have done this in a nice way but every time I've encountered it (3 times thus far), it's always ended with complex SQL and tables being bent out of shape to try and keep performance.

> RDF and SPARQL are limited to simple triples which link a "source" and a "target" entity according to a fixed "predicate"

But can also infer transitive relations based on those predicates - `A canSee B`, `B canSee C` => `A canSee C` - which is handy when you're trying to discover those relationships in your data.


> But can also infer transitive relations based on those predicates - `A canSee B`, `B canSee C` => `A canSee C` - which is handy when you're trying to discover those relationships in your data.

You can do this sort of inference in a view if you use relational databases. (A view is a sort of "virtual" table based on the result of some database query. Many databases can also materialize views for improved performance, though this can make it a bit challenging to manage updates.)


> You can do this sort of inference in a view if you use relational databases.

You'd need to use recursive queries though, I think, and there be dragons.

> [materialized views] can make it a bit challenging to manage updates

We're not using them with Postgres because refreshing materialized views is very much a blunt hammer and it'd cause more hassle than it would solve. Which is annoying because they've been great when I've used them previously.


I'd describe SPARQL and RDF as general interfaces focused on interop, not "technologies" per se. You could use any technology to provide a SPARQL/RDF endpoint, including a relational database.


If we are just going by resource usage, I'm sure that RedisGraph (as it's "just" a Redis module) would also fit the bill.


Isn't that keeping everything in memory?


It seems it can persist to disk just like normal Redis stuff but only work with in-memory graphs - tricky if it's too big to fit, I suppose.

cf https://github.com/RedisGraph/RedisGraph/issues/152


It's keeping everything in memory (index and entities) with disk persistence, but doesn't have a big memory overhead when running with little data.


> why do we want to use SQLite as one

Because SQLite is ubiquitous. It's simply everywhere. And it doesn't need a dedicated server to run.


I would also add the following - another good reason is that it's probably going to be the most reliable, best tested part of any stack you decide to use.


My experience with graph databases so far has been that they are tuned more for nice schemas and queries than for graph analysis.

I've actually been considering migrating away from neo4j to sql with recursive queries for performance reasons.


> SQL has a knack for doing really easy things well, and moderately complicated things badly. I would assume by default that anything involving graphs should not involve SQL

A schema may have only one, or anyway, comparatively very few instances, of recursion.

In this case, using two separate data stores would be too much overhead; if the CTE doesn't do anything particularly fancy, that is, if it just retrieves records recursively associated via keys, it's not implicitly hard to debug (actually, there isn't much to debug).

A point in case is GitLab's users management, which dropped MySQL also because of the lack of CTEs (at the time).


Some of the early RDF Graph Stores were indeed based on SQL databases (Jena RDB/SDB) but you're correct in that they didn't scale particularly well.

There are pure graph stores out there, AllegroGraph, OpenLink Virtuoso (although this is a strange hybrid of SQL + Graph technologies) and others - and for more advanced graph query constructs like path finding there are optimisations that are difficult/not well supported in SQL.


Yes why not? If you have a data model where parts of the data is graph-like, and other classic relational, why would you not want to store the entire thing in the same database? If your domain is completely graph-oriented, and you have complex requirements on your database to support graph-oriented operations, it would be one thing. But evaluating use of a well known and mature SQL database is no bad starting point.


> Yes why not? If you have a data model where parts of the data is graph-like, and other classic relational, why would you not want to store the entire thing in the same database?

SQL is an implementation of relational algebra, and databases that implement it are relational databases for storing relational data.

Pointing at the name of a thing isn't an argument, I admit it, but asking "what could go wrong if we want to use a system tailored to relational data to deal with non-relational data?" strikes me as the sort of question that might turn out to have a compelling answer 6 months in when it is too late to easily change the database.

If I were involved in a project, I would be representing very hard that the data comes out of the relational DB using "SELECT * FROM table WHERE conditions" and then the clever graph things start happening. If the data needs to be read from disk using clever graph based algorithms, then use a clever, graph-based database. There are a bunch out there according to Wikipedia.


Graph data is not necessarily "non-relational". A lot of real-world data just can't be described as "purely graph-like" (no complex semantics or constraints) or "purely relational" (no recursive queries, etc.); often it's convenient to merge both feature sets. This is exactly what modern SQL with CTE's allows.


> Pointing at the name of a thing isn't an argument, I admit it, but asking "what could go wrong if we want to use a system tailored to relational data to deal with non-relational data?" strikes me as the sort of question that might turn out to have a compelling answer 6 months in when it is too late to easily change the database.

This would suggest that pretty much every existing RDBMS has made some very bad decisions since JSON/XML types, arrays, and all sorts of other non-relational features have long been supported.

Given the nature of SQLite you probably aren't dealing with petabytes of data.


If the project involves spending 6 months writing code involving navigating graphs, then SQLite probably isn't the right choice. If it's one or two queries, that's a different situation


Sure, but (and I am apologetic to be labouring on this point) if you have a small one-off blob of data that needs to be processed as a graph, SQL is the worst language to do the processing and a database is the worst place to be doing it. Get it into memory, use a programming language that supports recursive functions.

This is the basic argument here. If the situation is delicate enough we need the database to be doing graph operations, why would we pick SQLite? If it isn't, why would we pick SQL over Python?

I don't think it is bad that SQLite supports this, just ... what are the circumstances where this is a good idea? Is there a reason to do graph-style algorithms in SQL?


The problem that prompted me to make this extension to SQLite was a web-page request, so it needs to be processed with low latency. I initially tried loading the graph into memory and processing it that way. But there are approx 100K nodes in the graph, only a few dozen of which are relevant to the answer. It took a lot of time to pull in 100K nodes from disk. Running the query entirely in SQL is not only much less code to write, debug, and maintain, it is also much faster.


Thanks for adding it. I learned window functions using SQLite, looking forward to learn recursive queries next.


> SQL is the worst language to do the processing and a database is the worst place to be doing it. ... Is there a reason to do graph-style algorithms in SQL?

I don't think it's such a far-off fit from the relational database problem. Any time you have a table with a relation to itself, there's the potential to do graph-style algorithms on that data.

> If the situation is delicate enough we need the database to be doing graph operations, why would we pick SQLite?

Perhaps it has nothing to do with the situation being "delicate" and it is just a simple matter of it being less work and less lines of code to use a graph-style query in SQL, rather than re-implementing the graph algorithms in your application code or having to bring in an entirely new database system just to process one query.


I agree. There is nothing wrong relationally with self-referencing tables. Support for traversing such data models may have been poor in the past but with the support added, what is the problem?


I write SQL for a living and am time woodworker (not good at all). Sometimes I get pieces of wood cut by industrial machines at the shop and sometimes I use my manual saw. I don't have money or space to set up a professional wood cutting machine in my workshop. Hand held saw not the best for cutting wood but sometimes it just makes sense. You are right, graph database things need to be done in a graph database but speaking for myself I have never set up or used a graph database. It is easier (not right) for me to learn recursive statements than to set up and learn to use a graph database.


It's an unlikely comparison in the first place; who is considering, on one hand, using SQLite right on the client device, and on the other, maybe spinning up a graph database that will almost certainly not just run on the client device? I submit that it's next to nobody. You can run SQLite on a server, but nearly nobody does that either (and not without reason). The idea that recursive CTEs are some sort of arcane technique that's impossible to debug is also not my experience with them at all.


Very nice. Graph following for the win. I wrote an inventory control system about 10 years ago that achieved this in a very hacky (read code outside SQLite) way. The requirement was to be able to track items, track users of items, track how much time an individual user had used and item, and to track maintenance and repairs of items. The would let you "charge back" wear and tear on things to the most responsible party, and to provide lifetime information in order to buy replacements in a timely way. All that hackyness would be a single statement with this change.


I loved everything about that thread.

Slightly off-topic, but I want to point out how well Richard deals with a slightly rude commenter [1]. Keeps it classy and productive while still calling out the bad behavior.

[1] https://fossil-scm.org/forum/forumpost/7817eb709d?t=h


The commenter's comment [1] is absolutely fine, though.

Hipp's response is also fine! The only thing that wouldn't be fine is if someone inappropriately reacted to that comment as an insult.

[1] https://fossil-scm.org/forum/forumpost/7840137009?t=h


It's possible I'm misreading the tone of "has ways to go" and "niche volunteer-based SCM".


If anyone comes across as rude in that exchange it's Richard. Maybe it's just his style (I don't have any experience to judge by except this comment), but he comes off as very abrupt here.


I don't like SQLite because of this: https://www.sqlite.org/datatype3.html

> The type affinity of a column is the recommended type for data stored in that column. The important idea here is that the type is recommended, not required. Any column can still store any type of data.

I simply don't want weird rows (that violate the _expressly declared_ type of a SQL table) to be merrily allowed into persistent storage.


> Why on earth would I want weird rows (that violate the _expressly declared_ type of a table) to be merrily allowed into persistent storage?

Some problems are easy to solve given a column of type VARIANT: 1. simple sparse columns instead of a multi-type EAV table, 2. a BIG Table style schema that is easily range partitioned.

The main reason in SQLite is that most SQL dialects just work. SQLite becomes a universal desktop workbench for SQL. This flexibility also applies to things like delimited identifiers [1].

The problem is not that SQLite uses VARIANT datatypes, the problem is that once a datatype is specified it is not enforced by the engine. SQLite could/should add a PRAGMA to enable Domain Integrity much like it does with Foreign Keys (PRAGMA foreign_keys = ON;).

[1] https://sqlite.org/lang_keywords.html


I remember hearing something along the lines of the odd typing in SQLite being vestigial from its early days when it was closely aligned with Tcl.


I'm not sure of the original inspiration but SQLite's core storage types with variable length encoding but it is the right way to implement a row store or a data serialization format, in my opinion.

Fixed length data types in SQL were a premature optimization that led to verbose syntax like LONG VARCHAR FOR BIT DATA [1] and incompatibilities tied to internal implementation details. There is no reason why more specific constraints can't be specified on top of the core storage types.

One can argue that SQLite is sorely missing an exact NUMERIC type but that holds true for most/all language runtimes as well. Datetime is a bit weird too but the format of the DB file is a thing of beauty.

[1] https://db.apache.org/derby/docs/10.1/ref/rrefsqlj30118.html


Anyone have a recommended resource for understanding the practical application of recursive CTEs / Graph DBs?


Look for a good reference about handling trees in SQL. There are a number of books on the whole area, and many smaller more focused articles online.

The most common place I've encountered recursive CTEs being massively useful in real world work is working with company responsibility & reporting hierarchies (line management, regulatory supervision, "spans of control" reporting, ...) where there is not a fixed number of levels.

When there are a small and fixed number of levels (for instance every drone has a supervisor and senior supervisor and you don't monitor connections above that) there are often other solutions that people find easier to understand and/or perform faster but for deeper trees or those where the level structure is any less static, recursive CTEs are a godsend.

Other examples include nested taxonomies (nature, book or other publication filing, nested tags for categorisation), properly threaded discussion records, family trees (though these are graphs rather than trees unless you impose some strict limits on what you count), representing file-system like structures in the DB, ...

Graphs get tricker (trees are a subset of graphs) as you may need to consider multiple paths to the same node, infinite loops, and other complications, so I suggest looking into trees first then expending your research.


Going out on a limb here because your question can be interpreted in multiple ways and I only feel qualified to answer one of them:

The most common example for graphs in DBs would be a graph representing the friends/contacts of users. It's one of these "you'll know when you need it" solutions.

Sometimes your data structure is a graph, and sometimes you want that graph to live in a database.

And in even rarer cases you want to search that graph in your database in some recursive manner.


What about representing states? Not sure about how graph dbs could perform on that scenario. There is a recent post talking about the topic and I'm still thinking about the best solution to store flows or similar. https://news.ycombinator.com/item?id=24764303


Recursive Common Table Expressions (CTE) are hard to understand until you've worked through at least one real-world example. The SQLite documentation on the WITH Clause has a couple of simple examples [1] for an Org Chart and a Family; both represent hierarchical data structures.

The Fossil discussion involves the SCM's representation of a File System, probably the most common hierarchical data structure we regularly encounter. Hierarchical data structures are notoriously hard in SQL. Recursive CTEs are not easy but they are useful and efficient for hierarchical data.

[1] https://www.sqlite.org/draft/lang_with.html#hierarchical_que...


Another good writeup is included in the SQL Anywhere documentation [1] which has a good example of a parts explosion problem [2]. I don't know if the syntax works in SQLite but it might be a good exercise to get it to work.

[1] http://dcx.sap.com/1200/en/dbusage/ug-commontblexpr.html

[2] http://dcx.sap.com/1200/en/dbusage/parts-explosion-cte-sqlug...


I've used them (in postgresql) to evaluate the merging of "parent" and "child" values - e.g. one row can be parent, and if the child has NULL for value it takes it from the parent (though do not remember much of the details). This can go to deeper just first parent, but parent of parent too.

I was able to make a query like this in PostgreSQL, but forgot the details of how I did it (though once you sit down, you get it eventually).


In the past I used recursive CTEs in Postgres to calculate the BOM of a product. Products consisted of "parts", and each "part" had a cost and could consist of any number of other "parts" (subassemblies). This was for a homemade ERP resembling software.


While the end result looks like a graph, I think that still qualifies as a standard relational DB use case though?

A graph database usually means you’re storing the graph nodes and edges as entities, so you can traverse the graph easily from any direction, and express different kinds of relationships.


Yeah, wasn't trying to say this was an example of graph. Maybe it is slightly outside what would be considered "standard" relational DB use case? My impression of recursive CTE's in RDBMS is that if you need speed or any more advanced graph-like features, you should move to an actual graph db. So far I haven't taken on any projects that warranted that, though it does interest me.

Either way my intention here was to simply provide an example of a real-world recursive CTE use case.


One application not mention in this thread is to traverse trees, like in the old forums where you had many levels of categories/subcategories. AKA hierarchical query. SQLite already has support for this[0]. It seems they are supporting multiple selects now.

[0] https://sqlite.org/lang_with.html


Yes, chapter 2 in Martin Kleppmann's Designing Data-Intensive Applications.



If I'm not mistaken, this brings SQLite up to parity with Datalog.


Can you please expand on this. Thanks


Unless you're using Clojure engines with Datomic pull syntax, SQL engines (most? all?) can't pull out nested results without serialization nastiness


Does this mean it’ll be easy to do graph transitive closure computations in sqlite now?


Do you mean iterating on an agency matrix dimension times to get the full path length between any two nodes?

I have had to run my recursive queries external to the database (MySQL, SQLite), this change looks to bring it into the database. Excellent news!


Transitive closure will tell you if a path between two nodes exists, not the length of the shortest path between two nodes.


You're technically correct, but afaik there isn't really a way to compute transitive closure without also computing shortest paths along the way, modulo constant factors?


You'll compute paths along the way but won't know that they're the shortest path. Odds are good though that you'll havr found something good enough for most purposes though.


If all you do is iterate over the adjacency matrix, sure, but wouldn't Floyd-Warshall be the same time complexity? All I'm saying is that you get both pieces of information for essentially the same price, modulo constant factors.


Even if it is, i imagine its not going to be that efficient if it isnt optimized for that use csse.


I solved a task involving transitive closures in last year's Advent of Code using SQLite:

https://github.com/nikeee/advent-of-code-2019/blob/master/06...

How could this be improved with the new feature of SQLite?


Nice, I was about to pull RedisGraph in to a small project using SQLite to do two graph queries.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: