Once I started thinking about joins in terms of spatial dimensions, things got a lot easier to reason with. I like to think of the inner join like the scene from Stargate where they were describing how gate addressing works. Assume you have a contrived example with 3 tables describing the position of an off world probe in each dimension - X, Y and Z. Each table looks like:
CREATE TABLE Dim_X (
int EntityId
float Value
)
Establishing a point in space for a given entity (regardless of how you derive that ID) is then a matter of:
SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
FROM Dim_X, Dim_Y, Dim_Z
WHERE Dim_X.EntityId = Dim_Y.EntityId --Join 1
AND Dim_Y.EntityId = Dim_Z.EntityId --Join 2
AND Dim_X.EntityId = @MyEntityId --The entity we want to find the 3D location of
You will note that there are 2 inner joins used in this example. That is the bare minimum needed to construct a 3 dimensional space. Think about taking 3 cards and taping them together with 2 pieces of rigid metal tape. You can make a good corner of a cube, even if it's a bit floppy on one edge. Gravity doesn't apply inside the RDBMS, so this works out.
This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.
The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).
Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.
Your example raises a question I've never found a good answer to:
Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?
My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.
I have had this issue, and solved it by learning that join happens from left to right. `A left join B left join C` is equivalent to ((A left join B) left join C).
("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")
⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.
Good point, but not exactly correct. If R is a relation on A x B and S is a relation on B x C then SR = {(a, c) | aRb and bSc for some b in B}. The join uses (a, b, c) in the comprehension.
True, but I think slightly pedantic for this context. I think there have been proposed relational operators which produce the result you describe - a join, then remove the join key fields from the result.
If we are in a pedantic mood, also, a relational-tuple is not exactly the same as a mathematical tuple, relational tuples take different forms depending on who's system you are following (e.g. whether the fields are ordered or not).
Just endeavoring to make true statements. A relation on sets A and B is a subset R of the Cartesian product of A and B. E. F. Codd had something else in mind with his definition of join. Do you think he got that wrong?
That's true, but I think the benefit of treating a Cartesian product as fundamental is that it lets you stop thinking about loops at all, or which one is the inner or outer loop, or of it as an iterative process at all.
Boxing all that up into the Cartesian product is a really useful concept, and the whole idea of relational algebra is to find convenient formalisms for relational operations, so it seems like it deserves a separate mention.
1,000%. I learned discrete/set math in college and then later SQL on the job. Helping a few analysts moving beyond Excel skills to learn SQL was interesting. They struggled with some things that clicked quickly for me because I immediately had "oh, these are set operators" intuition. Stuff like when to use an outer join vs a cross join, or how to control cardinality of output rows in fancy ways (vs slapping `DISTINCT ON` on everything).
I passed on the set understanding where it explained an unintuitive SQL behavior and I hope it helped 'em.
Absolutely. One of the biggest dis-services that SQL does is getting people thinking of relations as "tables" with "rows" which ends up making them think in an iterative, sequential, tabular model for something that I think they'd be better off thinking more abstractly about.
The whole relational model clicked for me a whole lot more once I started thinking of each tuple as factual propositions (customer a's name is X, phone number is Y), and then all the operations in the relational algebra start to look more like "how would it be best to ask questions about this subject?"... "I'm interested in facts about..."
For several days I'm having trouble finding good resources on _implementation_ for query execution/planning (predicates, existing indices <<especially composite ones - how to locate them during planning etc>>, joins etc).
Google is spammed with _usage_.
Anybody has some recommendations at hand?
ps. the only one I found was CMU's Database Group resources, which are great
I'd also recommend "How Query Engines Work" for a good practical exercise in the topic: https://andygrove.io/how-query-engines-work/ . I think one should go already have a high-level overview of what individual parts make up a query engine and what there purpose is, as it only touches on that lightly.
The SQLite optimizer is really primitive. You generally don't want to be using it as a template if you can avoid it; it's good for OLTP but can't handle many-way joins well.
Generally, this is a specialist topic, so you're unlikely to find e.g. good textbooks. It depends a bit on how deep you want to go and what specific parts you're interested in, though. (Query execution is pretty much entirely separate from planning, for one.)
The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)
As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)
I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.
Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)
It seems content farms, farming keywords with just fluff has taken over google. Can't find how to do anything anymore, just pages and pages of people talking about doing something.
You could try your query on a different search engine. I've had good luck with kagi.
It’s out of date and probably has less detail than I remember but I got a lot out of “Inside Microsoft SQL Server 7.0” which does deep dives into storage architecture, indices, query planning etc from an MSSQL specific POV. The book was updated for SQL Server 2003 and 2008. Obviously the book also has a ton of stuff about MS specific features and admin functionality that’s not really relevant, and I’m sure there are better resources out there, but I’ve found the background from that book has helped me understand the innards of Oracle, MySQL, and Postgres in the years since.
The 14th way is “multi way joins” also called “worst case optimal joins” which is a terrible name.
It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.
Negating inputs (set complement) turns the join's `AND` into a `NOR`, as Tetris exploits.
The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.
One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).
Do you happen to know if it supports such queries?
Excellent! We need more articles like this that demonstrate the subtleties of the relational model to primarily app-level developers. The explanations and explorations in terms of functional programming are both concise and compelling.
Another missed chance to educate on the N+1 area. Join on unclustered index is still N+1, it’s just N+1 on disk instead of N+1 over the network and disk.
There's a big performance difference between creating the Cartesian product and then filtering for the conditional, and creating the conditionals directly. An inner join with equi join conditions creates the conditions directly; any non equi join conditions actually have to be evaluated
That's true, but that is an implementation detail. In abstract terms, you can think of an inner join like that.
Also, while what you say is true in general for modern DB's, there are some implementations like old Oracle versions where the only way to create the effect of an inner join was in terms of a Cartesian product.
Great article. Side note: his normalization example reminded me how I used to design tables using a numeric primary key thinking they were more performant than strings. But then I’d have a meaningless id which required a join to get the unique value I actually wanted. One day I realized I could use the same unique key in both tables and save a join.
I still like to have a unique id field per table. It helps logging and it doesn't care about multi fields "real" key.
However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.
This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.
If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.
I think understanding that “relation” means the relationships of attributes to a primary key is a crucial, foundational concept. Without it, you can’t properly understand a concept like normalization, why it arises, and when it applies. You can’t even fully grasp the scope of “data corruption” (most developers have an extremely poor understanding of how broad the notion of corruption is in databases).
Yeah. Many database tables are too large for an in memory hash join.
My comment was a not very well fleshed out tangential remark on the value of practicing DS&A problems. I know a lot of devs hate Leetcode style interviews. I get it. It's not fun. But contrary to what some people say, I have run into a fair number of situations where the practice helped me implement more efficient solutions.
Venn diagrams are a terrible way to describe join types (and, tbh, I don’t understand why Wikipedia has these) because it makes it look like applying an 1:1 relationship. In a M:N relationship „artefacts“ (for lack of better words) of both tables would appear multiple times, and the venn diagram obscures this fact
> because it makes it look like applying an 1:1 relationship.
People can form different mental models of the same abstraction so I see what you are saying
I've never seen it that way because "Venn diagrams do not generally contain information on the relative or absolute sizes (cardinality) of sets." (see https://en.wikipedia.org/wiki/Venn_diagram).
And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.
This is a nice way of explaining a concept, and could probably be applied to any complex topic. As someone who learns best by analogy, concepts usually "click" for me once I've associated them with a few other concepts. So I appreciate the format, and would personally enjoy seeing more explainers like this (and not just about database topics).
Obviously I'd only be able to say for sure if I was the downvoter but I sometimes observe lighter text on comments which make a reference without calling out the reference. It's similar to using an acronym without defining it, though possibly more confounding if it's a niche enough reference. In this case, I did not get the reference and would have wondered why the parent commenter expected that title to have already been used.
At best, such a comment is referencing something topical which most readers will get (and ostensibly be entertained by); at worst, it's a distracting non sequitur. It generally ties back to a community preference that comments are curiosity-satisfying before entertaining.
Wouldn’t an allusion to 20th century American poetry fall into the category of “curiosity satisfying”? Given they were not the only person who got the reference it feels kind of arbitrary to say this is frivolous entertainment when another person in the community (in this case, me) found it curious and also wondered if there was an allusion there.
If it was an intentional allusion, then it may actually add depth/meaning to the conversation but we may not know since it was already downvoted…
It could be if it was called out as such. Without the explicit callout, one runs the risk of it “going over the readers’ heads” so to speak. Anyway, I’m not intending to justify any behavior; just offering my interpretation based on past observations.
This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.
The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).
Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.