Hacker News new | past | comments | ask | show | jobs | submit login
Joins 13 Ways (justinjaffray.com)
531 points by foldU on July 3, 2023 | hide | past | favorite | 76 comments



Once I started thinking about joins in terms of spatial dimensions, things got a lot easier to reason with. I like to think of the inner join like the scene from Stargate where they were describing how gate addressing works. Assume you have a contrived example with 3 tables describing the position of an off world probe in each dimension - X, Y and Z. Each table looks like:

  CREATE TABLE Dim_X (
    int EntityId
    float Value
  )
Establishing a point in space for a given entity (regardless of how you derive that ID) is then a matter of:

  SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
  FROM Dim_X, Dim_Y, Dim_Z
  WHERE Dim_X.EntityId = Dim_Y.EntityId --Join 1
  AND Dim_Y.EntityId = Dim_Z.EntityId --Join 2
  AND Dim_X.EntityId = @MyEntityId --The entity we want to find the 3D location of
You will note that there are 2 inner joins used in this example. That is the bare minimum needed to construct a 3 dimensional space. Think about taking 3 cards and taping them together with 2 pieces of rigid metal tape. You can make a good corner of a cube, even if it's a bit floppy on one edge. Gravity doesn't apply inside the RDBMS, so this works out.

This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.

The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).

Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.


I now understand joins less than I did before.


It's more a description of how dimensions work in OLAP and it's incomplete without measures.


I miss MDX (language behind MS Analysis Service).


This reminds me a lot of HyperDex[0], which hashes values into a multidimensional hyperspace based on their attributes for indexing purposes.

[0] https://dbdb.io/db/hyperdex



That Wikipedia article, and it's offshoot pages, are so awfully out of date...

But yes, OLAP.


This is like, a hypernormalization of the data though, isn't it? I think in standard BCNF you'd just leave your table as

  CREATE TABLE EntityPosition (
    int EntityId,
    float X,
    float y,
    float z
  )
It does remind me of data warehouse stuff though, given we're working with aggregates and piecing together bits of various dimensions.


> …a hypernormalization of the data …

Well, yeah - it’s an example to get the point across, not an exercise in finding the right level of normalization.


Your example raises a question I've never found a good answer to:

Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?

My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.


I have had this issue, and solved it by learning that join happens from left to right. `A left join B left join C` is equivalent to ((A left join B) left join C).


I think you are saying that every join is a variation of cross join. Did I understand you correctly?


Great. Now you've made me want to re-watch Stargate!


The 0th would be "it's an operator in relational algebra."

https://en.m.wikipedia.org/wiki/Relational_algebra

("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")

⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.


In the functional interpretation of relational theory, a join is function composition, surprised that one was left out.


Good point, but not exactly correct. If R is a relation on A x B and S is a relation on B x C then SR = {(a, c) | aRb and bSc for some b in B}. The join uses (a, b, c) in the comprehension.


True, but I think slightly pedantic for this context. I think there have been proposed relational operators which produce the result you describe - a join, then remove the join key fields from the result.

If we are in a pedantic mood, also, a relational-tuple is not exactly the same as a mathematical tuple, relational tuples take different forms depending on who's system you are following (e.g. whether the fields are ordered or not).


Just endeavoring to make true statements. A relation on sets A and B is a subset R of the Cartesian product of A and B. E. F. Codd had something else in mind with his definition of join. Do you think he got that wrong?


The cross-product plus predicate is referenced in "A join is a nested loop over rows".


That's true, but I think the benefit of treating a Cartesian product as fundamental is that it lets you stop thinking about loops at all, or which one is the inner or outer loop, or of it as an iterative process at all.

Boxing all that up into the Cartesian product is a really useful concept, and the whole idea of relational algebra is to find convenient formalisms for relational operations, so it seems like it deserves a separate mention.


1,000%. I learned discrete/set math in college and then later SQL on the job. Helping a few analysts moving beyond Excel skills to learn SQL was interesting. They struggled with some things that clicked quickly for me because I immediately had "oh, these are set operators" intuition. Stuff like when to use an outer join vs a cross join, or how to control cardinality of output rows in fancy ways (vs slapping `DISTINCT ON` on everything).

I passed on the set understanding where it explained an unintuitive SQL behavior and I hope it helped 'em.


Absolutely. One of the biggest dis-services that SQL does is getting people thinking of relations as "tables" with "rows" which ends up making them think in an iterative, sequential, tabular model for something that I think they'd be better off thinking more abstractly about.

The whole relational model clicked for me a whole lot more once I started thinking of each tuple as factual propositions (customer a's name is X, phone number is Y), and then all the operations in the relational algebra start to look more like "how would it be best to ask questions about this subject?"... "I'm interested in facts about..."


For several days I'm having trouble finding good resources on _implementation_ for query execution/planning (predicates, existing indices <<especially composite ones - how to locate them during planning etc>>, joins etc).

Google is spammed with _usage_.

Anybody has some recommendations at hand?

ps. the only one I found was CMU's Database Group resources, which are great


There's an entire 700-page book about this you can access for free here:

"Building Query Compilers"

https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...

EDIT: You also might get use out of TUM's course "Database Systems on Modern CPU Architectures"

The year that contains all the video lectures is 2020:

https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en


I'd also recommend "How Query Engines Work" for a good practical exercise in the topic: https://andygrove.io/how-query-engines-work/ . I think one should go already have a high-level overview of what individual parts make up a query engine and what there purpose is, as it only touches on that lightly.


I own the book and would agree -- only reason I didn't suggest this is that the areas on query-planning and optimization are a bit thin.

IMO it's great if you want a holistic view of building a query engine start-to-finish with just enough detail to build a basic implementation.

I probably learned more from the ~100 pages in this book than I did from most other sources.


I don't know if this has the level of detail you're looking for, but you might want to check https://www.sqlite.org/optoverview.html and https://www.sqlite.org/queryplanner.html.


The SQLite optimizer is really primitive. You generally don't want to be using it as a template if you can avoid it; it's good for OLTP but can't handle many-way joins well.


My go to pointer for this is to read the Postgres docs (and / or source - which is also super readable).

https://www.postgresql.org/docs/current/planner-optimizer.ht...


Generally, this is a specialist topic, so you're unlikely to find e.g. good textbooks. It depends a bit on how deep you want to go and what specific parts you're interested in, though. (Query execution is pretty much entirely separate from planning, for one.)

The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)

As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)

I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.

Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)


I just tried adding 'relational algebra' in front of my 'query planning' query, and at a glance it skews more towards implementation.


It seems content farms, farming keywords with just fluff has taken over google. Can't find how to do anything anymore, just pages and pages of people talking about doing something.

You could try your query on a different search engine. I've had good luck with kagi.


It’s out of date and probably has less detail than I remember but I got a lot out of “Inside Microsoft SQL Server 7.0” which does deep dives into storage architecture, indices, query planning etc from an MSSQL specific POV. The book was updated for SQL Server 2003 and 2008. Obviously the book also has a ton of stuff about MS specific features and admin functionality that’s not really relevant, and I’m sure there are better resources out there, but I’ve found the background from that book has helped me understand the innards of Oracle, MySQL, and Postgres in the years since.


I find the Advanced Databases Course from CMU an excellent resource. https://15721.courses.cs.cmu.edu/spring2023/schedule.html

You might want to look into academic papers, e.g., T. Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, in VLDB, 2011 https://www.vldb.org/pvldb/vol4/p539-neumann.pdf


The 14th way is “multi way joins” also called “worst case optimal joins” which is a terrible name.

It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.

There is a blog post and short video of this on https://relational.ai/blog/dovetail-join and the original paper is on https://dl.acm.org/doi/pdf/10.1145/3180143

I work for RelationalAI, we and about 4 other new database companies are bringing these new join algorithms to market after ten years in academia.


Justin also has a post on WCOJ that's really solid:

https://justinjaffray.com/a-gentle-ish-introduction-to-worst...


Negating inputs (set complement) turns the join's `AND` into a `NOR`, as Tetris exploits.

The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.

One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).

Do you happen to know if it supports such queries?


Excellent! We need more articles like this that demonstrate the subtleties of the relational model to primarily app-level developers. The explanations and explorations in terms of functional programming are both concise and compelling.


Another missed chance to educate on the N+1 area. Join on unclustered index is still N+1, it’s just N+1 on disk instead of N+1 over the network and disk.


"Another missed chance to talk about X problem I find interest in and would bloat the article"


An inner join is a Cartesian product with conditionals added.


There's a big performance difference between creating the Cartesian product and then filtering for the conditional, and creating the conditionals directly. An inner join with equi join conditions creates the conditions directly; any non equi join conditions actually have to be evaluated


That's true, but that is an implementation detail. In abstract terms, you can think of an inner join like that.

Also, while what you say is true in general for modern DB's, there are some implementations like old Oracle versions where the only way to create the effect of an inner join was in terms of a Cartesian product.


This dudes blog is fire. Very nice explanations of complex database topics.


Justin Jaffray is a gem


Very nice explanation!

> The correct way to do this is to normalize the table

This is true for transactional dbs, but in data warehouses it's widely accepted that some degree of denormalization is the way to go


Great article. Side note: his normalization example reminded me how I used to design tables using a numeric primary key thinking they were more performant than strings. But then I’d have a meaningless id which required a join to get the unique value I actually wanted. One day I realized I could use the same unique key in both tables and save a join.

Simple realization. Big payoff


I still like to have a unique id field per table. It helps logging and it doesn't care about multi fields "real" key.

However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.


A couple things:

This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.

If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.


I think understanding that “relation” means the relationships of attributes to a primary key is a crucial, foundational concept. Without it, you can’t properly understand a concept like normalization, why it arises, and when it applies. You can’t even fully grasp the scope of “data corruption” (most developers have an extremely poor understanding of how broad the notion of corruption is in databases).


I think it depends on the data.

If it's presorted by the join variable then rolling the loop is faster. Also, if the index is too big for memory, then it might be faster to loop.


Yeah. Many database tables are too large for an in memory hash join.

My comment was a not very well fleshed out tangential remark on the value of practicing DS&A problems. I know a lot of devs hate Leetcode style interviews. I get it. It's not fun. But contrary to what some people say, I have run into a fair number of situations where the practice helped me implement more efficient solutions.


I think the blog author uses it the same way Codd does, that's why he's talking about two columns from the same table.



I always thought venn diagram was a good representation but I think I was wrong.

Edit: Why did I get down voted? :)


I would also like to know why you are getting downvoted.

Even wikipedia uses a Venn diagram to explain JOIN https://en.wikipedia.org/wiki/Join_(SQL) .

Not trying to use an argument from authority but just pointing out that this is not unheard of.


Venn diagrams are a terrible way to describe join types (and, tbh, I don’t understand why Wikipedia has these) because it makes it look like applying an 1:1 relationship. In a M:N relationship „artefacts“ (for lack of better words) of both tables would appear multiple times, and the venn diagram obscures this fact


> because it makes it look like applying an 1:1 relationship.

People can form different mental models of the same abstraction so I see what you are saying

I've never seen it that way because "Venn diagrams do not generally contain information on the relative or absolute sizes (cardinality) of sets." (see https://en.wikipedia.org/wiki/Venn_diagram).


I've always found Coding Horror's Venn diagram depiction of SQL joins quite enlightening:

https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...


Because a venn diagram does not describe join mechanics well. A venn diagram does however describes the UNION, INTERSECT, EXCEPT part of sql.

https://blog.jooq.org/say-no-to-venn-diagrams-when-explainin...

And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.


This is a nice way of explaining a concept, and could probably be applied to any complex topic. As someone who learns best by analogy, concepts usually "click" for me once I've associated them with a few other concepts. So I appreciate the format, and would personally enjoy seeing more explainers like this (and not just about database topics).


I think the next article could be about group by.

Like (only intuitively sofar...)

A group by from A rows to B rows - is a map-reduce job - is an AxB linear transformation matrix from your linear algebra course - is...


Can join also be thought of as unification? Similar to type checking.

https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...


Fantastic post, I always enjoy reading about different computation mental models.

And under "A join is a…join", there's a typo in the partial order properties. It currently reads:

    1. Reflexivity: a≤b,
And I'm pretty sure it should be

    1. Reflexivity: a≤a,
instead (i.e., every element is ≤ to itself).


You are correct! I will fix it when I get home, thank you for the correction!


Very useful. Something like this but for Flink’s fancy temporal, lateral, interval, and window joins would be great too.


   1. cross join
   2. natural join
   3. equi join
   4. theta join
   5. inner join
   6. left outer join
   7. right outer join
   8. full outer join
   9. left semi join
  10. right semi join
  11. left anti semi join
  12. right anti semi join
  13. ???


13. lateral join :)


The title of this post sounds like an advertisement for a cult


Very cool work!


I imagine the title ‘thirteen ways of looking at a join’ was taken?


Dunno why this was downvoted, I came here to make a similar comment about the possible Wallace Steven’s reference.


Obviously I'd only be able to say for sure if I was the downvoter but I sometimes observe lighter text on comments which make a reference without calling out the reference. It's similar to using an acronym without defining it, though possibly more confounding if it's a niche enough reference. In this case, I did not get the reference and would have wondered why the parent commenter expected that title to have already been used.

At best, such a comment is referencing something topical which most readers will get (and ostensibly be entertained by); at worst, it's a distracting non sequitur. It generally ties back to a community preference that comments are curiosity-satisfying before entertaining.


Wouldn’t an allusion to 20th century American poetry fall into the category of “curiosity satisfying”? Given they were not the only person who got the reference it feels kind of arbitrary to say this is frivolous entertainment when another person in the community (in this case, me) found it curious and also wondered if there was an allusion there.

If it was an intentional allusion, then it may actually add depth/meaning to the conversation but we may not know since it was already downvoted…


It could be if it was called out as such. Without the explicit callout, one runs the risk of it “going over the readers’ heads” so to speak. Anyway, I’m not intending to justify any behavior; just offering my interpretation based on past observations.


Humor on HN has to thread a needle. I take a lot of shots even though they often don’t work out. As an AI language model it’s a risk I take.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: