
Say no to Venn diagrams when explaining SQL joins (2016) - alecco
https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-when-explaining-joins/
======
obeattie
Honestly, if it were not for Venn diagrams, I may have never learned how SQL
joins work. It might not be perfect, but if you'd tried to make 16-year-old me
understand relational algebra in order to grok how the joins in his Django
site were working, he'd have given up before he began.

That's not to say that other ways of illustrating things don't have merit, but
broad statements like "say NO to Venn diagrams" just aren't helpful. If people
find Venn diagrams a useful tool (and their proliferation indicates they do),
I don't see the harm. They didn't stop me going deeper on stuff like
relational algebra when I was ready to.

~~~
hinkley
The thesis is Say no to Venn diagrams for _Joins_.

The first section is about how Venn works fine for AND and OR because it’s set
theory and Venn works just fine for examples (maybe not every problem though).

Joins are not sets. They are Cartesian products.

~~~
y4mi
I'll be honest: if I didn't know SQL already, I would have no idea what his
diagrams are trying to convey...

I do know at this point that he is technically right... But being 100% correct
doesn't really matter at the beginning.

And Venn diagrams are right enough to start working with joins

~~~
hinkley
That may be a fair critique. Should SQL be successful only to mathematicians
and computer scientists? (Who else would even know about set theory or Venn a
priori?)

~~~
darkpuma
In my underfunded rural American elementary school, they taught us Venn
diagrams. Not set theory per-se, but they taught us to read and create Venn
diagrams.

I don't know if that's representative, but my gut instinct is that Venn
diagrams are very accessible to the general public.

------
0xfffafaCrash
This is silly. They're a cheatsheet for those who know how SQL joins basically
work and what they represent actually makes perfect sense if you don't
misrepresent what sets the circles refer to.

Two tables (A & B) are somehow being joined (in the sql sense, by some query)
ON a column from each table expressing the same type of id key.

Consider the union of values for those columns in either table. That is the
set containing everything in either circle in the Venn diagram. Then consider
the values which appear just in A or just in B -- those, respectively are the
non-overlapping sides. Then consider the intersection (values which appear as
values in both A & B), that is the middle part shared by both circles.

The highlighting on the Venn diagram shows which sets of values for the key
could be made to appear in the resulting join for each type of query (and,
yes, by extension indicating whether some other non-null data could also be
coming from a given table in the cases where the given value is present for
the table with the ON'ed column in question)

That's perfectly valid for a Venn diagram. The representation is meaningful
and concise.

If you don't understand the fundamental nature of what is happening in a SQL
JOIN I can see why you may need a more detailed/verbose diagram for each type
of join, but the diagrams don't purport to be about the sets of rows from a
table in the result and anyone who grasps the basic principle behind the joins
won't struggle with the Venn diagrams because they strike the nail on the head
with respect to the differences between each type of join without the need for
verbosity & redundancy.

~~~
mabbo
Because it's the wrong model.

Take two tables of 1m rows each and do a cross join on them. The Venn diagram
model says you would have at most 2m output rows. But that's wrong. You would
have 1t rows output. When when you're using a normal join on a specific value
matching, if the columns are unsorted, that's 1t comparisons potentially.

The right mental model leads to the right intuition of what's going under the
surface. And it's all well and good to say "If you don't understand the
fundamental nature of what is happening in a SQL JOIN..." But that is exactly
who these diagrams are intended for! People who don't understand yet and are
trying to.

Teaching people with the right model means they don't have to unlearn the
wrong model later when questions get more complicated.

~~~
0xfffafaCrash
I have yet to see a cross join represented by a Venn diagram, so this strikes
me as a red herring on the topic of whether Venn diagrams for the joins that
are usually represented with them make sense.

~~~
ryanmonroe
The author is arguing for a mental model in which all joins are just cross
joins with a filter added, and that Venn Diagrams are a bad representation
because they depict set differences, which is not what SQL joins are.

Say tbl_1 has 5 rows, and column "A" is the numbers 1 to 5. Then tbl_2 has the
numbers 1 and 2, each repeated 5 times. Now how many rows are in the result of
the query below?

    
    
        select tbl_1.A from tbl_1 left join tbl_2 using(A)
    

The fact that the result has 13 rows (more rows than either of the tables!)
seems like it would be pretty surprising to someone using the left-join venn
diagram as a mental model. The Venn diagram depicts a left-join as a set
difference, which it clearly is not.

~~~
0xfffafaCrash
Diagrams and notation serve similar symbolic roles. Just because you can
represent a real number, say 5, in a complex number's notation (5 + 0i) and
you can't represent a complex number as a real number doesn't mean that you
should always use the notation for complex numbers when trying to represent
different real numbers. You typically want to use a concise representation
that gets the message across without adding extra noise. Things can always be
abstracted further into a more generalist, but more verbose representation.
Complex numbers can be represented as quarternions, and so on. For the same
reason, it makes little sense to use the existence of a JOIN that doesn't use
ON (and is frankly not as commonly used) as a reason to complicate the
representation of the 7 JOIN queries that do use ON. If translating real
numbers into a notation that represents the complex numbers helps you anchor
your understanding of complex numbers, great, but I wouldn't buy that our
natural representation of real numbers is deficient as a result. I also don't
buy that using Venn diagrams to represent JOINs that use ON are "PLAIN WRONG!"
because there are things that they can't represent and aren't used to
represent.

~~~
ryanmonroe
The fact that I don't use "on" is just syntax, it doesn't change the query.
The following query is equivalent.

    
    
        select tbl_1.A 
        from tbl_1 
        left join tbl_2 
          on tbl_1.A = tbl_2.A
    

I don't think this is a special complex example. It's about as simple as you
can get.

~~~
0xfffafaCrash
This was incorrect:”The Venn diagram depicts a left-join as a set difference,
which it clearly is not.”

It is in fact a set difference and the sets in question are values for the
column being ONed on in each table (as I explained in the top level comment)

My reply was with respect to these being cross joins with filters. That’s just
one way of thinking about a JOIN that actually adds notational complexity (and
does not actually represent how your database would be doing the join because
it would be unnecessarily expensive computationally). Thinking of it in terms
of a set difference of the values in thr ON column is arguably the better way
of thinking of it if you are interested in how such a join should be
implemented. Alternatively you can interpret the venn diagrams as just the
filtration if you prefer interpreting as a cross join with filters.

~~~
ryanmonroe
f you have one set with 5 elements and another with 10, the set difference
cannot have 13 elements.

\-------

In anticipation of some possible technical replies, let me be more explicit:

It isn't a set difference, no matter how you view it.

Viewpoint 1: The original lists of objects are not actually sets because the
elements are not distinct. From this viewpoint, it isn't a set difference
because they aren't even sets to begin with.

Viewpoint 2: There is an implicit "row number" feature of each object so in
that sense the objects are distinct and do form sets. From this viewpoint, it
still isn't a set difference. If you have one set with 5 elements and another
with 10, the set difference cannot have 13 elements.

------
40four
I love this premise! I have always hated the venn diagram approach. When I
started programming, I never found them useful. I agree it is the wrong model
to teach what is happening with joins. It think it produces more confusion.

The author's approach is a better visual model. Using boxes makes so much more
sense, since we are dealing with rows of data. The circles of a venn diagram
don't represent a result set visually, and i think that's why I always found
it confusing.

The the main point I think should be expressed to really explain what's
happening in a join is about null values. If one table has nulls in the
matching set, which ones do we want to keep, or discard? The box approach
shows this beautifully, while a venn approach doesn't really make this
obvious.

The best visual representation I've ever seen is similar to the author's, but
animated.
[https://news.ycombinator.com/item?id=16983007](https://news.ycombinator.com/item?id=16983007)

------
mabbo
I remember being a third year CS student, studying for some database exam
coming up, and a friend of mine (a wickedly smart second year student)
explained joins as Cartesian products to me. Took her all of 60 second to turn
my entire brain over. Blew my mind.

I had been _using_ SQL but not _understanding_ it. Mostly because of poor
previous explanation via things like Venn diagrams- the wrong model lead me to
think incorrectly and make mistakes.

With the right model in my head, the exam was dead easy. I don't even know
that she realized how impactful that was to me. So on the off chance you're
reading HN, thanks Wen!

~~~
alexis_fr
So in a sense she drew a Wen diagram?

------
Vogtinator
I really don't see the point here.

If I turn the "correct" diagrams sideways, it looks very much like a venn
diagram again.

~~~
0xfffafaCrash
Yep: [https://i.imgur.com/cgYkQbS.png](https://i.imgur.com/cgYkQbS.png)

I'm drawing the conclusion that some people just need to see arbitrary
concrete values to be able to understand what's going on.

~~~
johnNash05
This image isn't opening on my end. Can you please share in another format.
You can use
[https://www.coolutils.com/TotalImageConverter](https://www.coolutils.com/TotalImageConverter)

------
quietbritishjim
Here's the key point I think the article misses: Very nearly always (I mean
really really nearly always), when you do a left/right/inner join, the
condition you use for the join is a unique (usally primary) key on one side
and a foreign key on the other side. So although in theory you start with the
full cartesian join and then filter some of the results out (picking out of
n×m entries), in practice you are only left with at most max(n,m) entries. So
something like a Venn diagram is much less wrong than it is for the outer join
(but still a _bit_ wrong).

What I would like to see is inner/left/right joins explained with colours used
to indicate value (as the article already does) but in the form of entries in
a PivotTable (as the article used only for outer join), perhaps with faded
entries for those that were filtered out. Most of all, it would be great to
see this with non-unique values in _both_ tables, as rare as this admittedly
is. This could be followed up by the more usual primary key / foreign key
combination. At that point it would be visually clear that only max(n,m)
entries are left, so you could finish up by showing those evil Venn diagrams
after all, with the disclaimer that you understand that there's a full n×m
combination under the hood in theory.

------
scarface74
I once interviewed a QA person for a very heavy data centric project.

He didn’t know SQL well, but he described a left join problem perfectly by
drawing a Venn diagram to explain the data that needed to be retrieved. When I
was about to finish up the interview, he was so curious about the answer he
insisted that I explained the answer so “he could sleep at night”.

I gave him a thumbs up just because he knew the concept and displayed
intellectual curiosity. It was a great choice.

------
Groxx
Since _by far_ the most errors I see in SQL queries are places where Venn
diagrams do not accurately describe SQL behavior... yeah, I'm gonna have to
agree, though I'm not sold on the alternative here. Like 90%+ of the flaws I
run across in queries are due to multiple matching rows on the right side,
without `distinct` or an aggregation of some kind.

1:1 matches on rows are trivial. Venn diagrams can describe these too. But you
really can't let people develop an intuition based on that, early on, or they
kinda get stuck and it takes a while to unravel.

------
nkozyra
I think being overly concerned with an accurate representation of a concept is
less important than conveying information.

A Venn diagram may not be a perfect analog for joins but they relay the
concept quite well.

~~~
okl
I don't think that is true. If you base your understanding on an inaccurate
representation, then you may later on run into problems trying to fit new
information into your inaccurate thought-model. For example explaining
electric circuits with water analogies (like described here:
[https://learn.sparkfun.com/tutorials/voltage-current-
resista...](https://learn.sparkfun.com/tutorials/voltage-current-resistance-
and-ohms-law/voltage)) cannot account for EMF.

~~~
paulgb
Every model or analogy has this problem, but it doesn't make models and
analogies useless for learning.

~~~
nkozyra
Exactly. Abstractions are lossy. The goal here is comprehension, I don't see
how the models here differ in that regard.

------
TravelAndFood
Really well done, but I agree with one of the comments in there: need to add
patterns to the solid colors; it would differentiate for color blind folks and
in general make the relationships easier to grasp.

~~~
daveFNbuck
I tried pasting the cross join and inner join into a colorblindness simulator
[1] and the only issue I found is that the A and the 2 in the cross join look
similar if you have monochromacy. Looking at the monochromatic version, I
don't think this really affects the clarity of the cross join diagram.

The inner join is already perfectly distinguishable by how dark the squares
are, and anything other than monochromacy preserves the distinctions that the
color adds in the cross join, unless the site I used is misleading.

[1] [https://www.color-blindness.com/coblis-color-blindness-
simul...](https://www.color-blindness.com/coblis-color-blindness-simulator/)

------
your-nanny
I have always found the use of Venn Diagrams for this purpose utterely
confusing, and even wrong headed. Bravo.

------
zoomablemind
Many ways to skin a cat...

SQL used to have an intriguing aura of exotic knowledge back in time when SQL
was still a Language to highlight on a resume and to have an RDBMS server
meant spending thousands.

So Venn diagrams were just adding to such aura. But it's a visual aid and is
usually supplemented with examples. If it doesn't ring a bell, some hands-on
or detailed explanation helps crack it.

Personally, I believe JOIN should first be touched upon when introducing
normalization. It just makes sense: break tables apart then join them
together. Here one could also grasp a utility of an index on join column.

Then progress to more advanced use of joins - outer join (needs a concept of
NULL), union (basically a synthetic set) etc.

Finally, a cross join - as one of those cases that may lock your terminal if
you don't know what you're doing. Of course learning Cartesian product concept
here, if not yet.

As anything in Computer, the hands-on aspect usually clears thing much better
then science-first. The science would ease in much better once the utility is
grasped.

Teaching means helping someone to learn, not making them understand your way
of teaching. So different aids may work for different audiences.

------
tarabanga
Either the author is not proposing an alternative, or the alternative is
formulae and poorly formatted screenshots, or my ublock origin settings are
blocking the perfect method to explain SQL joins in 100x100 pixels.

------
iiioooppp
Why does SQL have both WHERE and ON, plus various kinds of JOINs? It seems to
me that it would be simpler to have just have one JOIN which represents the
Cartesian product, plus WHERE for filtering - and then the DB engine could
apply optimizations so that queries would actually run as fast as in existing
SQL. Am I missing something?

~~~
barrkel
The bit your missing is the awkwardness of expressing outer joins. The way
Oracle does it is an existence proof of its awkwardness.

The 'FROM' clause defaults to Cartesian product for multiple tables. You can
then specify your join conditions in the where clause, but it makes outer
joins harder to write - Oracle does them with special (+) operator, but it's
clumsy.

It doesn't (shouldn't) make much difference at all to the database whether you
specify your join condition after the `ON` or in the `WHERE` - if it doesn't
hoist `WHERE` conditions into the join clause, and potentially change the join
order because of it, it's missing big optimization opportunities.

~~~
craigds
Can you expand on how oracle does it?

I can't imagine any reasonable way to do outer joins in a where clause.

For inner joins the grandparent is correct and it should never matter whether
you use ON or a where clause.

~~~
barrkel
[https://stackoverflow.com/questions/22678643/old-style-
oracl...](https://stackoverflow.com/questions/22678643/old-style-oracle-outer-
join-syntax-why-locate-the-on-the-right-side-of-the)

Of course it's weird and ugly. I don't think there's anything to recommend it
over ANSI join syntax.

------
gregoriol
Thanks for the link, it is so true, well done!

------
jmmcd
A lot of people are saying hang on, they're useful for explaining. But my
experience when teaching relational databases is that they give the _illusion_
of understanding. I'd rather have the students know that they do not know (and
work harder) than look at a Venn diagram, think they get it, and move on.

------
burnstek
What a silly little article. Venn diagrams are an extremely illustrative and
useful way to visualize joins.

------
roywiggins
This makes me glad I learned SQL first with relational algebra. Probably that
wouldn't work for a lot of people, but as a math major it made the basic
concepts pretty clear.

------
xivzgrev
Lol I used Venn diagrams to explain joins just a week or two ago. It does the
trick but doesn’t account for many to one joins

------
haroldl
“A JOIN is really a cartesian product (also cross product) with a filter.” It
always boils down to this for me.

------
laszlokorte
I have always been wondering why Venn diagrams are used for illustrating
joins...

~~~
k__
Seemed like a good idea. It certainly helped me to get inti the topic.

------
coldtea
> _Honestly, if it were not for Venn diagrams, I may have never learned how
> SQL joins work. It might not be perfect, but if you 'd tried to make
> 16-year-old me understand relational algebra in order to grok how the joins
> in his Django site were working, he'd have given up before he began._

The whole point of the article is that there are different visualizations that
are much better than Venn diagrams (in fact Venn diagrams are totally
misleading), and they don't require any relational algebra.

That said, if someone doesn't know what a cartesian product is (high school
level math), perhaps they shouldn't be allowed near a database...

~~~
obeattie
>if someone doesn't know what a cartesian product is (high school level math),
perhaps they shouldn't be allowed near a database...

This is such an intellectually snobbish position to take. I can tell you that
I didn’t know this term when I first interacted with a database, and if
someone had told me the above, I would have just felt stupid and given up.

The fact you learned what a Cartesian product was at high school shouldn’t
preclude someone who hasn’t from trying to build stuff and experiment with
tech. Not everyone learns things the same way, or in the same order, and
that’s okay.

~~~
coldtea
> _This is such an intellectually snobbish position to take._

Intellectually snobbish is "everyone should listen to Mahler and read
Epictetus in the original language".

This is just basic professional requirement...

> _The fact you learned what a Cartesian product was at high school shouldn’t
> preclude someone who hasn’t from trying to build stuff and experiment with
> tech._

Just not in any capacity where people depend on their output...

~~~
cortesoft
You didn't say "should not be allowed near a production database" you said
"database"

That is very different. This was a 16 year old kid trying to mess around and
learn stuff, he should be able to be around databases all he wants. I wouldn't
hire him as a DBA, but he should be able to learn without having to wait for
more advanced math.

~~~
coldtea
> _You didn 't say "should not be allowed near a production database" you said
> "database". That is very different._

And that's what I meantt, so since you agree that's "that's very different",
we're on the same page.

> _a 16 year old kid trying to mess around and learn stuff, he should be able
> to be around databases all he wants_

Well, it's not like a comment on HN is going to stop them.

~~~
cortesoft
Well, you were replying to someone who said they learned SQL for their Django
site when they were 16, and you admonished them for not knowing Cartesian
products and using a database... so it was certainly not clear that you meant
a production database.

