
Cost of a Join - DerpyBaby123
https://www.brianlikespostgres.com/
======
jjeaff
I love a good writeup backed up with some performance benchmarks.

But I'm a little befuddled as to why the three possible solutions to putting a
"status" column in your Products table were:

    
    
        1. Add a status_id column to the product table and reference a new status table
        2. Add a status_id column to the product table and let the app define the mapping of what each status_id is
        3. Add a status column of type text to the product table
    

What about :

    
    
        4. create an enum column with your possible statuses.
    

If you ever reach more than 65k possible statuses, you've got a different
problem.

Perhaps the status example is just a hypothetical example? Although the author
says they would use option 1 in this case, so what gives? Seems crazy to
create a join table for 'status'.

~~~
__s
I was using enums, but then I found out that you can't alter an enum inside a
transaction.. very painful in a knex environment where up/down all occurs
within a transaction

~~~
yen223
If knex doesn't provide facilities for running migrations outside of a
transaction, you're going to have an awful time.

Useful operations like CREATE INDEX CONCURRENTLY can't be run in transactions.

------
Terr_
The submitted URL is not stable, it just links to the front page. So for the
benefit of later-visitors, here's the direct link:

[https://www.brianlikespostgres.com/cost-of-a-
join.html](https://www.brianlikespostgres.com/cost-of-a-join.html)

~~~
DerpyBaby123
Thanks!

------
jaggederest
Hmm, I think these results are somewhat suspect. Postgres isn't even really
hitting its stride yet at a million rows. You have to start looking at 100GB+
database sizes otherwise most everything fits in working memory, or at least
the indexes do.

My experience has been that joins are, in fact, cheap in larger databases but
they do scale on the size of tables, so you should be cautious about excessive
joining.

In the extreme case, I wouldn't make a table for each attribute unless there
were some overriding reason (at which point you've rediscovered column-based
databases essentially). I do sometimes go up to 5th or DKNF normal form
though, which I think is substantially beyond what most people designing or
revamping a database would do.

~~~
Jupe
Normalize till it hurts... Denormalize till it works :)

~~~
lessclue
Golden words. This is the cycle that keeps repeating in complex production
systems.

------
zmmmmm
Great to see this analysis. Especially with the proliferation of ORMs that
create relationship tables like confetti it is very useful (and comforting) to
know that needing to join a large number of tables is not in and of itself a
performance disaster. It does not alleviate all my concerns with join-happy
designs, but it is definitely good to know.

------
asavinov
There are two general (potential) problems due to the use of (multiple) joins:

* run time: performance disaster

* design time: conceptual chaos

Some of them are analyzed in [1] where it is essentially argued that join
considered harmful and a join-free (column-oriented) approach is described.
This approach has been implemented in [2] which is a library for batch and
stream data processing, and in [3] which is a library for data analysis. Their
main common unique feature is that they rely on _column operations_ , that is,
data processing is described as a graph of operations with columns as opposed
to using set operations like join.

[1]
[https://www.researchgate.net/publication/301764816_Joins_vs_...](https://www.researchgate.net/publication/301764816_Joins_vs_Links_or_Relational_Join_Considered_Harmful)
Joins vs. Links or Relational Join Considered Harmful

[2] [https://github.com/asavinov/bistro](https://github.com/asavinov/bistro)
\- A general-purpose data analysis engine radically changing the way batch and
stream data is processed

[3] [https://github.com/asavinov/lambdo](https://github.com/asavinov/lambdo)
\- A column-oriented approach to feature engineering. Feature engineering and
machine learning: together at last!

(Disclaimer: I am an author)

------
protomyth
I really don't understand the reasoning for the 3 choices:

 _1\. Add a status_id column to the product table and reference a new status
table_

 _2\. Add a status_id column to the product table and let the app define the
mapping of what each status_id is_

 _3\. Add a status column of type text to the product table_

3 is unacceptable under any circumstances, and 2 really just makes no sense
from a database perspective (unexplained values in a database is just wrong).

Choice 1 is pretty normal, but I am really curious about the whole idea that
any app needs to join with that status table to get anything but flags,
values, state, or labels associated with that status. The idea of using a
status table as the starting point of any join is just wrong and shows some
very poor data management in the relationship between the app folks and
database folks. I can see using a text field to look up and store the id, but
really the mechanism that is populating your status table (with any flags,
values, state, or text that is associated with each status) should generate
the equivalent code or config for the app.

Indexes on status are often painful because status is generally not a large
number of values and having an index that doesn't cut down the number of rows
significantly is often a problem. My big rule about this is _if the app asks
for a status field on an object, expect someone (probably your report writer)
to want all the objects with the same value of status field_. A typical
example is open invoices. You really need to look at where the index is sparse
and does it help in each situation. This is especially true if you track
statuses through time (e.g. itemId, statusId, effBegUTC, effEndUTC)[1]. It is
definitely worth the time to look at work tables, views, and each index with
someone with some experience (yes, a good DBA).

1) If you are mostly looking to get all the items currently in a single status
then leave out effBegUTC from the index and alway put a single value in the
effEndDt for current (e.g. 9999-12-31). an index with (EffEndUTC, statusId) is
better than (effBegDt, effEndDt, statusId). Including the itemId in the index
really depends on the optimizer and how covered indexes are handled.

------
cuchoi
Would love to see an analysis that depends on the width (number of columns) of
the table.

