
The myth of “joins don't scale” - forrestbrazeal
https://blog.dbi-services.com/the-myth-of-nosql-vs-rdbms-joins-dont-scale/
======
crazygringo
Unfortunately, the author/article seems to _completely_ miss the meaning of
"joins don't scale".

 _Obviously_ indexed joins on a single database server scale just fine, that's
the entire selling point of an RDMBS!

The meaning of "joins don't scale" is that they don't scale _across_ database
servers, when your dataset is too big to fit in a single database instance.
Joins scale across _rows_ , they don't scale across _servers_.

Now a lot of people don't realize how insanely powerful single-server DB's can
be. A lot of people that assume they need to architect for a multi-server DB
don't realize they can get away with a hugely-provisioned SSD single-server
indexed database with a backup, with performant queries.

But if you're absolutely sure you need to be planning to run at close to
Twitter or Facebook scale one day, then yep, you'd better architect from the
beginning not to use joins. And then whether you pick an RDBMS or NoSQL
solution is mostly a tooling issue.

~~~
yelloweyes
reality is 99% of companies out there will be just fine with a single db
server

all this nosql fad is simply because SQL is hard and newbies would just rather
call it old and slow because they can't be bothered to learn it

~~~
AtlasBarfed
Your single server is a CA service.

If you want CP or AP, then you need a "fad" database.

~~~
deleuze
single server sql databases are not "CA" it doesn't make sense to talk about
cap theorem like this.

------
m0xte
It’s not a myth. In the trite examples provided with tiny datasets they scale.
In stark reality when you’ve got 5000+ qps of dog shit queries of several
joins each and buggered table statistics (thanks optimiser!) coming in on
realistic datasets they are a damage amplifier.

Really all they do is repeat work in the most costly way possible that should
have been done up front once. But that incurs the cost of knowing what you’re
looking for before you do it.

~~~
yelloweyes
you either need a relational model or you don't. if you try to build a complex
enterprise application with nosql all that's gonna happen is you end up
reinventing an RDBMS

there are not many companies in the world that can afford to reinvent shit

~~~
m0xte
I entirely disagree with the first point on the basis that most enterprise
applications are CRUD with a poorly contrived search interface. Both of which
are somewhat better serviced by anything other than SQL.

The second one however is on the mark.

------
htfy96
This article only tested two pages, each with millions of data. In corporate
developments (and probably the original meaning of "joins don't scale" in the
era of Oracle/MSSQL), it's usually the opposite - usually joining half dozen
of tables, each with a few data, but their mappings and underlying
relationship are usually not explicitly-defined, making joins across multiple
tables super slow (RDBMS can only plan the query based on fundamental stats).
In other words, joins now scale with table size, but still a challenge in
terms of #tables

~~~
chadcmulligan
> but their mappings and underlying relationship are usually not explicitly-
> defined

wouldn't this be an issue no matter what you're using?

I'm not sure what you mean by joins scale with table size? There are few times
you'd join more than say 5 or 6 tables.

~~~
htfy96
> There are few times you'd join more than say 5 or 6 tables.

This is only true in new tech where data relationship isn't complicated. Any
stored procedure in bank would contain joins with no fewer than that number of
tables.

~~~
chadcmulligan
Really? Twenty years as a database consultant and 5 or 6 tables would be a
typical upper limit, you could perhaps get up to 10 or so if you include a lot
of lookup tables for codes. I've never worked in banks though, so what would
you be looking at there typically?

~~~
Akronymus
I work at a leasing company, and we somewhat frequently have around 5-7 joins
when getting data such as where to send how much money for work done at
another company.

The most I saw was "only" 10 though.

------
DaiPlusPlus
I get the impression that (crucially: outside of Facebook/Google-scale
applications) the strongest advocates for No-SQL databases are great software
devs who just had a poor experience with SQL databases as juniors or students
- or didn’t develop have the necessary experience and skills to make the most
of a SQL RDBMS because it was out of their field (e.g. game devs); they’d be
missing things like not understanding relational-algebra or relational-
calculus - or not even knowing SQL at all - or having the DBA skills to know
how to define the best indexes - or even what a cardinality-estimate in an
execution plan is. If you throw all of those topics at someone and also add
that they only have a week to build the thing - it’s no-wonder that people
turn to No-SQL.

I’m not putting those people down either - I would have opted for No-SQL
myself if I wasn’t already familiar with RDBS theory and management. What I am
criticising is the large amount of manual-management needed when you do run an
RDBMS because this is putting people off. Instead, things like automatic index
generation and pruning, nightly index rebuilds and statistics computing, etc
arguably SHOULD be part of the RDBS out-of-the-box and enabled by default. To
my knowledge only IaaS databases have features like that (it’s what justifies
their subscription pricing). I believe some on-prem database server products
will generate recommendations but they won’t automatically implement them on
their own accord.

Another “approachability” issue I have is with the design of the SQL language:
it’s unnecessarily verbose and very confusing for beginners as its basis in
relational-calculus instead of relational-algebra throws people off (“SELECT
is evaluated _after_ the WHERE clause?!”).

If SQL is like SGML and XML, which was replaced in many applications by the
faster and lighter JSON format, can we do the same for SQL? Unfortunately
there’s too much institutional inertia behind SQL for any of the DBMS vendors
to go-it-alone, and beyond trivial queries all of the SQL implementations are
incompatible with each other anyway.

In summary:

1\. RDBMS servers should have an “automatic DBA”-mode enabled by default
(because if it’s not the default then the exact people who need it
(unsophisticated DB devs) won’t be using it, and the people who don’t want it
will just turn it off anyway.

2\. SQL needs to be replaced with something better: less verbose but just as
expressive, and with all of SQL’s “gotchas” removed (like adding multiplicity
assertions to joins, implicit joins when dereferencing a foreign-key, implicit
locks, allowing comparisons to NULL, and leaning more towards relational-
algebra instead of relational-calculus, and support for column macros: this is
desperately needed for CRUD applications!)

3\. Thank you for coming to my TED Talk.

~~~
ladberg
I totally agree with this. I'm relatively new to SQL but have recently had to
start messing around with it. In all other CS domains I know exactly what will
happen under the hood for any given action, but for SQL I have zero clue.

I would love it if I could just write queries and it would automatically add
an index or do something else for performance if needed, and possibly report
back to me if a query I wrote has particularly bad performance.

~~~
nine_k
> _In all other CS domains I know exactly what will happen under the hood for
> any given action, but for SQL I have zero clue._

With experience, you'll learn to appreciate the _lack_ of need to say how to
do something, and the ability to just specify _what_ needs to be done. Read
more about declarative vs imperative.

Knowing exactly what happens under the hood costs a lot of mental resources
(this is why assembly is considered low-level), and is often an illusion. A
proven tool which knows a particular area instead of you is immensely valuable
(think about compilers, for instance).

> _it would automatically add an index_

Adding an index has performance costs, storage costs, etc. Just indexing every
permutation of columns is prohibitively expensive. Then, columns with low
variation need different indexing approaches, or sometimes even refactoring of
the schema.

Software that suggests which indexes would benefit a particular query does
exist. Consideration is still needed.

~~~
DaiPlusPlus
> Adding an index has performance costs, storage costs, etc. Just indexing
> every permutation of columns is prohibitively expensive. Then, columns with
> low variation need different indexing approaches, or sometimes even
> refactoring of the schema.

To be clear, I’m not suggesting a DBMS should always add a new index every
time it encounters a query that’s slow due to a missing index. I’m saying that
DB servers should look at trends of queries - and see where real wall-clock
time is being wasted, and act accordingly.

The performance cost of an extra index for insert/update/delete queries is
only a concern when you get to the point where you do need to manually manage
indexes - but for most database tables with less a few million rows there’s no
reason for the server not to automatically add indexes on columns used in
predicates. It won’t be perfect, but it’s better than allowing slow queries to
always be slow.

~~~
no-s
>> The performance cost of an extra index for insert/update/delete queries is
only a concern when you get to the point where you do need to manually manage
indexes - but for most database tables with less a few million rows there’s no
reason for the server not to automatically add indexes on columns used in
predicates. It won’t be perfect, but it’s better than allowing slow queries to
always be slow.

You'd be surprised. The thing that really drives index explosion is the number
of distinct queries. IME, inspecting and reducing distinct queries is one of
the easiest ways to optimize and it can typically be automated. I've migrated
systems from millions of distinct queries down to hundreds or dozens. However
the objective of the optimization is typically driven by requirements external
to the database and may vary over time. Thus difficult to align with an
optimizer.

~~~
Akronymus
Distinct as in doing different things or distinct as in prefer batching? My
English seems to be failing me here.

------
sradman
It is 2020. If you comparing the performance of PostgreSQL and DynamoDB in
terms of GROUP BY analytics, you have probably missed the Column Store
revolution.

Now if these two authors actually focused on the trade-offs between these two
architectures in terms of their sweet-spot, operational row stores, we might
gain some insight.

