The issue with scaling the database is that SQL joins (generally) do not scale u...

haileys · on July 18, 2021

SQL is just a language. Any techniques which can be used scalably (client-side joins for instance) could also be used by an implementation supporting the SQL language. Perhaps your argument holds for some of the more well-known RDBMSes out there, but I don't think SQL necessarily has to be unscalable in the general case.

AtlasBarfed · on July 18, 2021

Let's take the general case of a join: two tables that are keyed/indexed by fundamentally different data.

To distribute data, you have a data distribution scheme. Cassandra and various distributed hash maps (which is the typical approach) it is a consistent hash function. But even if you do distribution using natural ordering, the same problem exists:

The data you are joining is going to be on different nodes on a row-by-row case. Hashing will produce this by the design of the hash function. Natural ordering will do this because different key datatypes will order differently.

In the case of large scale distribution across a LOT of machines (which is what you invariably have to go to once you expend the options in big iron), that means a huge amount of network traffic, with each retrieval needing to be resolved for consistency due to if you want AP. If you rely on CP, then your join is dependent on SO MANY nodes correctly communicating that you become extremely exposed to network partitions, retries, etc.

Thus you either shard your data so all data is on the same machine (but your joins are necessarily subsets of the overall data: only the shard), or you don't and prepare for extremely bad performance, unreliable performance, or approximations of correctness.

EdSchouten · on July 18, 2021

But how is that specific to SQL?