A gentle-ish introduction to worst-case optimal joins

juancn · on June 2, 2022

Neat, now, this leaves me wanting: how do you get closer to that bound? And will the constant factors not eat you up?

Still, interesting work. In pure analytic workloads, you typically avoid joins like the plague and lay out the data "pre joined" in a sense (you need a custom query engine usually, not using a relational model as is), so you can avoid jumping around the data, which is what kills you about joins.

Usually when joining, one relation goes in order, the other is spread all over, if you keep them close, you can save some seeks, but reduces the universality of the joins (you also need a-priori knowledge of the cardinalities for it to make sense).

Thankfully most real world data has tighter constraints that you can leverage for effect. Usually theoretical models do not contemplate these, thay concern themselves with a wide swath of cases, but if you assume these constraints true, you can typically build faster structures that have ugly -theoretical- worst cases (if the assumptions don't hold) but that in practice you never see.

infogulch · on June 2, 2022

One interesting observation of 'modern' sorting algorithms that ship with programming languages is that they're mostly an amalgamation of various classical sorting algorithms with tests and heuristics to dynamically switch between them depending on the case. Cases like: few elements, elements already sorted with some out of order elements appended, elements sorted in reverse, detecting worst case behavior of quicksort and changing to heap sort instead, exploiting exit conditions of prior recursion steps to skip checks, etc etc. These kinds of meticulous opportunistic optimizations give us a chance to get theoretically ideal performance in both the worst and best case scenarios, and I hope we can apply this same methodology for many more algorithms in the future. Databases query optimizers sorta do this, but they typically make such decisions ahead of time so are prone to getting stuck with a bad plan. (Obviously this is highly dependent on your engine of choice.)

CHY872 · on June 2, 2022

Increasingly databases can get do-overs on bad plans. A micro-level example might be dataflow splitting particularly expensive tasks so as to better parallelize them: https://cloud.google.com/blog/products/gcp/no-shard-left-beh...

Spark will re-run the query planner after doing the map stage of a map-reduce operation so as to leverage the information collected: https://databricks.com/blog/2020/05/29/adaptive-query-execut...

Both examples represent changing parameters about a lower level plan than simply join operators, but the industry is certainly moving in this direction.

sitkack · on June 3, 2022

Most systems that make expensive scans should be updating or creating a count-min-sketch histogram of the data to allow schedulers and query planners to do a better job.

https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

https://duckduckgo.com/?t=ffab&q=count-min+sketch+site%3Avld...

https://www.semanticscholar.org/search?q=count-min%20sketch%...

klysm · on June 2, 2022

I thought a lot of optimizers used tabled cardinalities to estimate the cost of various join algorithms.

juancn · on June 2, 2022

They do, but joins still cause unpredictable data scan jumps. Even if perfectly optimized.

Think of an analytic query over a star schema, maybe the query needs 10 or 20 joins to align all the data for each of the fact tables, and scans most of the data (some aggregate report over the last year of data), but these joins don't reduce the cardinality much.

If you look at the data access pattern, it's scattered all over the place for each joined row. You end up needing to read in a lot of data, possibly many times to complete the query.

zasdffaa · on June 2, 2022

Pretty well all modern commercial DBs do use summary (or sampled?) statistics of the cardinalities. The alternative is rule-based rewriters which were heuristically-driven only, and AFAIK they justifiably went out of use decades ago.

zasdffaa · on June 2, 2022

Not sure I understand. For analytics/warehouse stuff you tend to pre-join stuff to get a very non-normalised fact table (which is static so unnormalised is fine). This means very fast, maybe optimal, for what you're doing. It's a solved problem really. Sure, it doesn't support general queries but that's ok. Am I missing something?

juancn · on June 2, 2022

In pure analytics, most queries need to read most rows, but not all columns. Typical relational queries just touch a few rows of the dataset (think CRUD).

That's why if you force analytics into a relational DB you end up with a star schema, but the fact tables are not physically aligned, so you need to join them, this causes a query scan to jump around the data a lot.

Analytic engines, keep the data aligned across columns, but still on a per-column basis, so the full scans likely only need to look at nearby data, helping prefetchers (both CPU and from disk).

This avoids joins, but makes inserts a bit more annoying, multi valued data, is usually kept also close and compact so the scan doesn't jump around, but the explanation is a bit longer for a comment.

zasdffaa · on June 2, 2022

Col-based storage is for usual tables, DW don't need that because all the stuff is pre-joined except for the dimension tables.

The dimension tables are tiny IIRC, and there should (?) be only one fact table. You don't join between fact tables anyway (do you?)

I feel you may be mixing up a few things but it's not my strong point here.

juancn · on June 8, 2022

Yeah, I may have messed the terminology here, depending on the dataset, the dimension tables can be close to the fact table in cardinality, for true flexibility, it degenerates into a fact table with just a record id and dimension tables with a FK to the fact table and a single column.

If the DB is a true columnar store, there's no need to do that, but most are record based.

mf192 · on June 2, 2022

Well that escalated quickly, nice article!