
Cost-Based Optimizer in Apache Spark 2.2 - dmatrix
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
======
hkothari
If you are interested in the code behind this, I wrote an overview last month
on the functionality and links to the different code that backs the
improvements they talk about: [http://hydronitrogen.com/spark-220-cost-based-
optimizer-expl...](http://hydronitrogen.com/spark-220-cost-based-optimizer-
explained.html)

There's a fair amount of overlap, but where the databricks article explains
the techniques with charts and high level explanations, I go over the code
instead.

------
elvinyung
On this topic, I really like the Join Order Benchmark paper:
[http://www.vldb.org/pvldb/vol9/p204-leis.pdf](http://www.vldb.org/pvldb/vol9/p204-leis.pdf)

It basically shows that most cost-based optimizers are pretty bad at
cardinality estimation, which compounds when queries use more joins.

------
ris
Still catching up with postgres which added multivariate column statistics in
9.6 :)

Not that this isn't a great development in itself...

------
makmanalp
What's cool about these statistics-based approaches is that you mostly don't
even need fully up-to-date statistics, just overall decent stats, unless you
have an insane amount of churn. Meaning - you can get query speedup without
insertion overhead: you choose to take that overhead any time you want using
ANALYZE.

Very neat stuff from the databricks team!

~~~
trhway
For non latency sensitive queries you can also run dynamic sampling.

