
Debugging a failing test case caused by query running “too fast” - rxin
https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html
======
stordoff
Vaguely reminds me of a bug I thought I had in a Python sorting routine
(sorting a set of database entries by date/time) - all of the tests passed
until I removed some debug print statements, after which it ostensibly stopped
sorting and just returned the original order.

I eventually realised that the routine was fine, but my test data was being
generated quickly enough that time.time() (IIRC) returned identical values for
all of the dummy records (with the print statements, there was just enough of
a delay for there to be a few milliseconds between each one).

------
tomrod
Very nice!

I've been consistently impressed with Databricks' approachable blog. This
particular post spawned a nice discussion around database design with my son,
who has taken a lot of recent interest in all things technological. Keep up
the good work.

------
nevi-me
Interesting read! I have a service that was in SAS, and we've been translating
it to run in Spark, but one of the killer issues that we identified was
latency, without understanding what held up the computation, the would be
increasing pauses of a few seconds, sometimes reaching nearly a minute, in
execution. This is on a single machine, and at that time we wouldn't notice
any resource utilisation. No disk writes, CPU nearly at 0.00, etc.

I keep coming back with every new Spark version to see if the problem has gone
away, (wrote it at 2.0.0, so I mean every minor and patch). I looked up what I
could online about optimisation in Spark, and applied that.

The business people got tired of us wasting time trying to optimise, and
forced us down the lines of SAP HANA and other proprietary marketing hoohah
because we need a product that's real-time.

I hope the upcoming version of Spark at least helps reduce latency, perhaps
through improvements in the whole-stage code-gen.

------
ww520
There are some insane performance. I thought memory bandwidth was a
limitation, to move 2~4TB data through memory. Then saw it was a cross join of
1M numbers X 1M numbers. 1 million of 4-byte int is just 4MB, which can fit
comfortably in the L1/L2/L3 caches. And the output of the 1M x 1M cross join
is thrown away; just a counter is incremented. So no 1T results were pushed
through the memory.

A cross join is just two nested loops iterating over one array over another.
With 40 cores, each handles 25 billions iterations of the 1 trillion. Assuming
each iteration takes 10 CPU cycles, a 6GHZ core can handle 600M
iterations/second. 25B / 600M = 41 seconds to run the whole thing.

Yes, 1 second is too fast.

Awesome that they figured it out it's the JVM optimized out the no-side effect
computation.

------
tyingq
Interesting. Most of the traditional RDBMS implementations have some kind of
delay functionality. SLEEP is in mysql, postgres has pg_sleep, MSSQL has
WAITFOR, and so on. I guess Spark doesn't have one.

These are also handy to test for SQL injection issues without screwing
anything up.

~~~
comex
> These are also handy to test for SQL injection issues without screwing
> anything up.

They're also handy to _exploit_ SQL injection issues, in cases where you can't
see the output of the query, but can measure how long it takes to execute :)

------
AtlasLion
Awesome seeing Ala and Bogdan already producing awesome stuff at Databricks.

