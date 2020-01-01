Hacker News new | comments | show | ask | jobs | submit login
Joining a billion rows 20x faster than Apache Spark (snappydata.io)
Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.

I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense.

Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.

The question is more, what do you want as result? Suppose you search in your 8TB database of molecules the 1000 molecules most similar to a given one, you have 16 cores, you cut the 8TB in 500GB skunks, preload continuously 1GB of molecules per core and accumulate 16*1000 molecules and merge at the end. You can do it on a single system and you work with a TB size dataset.

It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.).

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

The font in the embedded gists or the font on the page?

Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.

Will look into this

