
Joining a billion rows 20x faster than Apache Spark - plamb
http://www.snappydata.io/blog/joining-billion-rows-faster-than-apache-spark
======
banachtarski
Am I reading this correctly? The testbed was a single laptop? A big part of
spark is the distributed in-memory aspect so I'm not sure I understand why any
of these numbers mean anything.

~~~
quantumhobbit
This paper is a must read:
[https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...](https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daae43f81b7a48017.pdf)

People keep stumbling upon the same thing over and over which is that the
ability to scale has significant overhead.

~~~
edw
Back in '10, I needed a three or four node Hadoop cluster just to match the
performance I was getting using a spare Mac mini in development mode when I
was doing a lot of work in Cascalog, which is based on Cascading.

Most problems are not Big Data problems. The size a problem must be before it
qualifies as a Big-Data problem grows larger every day with the availability
of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`,
`join`, and so forth are some of the least appreciated tools in the Unix
toolbox.

People want to think they have Big Data problems but they probably just have
plain old normal-data problems. I have had to unwind the ridiculous, heavy-
weight, Big Data solutions to normal-data problems that "kids today" love.

If you don't work for Netflix or Google or Facebook or insert maybe a hundred
other companies here, you probably do not have a Big Data problem.

~~~
makapuf
Amen. Also said as: too big for excel is not big data. See also
[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

~~~
edw
Amen right back at ya! (I love the O'Reilly book cover.) I highly recommend
people read your blog post. And there's also this classic:

[https://aadrake.com/command-line-tools-can-be-235x-faster-
th...](https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

I'll also take this opportunity to plug Make and Drake for manipulating data
in a replicable way:

[https://bost.ocks.org/mike/make/](https://bost.ocks.org/mike/make/)

[https://github.com/Factual/drake](https://github.com/Factual/drake)

If you're processing data using tools that cannot trace their ancestry
directly to some time before 1985, you're probably wasting your own and your
colleagues' time.

~~~
makapuf
Just for clarification, I'm not the original blogger. +10 for the other link
and using make ! I don't know Drake however.

------
filereaper
I apologize in advance, but whenever people claim to use a in-memory big-data
system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-
terms of memory, unless you buy a massive big-iron POWER box, you scale out
horizontally. But with each of these in-memory appliances, what happens when
you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to
buying boxes with fast SSD's instead? Sure you spill out to disk, but do you
take that big of a hit compared to the enormous cost of keeping everything in
memory?

~~~
stingraycharles
I think there are many use cases. Fraud detection, risk analysis in finance,
weather simulations, etc. These don't need to spill out to disk and are a
perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to
make predictions for energy brokers, to predict prices of wind / solar energy
on the market. They use these kind of systems extensively, because of the
speed and volatility of the data. Fascinating stuff.

~~~
rodionos
You can also measure cloud oktas from satellite imagery of you want to get
fancy in terms of solar energy supply side forecasting:
[https://axibase.com/calculating-cloud-
oktas/](https://axibase.com/calculating-cloud-oktas/)

------
usgroup
Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at
how many big data problems you can solve with a fraction of the complexity.

~~~
makapuf
Pardon my ignorance but what would you use Jenkins for ? Scheduling ?

~~~
tetha
Jenkins in such a setting gives you two good things: i) scheduling. and ii)
access control. The ability to give random dude X the ability to trigger
computation Y, Z, and A, without the ability to change said computations.

------
EGreg
This seems like impressive stats about a relational database technology. But
the scrolling on their website doesn't work on mobile. So in grand HN
tradition, I left and now tell you all about it here, instead of the main
point of their invention :)

~~~
franciscop
It worked for me but the nav of the browser didn't hide, which I recognize as
messing around with absolute/fixed positioning and/or overflows. I'd recommend
to use media queries to show a simple site on mobile and leave all the fancy
stuff they are surely doing in the desktop for the desktop only.

Edit: on a second check, it might have to do with that nav that moves the
whole page down.

~~~
plamb
Appreciate these comments, the site did not go through much testing before
being deployed. Overflowing was modified to eliminate horizontal scroll on
mobile but it looks like there were some vertical issues as well. We will get
this fixed

------
Loic
What is the algorithm used to join the tables? Is it a hash join on `id` and
`k` or using the fact that the ids are sorted and using a kind of galloping
approach?

~~~
jagsr123
Yes, it is a hash join.

~~~
Loic
I will need to dig into the implementation of the hash function, it must be a
nice read as the speed shows that it is definitely well optimized! Thank you.

------
alexchamberlain
Python 2.7 can do it in 0.0867 usec (Intel i7);

    
    
        $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
        10000000 loops, best of 3: 0.0867 usec per loop
    

(Admittedly, I killed `n=10 __9; sum(range(1,n+1))`.)

------
marknadal
Great article, actually. Typical HN comments on performance optimizations are
complaints like "this isn't a real world use case" or things like that. Most
of which, they miss that comparing baseline performance metrics against two
systems is still genuinely interesting in and of by itself, and acts as a huge
learning catalyst to understanding what is going on. I think this article did
a great job of making an honest comparison and discussing what is going on, so
props to the team! (We did something similar as well, where we compared cached
read performance against Redis, and were 50X faster - here:
[https://github.com/amark/gun/wiki/100000-ops-sec-in-
IE6-on-2...](https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2GB-
Atom-CPU) ).

~~~
banachtarski
The problem is what "baseline" means. For example, a multithreaded program
will always run slower than a single threaded one on one thread by definition.
It has to do work in order to coordinate the threads. Obviously, this doesn't
mean we avoid multithreaded code.

In this case, the software being tested was explicitly written to manage the
coordination of data on many nodes, so why is the definition of "baseline" a
single laptop? Seems specious.

~~~
marknadal
Yes, but that is exactly why I think these types of articles and discussions
are useful - people who understand what is going on often times assume that so
do others, but for many people it all looks like magic.

How many people out there (genuine question here) assume the opposite of what
you know / that they are ignorant of it? How many people do you think that
when they hear "multithreaded" that they associate that with being faster?

Now assume the people who know there is overhead work to split up and divide
the work across threads... because they have this knowledge, also "see
everything as a nail because they have a hammer"? That sometimes the right
solution is to simply run a single threaded operation, not parallelize
everything?

I think there are interesting merits to all of that, even if it means
"hyperbolic" articles or cliche not-realistic world tests. They challenge our
thinking, our assumptions, our approach. And then separately, there should be
articles/discussions on real-world tests and use cases.

~~~
user5994461
You guys should realize that this is a commercial company promoting its own
product. They're just doing the test that makes them look good.

------
Bedon292
I know its just a benchmark for comparison, and it is awesome. I love seeing
cool comparisons like this, but why do I care that this particular benchmark
is faster than Spark? What sort of analytics will be affected by this
improvement, and will it actually be saving me time on real world use cases?

~~~
luckydata
it's literally impossible to test every possible workload for a tool like
Spark so... what's the point of asking? Stand up a testing cluster, run some
of your jobs and you'll get the answer.

~~~
Bedon292
I am not asking about every workload. I was just curious about an example
workload where this benchmark matters.

------
supergirl
why would you choose values between 1 and 1000 for the right side? why not
1000 values between 1 and 1 billion?

------
zzleeper
In case the author reads this: I can't read well with that font, unless I zoom
in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

~~~
plamb
The font in the embedded gists or the font on the page?

~~~
minimaxir
Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font
but work fine on macOS due to better font rendering, but do not work well on
Windows.

~~~
maaaats
Is it better on Mac? Whenever I boot into Win10 Im struck by how crisp text
looks compared to mac.

~~~
mtanski
Prob the retina display the high PPI makes fonts much easier to read.

