
Setting a new world record in CloudSort with Apache Spark - rxin
https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html
======
devonkim
It's not clear _how much_ of these improvements are from reductions in pricing
rather than algorithms and design decisions. They've documented things like
using Netty for network latency, avoiding GC, and getting better with Spark,
but it'd be interesting if the team could go back and run the benchmark using
the same infrastructure as their 2014 benchmark for a code-vs-code comparison
to separate engineering improvements from economies of scale.

~~~
rxin
I definitely agree that it'd be great to decouple the improvements in software
and drops in cloud pricing. In reality it is pretty difficult because the
Nanjing U/Alibaba team spent a lot of time also optimizing the software
specifically for the AliCloud environment, which might not be applicable when
running on Amazon EC2, which was the environment of the 2014 record.

This is a great task for a rigorous academic paper!

Disclaimer: I wrote the blog post.

~~~
wsmith
Is Spark faster than MemSQL?

~~~
gopalv
MemSQL is a transactional database (system of record).

Spark is a way of processing data, ideally stored in a system of record
(Hive/HDFS/S3/MemSQL etc).

They're not the same.

~~~
wsmith
There are similarities. A database is also a way of processing data.

For the kinds of processing both Spark and MemSQL do (e.g. join operation) is
Spark faster than MemSQL?

------
embiggen
Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+
years ago. We've still got a long ways to go in OSS land to compete with the
big boys.

~~~
fhoffa
This post tells the "History of massive-scale sorting experiments at Google"

\- [https://cloud.google.com/blog/big-data/2016/02/history-of-
ma...](https://cloud.google.com/blog/big-data/2016/02/history-of-massive-
scale-sorting-experiments-at-google)

When I asked why BigQuery doesn't do these sorts, the answer came straight
from the post "Nobody really wants a huge globally-sorted output. We haven’t
found a single use case for the problem as stated."

These accomplishments are awesome nevertheless!

Disclaimer: I'm Felipe Hoffa, and I work for Google
([http://twitter.com/felipehoffa](http://twitter.com/felipehoffa)).

~~~
andrioni
Do you think you could ask someone and find out the cluster sizes they used
for those sorts? They mention "With the largest cluster at Google under our
control", but it would be more interesting to have an idea of actual numbers,
even if just an order of magnitude.

~~~
fhoffa
I could ask - but then I wouldn't be able to publish unpublished numbers on my
own (if I want to keep my job).

:)

------
flukus
A price record not a performance one.

Also, seeing how expensive it is to sort 100TB ($144) you have to wonder why
it wouldn't be better to do it on your own hardware.

~~~
bsg75
Databricks is a cloud service. Publishing a premise approach would not benefit
their business?

~~~
flukus
Which also makes it more of an ad than anything.

~~~
rxin
I'd argue this (performance/cost) is exactly the right metric to measure. One
of the biggest benefits of the cloud is elasticity. In on-premise world, one
would have to provision based on peak demand, and most of the time the cluster
utilization rate is pretty low.

~~~
flukus
But it's cost for every iteration for cloud vs a once off cost for on premise.
It doesn't take very long for on premise to get cheaper.

~~~
rxin
The one-off cost is very high, and there is higher ongoing maintenance cost as
well. Most organizations are moving to the cloud, because it in general makes
more economic sense.

~~~
flukus
I'm really not seeing most organizations moving to the cloud. It only makes
sense economically for the smallest companies/organisations that can't afford
maintenance costs. Most bigger ones won't risk the loss of their core data and
a lot can't go to the cloud for various reasons like legality, speed and
redundancy.

~~~
rxin
Take a look at the customers featured at reinvent. Computing as "utility" is
the future, and the future is arriving fast.

------
iaw
I got excited and then I saw that this was for sorting not storage...

~~~
falaki
Apache Spark is agnostic to storage layer.

~~~
iaw
The title has been changed, the original post didn't reference CloudSort and
I'm not sure if it included Apache Spark either. It was something along the
lines of "New record, 1TB at $1.44"

Hence my confusion.

