

Spark Breaks Previous Large-Scale Sort Record - metronius
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html

======
chubot
FWIW, in 2011, Google wrote that they achieved a PB sort in 33 minutes on 8000
computers, vs. 234 minutes on 190 computers with 6080 cores reported by Spark
here.

[http://googleresearch.blogspot.com/2011/09/sorting-
petabytes...](http://googleresearch.blogspot.com/2011/09/sorting-petabytes-
with-mapreduce-next.html)

~~~
deeviant
I'm not sure why you list Google as using "8000 computers" and Spark using
"190 computers with 6080 cores".

Using two different metrics for two like things seems like there is some sort
of implication there. Were Google's machines single-cored?

~~~
chubot
I'm just writing down exactly what they reported. They used different metrics.

Certainly it would be interesting to have an apples to apples comparison. But
the computers aren't the only thing that is relevant -- we also need to know
about the networking hardware.

------
discardorama
It's interesting, but not earth-shattering. The "10x fewer nodes" means
nothing; how powerful are the new nodes? What's the network? Do you use SSDs?
etc. etc.

They also tuned their code to this specific problem:

" _Exploiting Cache Locality: In the sort benchmark, each record is 100 bytes,
where the sort key is the first 10 bytes. As we were profiling our sort
program, we noticed the cache miss rate was high, because each comparison
required an object pointer lookup that was random..... Combining TimSort with
our new layout to exploit cache locality, the CPU time for sorting was reduced
by a factor of 5._ "

I would love to see MR and Spark compete on the exact same hardware
configuration.

~~~
saryant
The article says exactly what they ran on. EC2 i2.8xlarge instances which have
32 cores, 800GB SSD and 244GB RAM.

~~~
discardorama
I read that. But how does that compare with the nodes they're comparing
against ("10x fewer nodes")?

~~~
rxin
The old entry had 10Gb/s <full-duplex> (40 nodes/rack 160Gbps rack to spine.
2.5:1 subscription), 64GB of RAM, and 12 x 3TB SATA.

The network part is probably the most important one here, and both have
comparable network.

~~~
discardorama
Since each node was handling 500GB of data (roughly), I think the disk speed
may have been a more critical factor since each node had 244GB of memory.
Their nodes used SSDs; the older nodes used spinning rust. The seek times
alone will be a killer.

~~~
rxin
Not sure why you mentioned seek time. In large scale, distributed sorting, I/O
is mostly sequential.

~~~
tlipcon
If only that were true -- the shuffle is typically seek-bound when the
intermediate data doesn't fit into cache (plenty of papers show this pretty
conclusively).

~~~
rxin
Hi Todd,

Except in the case of MR 2100 nodes the entire dataset fit in memory :)

~~~
lmeyerov
Doesn't that speak to his point? Either smaller memory and more nodes, or more
memory and less nodes. Why not do apples to apples? (This feels like the
benchmarketing going on in browsers, which, at this point, is largely
meaningless.)

Edit: on the other hand, this is an endorsement of the current wave of "per
node performance stinks, let's avoid rewriting software for an extra year or
two by throwing SSDs at it." Great for hardware vendors!

~~~
rxin
No it doesn't. The old record used 2100 nodes so the entire data actually fit
in memory. There shouldn't be much seek happening even in the MR 2100 case. In
Spark's case, the data actually doesn't fit in memory.

Also this was primary network bound. The old record had 2100 nodes with 10Gbps
network.

~~~
discardorama
As per the specification of this test, the data has to be committed on disk
before it is considered sorted. So even if it all fits in memory, it has to be
on disk before the end.

So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs.
That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each
write has an associated read and write too.

This does not even include the intermediate files, which (depending on how the
kernel parameters have been set), _could_ have been written on disk. Typical
dirty_background_ratio is 10; so after 6GB of dirty pages, pdflush will kick
in and start writing to the spinning disk.

~~~
rxin
Yes, but the final data is sequential only. We were discussing about random
access, which only applies to the intermediate shuffle file.

Maybe you can email me offline. I can tell you more about the setup and how
Spark / MapReduce works w.r.t. to it.

------
panarky
The 100 terabyte benchmark used 206 Spark nodes, compared with 2100 Hadoop
nodes.

Going up to 1 petabyte, the Hadoop comparison adds more nodes, 3800, while the
Spark benchmark actually reduced the number of nodes to 190.

Does Spark scale well beyond ~200 nodes, or does the network become the
bottleneck?

In any case, it's an impressive result considering that they didn't use
Spark's in-memory cache.

~~~
Lanzaa
I believe the network had become a bottleneck. As per the article:

> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity
> during the reduce phase, saturating the 10Gbps link available on these
> machines.

If the network is the bottleneck it makes sense to reduce the number of nodes
to reduce the network communications.

~~~
rxin
The job is actually very linearly scalable. i.e. running it on 200 nodes
roughly doubles the throughput of 100 nodes.

------
showerst
For the curious, the (max) price of those instances is $6.82/hr, so 206 * 6.82
* (23/60) = $538.55 --If they did it with non-reserved instances in US East.

If they used reserved instances in USEast, it drops to $181.

Obviously there are lots of costs involved beside the final perfect run, but
it's an interesting ballpark.

~~~
sp332
You have to put spaces around your * 's to keep HN from italicizing
everything.

~~~
showerst
Oops, edited. Thanks!

------
rxin
Thanks for sharing this. I'm the author of this blog post. Free free to ask me
anything.

~~~
chad_walters
Your post mentions "single root IO virtualization" as a factor in maximizing
network performance. I am wondering what the impact of this was in your
sorting. Do you have data for runs where you didn't enable this?

~~~
rxin
It was part of the enhanced networking. Without enhanced networking, we were
getting about 600MB/s, vs 1.1GB/s with.

------
gtrubetskoy
The strength of Hadoop isn't so much speed but that it's been around and there
is a pretty impressive and fairly mature set of projects that comprises the
Hadoop ecosystem, from Yarn to Hive, etc. There are still many issues to
resolve, and this evolution will continue for decades to come.

The TB sort benchmark is pretty useless to me - I am much more concerned with
stability, a vibrant community (which means people, the software they write
and institutions using Hadoop in production).

Last time I tinkered with Spark (this was over a year ago) it was so buggy,
next to useless, but perhaps things have changed.

Still - the idea that there is some sort of a revolutionary new approach that
is paradigm-shifting and is way better than anything before should be viewed
with extreme skepticism.

The problem of distributed computing is not a simple one. I remember tinkering
with the Linux kernel back in the mid nineties, and 20 years later it still
has ways to go to improve.

Twenty years from now it might or might not be Hadoop that is the tool for
this sort of thing, we don't know, but I will not take seriously anything or
anyone who claims that the "next best thing" is here in 2014.

~~~
metronius
1\. Cloudera left M/R for Spark, Mahout left M/R for Spark. Spark community
will be huge soon.

2\. Yes, Spark was/is buggy.

3\. For me Spark is really paradigm shift, next generation framework compared
to M/R

~~~
gtrubetskoy
Spark _requires_ Hadoop to run, so this whole Spark vs Hadoop debate makes no
sense whatsoever.

There is a place for arguing how effective Map/Reduce is, but it's been known
for years that M/R is not the only, nor best general purpose algorithm for
solving all problems. More and more tools these days do not use M/R, Spark
including, and Spark certainly is no the first tool to provide an alternative
to M/R. AFAIK Google has abandoned M/R years ago.

I just don't understand this constant boasting about Spark, it seems very
suspicious to me.

~~~
nchammas
> Spark _requires_ Hadoop to run

This is not correct. Spark uses the Hadoop Input/Output API, but you don't
need any Hadoop component installed to run Spark, not even HDFS.

You can -- and many companies do -- run Spark on Mesos or on Spark's
standalone cluster manager, and use S3 as their storage layer.

> this whole Spark vs Hadoop debate makes no sense whatsoever

If we talk about Hadoop as an ecosystem of tools, then yes, it doesn't make
sense to frame Spark as a competitor. Spark is part of that ecosystem.

But if we talk about Hadoop as Hadoop 1 MapReduce or as Hadoop 2 Tez, both of
which are execution engines, then it very much makes sense to pit Spark
against them as an alternative execution engine.

Granted, Hadoop 1 MapReduce is pretty old compared to Spark, and Tez is still
under heavy development, but these are alternatives and not complements to
Spark.

(Note: In Hadoop 2, MapReduce is just a framework that uses Tez as its
underlying execution engine.)

> I just don't understand this constant boasting about Spark, it seems very
> suspicious to me.

Suspicious how?

I think Spark's elegant API, unified data processing model, and performance --
all of which are documented very well in demos and benchmarks online -- merit
the excitement that you see in the "Big Data" community.

------
ddlatham
Most recent results I can see to compare to (Google, Yahoo, Quantcast):
[https://www.quantcast.com/inside-
quantcast/2013/12/petabyte-...](https://www.quantcast.com/inside-
quantcast/2013/12/petabyte-sort/)

------
gane5h
Going on a tangent here: this benchmark highlights the difficulty of sorting
in general. Sorts are necessary for computing percentiles (such as the
median.) In practical applications, an approximate algorithm such as t-digest
should suffice. You can return results in seconds as opposed to "chest
thumping" benchmarks to prove a point. :)

I wrote a post on this: [http://www.silota.com/site-search-blog/approximate-
median-co...](http://www.silota.com/site-search-blog/approximate-median-
computation-big-data/)

~~~
sonoffett
Perhaps I misunderstand your comment, but you actually don't need to sort to
compute a median (see O(n) median of medians algorithm [1]).

[1]
[http://en.wikipedia.org/wiki/Median_of_medians](http://en.wikipedia.org/wiki/Median_of_medians)

------
coldcode
No matter what the circumstances, sorting 100 TB or 1 PB of anything is
impressive, much less doing it during the time it takes me to eat lunch.

------
vinay_ys
Where can I find the source code & instructions on how to reproduce this
benchmark?

------
metronius
What change the biggest difference in performance between Spark and MapReduce?

------
xxcode
Does this mean that Spark is the new God. If this is the case, then Databricks
will be the next Cloudera. Cloudera is probably a 10B+ company.

Good job

