Project Tungsten: Bringing Spark Closer to Bare Metal

rxin · on April 28, 2015

And on a related note, congratulations to Matei Zaharia for winning the ACM Best Dissertation Award for his work on Spark: http://awards.acm.org/doctoral_dissertation/

xtacy · on April 28, 2015

Many congrats to Matei! Well deserved.

papertowebforms · on April 28, 2015

This is really exciting and important Matei, can't wait to see it in action.

* Matei Zaharia is the creator of Apache Spark

anonymousDan · on April 28, 2015

I'm not sure I completely buy their claim that network/disk IO is no longer the bottleneck in many situations. Their recent NSDI paper (https://kayousterhout.github.io/trace-analysis/) is compute bound for a 20 node cluster, but they never evaluate it in a larger cluster with fixed data size. The availability of on-demand cloud resources would potentially make it easier to solve the bottleneck by just increasing the cluster size.

jandrewrogers · on April 28, 2015

It really depends on what you are doing. These days, assuming the code is well designed and highly efficient (not that common), you typically bottleneck on effective memory bandwidth or network bandwidth. To the extent this is not the case, it frequently is due to relatively poor efficiency of implementation in other parts of the system (extremely common).

Given a set of cheap SSDs and a good I/O scheduler, very few workloads should bottleneck on disk because the disk has more available bandwidth than the network. If you are bottlenecking on disk, it usually means the I/O scheduling is poor. The operating system I/O scheduler is quite poor for many use cases, hence why I/O intensive apps write their own.

Network can be inexpensively and easily saturated at 10 GbE these days, even when doing something ugly like parsing JSON streams. Unless there is a bottleneck elsewhere in the system, like memory bandwidth or occasionally CPU, this is where I typically see bottlenecks in real systems. However, switch fabrics do not scale infinitely even if you know what you are doing, so there is that, and bandwidth does not always scale linearly (though things like graph analysis are much closer to effectively linear in sophisticated implementations than you see using naive algorithms).

Data motion is always the bottleneck. Right now, moving from RAM to CPU or from machine to machine is almost always the bottleneck if you are doing everything else right. Many popular software designs and implementations for "big data" have many other bottlenecks that are not strictly necessary, so that throws off expectations e.g. poor I/O scheduling saturating disk or poor JSON parsing burning up CPU or poor use of threading wasting a lot of CPU time.

kayoust · on April 28, 2015

For most workloads, increasing the number of machines does not increase the amount of data sent over the network, so the ratio of computation to network bandwidth stays the same. As a result, increasing the cluster size doesn't make workloads any more I/O bound.

At some point, a larger cluster will be more bandwidth-constrained because of oversubscription, but given the network utilizations we saw (<5% at the median in Figure 5), a cluster would have to have pretty high oversubscription for the network to become the bottleneck.

The one caveat here is, for example, matrix workloads, where the data sent over the network increases with the number of machines.

anonymousDan · on April 28, 2015

Interesting. What about e.g. map-reduce workloads with large fan-in for the reduce step?

rxin · on April 28, 2015

Author of the blog post here. Kay already pointed out that increasing cluster size usually reduces network utilization. In addition to that, there a few challenges with just getting bigger cluster size with "on-demand" resources:

1. Obviously, it costs more to get a bigger cluster.

2. You might not be able to get as big of a cluster as you wanted, since "cloud" is not an infinite resource.

3. Some workloads are not embarrassing parallel. For some (graph, matrix, join), more parallelism -> more communication.

choppaface · on April 28, 2015

Are there JIRA issues related to these ongoing projects? I'd be interested to see the current thoughts on llvm code gen and the "Spark Binary Map."

rxin · on April 28, 2015

Here's the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7075

batbomb · on April 28, 2015

That's a clever name. Somebody at spark has apparently used a TIG welder before.