
Apache Spark: A Unified Engine for Big Data Processing - akashtndn
http://cacm.acm.org/magazines/2016/11/209116-apache-spark/fulltext
======
sandGorgon
The new programming model of spark is around Graphframes...which are a
Pandas/Dataframes-esque mental model.

Much nicer IMHO

[https://databricks.com/blog/2016/03/03/introducing-
graphfram...](https://databricks.com/blog/2016/03/03/introducing-
graphframes.html)

~~~
ap22213
I was really looking forward to using Datasets. unfortunately, I've had much
better results sticking with RDDs.

I migrated to 2.0 a few months ago and have been banging my head ever since.
RDDs are very straight forward. You get a distributed collection, and you
apply functions to them. It's (almost) very explicit about what's going on.

I've been trying to do similar things with Datasets that are super simple with
RDDs, but the Dataset / SQL interface hides too many details. I really tried
to give them a fair shot. But, I ended up falling back to the RDD interface.

However, in my case, I'm dealing with 50+ TiB of data, so understanding how
memory and processing are being used is very important. It's probably less
important for casual users.

~~~
rxin
Reynold from Databricks and the Apache Spark project here.

Would you mind shooting me an email rxin at databricks.com so I can understand
more the issues you run into?

~~~
snnn
Hi, glad to see you here. We were using spark for logistic regression training
but switch to MPI now, because of the 2GB problem. I think LR was spark's
killer feature, would you make it better? Thanks.

~~~
sandGorgon
Oh wow... I didn't even know these limitations (and in fact ,they are not very
Google-able).

Can you talk about these limitations and your experience?

~~~
snnn
[https://issues.apache.org/jira/browse/SPARK-139](https://issues.apache.org/jira/browse/SPARK-139)

[https://issues.apache.org/jira/browse/SPARK-1476](https://issues.apache.org/jira/browse/SPARK-1476)

[https://issues.apache.org/jira/browse/SPARK-6235](https://issues.apache.org/jira/browse/SPARK-6235)

You'll hit this bug when your model size is larger than 2GB.

BTW, Recomputation of RDDs may result in duplicated accumulator updates. So
please do not use accumulator in your trainer for gradient summation. They
know that, but they said they will not fix that.

[https://issues.apache.org/jira/browse/SPARK-732](https://issues.apache.org/jira/browse/SPARK-732)

[https://issues.apache.org/jira/browse/SPARK-5490](https://issues.apache.org/jira/browse/SPARK-5490)

------
mydpy
If you follow the Spark community closely, you won't find any new information
here. However, as a software consultant, I find this to be a useful
nontechnical overview to send to partners and clients.

------
techno_modus
> The RDD _programming model_ provides only distributed collections of object

It is ok for a non-technical high level overview but RDD should probably be
referred to as a _data model_ (how data is represented and managed) while map-
reduce is a _programming model_ (how data is being processed).

~~~
kod
No, the rdd interface has methods on it that directly relate to how data is
processed.

~~~
techno_modus
You are right, but what I meant is that _delta_ between Spark and Hadoop is
RDD (RDD - Resilient Distributed _Dataset_ ) which is data structure first of
all. Thus Spark and Hadoop have the same data processing model implemented
over different data representation models (and hence the performance gain).

~~~
marmaduke
> delta

why can't we use the word difference? delta, mathematically, suggest some
space and measure.

------
lobster_johnson
Question for those of you who use Spark: Is it possible to use with Go, Rust,
Nim or C++?

I'm not a fan of those other languages, and I'm also trying to reduce the
amount of context switching these days.

~~~
perturbation
It would be possible in principle (especially if you can write a Python module
bridge that can wrap your C++ / Rust / etc. code), or something that you can
load in Java. But, all of the RDD and SparkContext APIs are going to be
available only in Java / Scala / Python without a significant amount of work.
It's probably not worth it to use another language and wrap it in that way.

I would recommend using Scala (for the static typing) or Python (numpy and
other libraries are very useful with Spark). They're not hard to pick up,
especially when using pyspark + ipython for prototyping locally.

------
mpweiher
Funky: _10 to 20 local disks, for approximately 1GB /s to 2GB/s of disk
bandwidth_

The new MacBook Pros apparently have north of 2GB/s of disk bandwidth for
their internal SSD.

[https://9to5mac.com/2016/11/01/the-late-2016-entry-
level-13-...](https://9to5mac.com/2016/11/01/the-late-2016-entry-
level-13-macbook-pro-has-a-ridiculously-fast-ssd/)

~~~
supergirl
Yeah but can you store same amount of data on one ssd as on 10 disks?

~~~
snovv_crash
This. If you can fit it on your soldered-in SSD, I'm not sure it really counts
as "Big Data".

~~~
RBerenguel
Well, O(100GB) is where I start thinking of "big data". Processing 100 GB is
slow in a single "normal" machine (depending on what, sure). Between 10 and
100 I usually do it in a cluster since it is less hassle (even if I could
process it locally with some tweaks or patience). Less than 10 is usually
locally run unless I'm already computing something else

~~~
matteuan
100GB can be handled without problems by mysql or postgre on a single node
(yeah maybe not a laptop). You need to tune your database and take the right
design decisions, but it's still less work than setting up a distributed
system.

~~~
EdwardDiego
Depends on how you're setting it up.

    
    
      aws s3 cp --recursive bunch-o-data s3://some-bucket/
      spark-ec2 --region eu-west-1 --identity-file s.pem --key-pair=spark --instance-type m3.2xlarge --slaves 40 -v 1.5.2 launch my-cluster
    

Is significantly easier than making PG work easily at the 100GB scale in my
experience. spark-ec2 is a script that ships with Spark to make it easy to set
up a cluster in AWS.

~~~
nchammas
As of Spark 2.0 spark-ec2 doesn't ship with Spark anymore [0]. It's been moved
here [1].

[0] [http://spark.apache.org/releases/spark-
release-2-0-0.html#re...](http://spark.apache.org/releases/spark-
release-2-0-0.html#removals)

[1] [https://github.com/amplab/spark-ec2](https://github.com/amplab/spark-ec2)

