

Introducing DataFrames in Spark for Large Scale Data Science - rxin
http://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
I&#x27;m the author of this blog post. We are very excited about this API and think it will be the common interchange format for data in Spark. It has also some neat features (such as code generation, predicate push down, etc) that make it very useful for Big Data.<p>Feel free to ask me anything.
======
super_sloth
Has anyone had some good experiences with Spark?

I put several weeks in to moving our machine learning pipeline over to Spark
only to find I kept hitting a race condition in their scheduler.

After doing a bit of searching, it seems this is actually a known issue
[https://issues.apache.org/jira/browse/SPARK-4454](https://issues.apache.org/jira/browse/SPARK-4454)
and there's been a fix on their github for a while:
[https://github.com/apache/spark/pull/3345](https://github.com/apache/spark/pull/3345)
and yet in that time two releases have swung by bringing a tonne of features.

I ended up having to drop Spark ultimately because I wasn't confident about
putting it in to production (the random OOMs and NPEs during development
weren't great either). Does anyone have any positive experiences?

~~~
rjurney
Spark is less mature than Hadoop, so you will run into issues like this. In my
experience, advocating for the bug to get fixed often results in it getting
fixed... on a several month timeline. This happened with Avro support in
Python. I advocated for the patch and someone supplied it in the next version
of Spark.

Lemme tell you though... as someone that has use Hadoop for 5+ years... not
waiting 5-10 minutes every time you run new code is worth the trouble. Despite
more problems owing to immaturity, or just Spark 'doing less for you' in terms
of data validation than other tools like Pig/Hive, if you can get your stuff
running on Spark... development is joyous. You just don't have to wait very
long during development anymore.

I feel like 5 years of my life were delayed 10 minutes. That did terrible
things to my coding that I'm just starting to get over. With Spark I am 10x as
productive, and I am limited by my thinking, not the tools.

PySpark in particular is really great.

~~~
threeseed
Seriously ./spark-shell is a godsend for development.

And I love the fact you can press Tab and get autocompletion of methods.

------
elliptic
Spark the platform seems awesome. I'm somewhat less convinced by mllib - I'm
not sure there are as many use cases for distributed machine learning as
people seem to think (and I would bet that a good deal of companies that use
distributed ML don't really need it). I've seen a lot of tasks that could be
handled by simpler, faster algos on large workstations (you can get 250 GB RAM
from AWS for like $4.00/hr). I'd love to hear counterarguments, though!

~~~
ogrisel
While fitting the algorithm might not often benefit from partitioned data, I
see two upsides from using spark for predictive modeling.

First it makes it easy to do the feature extraction and model fitting in the
same pipeline, hence make it possible to cross-validate the impact of the
hyper-parameters of the feature extraction part. Feature extraction generally
starts from a collection of large, raw datasets that needs to be filtered,
joined and aggregated (for instance a log of user clicks, sessionized by user
id over temporal windows, then geo-joined to GIS data via a geoip resolution
of the IP address of the user agent). While the raw datasets of clicks and
geographical databases might be too big to be processed efficiently on a
single node, the resulting extracted features (e.g. user session statistics
enriched with geo features) is typically much slower and could be processed on
a single node to build a predictive model. However spark RDDs make it natural
to trace the provenance hence trivial to rebuild downstream models when
tweaking upstream operations used to extract the features. The native caching
features of Spark make that kind of workflow very efficient with minimal
boilerplate (e.g. no manual file versionning).

Second, while the underlying ML algorithm might not always benefit from
parallelization in itself, there are meta-level modeling operations that are
both CPU intensive and embarrassingly parallel and therefore can benefit
greatly from a compute cluster such as Spark. The canonical case are cross
validation and hyper-parameter tuning.

------
rxin
I'm one of the authors of the blog post as well as this new API. Feel free to
ask me anything.

~~~
djcater
Are there any timelines for when this (and Spark in general) will fully
support ORC files (including predicate-pushdown)?

~~~
pwendell
Very likely in Spark 1.4. Hortonworks has been helping out with this, we just
need some internal refacotring to the API to make it work.

------
rjurney
Hey... don't downvote Reynold Xin, author of the post as dupe when he says
AMA.

~~~
rxin
Somehow my comment was removed ... :(

~~~
rjurney
Re-make it.

------
homerowilson
A replica of this benchmark on my laptop running R has this running in about
1/4 second. Seems like a pretty trivial benchmark?

library(data.table)

x =
data.table(a=sample(10,10e6,replace=TRUE),num=sample(100,10e6,replace=TRUE))
t1=proc.time(); x[,sum(num),by=a]; print(proc.time()-t1)

    
    
       user  system elapsed
      0.209   0.032   0.245

~~~
rxin
The example was mostly a toy example. The power really comes when you get
interactivity for small data and big data. Using this, you could scale up to
TBs of data on a cluster and still get results relatively fast, which is not
something you can do with R.

I don't expect at small scale to beat R yet. There are a few low-hanging
fruits for single node performance. For example, even for single node data, we
incur a "shuffle" to do data exchange in aggregations. This is done to ensure
both single node program and distributed program go through the same code
path, to catch bugs. If we want to optimize more for single node performance,
we can get the optimizer to remove the shuffle operation in the middle, and
just run the aggregations. Then this toy example will probably be done in the
100ms range.

------
eranation
Native question - how is this different than Spark SQL and things like project
zeppelin?

------
jyotiska
This is great news! Where do I see the source for this?

~~~
zodvik
Since this will be released in Spark 1.3, you can track the dev. in the 1.3
branch on Github
[https://github.com/apache/spark/tree/branch-1.3](https://github.com/apache/spark/tree/branch-1.3)

------
zodvik
Will the DataFrame API work with Spark Streaming?

~~~
rxin
There are some ways you can integrate the two. E.g. Streaming allows you to
apply arbitrary RDD transformations, and thus you can pass a physical plan
generated by DataFrame into streaming.

We will work on better integration in the future too.

