Hacker News new | comments | ask | show | jobs | submit login
Introducing DataFrames in Spark for Large Scale Data Science (databricks.com)
138 points by rxin on Feb 17, 2015 | hide | past | web | favorite | 41 comments

Has anyone had some good experiences with Spark?

I put several weeks in to moving our machine learning pipeline over to Spark only to find I kept hitting a race condition in their scheduler.

After doing a bit of searching, it seems this is actually a known issue https://issues.apache.org/jira/browse/SPARK-4454 and there's been a fix on their github for a while: https://github.com/apache/spark/pull/3345 and yet in that time two releases have swung by bringing a tonne of features.

I ended up having to drop Spark ultimately because I wasn't confident about putting it in to production (the random OOMs and NPEs during development weren't great either). Does anyone have any positive experiences?

Spark is less mature than Hadoop, so you will run into issues like this. In my experience, advocating for the bug to get fixed often results in it getting fixed... on a several month timeline. This happened with Avro support in Python. I advocated for the patch and someone supplied it in the next version of Spark.

Lemme tell you though... as someone that has use Hadoop for 5+ years... not waiting 5-10 minutes every time you run new code is worth the trouble. Despite more problems owing to immaturity, or just Spark 'doing less for you' in terms of data validation than other tools like Pig/Hive, if you can get your stuff running on Spark... development is joyous. You just don't have to wait very long during development anymore.

I feel like 5 years of my life were delayed 10 minutes. That did terrible things to my coding that I'm just starting to get over. With Spark I am 10x as productive, and I am limited by my thinking, not the tools.

PySpark in particular is really great.

Seriously ./spark-shell is a godsend for development.

And I love the fact you can press Tab and get autocompletion of methods.

Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.

Sorry, didn't mean to come over completely negative.

I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.

No worries. Hopefully you'll reconsider using it!

We've recently adopted Spark SQL and our queries run 5 - 200x times faster than with Hive.

Our experience with Spark Streaming, on the other hand, has been mixed. Our streaming app runs stably most of the time (up to 4 days in some cases), but we still see the occasional failure, sometimes with no exception or stack trace indicating what failed.

Our goal is to have a 24/7 streaming service, and Spark has gotten us close to that. There are just a couple of unexplained errors standing in our way.

I'm building distributed ml on top of spark and found it to be good overall. I've had to work out issues with partitioning and mini batching, but I've had a good time so far. The data frame initiative will certainly help things. The JVM ecosystem needs a scientific environment like python (pandas,scipy,..). The potential is there with scala as we're seeing here today.

Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.

We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.

PySpark is really compelling for Pig/Python shops. If it weren't for Pig on Spark, I'd fear for Pig's future.

Spark is pretty fantastic from our perspective. People just think about it in terms of a faster Hadoop MR but it is so much more. The APIs and integration with external systems are so much easier and more intuitive to use.

It really is Hadoop 2.0.

We've been using it in production for about half a year now and it's been great. It especially shines when you have iterative algorithms were you e.g. first need to group-by something then process that a bit, then split it out again and process that a bit more etc. This kind of task is just so much faster in Spark than in MapReduce, it doesn't even compare. Also their API is much nicer and so are the underlying ideas (RDDs in particular). I think it will mostly replace MR in the future. If you are starting a new project now I see few reasons to not use Spark if it fits the bill.

I've used Spark quite successfully for a few small jobs and I love it. I'm sure robustness will improve over time (haven't been bothered by any immaturity myself), but as a system that supports a variety of Big Data processing styles like batch, streaming, graph, ML, etc. so well and is clearly bridging the gap between distributed systems and more traditional analysis languages, I think it's very exciting.

Btw, there is an new PR by Josh Rosen to fix SPARK-4454 here: https://github.com/apache/spark/pull/4660

I've been successfully using Spark in production since 0.7, across three or four significantly different projects.

I don't think I could bring myself to ever write another Hadoop job.

I do, mostly using GraphX, ran a pretty big EC2 cluster using Spark 1.2 (hundreds of cores) and had no significant issues.

Spark the platform seems awesome. I'm somewhat less convinced by mllib - I'm not sure there are as many use cases for distributed machine learning as people seem to think (and I would bet that a good deal of companies that use distributed ML don't really need it). I've seen a lot of tasks that could be handled by simpler, faster algos on large workstations (you can get 250 GB RAM from AWS for like $4.00/hr). I'd love to hear counterarguments, though!

While fitting the algorithm might not often benefit from partitioned data, I see two upsides from using spark for predictive modeling.

First it makes it easy to do the feature extraction and model fitting in the same pipeline, hence make it possible to cross-validate the impact of the hyper-parameters of the feature extraction part. Feature extraction generally starts from a collection of large, raw datasets that needs to be filtered, joined and aggregated (for instance a log of user clicks, sessionized by user id over temporal windows, then geo-joined to GIS data via a geoip resolution of the IP address of the user agent). While the raw datasets of clicks and geographical databases might be too big to be processed efficiently on a single node, the resulting extracted features (e.g. user session statistics enriched with geo features) is typically much slower and could be processed on a single node to build a predictive model. However spark RDDs make it natural to trace the provenance hence trivial to rebuild downstream models when tweaking upstream operations used to extract the features. The native caching features of Spark make that kind of workflow very efficient with minimal boilerplate (e.g. no manual file versionning).

Second, while the underlying ML algorithm might not always benefit from parallelization in itself, there are meta-level modeling operations that are both CPU intensive and embarrassingly parallel and therefore can benefit greatly from a compute cluster such as Spark. The canonical case are cross validation and hyper-parameter tuning.

The benefit of Spark and related systems is you get a flexible infrastructure that can handle a wide range of tasks reasonably well. You pay once for infrastructure, training, devops, and so on.

You can optimise any particular use case to perform better than Spark, but then you are going to incur the above costs for every project you create.

I'm one of the authors of the blog post as well as this new API. Feel free to ask me anything.

Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.

For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.

One useful feature is to consider the neighbors rows, for example for differentiating a time series. Do you plan to make an efficient method for that?

How does the API communicate between the client and the server? Any interest in talking to it from R? (I'd be happy to help)

Are there any timelines for when this (and Spark in general) will fully support ORC files (including predicate-pushdown)?

Very likely in Spark 1.4. Hortonworks has been helping out with this, we just need some internal refacotring to the API to make it work.

Are DataFrames RDDs with a new DSL?

In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.

Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?

I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?

You guys are doing great work !

Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.

Could you give an example of something that could not be expressed with DataFrames? Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?

Hey... don't downvote Reynold Xin, author of the post as dupe when he says AMA.

Somehow my comment was removed ... :(

Re-make it.

A replica of this benchmark on my laptop running R has this running in about 1/4 second. Seems like a pretty trivial benchmark?


x = data.table(a=sample(10,10e6,replace=TRUE),num=sample(100,10e6,replace=TRUE)) t1=proc.time(); x[,sum(num),by=a]; print(proc.time()-t1)

   user  system elapsed
  0.209   0.032   0.245

The example was mostly a toy example. The power really comes when you get interactivity for small data and big data. Using this, you could scale up to TBs of data on a cluster and still get results relatively fast, which is not something you can do with R.

I don't expect at small scale to beat R yet. There are a few low-hanging fruits for single node performance. For example, even for single node data, we incur a "shuffle" to do data exchange in aggregations. This is done to ensure both single node program and distributed program go through the same code path, to catch bugs. If we want to optimize more for single node performance, we can get the optimizer to remove the shuffle operation in the middle, and just run the aggregations. Then this toy example will probably be done in the 100ms range.

Actually I just realized the main difference -- if you load the data from memory, it is much faster, which the case of R.

Native question - how is this different than Spark SQL and things like project zeppelin?

This is great news! Where do I see the source for this?

Since this will be released in Spark 1.3, you can track the dev. in the 1.3 branch on Github https://github.com/apache/spark/tree/branch-1.3

Will the DataFrame API work with Spark Streaming?

There are some ways you can integrate the two. E.g. Streaming allows you to apply arbitrary RDD transformations, and thus you can pass a physical plan generated by DataFrame into streaming.

We will work on better integration in the future too.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact