Hacker News new | past | comments | ask | show | jobs | submit login
Announcing Apache Spark 1.4 (databricks.com)
154 points by rxin on June 11, 2015 | hide | past | web | favorite | 45 comments

Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at Berkeley's AMPLab) in collaboration with DataBricks (Commercial company started by Spark creators) just started a free MOOC on edx: https://www.edx.org/course/introduction-big-data-apache-spar...

(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)

It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).

So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)

I use Spark for work (Scala API) and still learned one or two new things.

It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.

It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.


(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)

The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM (http://www.acm.org/press-room/news-releases/2015/dissertatio...)

Spark also set a new record in large scale sorting (Beating Hadoop by far): https://databricks.com/blog/2014/11/05/spark-officially-sets...

* EDIT: typo in "Berkeley", thanks gboss for noticing :)

I would love to learn about spark,but as some one who li e in third world country I hate edx,instead I am in love with udacity and coursera.the place I am living ,we don't have much traffic monthly ,instead we can download everything we want between 1am-6am,so there is no way to download course from edx ,simply and using it later.I wish it was on udacitg or coursera,is there any torrent for course material?

I'm doing the spark course. Edx has a download button on the videos, and can download PDF files for the lectures. The rest like quizes that are embeded, i just screenshot or save as pdf for posterity.

Are you sure you can't download, or maybe they've changed recently.

Yes I am aware of download button , but consider every course is ~50 distict video and also consider our downloading time you are going to agree with me about downloading is extermely painful ,why they just doesn't put whole material (at least just videos) like the way udacidy does.

You can download the lectures using the edx-downloader: https://github.com/shk3/edx-downloader

> It is followed by a more advanced spark course (Scalable Machine Learning)

Is it really more advanced regarding Spark? The requirements state explicitely that no prior Spark knowledge is required.

Cool, I stand correct. Thanks

"... on top of Hadoop".

Can safely remove this part. Hadoop not required.

Hadoop isn't require and it only run better if you fit data in memory.

Spark does micro batch processing where as Hadoop traditionally does batch processing. Hadoop yarns is different now and even with old Hadoop if you can fit it into memory it can be supposely as fast according to a meetup I've attended.

There's also Apache Flink by data artisan.

I've been struggling to set up it correctly on my debian machine. Are there debian packages or some concise tutorial? I've found some thing on the web, but certain things does not much mine and I'm lost...

Thanks for the detailed info and context. Just signed up for my first edX course.

Thanks! I've been following the course and so far it's been awesome!

Thanks for the plug, I have signed up as well to the class and its great !

As someone who doesn't know what Apache Spark is, this article reads like it could have been randomly generated.

Apache Spark is a general purpose distributed data processing and caching engine. It is an evolution of MapReduce concepts into more general "directed acylic graph" processing, which is very flexible for defining and executing data processing work on a distributed cluster. It's got some similarities to PrestoDb, Apache Drill and or Apache Storm (although not quite the same).

It also has some nice data mining libraries, a library for handling streaming data, some connectivity to external data sources and a library for accessing data stored in its generic "data frames" via SQL. "Data frames" are just an abstraction for a dataset, but they are distributed, and in-memory and/or persistent.

Personally, I like to think of as an engine for data analysis/processing and queries, but different in that it is not really a "database" like you would traditionally consider. It's almost like if you took the SQL data processing engine out of your database and made it really flexible.

Edit: Also, all the functionality of Apache Spark is programmatically accessible in Java, Scala and Python, or through SQL with their Hive/thrift interface.

There's an About page on that same site, available in the top navigation:


The release notes: https://spark.apache.org/releases/spark-release-1-4-0.html

Another major change is that it supports Python 3 now. https://issues.apache.org/jira/browse/SPARK-4897

They've integrated Tungsten / native sorting into shuffle and observed some decent speedups:

* https://issues.apache.org/jira/browse/SPARK-7081

* https://github.com/apache/spark/pull/5868#issuecomment-10183...

However, I guess reduceByKey (and friends) don't benefit yet.

Their SGD implementation still uses TreeAggregate ( https://github.com/apache/spark/blob/e3e9c70384028cc0c322cce... ) so I wonder when they're planning to add some of the "Parameter Server" stuff (e.g. perhaps butterfly mixing or Kylix http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf )

I'm excited about SparkR, even though R is shunned in the field of big data. Between that and dplyr (which inspired the SparkR syntax) for data manipulation and sanitation, it should be much easier to write sane, reproducible code and visualizations for big data analysis. (the Python/Scala tutorials for Spark gave me a headache)

SparkR appears to have strong integration into Rstudio, which is big news: http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent...

R is absolutely not shunned in big data. It is very popular.

There is a reason Microsoft acquired Revolution Analytics.

R on Spark is great, but the biggest issue in my view is R's runtime licensing, isn't it GPL? Am I worried for nothing?

I too have been mystified by R's licensing. I actually don't see how anyone can ship a commercial product using R in its current form. At very least you're in a legal gray area, at worst you are involuntarily open sourcing your product. Not that there's anything wrong with open sourcing a product, but I think there's an enormous potential issue that could foul up a lot of people down the track. The best discussion I have seen about this pretty much ends up with uninformed speculation. For now, I take the policy of "explore and prototype in R, build the real system in something else". Fortunately the flaws and limitations of R as a language make this a sensible choice for a host of other reasons as well.

Not sure that R is _shunned_ in big data, as much as there are better solutions once you get to a certain level of big.

It will be interesting to see how all the R libraries play with Spark. There are bound to be some hiccups there.

My interpretation is that it will convert DataFrames to normal data.frames when necessary. Unfortunately, this removes the performance efficiency of Spark.

Since currently SparkR only supports aggregation, it limits the usability of SparkR slightly. Future versions will apparently have MLib support which should alleviate that.

Does anyone know if there's a guide to integrating Spark between a realtime write only database and a historical database?

I've looked into using Spark Streaming, but I can't work out how you could seamlessly transition data from a streaming batch to the historical db in a reasonably tight time period.

I'd be willing to pay for training if it came to it, but I don't think I'm using the right search terms.

Check out MemSQL's Community Edition for this very use case. We shipped MemSQL with an open-source multi-threaded, bi-directional connector to Spark.

The DB has two storage engines: in-memory row tables, and on-disk column tables for efficient compression and permanent retention. Then, it becomes an easy task of INSERT/SELECT...FROM to move data from memory to disk very quickly.

Do this: http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-ti...

Then at the end instead of writing to HBase you can write JDBC to do the insert.

JDBC works great if you have a large RDD that you want to persist in one go.

But if you are writing row by row you will need to implement your own batching algorithm and connection pooling to get any decent performance.

You could always use mapPartition to open one connection per partition.

Nice link. Spark jobs run Java code anyway, so why not write to JDBC with the results of processing. Makes perfect sense.

May I ask, why do you want to integrate Spark in the middle of the two? I am seeing Spark used more for distributed processing/caching data rather than being a conduit for data movement from one system to another.

You have a realtime write only database and you want to update a historical database from that write only database? Or do you just want to join data across the two sources on the fly? Those are two pretty different use cases.

Based on what you're asking, you might find these two articles interesting:

- http://blog.confluent.io/2015/03/04/turning-the-database-ins...

- http://lambda-architecture.net/

Well, maybe I'm totally off on this, but it's more that I'd like to be able to run analytics which include real-time data without having any notable pauses. I'm willing to look at anything in terms of getting the data from the real time capture into the historical database as long as the spark queries "just work".

Sorry, I think maybe "integrating between" was the wrong way to phrase it.

On the other hand, I mean there's clean up and preprocessing I want to do on data that goes into the historical dataset, so hey, why not do that clean up/processing with Spark?

I've seen Lambda Architecture before, but it seems like it's kinda gone dark and unless I just totally overlooked it, I don't think there was a "Hey, this is the way to do it guys!"

Not sure if you have used it but Spark is exceptionally good at data movement.

In fact that is what a lot of people initially started using it for (as a replacement for Hive/Pig). You can write SQL against HCatalog tables, do some transformation work then write the results out to a different system. We have hundreds of jobs that do just this.

Well, I guess that is the power of it being so general purpose. I have used Spark more for analytics (and Spark SQL) but not extensively for ETL. What you're saying makes sense, you're still using Spark as an execution/computation engine, just writing the plumbing code to use it like an intermediary ETL tool.

While Spark is not intended for ETL per se, when I need to copy data from s3 to HDFS, I just use sc.textFile and sc.saveAsTextFile, in most of my use cases it does it pretty fast.

But Spark is mostly a computation engine replacing MapReduce (plus a standalone cluster management option). not an ETL tool.

I would look into other tools, such as https://projects.apache.org/projects/sqoop.html but I'm sure you know it already.

Sqoop does the extraction but not the transformation part of ETL and is only used for bulk moves not iterative.

I know it's cool to bash MongoDB but it is really nicely integrated with Spark, extremely quick for a write workloads (3000 writes/second on our slow drives and that's inside the RDD map) and doesn't flinch even when getting it to write 1 billion rows in quick succession. One thing that is really nice is that being schemaless you don't have to worry about setting up tables structures beforehand.

You can look at Cassandra which is historically known for exceptional write performance.

All the Spark links & hype I see is Cassandra Cassandra Cassandra, can you point me to a link or place that talks about hooking Spark to MongoDB?

you can use Spark JDBC RDD with Apache Phoenix, it's pretty fast: https://phoenix.apache.org/performance.html

You can even go one better and use the Spark integration directly:


I, somehow, always keep getting confused between Spark and Storm! Can someone explain the difference between the two (usecases etc.) as if explaining to a five year-old? Thanks!

Storm = Streaming data processing, written in Clojure and previously used at Twitter until they replaced it with Heron.

Spark = Streaming (technically micro-batch) and batch data processing, written in Scala and used very widely.

Is support for User Defined Aggregation Functions (regarding DataFrames) slated for 1.5?

Too bad the website is so hard to read.

Time for that site to join contrastrebellion.com

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact