(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)
It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).
So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)
I use Spark for work (Scala API) and still learned one or two new things.
It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.
It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.
(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)
The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM (http://www.acm.org/press-room/news-releases/2015/dissertatio...)
Spark also set a new record in large scale sorting (Beating Hadoop by far): https://databricks.com/blog/2014/11/05/spark-officially-sets...
* EDIT: typo in "Berkeley", thanks gboss for noticing :)
Are you sure you can't download, or maybe they've changed recently.
Is it really more advanced regarding Spark? The requirements state explicitely that no prior Spark knowledge is required.
Can safely remove this part. Hadoop not required.
Spark does micro batch processing where as Hadoop traditionally does batch processing. Hadoop yarns is different now and even with old Hadoop if you can fit it into memory it can be supposely as fast according to a meetup I've attended.
There's also Apache Flink by data artisan.
It also has some nice data mining libraries, a library for handling streaming data, some connectivity to external data sources and a library for accessing data stored in its generic "data frames" via SQL. "Data frames" are just an abstraction for a dataset, but they are distributed, and in-memory and/or persistent.
Personally, I like to think of as an engine for data analysis/processing and queries, but different in that it is not really a "database" like you would traditionally consider. It's almost like if you took the SQL data processing engine out of your database and made it really flexible.
Edit: Also, all the functionality of Apache Spark is programmatically accessible in Java, Scala and Python, or through SQL with their Hive/thrift interface.
Another major change is that it supports Python 3 now.
However, I guess reduceByKey (and friends) don't benefit yet.
Their SGD implementation still uses TreeAggregate ( https://github.com/apache/spark/blob/e3e9c70384028cc0c322cce... ) so I wonder when they're planning to add some of the "Parameter Server" stuff (e.g. perhaps butterfly mixing or Kylix http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf )
SparkR appears to have strong integration into Rstudio, which is big news: http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent...
There is a reason Microsoft acquired Revolution Analytics.
Since currently SparkR only supports aggregation, it limits the usability of SparkR slightly. Future versions will apparently have MLib support which should alleviate that.
I've looked into using Spark Streaming, but I can't work out how you could seamlessly transition data from a streaming batch to the historical db in a reasonably tight time period.
I'd be willing to pay for training if it came to it, but I don't think I'm using the right search terms.
The DB has two storage engines: in-memory row tables, and on-disk column tables for efficient compression and permanent retention. Then, it becomes an easy task of INSERT/SELECT...FROM to move data from memory to disk very quickly.
Then at the end instead of writing to HBase you can write JDBC to do the insert.
But if you are writing row by row you will need to implement your own batching algorithm and connection pooling to get any decent performance.
You have a realtime write only database and you want to update a historical database from that write only database?
Or do you just want to join data across the two sources on the fly? Those are two pretty different use cases.
Based on what you're asking, you might find these two articles interesting:
Sorry, I think maybe "integrating between" was the wrong way to phrase it.
On the other hand, I mean there's clean up and preprocessing I want to do on data that goes into the historical dataset, so hey, why not do that clean up/processing with Spark?
I've seen Lambda Architecture before, but it seems like it's kinda gone dark and unless I just totally overlooked it, I don't think there was a "Hey, this is the way to do it guys!"
In fact that is what a lot of people initially started using it for (as a replacement for Hive/Pig). You can write SQL against HCatalog tables, do some transformation work then write the results out to a different system. We have hundreds of jobs that do just this.
But Spark is mostly a computation engine replacing MapReduce (plus a standalone cluster management option). not an ETL tool.
I would look into other tools, such as https://projects.apache.org/projects/sqoop.html but I'm sure you know it already.
You can look at Cassandra which is historically known for exceptional write performance.
Spark = Streaming (technically micro-batch) and batch data processing, written in Scala and used very widely.
Time for that site to join contrastrebellion.com