Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.
And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:
> One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.
My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what's going on?
The longer answer is that this is about how to logically think about the semantics of computation using a declarative API, and the actual physical execution (e.g. incrementalization, record at a time processing, batching) is then handled by the optimizer.
The key feature for me is Machine Learning functions in R, which otherwise lacks parallelizeable and scalable options. (Without resorting to black magic, anyways)
This is just one advantage of the more traditional models that Spark includes out of the box. Just because neural nets are awesome doesn't mean there aren't big advantages to the older stuff. Also, often it's not the algorithm that matters, it's how you prepare the data and how much data you have in hand. After all is said and done, in many cases the fancy model has similar performance to the boring model.
Very happy to see Spark developing yet more abilities to make ML on large datasets seamless and (relatively) painless.
FWIW, my interest here is what becomes possible for sub-second computations. We do interactive analytics, and GPUs become exciting for that, including in the distributed case :) Similar story for stuff like VR, real-time ML, etc.
Also this is attractive to me because recursion.
i assume results such as this are due various optimizations under the Tungsten rubric (code generation, manual memory management) which rely on the sun.misc.Unsafe api.
We will be writing a deep dive blog post about this in the next week or two to talk more about this idea.
Excellent, I'm glad that the "big data" world is starting to look at database literature in terms of how it does execution, as there is much to be learned.
Most of these systems are extremely inefficient (looking at you Hadoop), when they don't really have to be. Efficient code generation should be table stakes for any serious processing framework IMO.
In practice, Spark seems to perform reasonably well on smaller in-memory datasets and on some larger benchmarks under the control of Databricks. My experience has been pretty rough for legitimately large datasets (can't fit in RAM across a cluster) -- mysterious failures abound (often related to serialization, fat in-memory representations, and the JVM heap).
The project has been slowly moving toward an improved architecture for working with larger datasets (see Tungsten and DataFrames), so hopefully this new release will actually deliver on the promise of Spark's simple API.
* distributed machine learning tasks using their built-in algorithms (although note that some of them, e.g. LDA, just fall over with not-even-that-big datasets)
* as a general fabric for doing parallel processing, like crunching terabytes of JSON logs into Parquet files, doing random transformations of the Common Crawl
As a developer, it's really convenient to spin up ~200 cores on AWS spot instances for ~$2/hr and get fast feedback as I iterate on an idea.
So real world use-cases? Any MR use case should be doable by Spark. There are plenty of companies using Spark to create analytics from streams, some are using it for its ML capabilities (sentiment analysis, recommendation engines, linear models, etc.).
I apologize if my comment isn't as specific as you're looking for, but I know of people who use it for exactly the scenarios I've outlined above. We are probably going to use it as well, but I don't have a use case to share just yet (at least nothing concrete at the moment). Hopefully this gives you some idea of where Spark fits.
Netflix has users (say 100M) who have been liking some movies (say 100k). Say The question is: for every user, find movies he/she would like but have not seen yet.
The dataset in question is large, and you have to answer this question with data regarding every user-movie pair (that would be 1e13 pairs). A problem of this size needs to be distributed across a cluster.
Spark lets you express computations across this cluster, letting you explore the problem. Spark also provides you with a quite rich Machine Learning toolset . Among which is ALS-WR , which was developped specifically for a competition organised by Netflix and got great results .
The code is very straightforward and it is fast.
I had hundreds of gigabytes of JSON logs with many variations in the schema and a lot of noise that had to be cleaned. There were also some joins and filtering that had to be done between each datapoint and an external dataset.
The data does not fit in memory, so you would need to write some special-purpose code to parse this data, clean it, do the join, without making your app crash.
Spark makes this straightforward (especially with its DataFrame API): you just point to the folder where your files are (or an AWS/HDFS/... URI) and write a couple of lines to define the chain of operations you want to do and save the result in a file or just display it. Spark will then run these operations in parallel by splitting the data, processing it and then joining it back (simplifying).