Does anyone have a good comparison of Flink and Spark, especially from a use cas...

doctorcroc · on March 9, 2016

The benchmarking study done by Yahoo was fairly comprehensive, and quantitatively assesses the different stream frameworks (including Flink and SS) https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...

We also did a podcast about it if you're interested in digging deeper -- http://softwareengineeringdaily.com/2016/02/03/benchmarking-...

abeppu · on March 9, 2016

Thanks for sharing that. I recently was looking at a different comparison of the _batch_ capabilities of flink and SS [1], which found that flink was faster at terasort than SS. I'm curious to understand why it is that it looks like SS can get higher throughput than flink in the streaming case, but less in the batch case.

[1] http://www.slideshare.net/ssuser6bb12d/a-comparative-perform...

sewen · on March 10, 2016

I don't think SS gets higher streaming throughput than Flink. That was an assumption written in the Yahoo! streaming benchmark without an actual experiment.

mtrn · on March 9, 2016

Flink comes from a more database oriented background. It grew out of research project at TU Berlin[1].

I believe, Flink tried to focus on query language and optimization, as you probably would in a database settings. In contrast, Spark is sometimes described as a batch processing system, which provides a real-time experience by intelligently partitioning the work.

[1] http://stratosphere.eu/

sewen · on March 9, 2016

That was the research project where many contributors came from.

Most of the streaming tech and developments in Flink are very disconnected from that and have little to do with database tech any more, actually.

sewen · on March 9, 2016

The differences are mainly around batch-centric vs. streaming-centric model and executions.

Here is for example a video walkthrough by MapR: https://www.mapr.com/blog/apache-spark-vs-apache-flink-white...

faizshah · on March 9, 2016

Another major difference is that Flink's scheduler doesn't have a notion of data locality like Spark's. If you want to use data local to the node you have to query whatever you're storing your stuff in (HDFS) and filter those items that aren't on the node.

sewen · on March 9, 2016

That works a bit differently in Flink and Spark.

Inside a data flow program, the scheduler tries to schedule as local as possible.

For the inputs to a streaming program (for example Kafka partitions), there is currently no locality consideration, but the locality there changes throughout a program lifetime anyways (brokers change leadership and rebalance)

Flink's DataSet API does assign data to tasks after the scheduling. That assignment respects locality, actually. That lazy assignment makes it possible to handle large numbers of small files, for example.

faizshah · on March 9, 2016

Hmm, I didn't know that about the DataSet API. When I looked at the scheduler's code it didn't seem to have any notion of data locality except for colocation of vertices. I'll take a look at the DataSet API's code though, thanks!

ecesena · on March 9, 2016

If you're interested in Apache Beam (dataflow), then Flink seems to me the best candidate to become sn open source runner. Spark 2.0 may change things though.