Thanks for sharing that. I recently was looking at a different comparison of the _batch_ capabilities of flink and SS [1], which found that flink was faster at terasort than SS. I'm curious to understand why it is that it looks like SS can get higher throughput than flink in the streaming case, but less in the batch case.
I don't think SS gets higher streaming throughput than Flink. That was an assumption written in the Yahoo! streaming benchmark without an actual experiment.
Flink comes from a more database oriented background. It grew out of research project at TU Berlin[1].
I believe, Flink tried to focus on query language and optimization, as you probably would in a database settings. In contrast, Spark is sometimes described as a batch processing system, which provides a real-time experience by intelligently partitioning the work.
Another major difference is that Flink's scheduler doesn't have a notion of data locality like Spark's. If you want to use data local to the node you have to query whatever you're storing your stuff in (HDFS) and filter those items that aren't on the node.
Inside a data flow program, the scheduler tries to schedule as local as possible.
For the inputs to a streaming program (for example Kafka partitions), there is currently no locality consideration, but the locality there changes throughout a program lifetime anyways (brokers change leadership and rebalance)
Flink's DataSet API does assign data to tasks after the scheduling. That assignment respects locality, actually. That lazy assignment makes it possible to handle large numbers of small files, for example.
Hmm, I didn't know that about the DataSet API. When I looked at the scheduler's code it didn't seem to have any notion of data locality except for colocation of vertices. I'll take a look at the DataSet API's code though, thanks!
If you're interested in Apache Beam (dataflow), then Flink seems to me the best candidate to become sn open source runner. Spark 2.0 may change things though.
Most I have found are light in actual contrast detail.