Hacker News new | past | comments | ask | show | jobs | submit login

Does anyone have a good comparison of Flink and Spark, especially from a use case perspective?

Most I have found are light in actual contrast detail.




The benchmarking study done by Yahoo was fairly comprehensive, and quantitatively assesses the different stream frameworks (including Flink and SS) https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...

We also did a podcast about it if you're interested in digging deeper -- http://softwareengineeringdaily.com/2016/02/03/benchmarking-...


Thanks for sharing that. I recently was looking at a different comparison of the _batch_ capabilities of flink and SS [1], which found that flink was faster at terasort than SS. I'm curious to understand why it is that it looks like SS can get higher throughput than flink in the streaming case, but less in the batch case.

[1] http://www.slideshare.net/ssuser6bb12d/a-comparative-perform...


I don't think SS gets higher streaming throughput than Flink. That was an assumption written in the Yahoo! streaming benchmark without an actual experiment.


Flink comes from a more database oriented background. It grew out of research project at TU Berlin[1].

I believe, Flink tried to focus on query language and optimization, as you probably would in a database settings. In contrast, Spark is sometimes described as a batch processing system, which provides a real-time experience by intelligently partitioning the work.

[1] http://stratosphere.eu/


That was the research project where many contributors came from.

Most of the streaming tech and developments in Flink are very disconnected from that and have little to do with database tech any more, actually.


The differences are mainly around batch-centric vs. streaming-centric model and executions.

Here is for example a video walkthrough by MapR: https://www.mapr.com/blog/apache-spark-vs-apache-flink-white...


Another major difference is that Flink's scheduler doesn't have a notion of data locality like Spark's. If you want to use data local to the node you have to query whatever you're storing your stuff in (HDFS) and filter those items that aren't on the node.


That works a bit differently in Flink and Spark.

Inside a data flow program, the scheduler tries to schedule as local as possible.

For the inputs to a streaming program (for example Kafka partitions), there is currently no locality consideration, but the locality there changes throughout a program lifetime anyways (brokers change leadership and rebalance)

Flink's DataSet API does assign data to tasks after the scheduling. That assignment respects locality, actually. That lazy assignment makes it possible to handle large numbers of small files, for example.


Hmm, I didn't know that about the DataSet API. When I looked at the scheduler's code it didn't seem to have any notion of data locality except for colocation of vertices. I'll take a look at the DataSet API's code though, thanks!


If you're interested in Apache Beam (dataflow), then Flink seems to me the best candidate to become sn open source runner. Spark 2.0 may change things though.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: