
Apache Flink - theBashShell
https://github.com/apache/flink
======
meritt
Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the
same thing. I feel like this is a bit overboard. And this is before we talk
about the non-Apache stream-processing frameworks out there.

* Apache Flink is an open source stream processing framework

* Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.

* Apache Storm is a distributed stream processing computation framework

* Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing

* Apache Spark is an open-source distributed general-purpose cluster-computing framework.

* Apache Apex is a YARN-native platform that unifies stream and batch processing.

* Apache Kafka is an open-source stream-processing software platform

~~~
mxmxm
Not quite. All of these are open-source projects but they are very different
in many aspects:

* Apache Flink

Sophisticated stream processing framework with focus on robustness (managed
memory) and correctness (exactly-once semantics)

* Apache Flume

Tailored towards log data.

* Apache Storm

First stream processing framework. Legacy.

* Apache Samza

Only used at LinkedIn. Tight to Hadoop's YARN.

* Apache Spark

Only great in batch processing.

* Apache Apex

Dead project. Tight to Hadoop's YARN.

* Apache Kafka

A distributed message queue with simple stream processing built on top via the
Confluent Platform.

~~~
dankohn1
Nice list. There are also non-Apache projects like NATS, hosted by CNCF.

Here are the 19 streaming and messaging projects and products that CNCF is
tracking: [https://landscape.cncf.io/category=streaming-
messaging&forma...](https://landscape.cncf.io/category=streaming-
messaging&format=card-mode&grouping=category)

------
lenticular
Controversial opinion here, but all of these distributed streaming
architectures are massively overused. They certainly have their place, but you
probably don't need them. I see it all the time with ML work. You wind up
using a cluster to overcome the memory inefficiency of Spark, when you could
have just used a single machine. For example, I've done huge graph clustering
models on a single machine just by being smart about memory consumption. It
would have taken an enormous and expensive Spark cluster.

~~~
bunderbunder
This has been my experience, too. I worked at just one place that had a
_really_ good handle on high-volume, high-velocity streaming data, and they
didn't use Flink or Storm or Kafka or anything like that. They mostly just
used the KISS principle and a protobuf-style wire format.[1]

There is definitely a point where these sorts of scale-out-centric solutions
are unavoidable. Short of that point, though, they're probably best avoided.

[1]: (It's truly amazing how many CPU cycles you can reclaim just by removing
branch instructions from your message deserialization code.)

~~~
mikhailfranco
See McSherry et al, _" Scalability! But at what COST?"_

[https://www.usenix.org/system/files/conference/hotos15/hotos...](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-
mcsherry.pdf)

------
meatyapp
What's the easiest way to get started with trying Apache Kafka/Spark/Flink on
the cloud? If I want to try out Redis there's RedisLabs, CloudAMQP for
RabbitMQ, Compose for Postgres/Redis/RabbitMQ, offerings like Google Cloud
SQL/MemoryStore and AWS RDS/ElastiCache, etc. Where do I go for some easy
Apache deployments?

~~~
deepsun
Google cloud has "one-click" installation of integrated 3rd-party solutions.
Never tried that though.

------
CSDude
AWS lets you uplod Flink programs to process to Kinesis streams, and Google
Cloud also has support for Apache Beam

~~~
rehevkor5
Yeah, or any other input. I don't think it's tied explicitly to Kinesis. This
is definitely easier than other ways to deploy! EMR also has Flink as an
option.

------
_ZeD_
"Prerequisites for building Flink: Unix-like environment (we use Linux, Mac OS
X, Cygwin) Java 8 (Java 9 and 10 are not yet supported)"

 _Sigh_ is too much to ask for proper crossplatform support? And when the hell
they will add support for recent versions of Java????

~~~
rsynnott
Pretty much all of these Hadoop-adjacent things are more or less Linux only,
and certainly unix-y-thing-only. What other platform do you want to run it on?

~~~
_ZeD_
Uhh, I don't know... maybe windows?

heck, hadoop itself runs fine on windows 10
[[http://hadoop.apache.org/docs/current/hadoop-project-
dist/ha...](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html)]

------
ckdarby
Does anyone have experience running in production?

If you have an article about using it or slides please send links.

~~~
wenc
Not sure why the Flink github site got linked. Here's a list of companies
using it in production, with links to details.

[https://flink.apache.org/poweredby.html](https://flink.apache.org/poweredby.html)

