
Introducing Complex Event Processing (CEP) with Apache Flink - stsffap
http://flink.apache.org/news/2016/04/06/cep-monitoring.html
======
scott_s
For those interesting in CEP in streaming systems, I invite you to read a
paper by my colleague, Martin Hirzel, "Partition and Compose: Parallel Complex
Event Processing":
[http://hirzels.com/martin/papers/debs12-cep.pdf](http://hirzels.com/martin/papers/debs12-cep.pdf)

His work has more of a declarative, regex-inspired take, and is implemented in
SPL, the language for IBM Streams. His prototype was adapted into the product,
and is a part of Streams:
[http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.0/com....](http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.0/com.ibm.streams.toolkits.doc/spldoc/dita/tk$com.ibm.streams.cep/op$com.ibm.streams.cep$MatchRegex.html)

------
ignoramous
For anyone wondering what CEP is... (from the blog) "in contrast to
traditional DBMSs where a query is executed on stored data, CEP executes data
on a stored query."

And

"Apache Flink with its true streaming nature and its capabilities for low
latency as well as high throughput stream processing is a natural fit for CEP
workloads."

Anyone looking at generic low latency, high throughput systems can look at
Disruptor by LMAX [0]. Spring's Reactor framework supports it out-of-the-box.

[0]
[https://news.ycombinator.com/item?id=3173993](https://news.ycombinator.com/item?id=3173993)

~~~
sgdread
Some CEPs even have query languages which look very like SQL (Esper [1]).
But... it's different: 1) it's push model vs pull in database (you toss data
in meatgrinder and see outcome on other side as stuff goes in vs going to data
and picking what you need); 2) CEPs usually keep track of data only in
specified windows (time/distance); you explicitly have to manage manual data
windows; this means queries to join everything with everything between two
data streams will result in memory explosion on large data sets; 3) CEPs
trigger evaluation on message inserts/removals - this means you can't trigger
re-evaluation at will without any stimulus event and you can't do joins with
absence of something; you can't simply ask CEP to fire an event if no new
events happened in last 5 minutes - that needs some query trickery.

[1]
[http://www.espertech.com/products/esper.php](http://www.espertech.com/products/esper.php)

------
haddr
I'm watching closely Apache Flink recently and it seems to me that it's more
universal that Apache Spark, and with less overhead. Today if somebody asked
me for general purpose stream processor engine I would recommend Flink. Of
course Spark has many advantages too, but unless you don't need explicitly
need Spark's RDD, then go for Flink.

What I'm still missing is better support for machine learning libraries, but I
suppose it's a matter of time.

~~~
jkot
There is a quite good technical description of difference between Spark and
Flink

[http://blog.madhukaraphatak.com/](http://blog.madhukaraphatak.com/)

~~~
slagfart
Here's the full link: [http://blog.madhukaraphatak.com/introduction-to-flink-
for-sp...](http://blog.madhukaraphatak.com/introduction-to-flink-for-spark-
developers-flink-vs-spark/)

------
x0rg
For those interested in a comparison between Spark and Flink, a recent article
from our blog here: [https://tech.zalando.com/blog/apache-showdown-flink-
vs.-spar...](https://tech.zalando.com/blog/apache-showdown-flink-vs.-spark/)

------
nxzero
Very possible it's obvious and I'm just not seeing it, but are there examples
of (static) streams and related code the maybe processes to get a sense of the
workflow, data, code, etc.?

~~~
stsffap
As input you can basically use any Flink DataStream<T>. There are multiple
ways to ingest this data into your Flink program. The most common way is to
use a Flink connector for Kafka or RabbitMQ. Or you can also write your own
Flink source function. This repository [https://github.com/tillrohrmann/cep-
monitoring](https://github.com/tillrohrmann/cep-monitoring) contains a
complete working example for the monitorig use case. Instead of reading from
an external source, we've defined a custom source to generate the input data.
The output data is written to stdout for demonstration purposes.

------
oh_sigh
From my shallow perusal, this seems like it is strongly connected with
reactive style programming? Is this a library to enable that style of
programming, or is there something different here?

------
vardump
What kind of throughput can I expect?

How many millions of events per second per contemporary Intel CPU core?

~~~
jtagx
I took the code from the blog post, set the parallelism to one and measured
how many events the data generator is pushing into the CEP pattern matcher of
Flink.

My CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz Thoughput on one core ~ 650k
events / second. Code: [https://github.com/rmetzger/cep-
monitoring/tree/throughput](https://github.com/rmetzger/cep-
monitoring/tree/throughput)

------
jamesblonde
To install flink on aws or gce, check out karamel.io

