
Discovering Anomalies in Real-Time with Apache Flink - GeneticGenesis
https://mux.com/blog/discovering-anomalies-in-real-time-with-apache-flink/
======
js8
Since I maintain a pretty large ETL (batch) application for a living, I am
genuinely curious about this. How do you handle failure in event-processing
systems? I mean in batch, it's simple - if there is a record (event) that
causes unexpected failure (or the program fails for other reason, for example
it runs out of space), we just restart the batch.

But in event processing, unless you can afford yourself to skip events, how do
you deal with that sort of thing, especially if the processing needs to keep
track of internal state between events?

I read about event-sourcing, which kinda is a solution to that, but add
checkpoints and you have pretty much batch processing again.

~~~
mtrn
Spot on. Can you recommend some good articles or books dealing with these kind
of problems?

My current understanding is that a event-based (in a very broad sense) system
are hard to "replay" in the case of a failure (error in data or just a bug),
unless you build additional machinery, which decreases robustness. In contrast
the task of making a batch processing system perform fast is easy and much
better defined.

~~~
urlgrey
I highly recommend reading Tyler Akidau's article titled "The world beyond
batch: Streaming 101": [https://www.oreilly.com/ideas/the-world-beyond-batch-
streami...](https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-101)

And its follow-up post: [https://www.oreilly.com/ideas/the-world-beyond-batch-
streami...](https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102)

------
hackerboos
"The Apache Flink project is a relative newcomer to the stream-processing
space. Competing Open-Source platforms include Apache Spark, Apache Storm, and
Twitter Heron."

Can someone explain why Apache are creating projects that compete with each
other? Why not focus on one?

~~~
lvh
Apache often houses existing projects; sometimes becoming a home for formerly-
proprietary software that gets thrown over the wall. Remember Google Wave?
That's Apache Wave now. Apache Storm started out as just Storm, open sourced
after a Twitter acquisition. Twitter Heron was open sourced by Twitter. Flink
is a fork of Stratosphere. Et cetera :) So: ASF is doing no such thing,
they're just providing a framework for open source projects to function in.

------
urlgrey
I'm the author of this Mux blog post and would love to take any questions or
comments, as well as suggestions for future posts. Thank you for your
interest!

~~~
oxtopus
Can you elaborate on the "novel anomaly-detection algorithms" used here?

~~~
urlgrey
Sure! We evaluated several anomaly-detection tools & libraries. They included:

tried-and-true statistical methods like probability density functions

Yahoo EGADS anomaly-detection library

Numenta HTM neural-network anomaly-detection library

We ruled out HTM due to AGPL licensing concerns. It's an interesting product,
but wasn't a good fit for us at this point in time. EGADS and other basic
statistical methods can actually get you pretty far.

~~~
bradknowles
And what do you do with this video anomaly information?

~~~
urlgrey
Mux ([https://mux.com/](https://mux.com/)) collects performance metrics for
video delivery & playback on the websites & apps of our customers. These
metrics feed into our real-time alerting system. If the error-rate for a
customer property (site) or video-title is exceptionally high then an alert
will be triggered. Mux customers can configure alert notifications to be sent
to Slack & email, and view a history of alerts in the Mux web dashboard. We
also offer the ability to view breakdowns of playback failures, video start-up
times, and more through our dashboard. This can be helpful for diagnosing
playback issues related to specific browsers, geographies, ISPs, and more.

------
falsedan
I'd like to see more about how they used Flink, and less about their system
architecture (which give great details, up until the data is processed with
Flink).

~~~
urlgrey
Great suggestion! We'll have a follow-up post soon that goes into greater
detail on the mechanics of how Mux uses Flink.

------
dtjon
We ditched Spark Structured Streaming for Flink for a Kafka consumer,
processing 3B events per day. Its been extremely stable so far, and half the
cost of the spark cluster

~~~
SureshG
Could provide more details about your cluster setup? How big is your cluster?

