

Twitter’s new R package for anomaly detection - astletron
http://www.r-bloggers.com/twitters-new-r-package-for-anomaly-detection/

======
jcr
This is blogspam and a duplicate. The original url is here:

[https://blog.twitter.com/2015/introducing-practical-and-
robu...](https://blog.twitter.com/2015/introducing-practical-and-robust-
anomaly-detection-in-a-time-series)

And the previous discussion is here:

[https://news.ycombinator.com/item?id=8846205](https://news.ycombinator.com/item?id=8846205)

~~~
Hansi
I wouldn't call it blogspam since r-bloggers.com is an opt-in aggregator. But
yes a link to the original makes more sense.

~~~
Dzidas
What's the point to credit middle man? Hacker news, reddit do the same (or
even better), r-bloggers just rip off the content. Been there, done that.

~~~
Hansi
That's not what I said, my point was that labelling it blogspam doesn't seem
correct because it's opt-in and a preferred source for articles for many in
the R community. And as I said the original link should have been linked to.

------
tempodox
This reminds me how much of a hole there is in my knowledge about statistics
and such. I built myself a Twitter client that sucks users' geolocations into
a DB so I can do all kinds of analyses on their movements. Makes me wish we
had Statistics classes back in school. That should come right after learning
the ABC.

~~~
jebus989
Even the original twitter blog seems to have gone out of its way to make this
seem more complex then it is...

Decomposition of time series is done with STL (stl function in stats package)
and this is the first part of what they call "Seasonal Hybrid ESD (S-H-ESD)"
(sounds impressive right?) which then apparently just involves taking the max
absolute difference from the detrended sample mean in terms of standard
deviations, remove it and repeat until you have your collection of x outliers.
If they wanted to this could be explained in a few sentences, and the
underlying code is really simple [0], but for whatever reason it's been
written up as advanced analytics — as if decomposing a time series is a major
challenge.

[0]
[https://github.com/twitter/AnomalyDetection/blob/master/R/de...](https://github.com/twitter/AnomalyDetection/blob/master/R/detect_anoms.R)

~~~
dtjones
While the computation might be relatively simple, its still necessary to be
aware of literature and use the proper academic description for the methods.

------
naftaliharris
It's interesting that they chose to write their anomaly detection code in R,
which is typically used in offline, post hoc analysis mode. It seems
reasonable to suppose that the ability to discover an anomaly like "service X
is failing" in real time is more valuable then discovering it a week later.

In order to monitor important time series with this code, they would
presumably need to run it every n minutes on the entire time series, or at
least the recent part of it. Seems an anomaly detection system operating on
streaming data might make more sense.

Perhaps their real time anomaly detection system uses simpler logic on
streaming data?

------
codewithcheese
Does anyone know if these R packages (AnomalyDetection, BreakoutDetection) are
to be used on large scale data or they more intended to be used in lab work?

~~~
dtjones
Doesn't look like they are setup to run in parallel, but most R stuff isn't.
Unless a package has explicit integration with one of the distributed
libraries such as doParallel

Would be interesting to see this package hooked up to streaming data and
monitor performance

~~~
bazzargh
I was wondering if the algorithm could be added to Etsy' Skyline, which does
anomaly detection on streaming data, based on a basket of algorithms (see
[https://github.com/etsy/skyline/blob/master/src/analyzer/alg...](https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py)).
Our own data is periodically bursty, and because Skyline doesn't apply STL
like this code, we have data that looks anomalous all the time.

