Twitter’s new R package for anomaly detection

jcr · on Jan 8, 2015

This is blogspam and a duplicate. The original url is here:

https://blog.twitter.com/2015/introducing-practical-and-robu...

And the previous discussion is here:

https://news.ycombinator.com/item?id=8846205

Hansi · on Jan 8, 2015

I wouldn't call it blogspam since r-bloggers.com is an opt-in aggregator. But yes a link to the original makes more sense.

Dzidas · on Jan 8, 2015

What's the point to credit middle man? Hacker news, reddit do the same (or even better), r-bloggers just rip off the content. Been there, done that.

Hansi · on Jan 8, 2015

That's not what I said, my point was that labelling it blogspam doesn't seem correct because it's opt-in and a preferred source for articles for many in the R community. And as I said the original link should have been linked to.

tempodox · on Jan 8, 2015

This reminds me how much of a hole there is in my knowledge about statistics and such. I built myself a Twitter client that sucks users' geolocations into a DB so I can do all kinds of analyses on their movements. Makes me wish we had Statistics classes back in school. That should come right after learning the ABC.

jebus989 · on Jan 8, 2015

Even the original twitter blog seems to have gone out of its way to make this seem more complex then it is...

Decomposition of time series is done with STL (stl function in stats package) and this is the first part of what they call "Seasonal Hybrid ESD (S-H-ESD)" (sounds impressive right?) which then apparently just involves taking the max absolute difference from the detrended sample mean in terms of standard deviations, remove it and repeat until you have your collection of x outliers. If they wanted to this could be explained in a few sentences, and the underlying code is really simple [0], but for whatever reason it's been written up as advanced analytics — as if decomposing a time series is a major challenge.

[0] https://github.com/twitter/AnomalyDetection/blob/master/R/de...

dtjones · on Jan 8, 2015

While the computation might be relatively simple, its still necessary to be aware of literature and use the proper academic description for the methods.

eterm · on Jan 8, 2015

Time series analysis comes quite far into most statistics syllabi. I did quite a bit of statistics at school and it wasn't until my third year of Mathematics & Statistics at university that we touched time series data. (Although it could have been taken as a second year module I think).

TallGuyShort · on Jan 8, 2015

There's a TED talk you may enjoy in which the speaker (I believe Arthur Benjamin) argues that statistics should be taught before calculus. There's a bit of a dependency on calculus in statistics (e.g. the proof of linear regression makes no sense without differential calculus), but I find that I use linear algebra and statistics far more in my work, daily life, and when reading about topics that interest me. There was no mention of those branches of math in my high school - I had to learn about them on my own, and I feel they're more valuable. I don't think you need to drop trigonometry and calculus to make room, though. I started my education in New Zealand and I believe by 7th form (final year of high school) students there have done both statistics and calculus, and more linear algebra than students in the US have.

jrauser · on Jan 8, 2015

I have a talk in which I argue that if you can program, you have a huge advantage in learning statistics because you can simulate random processes and tinker with them to get an intuitive sense of the statistics involved.

At Data Driven NYC: https://www.youtube.com/watch?v=AfSM45ncAT8 Keynote at Strata+Hadoop World 2014: https://www.youtube.com/watch?v=5Dnw46eC-0o

bglazer · on Jan 8, 2015

I just signed up for a statistics class at my local university. I will officially be an undergrad student again on Jan 20, and I intend to learn much much more this time.

Luckily my employer encourages learning, and it helps that the class is mostly during lunch.

Keep on learning!

naftaliharris · on Jan 8, 2015

It's interesting that they chose to write their anomaly detection code in R, which is typically used in offline, post hoc analysis mode. It seems reasonable to suppose that the ability to discover an anomaly like "service X is failing" in real time is more valuable then discovering it a week later.

In order to monitor important time series with this code, they would presumably need to run it every n minutes on the entire time series, or at least the recent part of it. Seems an anomaly detection system operating on streaming data might make more sense.

Perhaps their real time anomaly detection system uses simpler logic on streaming data?

codewithcheese · on Jan 8, 2015

Does anyone know if these R packages (AnomalyDetection, BreakoutDetection) are to be used on large scale data or they more intended to be used in lab work?

dtjones · on Jan 8, 2015

Doesn't look like they are setup to run in parallel, but most R stuff isn't. Unless a package has explicit integration with one of the distributed libraries such as doParallel

Would be interesting to see this package hooked up to streaming data and monitor performance

bazzargh · on Jan 8, 2015

I was wondering if the algorithm could be added to Etsy' Skyline, which does anomaly detection on streaming data, based on a basket of algorithms (see https://github.com/etsy/skyline/blob/master/src/analyzer/alg...). Our own data is periodically bursty, and because Skyline doesn't apply STL like this code, we have data that looks anomalous all the time.