The servers that have error accumulation aren't always the ones highest in spikes, as shown in the link. However, they do consistently stay above others in amounts of errors. So the answer is thus:
Since your time series is quantized, the integral is simply a sum over the timeframe. I would recommend multiple time windows, like "10 seconds, 3 minutes, 1 hour, 12 hours". Do this for all your servers.
Now, you have a normalized 2d graph, with respect for time. We can now calculate the mean of all servers' error areas as well as standard deviation. Now, for reducing CPU load, one can run a modified thompson tau test on +/-2σ for outlier detection. 4σ is around 97.7% of all your data, so you would only be checking ~2.3% of your data.
When a server fails an outlier test (in other words, is detected as an outlier), the historical data can also be investigated. Is the machine doing gradually worse? If so, the machine could be removed from service until a memory and CPU test can be run. You can also keep anomaly detection on how many outliers per hour. Passing a threshold could indicate potentially erroneous machines.
My answer also assumes that the machines are equal in the amount of load. I'm sure this is not the case, in which you will have to normalize errors/# of requests. This shouldn't be a problem.
Both systems are certainly in place, though they serve different business use cases. RAD isn't used in real-time for operational decisions and Kepler isn't used for anything that Data Science uses RAD for.
Have you guys thought at all about adapting a reinforcement learning model for building an online model to detect outliers?
Ultimately we would like all of our systems to use RL to interpret user feedback to determine parameters, actions, etc...
Brilliant question by the way!
Sounds like you guys have thought this through really well.
I have seen many time-series analyses posted here lately, and all seem to use arbitrary window sizes.
Edit: PS, thanks for posting this and being around for questions.
We've got an experimental streaming version going that hasn't been set loose on any services yet, but it can get much higher granularity metrics ~ 10s (faster if we cared to).
edit: Forgot to add that we did try other window sizes, as long as 30 minutes but we found that longer windows allowed the past to influence the decision being made now too much. If it has spiked in the past we were aggressive about calling it an outlier with 30 minute windows, furthermore if it had been in lying and just become an outlier it killed our time to detect which is an important metric for us.
I had always figured timed filters like "above X error rate for Y occurrences in Z time" may work. Threshold filters can be much more complex, I'm curious if you tried different types of threshold filters and if they worked or not.
Imagine that service A depends on service B, and service B begins experiencing an issue. The outlier detection system that service A has setup will see the entire cluster surge upwards on errors. This would trip a traditional threshold based approach, but the last thing we'd want to do is terminate servers in service A. In this case the entire cluster moving upwards sets the context that there is a larger scale problem going on (nothing is outlying), whereas if only a few move upwards (become outliers) we have a localized issue we can fix.
Meanwhile service B through either outlier or threshold alerts is paged and dealing with their issue.
As a side note we do something similar to time sensitive thresholds when doing threshold based alerting on differences between the observed data and a double exponential smoothing fit. Parameter selection is an issue at times in this case, we talk about this a bit in an older blog post on Stream Starts per Second (SPS): http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-str...
Hope that adds a little context for you. :)
I would say that we don't care about root cause when the event happens; just get the server out of the pool. RCA can be done post mortem.
I would reduce dimensionality down to 2d: errors per time. In that case, we have a great deal of statistical tools at hand. They also do not require hand-waving of N dimensional cluster detection, where only the machine has any idea of detecting errors. And having something like 1000 dimensions is just tremendously slow, compared to integral analysis of errors in respect to time.
How do service owners get this data? by manual inspection of graphs like the ones shown earlier?
If you meant in the more mathematical sense we perform our clustering in a normalized euclidean space.
We throttle automatic terminations so that it doesn't drop an entire cluster at once. Yet to cause an outage, fingers crossed!
We generally run into two classes of errors: 1) Software bugs which follow the process outlined above. 2) Issues with AWS... an example being some virtual servers running on hardware experiencing a network issue. If they're terminated and replaced by the ASG generally the new ones spin up on good hardware and we've avoided the issue. Rare but it does happen at our scale.
I will say the fact that nupic produces an outlier score and confidence score are things that would have been incredibly useful by the time this was brought to its end users. Definitely worth a look for anyone looking to do realtime stream processing for anomaly detection.
It would be much better for it to be doing this sort of outlier detection - a gradual increase in error rate to 3% should not trigger a critical alert, whereas a big jump in error rates should trigger an alert quickly.
Has anyone implemented a system like this?
For example if network tcp retransmits are throwing it off we probably just want the system to kill it and let the autoscaling group bring up another server. If its memory usage we probably want to page someone.
Note my thoughts are included and thought they might be of interest to anyone looking into this problem.
We have considered modeling the distribution from which the data is typically drawn and then calculating likelihood of newly observed data. Some of the approaches that we use to detect anomalies on stream starts per second (SPS) now depend on these services. Same software package, slightly different solution.
A colleague of mine (Chris) implemented a data tagger which allows users to annotate data that is typically fed into this system. We have plans to have the backend automatically swap out the algorithm based on performance against their tagged data.
We've written about SPS here: http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-str...
You could also have a manual report button the user can click on. And if you want to get really advanced, have a system that learns from those manual reports so that it can later warn when there is a high probability of a "problem" occurring.
Great question, and I'm glad others are thinking along these lines -- we fully believe that this is important for bridging the gap between service owners and our analytics.
Instead of a line graph, try different opacities. So you have one line per server, going left-to-right, shading from white (for no errors) to whatever color (for the maximum number of errors overall). And perhaps a dot per server and a single line indicating overall status.
Possibly a startup opportunity to offer an easier option with error logs + machine learning.