
Outlier Detection in SQL - vive-la-liberte
https://www.periscopedata.com/blog/outlier-detection-in-sql.html
======
thom
I've recently been doing a bunch of stuff with sports stats, which involves
lots of GIS data. This sort of thing comes up a lot - trying to find a
player's 'territory' based on coordinates of their actions in a game, without
including outlying events that cause you to overestimate the area.

There's a concept in animal behaviour called a 'home range' which is more or
less the same thing - GPS attached to tigers in the wild etc. Some of the
algorithms there are quite interesting, from simply drawing a bounding box
around the the data points, to working out the probability density, to things
like LoCoH, which sort of recursively build up convex hulls from nearest
neighbours.

All of these things are pretty much possible in SQL to one degree of
performance or another. But ultimately I'm fascinated by things like SQL
Server's R support - you can get far simpler, more natural implementations of
these things in R (or indeed in custom aggregates or functions in other
languages). I think in the long term, database engines that offer this sort of
extensibility are going to thrive for analytics work, be they SQL based or
otherwise.

~~~
sqldba
I hope it works out but they need to do a lot of work on it between these CTPs
and RTM. They've made a critical mistake in the way they've marketed and
documented it.

* They call every use a "data scientist". This is fun and assuredly deserved for R experts but is going to scare a hell of a lot of potential users away.

* The installation notes give you a lot of steps but zero insight into why you're doing them. Install these two packages; okay but what exactly are they doing? We need to know this kind of thing. R people might know...

* There's very little troubleshooting documentation either on the installation/initial setup (and there are a LOT of bugs in getting it going; trust me I spent most of last week digging through them) but also once it's operational how it's going to be tracked and managed within the context of everything else that's going on in the SQL instance. Like the memory pools, wait states, resource governor pools, etc. We kind of need to know this stuff. Otherwise we see a server that gets reported as slow, we know it uses R, but we don't know where to look to determine whether R is causing it or not and what we can do about it.

* AFAIK it's single-threaded. Considering most places have super-downsized their CPUs and are going for massively threaded servers these days (which is arguable given Microsoft's 2012+ per-core licensing model; but I'm talking about servers in very big enterprises that have unlimited licensing agreements), that's not going to end well. It's very possible this will be fixed before RTM though.

* And finally, because they've bought revolution and rebranded it as Microsoft R a few days ago, and now it's split into SQL and non-SQL products, there's going to be a lot of confusion and a long wait for appropriate training materials to catch up for all of us non-R users.

* I haven't been impressed with the current tutorial materials. As a non-R user we're going to need a lot more to really understand how to use it. I'm sure there are fantastic R resources available but if you want to understand it with a Microsoft SQL Server background then it's a slightly different story...

~~~
IndianAstronaut
Can you easily fit things like glm across a view or large table with the SQL
Server R addon?

------
cwyers
Is it just me or is Periscope allergic to saying what databases it supports? I
can find nothing on the website, and I am really hesitant to sign up for more
info once companies started actually calling my phone after I did that.

Also, none of those queries will work on MySQL.

~~~
jpitz
They support Postgres, Microsoft SQL Server, Redshift, and Oracle I believe.

~~~
ajones
I know that they also support MySQL. We have Periscope connected to a MySQL
database.

------
et2o
I would like to know about more sophisticated techniques for outlier
detection. These are stat 101 level. Z scores? You can get into a lot of
trouble assuming a normal distribution.

What are credit card companies doing? What's the best way to combine multiple
variables that are predictive of an event for outlier detection? Is there a
simple framework to automate reporting of these events in real-time?

One way to model this might be to treat the outcome ("Is this event an
outlier?") as a 0/1 variable and use one of the many ways to model that type
of data–random forests, logistic regression, neural networks, etc. The problem
is that this isn't really "outlier" detection anymore.

~~~
huac
Anomaly detection is very different across different domains. For CC fraud /
risk, you have discrete transactions so the problem is one of classification,
and generally approached with supervised learning.

I don't know what you mean by combining multiple variables. Do you mean
analysis methods that work with multiple variables (instead of 1-dimensional
z-scores) or do you mean methods that combine multiple variables into 1, to
reduce input dimensions (i.e. principal component analysis)

Because data and data reporting platforms are so different across companies,
there's no 'simple framework' to do reporting. You probably want something
like [https://github.com/etsy/skyline](https://github.com/etsy/skyline).

You also describe an ensemble method for outlier detection, which is what
Skyline uses. I want to note that there is no reason to consider ensembles
"not" outlier detection.

~~~
et2o
I meant using multiple variables to categorize outlier events. What was shown
here are also techniques to categorize discrete events ("Does this day cross
some threshold?"). I guessed supervised learning methods.

------
numlocked
These are some cool SQL tricks! I like it.

The big caveat with the standard deviation technique is that it assumes a
normal distribution. Many datasets are not actually distributed normally
(power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a
much harder problem to 'generically' detect outliers without knowledge of the
underlying distribution.

I don't have any idea how to do it (though a former colleague came up with
nice idea of building a histogram and searching for values that occurred after
some number of empty bins, implying an outlier). Is there an accepted state-
of-the-art for general-purpose outlier detection? Or is that such a broad
question as to be meaningless?

~~~
rodionos
Other than normal distribution assumption, there is another assumption that
doesn't hold true for most time series related to human activity, nature, or
scheduling. If heteroscedasticity is present, you cannot use the same standard
deviation for the entire series. A more practical approach is to compute
variance for each calendar period separately.

Here's an example - expected variance for the number of SWIFT payments
processed during non-banking hours is 0. Transaction counter greater than 0 is
an outlier.

------
darkxanthos
Standard deviations assume normal distributions... but ultimately when you
look for some two standard deviations from the mean, you're just look for an
event that only occurs 95% of the time. A technique that works regardless of
distribution is percentiles. Postgesql supports these now as well.

~~~
sqldba
I wish they went into more detail within the context of what we're discussing.
I've looked at the wiki pages and my eyes glazed over.

Trying to pick out "bad" stats from the gigabytes of information I gather from
the servers I manage is always a top-of-the-list thing that I want to do.

But when we're talking about 95%; well I mean it's fine for showing on a graph
but how can you run alerting for something like that?

------
tmandry
Maybe I'm just unfamiliar with some of the Postgres syntax.. but shouldn't
there be a GROUP BY in the first query?

~~~
hglaser
Thanks! Fixed.

------
sqldba
More please.

~~~
huac
try using moving statistics, looking for outliers in more than 1 dimension,
and so on

