
Outlier Detection Techniques (2010) [pdf] - olooney
https://archive.siam.org/meetings/sdm10/tutorial3.pdf
======
graycat
The OP has a problem: The algorithms don't give any significant, meaningful
results. E.g., can run one of the algorithms, and it can report that there is
an outlier or not, and in either case we have no idea if the answer is better
than just something from reading tea leafs.

For sorting numbers, an _algorithm_ is appropriate, e.g., heap sort, once we
have checked the algorithm and shown that actually does sort. For outlier
detection, we have no such simple criterion that the _algorithm_ did anything
meaningful.

For some first progress, we can consider the framework of statistical
hypothesis testing with Type I error, Type II error, and the probability of
Type I error. If we can do some derivations so that we know the probability of
Type I error, then we have a meaningful way to know how seriously to take the
result.

E.g., suppose we are receiving points one at a time. Our null hypothesis is
that the points are independent and have the same distribution. Then Type I
error is saying that a point violates the null hypothesis, is an _outlier_ ,
when it does not. A Type II error is saying that the point does satisfy the
null hypothesis when it does not.

Generally we want to be able to adjust, set, and know the probability of Type
I error. Then we have an outlier technique with some meaning. If in addition,
e.g., via the classic Neyman-Pearson result, we can show that for the
probability of Type I error we have selected we get the lowest possible
probability of Type II error, then we have the best possible test.

The OP is too eager to work with a Gaussian assumption. Too commonly in
practice, that assumption is absurd. Instead, we should be able to work with a
_distribution free_ assumption, that is, the null hypothesis assumes that all
the data has the same distribution but we don't know what that distribution
is.

~~~
jeroenjanssens
I spent some time thinking about this. See
[https://github.com/jeroenjanssens/phd-
thesis](https://github.com/jeroenjanssens/phd-thesis)

~~~
curiousgal
Tangent: that is _the_ most visually pleasing PhD thesis I have ever come
across! How did you create it if I may ask?

~~~
throwawaylolx
It's just a regular LaTeX template.

~~~
curiousgal
Which one??

------
lbj
Although I'm usually the last to recommend ML/Deeplearning, this seems better
solved with nnets.

------
sophistication
2010 is pre-deep learning revolution/hype, so these slides are outdated. See
e.g. [https://arxiv.org/abs/1709.01907](https://arxiv.org/abs/1709.01907)
[https://arxiv.org/abs/1810.01403](https://arxiv.org/abs/1810.01403)

~~~
tw1010
Have you actually tried to implement outlier detection in production on any
serious level? Most who do realize pretty quickly that deep learning is way
overkill and finicky, and pretty soon head for resources like this.

~~~
sophistication
The method in the first paper is in production, AFAIK. Also, outdated ≠
obsolete.

