For sorting numbers, an algorithm is appropriate, e.g., heap sort, once we have checked the algorithm and shown that actually does sort. For outlier detection, we have no such simple criterion that the algorithm did anything meaningful.
For some first progress, we can consider the framework of statistical hypothesis testing with Type I error, Type II error, and the probability of Type I error. If we can do some derivations so that we know the probability of Type I error, then we have a meaningful way to know how seriously to take the result.
E.g., suppose we are receiving points one at a time. Our null hypothesis is that the points are independent and have the same distribution. Then Type I error is saying that a point violates the null hypothesis, is an outlier, when it does not. A Type II error is saying that the point does satisfy the null hypothesis when it does not.
Generally we want to be able to adjust, set, and know the probability of Type I error. Then we have an outlier technique with some meaning. If in addition, e.g., via the classic Neyman-Pearson result, we can show that for the probability of Type I error we have selected we get the lowest possible probability of Type II error, then we have the best possible test.
The OP is too eager to work with a Gaussian assumption. Too commonly in practice, that assumption is absurd. Instead, we should be able to work with a distribution free assumption, that is, the null hypothesis assumes that all the data has the same distribution but we don't know what that distribution is.
> The OP has a problem: The algorithms don't give any significant, meaningful results.
> The OP is too eager to work with a Gaussian assumption.
demonstrate strong mischaracterization and were supremely superficial. If I may be a little blunt, given your specialization and comment history I have come to expect better.
Compared to real estate of the slide deck that was spent on non-parametric methods Gaussian was just a passing mention. This slide-deck is supposed to be a resource to get acquainted with the topic -- not a technical publication. Of course there are no proofs but there's a list of references at the end which I presume do have the proofs. Although I have not personally verified each of them from the description of the techniques in the slides its not that hard to imagine how one can give performance guarantees in a non-parametric setting to many of them -- for example the nearest neighbor based methods are bog standard in this respect.
Yes, probability distributions exist; generally, though, we can't find them. One of the places we can find some probability distributions is the Poisson process, and, as I posted recently here, there are some qualitative, "axiomatic", criteria and also a limit theorem for justifying a Poisson assumption. So, we can get the distribution of times between arrivals, the distribution of the number of arrivals in, say, one hour, and more. The formulas for combinations and permutations permit knowing some more distributions. Still, tough to assume Gaussian.
Sorry, having statistics assume real data, e.g., when looking for outliers, especially for when looking for outliers, is a sore point with me. "Especially" is because even if Gaussian looks good on, i.e., eyeball near, average and standard deviation, or even with a goodness of fit test, looking for outliers will be considering the tails of the Gaussian, and that is just where, with just some real data, we are in a lot of doubt.
For getting an hypothesis test from those algorithms will usually mean doing a derivation to find the probability of Type I error, and such a derivation is commonly difficult to not done yet. Sometimes in practice have more information that permits a good estimate of the probability of Type I error, e.g., for some graveyard humor, wait ten years and see how many patients died.
The rest that you say is pretty standard stuff and should be elementary knowledge for anyone working seriously in this domain( * ). Am I missing something ?
> For getting an hypothesis test from those algorithms will usually mean doing a derivation to find the probability of Type I error, and such a derivation is commonly difficult to not done yet
Sure its difficult, doesn't mean its not been done. This is bread and butter in ML and non-parametric stats. In fact type I is easy and type I is no where near enough. Setting up an RNG (random number generator) will satisfy the requirements of type I error trivially. The key is type II or in other words key is the proof that the proposed method is better than the bespoke RNG, stated yet another way if power is abysmal it counts for squats.
( * ) Gratuitous assumptions of Gaussianity and incorrect applications of CLT does not qualify as 'working seriously in this domain" for me
For power, I did mention Neyman-Pearson.
In one paper I wrote on such things, multi-dimensional and distribution-free, I did some derivations, not easy to do (the paper was published in Information Sciences) for probability of Type I error, significance of the test, and used a classic result, LeCam called "tightness" of S. Ulam to show that the power of the test was better than that of a trivial test. Ulam's result is in P. Billingsley, Convergence of Probability Measures. Billingsley was long at U. Chicago.
erm that was exactly the last paragraph of my comment.
In ML distribution free multivariate tests have a slightly different flavor. The way those work is you compare the empirical expectation of all functions in a class (typically a Hilbert space). This is reminiscent of Cramer Wold. Separability properties of reproducing kernel Hilbert space makes it tractable to compare these.
As I mentioned, I got an, admittedly weak, power result using Ulam's tightness. Yes, it's weak, but it's quite general.
If I return to the mathematical statistics of hypothesis testing, I'll look into what you mentioned.
From what you mentioned, it appears that some of ML is actually being mathematical: I continue to believe that computer science is 'out of gas' and needs some new directions and those about have to be from, with, in pure/applied math, applied to problems newly important because of computing, but, still, pure/applied math.
BTW when is the launch. All the best.