

Detecting anomalous senators - reustle
http://jeroenjanssens.com/2013/11/24/stochastic-outlier-selection.html#detecting-anomalous-senators

======
jeroenjanssens
A brief explanation of the algorithm can be found at the top of the page:
[http://jeroenjanssens.com/2013/11/24/stochastic-outlier-
sele...](http://jeroenjanssens.com/2013/11/24/stochastic-outlier-
selection.html) ;)

~~~
adamtj
That's a really good writeup. Some of the details elude me though. I don't
understand why it's called "stocastic". Is it because of the binding
probability matrix? How is that not just a normalization step?

Am I correct in thinking that what SOS does is to find the datapoint at the
position of lowest flux density, where flux is generated by datapoints and
falls off according to some Gaussian-like function of the dissimilarity
metric?

In other words, it's like you've got 100 senators standing in a 172
dimensional senate chamber, positioned according to their votes, and they're
all shouting. SOS finds the one who suffers the least hearing loss? Except
that it's not exactly like sound. With sound, you could have two big outliers
shouting directly into each other's ears. They're definitely outliers, but
they're still going deaf. The gaussian is somewhat like specifying a minimum
distance so that halving the distance doesn't double the flux past a certain
point. And it's nice and continuous instead of a hard cutoff. Anyway, is that
an accurate way to think about it?

But then, instead of just summing the rows of the affinity matrix and
normalizing like I'd expect, you normalize rows so they sum to 1 (the binding
probability matrix) and multiply row values. My intuition tells me that'll
give the same ordering as summing and normalizing. Is that true? Is
multiplying the rows of the binding probability matrix different in an
important way?

I really like the role that the gaussian plays. That's a new tool for my
toolbox, thanks!

~~~
jeroenjanssens
Thanks! I hope you understand that I had to be brief in the blog post. Since
you seem so interested, I recommend you read the technical report. Do let me
know when you've got any questions after that. If you want, you could even
clone the SOS repository and change the Python code to test out your
hypothesis!

------
ErikAugust
Does this mean: Detecting senators with anomalous voting patterns?

And if so, anomalous to what? Party line?

(By the way, killer visualization and design across the board...)

~~~
jeroenjanssens
Thanks :-)

The algorithm is unsupervised, so it doesn't take any labels (i.e., party)
into account. A senator is considered to be an anomaly when its 172 votes are
too dissimilar from those of all the other senators.

------
dirtyvagabond
Fantastic article. And nice demo of Drake. ;-)

------
danso
So I read through the D3 thing...and I still don't understand...what is
considered "anomalous" here?

[http://bl.ocks.org/jeroenjanssens/7608890](http://bl.ocks.org/jeroenjanssens/7608890)

~~~
jeroenjanssens
The brief explanation of SOS (which starts at the top of the blog post) uses a
toy dataset. This toy dataset has 6 data points. Each data point has two
features. In other words, we have 6 data points in a 2-dimensional space.

As for the roll call voting data, the dataset has 103 data points, where each
data point has 172 features. The outlier probability of a senator depends on
the location of its corresponding data point in the 172-dimensional space.

