Detecting anomalous senators

jeroenjanssens · on Nov 27, 2013

A brief explanation of the algorithm can be found at the top of the page: http://jeroenjanssens.com/2013/11/24/stochastic-outlier-sele... ;)

adamtj · on Nov 27, 2013

That's a really good writeup. Some of the details elude me though. I don't understand why it's called "stocastic". Is it because of the binding probability matrix? How is that not just a normalization step?

Am I correct in thinking that what SOS does is to find the datapoint at the position of lowest flux density, where flux is generated by datapoints and falls off according to some Gaussian-like function of the dissimilarity metric?

In other words, it's like you've got 100 senators standing in a 172 dimensional senate chamber, positioned according to their votes, and they're all shouting. SOS finds the one who suffers the least hearing loss? Except that it's not exactly like sound. With sound, you could have two big outliers shouting directly into each other's ears. They're definitely outliers, but they're still going deaf. The gaussian is somewhat like specifying a minimum distance so that halving the distance doesn't double the flux past a certain point. And it's nice and continuous instead of a hard cutoff. Anyway, is that an accurate way to think about it?

But then, instead of just summing the rows of the affinity matrix and normalizing like I'd expect, you normalize rows so they sum to 1 (the binding probability matrix) and multiply row values. My intuition tells me that'll give the same ordering as summing and normalizing. Is that true? Is multiplying the rows of the binding probability matrix different in an important way?

I really like the role that the gaussian plays. That's a new tool for my toolbox, thanks!

jeroenjanssens · on Nov 27, 2013

Thanks! I hope you understand that I had to be brief in the blog post. Since you seem so interested, I recommend you read the technical report. Do let me know when you've got any questions after that. If you want, you could even clone the SOS repository and change the Python code to test out your hypothesis!

WestCoastJustin · on Nov 27, 2013

Nice write up and graphics! If you don't mind me asking, what software did you use to create these?

jeroenjanssens · on Nov 27, 2013

Thanks! For the graphics I used TikZ, which is a LaTeX package. The TikZ code of all the figures I used in my Ph.D. thesis are available on Github: https://github.com/jeroenjanssens/phd-thesis/tree/master/fig... which can be compiled using tikz2pdf: https://github.com/jeroenjanssens/tikz2pdf You can see a whole lot of other examples on http://www.texample.net/tikz/examples/all/

TikZ allows for high quality graphics, but if you're not into LaTeX, then the learning curve can be quite steep!

danso · on Nov 27, 2013

So I read through the D3 thing...and I still don't understand...what is considered "anomalous" here?

http://bl.ocks.org/jeroenjanssens/7608890

jeroenjanssens · on Nov 27, 2013

The brief explanation of SOS (which starts at the top of the blog post) uses a toy dataset. This toy dataset has 6 data points. Each data point has two features. In other words, we have 6 data points in a 2-dimensional space.

As for the roll call voting data, the dataset has 103 data points, where each data point has 172 features. The outlier probability of a senator depends on the location of its corresponding data point in the 172-dimensional space.

ErikAugust · on Nov 27, 2013

Does this mean: Detecting senators with anomalous voting patterns?

And if so, anomalous to what? Party line?

(By the way, killer visualization and design across the board...)

jeroenjanssens · on Nov 27, 2013

Thanks :-)

The algorithm is unsupervised, so it doesn't take any labels (i.e., party) into account. A senator is considered to be an anomaly when its 172 votes are too dissimilar from those of all the other senators.

dirtyvagabond · on Nov 27, 2013

Fantastic article. And nice demo of Drake. ;-)