Incredibly interested in your work here. For small-dimensional problems (or prob...

fennecfoxen · on Jan 4, 2013

Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.

(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.