Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Incredibly interested in your work here. For small-dimensional problems (or problems with features that can be engineered to be small-dimensional), ensemble methods through random forests and bagging and the like are incredibly useful.

But for high-dimensional text problems that're pure classification, I tend to rely simply on 1NN classifiers (against a single centroid of training data of a target category, of which there tend to be many). I've spent a lot of time with NMF, for its potential as an incredibly interesting data-exploration tool ("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error axis!") or low-dimension projection step. I've even spent a good amount of time on implementing the algorithm in a number of memory-efficient ways.

Could you expand a bit on how you used NMF for these problems in practice (similar to how a sparse autoencoder captures reduced-dimensional features en route to supervised learning), or how others used ensemble methods?



Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.

(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: