Wikipedia: Latent Semantic Mapping
WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning behind words and documents"
You can do all these tasks in the original document space, instead of in the latent space, but the advantage of the latent space is that it can capture patterns across the entire corpus. This is called unsupervised learning.
In particular, if I have only 100 training examples (e.g. 10 examples of spam and 90 examples of ham), I will learn a better classifier if I first use LSM and then train my classifier, than if I train my classifier over the original documents. In the former case, unsupervised learning detects patterns over the entire corpus, which I use to discriminate between spam and ham. In the latter case, I can only use features from the 100 labeled documents, so it is more difficult to generalize.
* What language is this document?
* Is this document about sports?
* Is this news article similar to 50 news articles that I previously marked as "highly interesting" ?
But MS Research has some heavyweight chops on board, and IMO is an excellent institution.
What I've found is that the problem seems to get a lot more reasonable if you know how many clusters there are.
K-Means requires this information, but afaict agglomerative techniques don't. I wonder why this tool's agglomerative clustering method requires the number of clusters as an argument.
Since lsm operates in a transformed space, other commonly used criteria like cluster distance may not be as convenient for the user to express.