Wikipedia: Latent Semantic Mapping
WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning behind words and documents"
You can do all these tasks in the original document space, instead of in the latent space, but the advantage of the latent space is that it can capture patterns across the entire corpus. This is called unsupervised learning.
In particular, if I have only 100 training examples (e.g. 10 examples of spam and 90 examples of ham), I will learn a better classifier if I first use LSM and then train my classifier, than if I train my classifier over the original documents. In the former case, unsupervised learning detects patterns over the entire corpus, which I use to discriminate between spam and ham. In the latter case, I can only use features from the 100 labeled documents, so it is more difficult to generalize.
* What language is this document?
* Is this document about sports?
* Is this news article similar to 50 news articles that I previously marked as "highly interesting" ?