
The lsm command for Latent Semantic Mapping - lars512
http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man1/lsm.1.html
======
lars512
Latent semantic mapping is a technique which takes a large number of text
documents, maps them to term frequency vectors (vector-space semantics), and
performs dimensionality reduction into a smaller semantic space. This then
lets you determine how similar in meaning different documents are. You can use
this for a variety of tasks.

Wikipedia: Latent Semantic Mapping
<http://en.wikipedia.org/wiki/Latent_semantic_mapping>

WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning
behind words and documents" <https://developer.apple.com/videos/wwdc/2011/>

~~~
tswicegood
I would really be interested in some use-cases. The examples they give are
fairly limited.

~~~
bravura
Classification (e.g. spam detection) and document categorization, as well as
clustering similar documents.

You can do all these tasks in the original document space, instead of in the
latent space, but the advantage of the latent space is that it can capture
patterns across the entire corpus. This is called _unsupervised learning_.

In particular, if I have only 100 training examples (e.g. 10 examples of spam
and 90 examples of ham), I will learn a better classifier if I first use LSM
and then train my classifier, than if I train my classifier over the original
documents. In the former case, unsupervised learning detects patterns over the
entire corpus, which I use to discriminate between spam and ham. In the latter
case, I can only use features from the 100 labeled documents, so it is more
difficult to generalize.

More examples:

* What language is this document?

* Is this document about sports?

* Is this news article similar to 50 news articles that I previously marked as "highly interesting" ?

------
wooster
We used this, when I was at Apple, to make the Parental Controls web content
filter (which I worked on), among other things. It works surprisingly well.

------
spitfire
I just can't ever see Microsoft shipping something like this available to
every user. This sort of quiet progress is why I like Apple. Sure they
highlight the glossy stuff, but below the surface there's so much blood and
guts progress.

~~~
jules
While Microsoft may not be shipping a LSM program to every user, they are
doing a ton of scientific research, much more than Apple. For example googling
"latent semantic analysis microsoft research" turns up several research papers
on the topic by Microsoft. They do cutting edge research comparable to a good
university on a large number of topics like programming language design,
compilers, machine learning, distributed computing, graphics, automated
theorem proving etc.

~~~
spitfire
Microsoft research does some amazing things. But like Xerox PARC they never
seem to ship anything. Except for the kinect, that's been a reasonable
success.

------
pepijndevos
So is anything like this available on other platforms? Because it's way faster
than <http://classifier.rubyforge.org/> , even with rb-gsl installed. I'd love
it for generating related posts on my Jekyll blog.

------
yters
How have you used this? Looks pretty interesting.

------
samg_
I've been playing with some clustering stuff in my free time for the past few
months.

What I've found is that the problem seems to get a lot more reasonable if you
know how many clusters there are.

K-Means requires this information, but afaict agglomerative techniques don't.
I wonder why this tool's agglomerative clustering method requires the number
of clusters as an argument.

~~~
microtherion
You're right that agglomerative clustering (unlike K-means) does not
inherently need to know the # of clusters in advance. However, it still needs
some sort of termination criterium, and # of clusters is one possible
criterium.

Since lsm operates in a transformed space, other commonly used criteria like
cluster distance may not be as convenient for the user to express.

------
codeape
Is it available on Linux?

~~~
rozim
I've used this LDA code <http://code.google.com/p/plda/> on multiple systems,
however it's not really as well packaged as the Apple code seems to be.

