Hacker News new | comments | ask | show | jobs | submit login
The lsm command for Latent Semantic Mapping (apple.com)
74 points by lars512 on June 24, 2011 | hide | past | web | favorite | 18 comments

Latent semantic mapping is a technique which takes a large number of text documents, maps them to term frequency vectors (vector-space semantics), and performs dimensionality reduction into a smaller semantic space. This then lets you determine how similar in meaning different documents are. You can use this for a variety of tasks.

Wikipedia: Latent Semantic Mapping http://en.wikipedia.org/wiki/Latent_semantic_mapping

WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning behind words and documents" https://developer.apple.com/videos/wwdc/2011/

I would really be interested in some use-cases. The examples they give are fairly limited.

Classification (e.g. spam detection) and document categorization, as well as clustering similar documents.

You can do all these tasks in the original document space, instead of in the latent space, but the advantage of the latent space is that it can capture patterns across the entire corpus. This is called unsupervised learning.

In particular, if I have only 100 training examples (e.g. 10 examples of spam and 90 examples of ham), I will learn a better classifier if I first use LSM and then train my classifier, than if I train my classifier over the original documents. In the former case, unsupervised learning detects patterns over the entire corpus, which I use to discriminate between spam and ham. In the latter case, I can only use features from the 100 labeled documents, so it is more difficult to generalize.

More examples:

* What language is this document?

* Is this document about sports?

* Is this news article similar to 50 news articles that I previously marked as "highly interesting" ?

Well, for example, how about sorting out a lot of pdf documents I have in a folder called papers/ ? I do use Mendeley now but there are some leftovers from before that I really don't want to sit and sort through (not to mention the fact that I probably may have multiple copies of some of them.)

Curriculum review comittees could reduce redundancy and fill gaps by reviewing course documents.

We used this, when I was at Apple, to make the Parental Controls web content filter (which I worked on), among other things. It works surprisingly well.

I just can't ever see Microsoft shipping something like this available to every user. This sort of quiet progress is why I like Apple. Sure they highlight the glossy stuff, but below the surface there's so much blood and guts progress.

While Microsoft may not be shipping a LSM program to every user, they are doing a ton of scientific research, much more than Apple. For example googling "latent semantic analysis microsoft research" turns up several research papers on the topic by Microsoft. They do cutting edge research comparable to a good university on a large number of topics like programming language design, compilers, machine learning, distributed computing, graphics, automated theorem proving etc.

Microsoft research does some amazing things. But like Xerox PARC they never seem to ship anything. Except for the kinect, that's been a reasonable success.

Actually MS SQL Server does ship with a set of data mining algorithms for some time now



afaik these are only in the paid versions.

No, MS doesn't ship this stuff with Windows.

But MS Research has some heavyweight chops on board, and IMO is an excellent institution.

So is anything like this available on other platforms? Because it's way faster than http://classifier.rubyforge.org/ , even with rb-gsl installed. I'd love it for generating related posts on my Jekyll blog.

How have you used this? Looks pretty interesting.

I've been playing with some clustering stuff in my free time for the past few months.

What I've found is that the problem seems to get a lot more reasonable if you know how many clusters there are.

K-Means requires this information, but afaict agglomerative techniques don't. I wonder why this tool's agglomerative clustering method requires the number of clusters as an argument.

You're right that agglomerative clustering (unlike K-means) does not inherently need to know the # of clusters in advance. However, it still needs some sort of termination criterium, and # of clusters is one possible criterium.

Since lsm operates in a transformed space, other commonly used criteria like cluster distance may not be as convenient for the user to express.

Is it available on Linux?

I've used this LDA code http://code.google.com/p/plda/ on multiple systems, however it's not really as well packaged as the Apple code seems to be.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact