

Latent Semantic Analysis Tutorial - rouli
http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html?showall=1

======
beagle3
State of the art implementation, using random projection for reasonably
accurate yet hundreds to thousands of times faster:
<http://radimrehurek.com/gensim/>

Random Projection is something you should be aware of if you do any kind of
large dimensional modeling. It _is_ magic.

~~~
vedrana
Also, random indexing, which is in its theoretic essence the same thing:
<http://www.sics.se/~mange/random_indexing.html>

------
textminer
Truncated SVD has been a wonderful tool for "cleaning up" pairwise cosine
similarity data for text document comparisons, graph/network building (for a
visual representation of entities represented by documents, embedded in
something like Gephi/Sigma.js/D3), and for item-based recommendation systems.

The biggest problems I then run into involves choosing a "k" (the dimensions
allowed in your truncation). Have had some thoughts about training this
unsupervised method (providing labeled data for what "oughta" be the top
nearest neighbors for this particular entity, and optimizing toward that) or
building an ensemble method on top of many SVD'd truncated vector spaces
(though the combination method is unclear to me-- pick kNN from a linear
combination of each model's outcomes? Pick the intersection of each method's k
nearest neighbors?)

To novices looking at this tutorial: NumPy's a wonderful tool for small toy
examples, but at a certain scale you will depend heavily on the sparse matrix
formats provided for you by SciPy. (That and random projections should curb
any memory problems you'll run into for many vector space-based problems,
short of operating at a Google/Yahoo scale, or if your target's TBs of logging
data).

~~~
ninjin
Visualising multi-dimensional data is indeed, tricky. The best tool I have
found so far for word representations similar to LSA has been t-SNE:

<http://homepage.tudelft.nl/19j49/t-SNE.html>

In some cases what I get isn't all that much better than simply using PCA, but
overall t-SNE is superior. Although t-SNE is dreadfully slow... Below is a
link to an implementation used for text and I can highly recommend the
original paper on t-SNE:

<https://github.com/turian/textSNE>

------
elchief
FYI

What are the differences among latent semantic analysis (LSA), latent semantic
indexing (LSI), and singular value decomposition (SVD)?

[http://stats.stackexchange.com/questions/4735/what-are-
the-d...](http://stats.stackexchange.com/questions/4735/what-are-the-
differences-among-latent-semantic-analysis-lsa-latent-semantic-i)

------
kleiba
Also: "LSA was patented in 1988 (US Patent 4,839,853) by Scott Deerwester,
Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum
and Lynn Streeter."

<https://en.wikipedia.org/wiki/Latent_semantic_analysis>

~~~
nilsbunger
So the patent is expired now, right?

------
jackhammer2022
Also here is a nice paper to get started with LSI :
<http://www2.denizyuret.com/ref/berry/berry95using.pdf>

