

Christmas Carol and other eigenvectors - spiffytech
http://evelinag.com/blog/2014/12-15-christmas-carol-and-other-eigenvectors/index.html#.VI9OjOpGjUY

======
jordigh
I wish everyone would stop saying that PCA is about eigenvectors and
eigenvalues... it's about _singular_ vectors and _singular_ values. The best
way to compute PCA is _NOT_ to compute the eigenvalues of A'*A. There are
specialised SVD algorithms that should be used instead:

[https://en.wikipedia.org/wiki/Singular_value_decomposition#C...](https://en.wikipedia.org/wiki/Singular_value_decomposition#Calculating_the_SVD)

Randomised methods have also gotten popular in recent years:

[https://code.google.com/p/redsvd/](https://code.google.com/p/redsvd/)

[http://arxiv.org/abs/0909.4061](http://arxiv.org/abs/0909.4061)

This is one case where the machine learning community takes a numerical
analysis idea and by giving it a different name also loses some of the
insights associated with the other name. Normally I don't care if they call it
"training" instead of "line search" or "features" instead of "dimensions", but
PCA should be using better algorithms. Instead of telling everyone to just use
the inferior method of using generic eigenvalue solvers, use better
specialised methods.

I know that finding eigenvalues of A'A is "good enough", and I guess
"eigenvalue" is already a big enough word for most programmers, but there's no
reason to not relegate better methods to specialised libraries and make those
specialised libraries available. If nothing else, LAPACK and ARPACK have
bindings for almost every language.

------
michaelbarton
N-grams are also regularly used in bioinformatics. For example you can split a
genome sequence into words of length k, where we use the term 'k-mers'. You
can use a k-mer frequency table as a heuristic where ever performing a full
sequence analysis is too computationally expensive. For example if two genomes
A <-> B have a smaller euclidean distance in their respective kmer tables than
A <-> C you might assume that A and B are more closely related in evolutionary
distance. This is just one example and kmers are widely used through out the
bioinformatics field. For instance, the wikipedia entry for velvet has a
description on how they can be used for genome assembly:

[http://en.wikipedia.org/wiki/Velvet_assembler](http://en.wikipedia.org/wiki/Velvet_assembler)

~~~
dave_sullivan
Have you by any chance taken a look at applying NLP techniques like generating
sentence vectors with recurrent neural networks to this type of data? 2+ pair
mappings (ngrams) have been in use successfully for a while in NLP, but they
don't know about any context outside the current ngram.

Recent work has involved creating word and sentence vectors that capture much
richer dynamics and this is where a lot of the excitement is coming from in
the deep learning world wrt text (of any language, maybe even computer code,
as it turns out)

I've been wanting to try applying these machine learning techniques to this
type of problem (could be for clustering--which was your example--
classification, regression, or sample generation--of eg novel and probable
sequences?)

If you or anyone here as an interest in this sort of thing on the bio side,
please contact me (email in profile)

------
mercurial
Coming from OCaml, I'm jealous of what I see of the F# standard library. Looks
like there is a lot of nice built-in stuff. The different case conventions
make a bit jarring, though.

------
Houshalter
This is pretty neat. I never would have thought that just _letter pairs_ would
correlate with authorship.

