
Nonnegative matrix factorization - feross
https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/
======
Radim
Timely! We just added a super-efficient implementation of Online Non-negative
Matrix Factorization to Gensim.

Jupyter notebook tutorial:

[https://github.com/RaRe-
Technologies/gensim/blob/nmf_speedup...](https://github.com/RaRe-
Technologies/gensim/blob/nmf_speedups/docs/notebooks/nmf_tutorial.ipynb)

What's unique about this implementation is that it's streamed, i.e. can
process arbitrary amounts of data in _constant RAM_. It's also heavily
optimized for sparse inputs (~text corpora, not just dense images), just like
the optimized sparse SVD, LDA, FastText etc in Gensim.

~~~
VHRanger
Just want to say thanks for existing, gensim!

------
FabHK
Interesting.

The problem of finding a low-rank-approximation (in Frobenius norm) to a given
matrix is well understood and solved in reasonable time (truncated SVD, as
stated in this article).

However, the particular characteristic of this NMF approach seems to be that
we require W, H >= 0 (and that's what makes it harder), but it's not clear
why, what purpose it serves. The original paper linked [1] mentions that it
makes interpretation easier (the vectors of W can be interpreted as facial
features, or a form of "eigenface"), but it's not clear to me why you can't
take a vector with negative elements after norming to [0,1].

Final note: The article contains quite a few paragraphs from the paper more or
less verbatim.

[1] [https://arxiv.org/abs/1401.5226](https://arxiv.org/abs/1401.5226)

~~~
stabbles
For example: you measure concentrations of `p` gases at `n` different places.
You collect the results in a matrix A where A_ij is the concentration (a value
between 0 and 1) of gas `i` at station `j`.

Now you factorize A ≈ B * C in the sense that ||A - B * C|| is small in your
favourite matrix norm. Here B is of size `p × k` with normalized columns and C
is `k × n` and of course k is very small.

Then the columns of B form an approximate basis for the column space of A --
so B compresses the data of A well.

Now, if also B is element-wise positive, you _might_ interpret the columns as
concentrations (values between 0 and 1 due to normalization). And you could
plot them.

It's not mathematical, but it's also not a crazy interpretation.

------
cpdomina
NMF works great for unsupervised audio source separation, e.g. [1]. See [2][3]
for a good overview of the field.

[1]
[http://librosa.github.io/librosa/generated/librosa.decompose...](http://librosa.github.io/librosa/generated/librosa.decompose.decompose.html)

[2]
[https://hal.inria.fr/hal-01631185/document](https://hal.inria.fr/hal-01631185/document)

[3] [https://sci-hub.tw/10.1002/9781119279860.ch8](https://sci-
hub.tw/10.1002/9781119279860.ch8)

~~~
jamesb93
If you're interested in this stuff a team i'm in proximity to are working on
such problems.

[http://www.flucoma.org](http://www.flucoma.org)

[https://pure.hud.ac.uk/en/publications/stationarytransient-a...](https://pure.hud.ac.uk/en/publications/stationarytransient-
audio-separation-using-convolutional-autoenco)

------
_eigenfoo
NMF (and matrix factorization techniques more generally) are a really
underrated machine learning technique. If you have a decent intuition for
linear algebra, matrix factorizations are straightforward to understand,
fairly easy to interpret, and can work wonders in applications far beyond
recommender systems.

As the blog post mentions, NMF has a potential applications to text mining,
which I've tried out here on Reddit posts: [https://eigenfoo.xyz/reddit-
clusters/](https://eigenfoo.xyz/reddit-clusters/)

------
darawk
Can anyone here provide an intuitive explanation for why you'd prefer NMF over
SVD or Truncated SVD? It seems to be simply another way of accomplishing
something very similar, but with this non-negativity constraint. But it's not
obvious to me why that would be useful, or produce a better factorization. Can
anyone shed any light on that?

My sense from reading about it and understanding what's going on to the extent
that I do, is that it's simply an alternative factorization. And there's no
real reason to prefer one to the other, but since they are similar but non-
equivalent you can try both and see which one works better.

------
stared
I like this thread "A parellel between LSA and pLSA"
[https://stats.stackexchange.com/questions/40723/a-parellel-b...](https://stats.stackexchange.com/questions/40723/a-parellel-
between-lsa-and-plsa) showing differences between PCA matrix factorization,
some naive non-negative matrix factorization (with L2 cost function) and pLSA
(log-loss cost function).

The later is a precursor to Latent Dirichlet Allocation, one of the key models
for topic modeling.

------
waynecochran
How is this significantly different than Principal Component Analysis (PCA)?
The image processing examples essentially do what a Karhunen-Loeve Transform
(KLT) does or a set of Eignenfaces.

~~~
stabbles
Well... the nonnegative matrix factorization builds a basis for the column
space where all entries of the basis vectors are positive. This might be
meaningful if the entries are thought of as some positive quantity like weight
or energy.

In the SVD the basis vectors for the column space consist of entries that are
not necessarily positive.

------
jph00
BTW, if you're interested in a code-first walkthru of NMF at a higher level,
we cover it in this Jupyter notebook and video:
[https://nbviewer.jupyter.org/github/fastai/numerical-
linear-...](https://nbviewer.jupyter.org/github/fastai/numerical-linear-
algebra/blob/master/nbs/2.%20Topic%20Modeling%20with%20NMF%20and%20SVD.ipynb)

------
hatmatrix
One point that I don't find raised much in the NMF literature is how
rotational ambiguity (non-uniqueness) is dealt with systematically - is the
exploration of factor rotations just not necessary?

~~~
gubbrora
From my understanding the vectors don't really matter so much. We care more
about the subspace they span (because we want to project our data into it). So
any rotation is as fine as any other.

~~~
gubbrora
On further thought I think this is wrong. Large rotations can cause
coefficients to become negative, so there are constraints.

