Hacker News new | past | comments | ask | show | jobs | submit login
Nonnegative matrix factorization (acolyer.org)
102 points by feross on Feb 18, 2019 | hide | past | favorite | 19 comments

Timely! We just added a super-efficient implementation of Online Non-negative Matrix Factorization to Gensim.

Jupyter notebook tutorial:


What's unique about this implementation is that it's streamed, i.e. can process arbitrary amounts of data in constant RAM. It's also heavily optimized for sparse inputs (~text corpora, not just dense images), just like the optimized sparse SVD, LDA, FastText etc in Gensim.

Just want to say thanks for existing, gensim!


The problem of finding a low-rank-approximation (in Frobenius norm) to a given matrix is well understood and solved in reasonable time (truncated SVD, as stated in this article).

However, the particular characteristic of this NMF approach seems to be that we require W, H >= 0 (and that's what makes it harder), but it's not clear why, what purpose it serves. The original paper linked [1] mentions that it makes interpretation easier (the vectors of W can be interpreted as facial features, or a form of "eigenface"), but it's not clear to me why you can't take a vector with negative elements after norming to [0,1].

Final note: The article contains quite a few paragraphs from the paper more or less verbatim.

[1] https://arxiv.org/abs/1401.5226

For example: you measure concentrations of `p` gases at `n` different places. You collect the results in a matrix A where A_ij is the concentration (a value between 0 and 1) of gas `i` at station `j`.

Now you factorize A ≈ B * C in the sense that ||A - B * C|| is small in your favourite matrix norm. Here B is of size `p × k` with normalized columns and C is `k × n` and of course k is very small.

Then the columns of B form an approximate basis for the column space of A -- so B compresses the data of A well.

Now, if also B is element-wise positive, you might interpret the columns as concentrations (values between 0 and 1 due to normalization). And you could plot them.

It's not mathematical, but it's also not a crazy interpretation.

Some quantities are non-negative in their nature, such as energy, intensity, number of occurrences etc. When modeling data that is assumed to be a sum of such non-negative terms (i.e. the terms cannot cancel out), it can make sense to enforce non-negativity.

The non-negative constraint is so that components will only add together. One can never subtract a feature, or add a negative feature.

For instance complex sounds can be decomposed into many smaller sound "atoms". Usually this is done in usually spectrogram representation split in small sections of time. With NMF, the weight matrix becomes how much each of the atoms is added to form the sound mixture. The weights can then be used for classification, similarity or synthesis.

Allowing to add a negative amount of a sound atom would make interpretation harder. What does it mean for an anti-sound to be present? Destructive interference of waveforms is possible, but quite rare to encompass a whole "atom" in a realistic acoustic scenario. And when not combined with its inverse it just sounds like a sound to the human ear. So its contribution is not always negative either. NMF interpretation is straightforward in comparison.

Intuitively, the purpose of the positive coefficients constraint is to allow for the interpretation as a mixture (admixture?) of components. This constraint gives the nice properties of W.

In principle there is no reason why you wouldn't allow negative coefficients in H, but if you want the W to learn additive components, then you need the non-negative constraint.

NMF works great for unsupervised audio source separation, e.g. [1]. See [2][3] for a good overview of the field.

[1] http://librosa.github.io/librosa/generated/librosa.decompose...

[2] https://hal.inria.fr/hal-01631185/document

[3] https://sci-hub.tw/10.1002/9781119279860.ch8

If you're interested in this stuff a team i'm in proximity to are working on such problems.



NMF (and matrix factorization techniques more generally) are a really underrated machine learning technique. If you have a decent intuition for linear algebra, matrix factorizations are straightforward to understand, fairly easy to interpret, and can work wonders in applications far beyond recommender systems.

As the blog post mentions, NMF has a potential applications to text mining, which I've tried out here on Reddit posts: https://eigenfoo.xyz/reddit-clusters/

Can anyone here provide an intuitive explanation for why you'd prefer NMF over SVD or Truncated SVD? It seems to be simply another way of accomplishing something very similar, but with this non-negativity constraint. But it's not obvious to me why that would be useful, or produce a better factorization. Can anyone shed any light on that?

My sense from reading about it and understanding what's going on to the extent that I do, is that it's simply an alternative factorization. And there's no real reason to prefer one to the other, but since they are similar but non-equivalent you can try both and see which one works better.

I like this thread "A parellel between LSA and pLSA" https://stats.stackexchange.com/questions/40723/a-parellel-b... showing differences between PCA matrix factorization, some naive non-negative matrix factorization (with L2 cost function) and pLSA (log-loss cost function).

The later is a precursor to Latent Dirichlet Allocation, one of the key models for topic modeling.

How is this significantly different than Principal Component Analysis (PCA)? The image processing examples essentially do what a Karhunen-Loeve Transform (KLT) does or a set of Eignenfaces.

Well... the nonnegative matrix factorization builds a basis for the column space where all entries of the basis vectors are positive. This might be meaningful if the entries are thought of as some positive quantity like weight or energy.

In the SVD the basis vectors for the column space consist of entries that are not necessarily positive.

You can think of it as “PCA where all the principal component entries and loadings must be positive”.

Harder optimization problem to solve though due to that constraint. The connection is especially clear when looking at the projected least squares algorithms for NMF — plain PCA minimizes the Euclidean reconstruction loss, while NMF algorithms alternates gradient descent steps to do that with projection steps to force everything positive.

BTW, if you're interested in a code-first walkthru of NMF at a higher level, we cover it in this Jupyter notebook and video: https://nbviewer.jupyter.org/github/fastai/numerical-linear-...

One point that I don't find raised much in the NMF literature is how rotational ambiguity (non-uniqueness) is dealt with systematically - is the exploration of factor rotations just not necessary?

From my understanding the vectors don't really matter so much. We care more about the subspace they span (because we want to project our data into it). So any rotation is as fine as any other.

On further thought I think this is wrong. Large rotations can cause coefficients to become negative, so there are constraints.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact