Jupyter notebook tutorial:
What's unique about this implementation is that it's streamed, i.e. can process arbitrary amounts of data in constant RAM. It's also heavily optimized for sparse inputs (~text corpora, not just dense images), just like the optimized sparse SVD, LDA, FastText etc in Gensim.
The problem of finding a low-rank-approximation (in Frobenius norm) to a given matrix is well understood and solved in reasonable time (truncated SVD, as stated in this article).
However, the particular characteristic of this NMF approach seems to be that we require W, H >= 0 (and that's what makes it harder), but it's not clear why, what purpose it serves. The original paper linked  mentions that it makes interpretation easier (the vectors of W can be interpreted as facial features, or a form of "eigenface"), but it's not clear to me why you can't take a vector with negative elements after norming to [0,1].
Final note: The article contains quite a few paragraphs from the paper more or less verbatim.
Now you factorize A ≈ B * C in the sense that ||A - B * C|| is small in your favourite matrix norm. Here B is of size `p × k` with normalized columns and C is `k × n` and of course k is very small.
Then the columns of B form an approximate basis for the column space of A -- so B compresses the data of A well.
Now, if also B is element-wise positive, you might interpret the columns as concentrations (values between 0 and 1 due to normalization). And you could plot them.
It's not mathematical, but it's also not a crazy interpretation.
For instance complex sounds can be decomposed into many smaller sound "atoms". Usually this is done in usually spectrogram representation split in small sections of time. With NMF, the weight matrix becomes how much each of the atoms is added to form the sound mixture.
The weights can then be used for classification, similarity or synthesis.
Allowing to add a negative amount of a sound atom would make interpretation harder. What does it mean for an anti-sound to be present?
Destructive interference of waveforms is possible, but quite rare to encompass a whole "atom" in a realistic acoustic scenario. And when not combined with its inverse it just sounds like a sound to the human ear. So its contribution is not always negative either. NMF interpretation is straightforward in comparison.
In principle there is no reason why you wouldn't allow negative coefficients in H, but if you want the W to learn additive components, then you need the non-negative constraint.
As the blog post mentions, NMF has a potential applications to text mining, which I've tried out here on Reddit posts: https://eigenfoo.xyz/reddit-clusters/
My sense from reading about it and understanding what's going on to the extent that I do, is that it's simply an alternative factorization. And there's no real reason to prefer one to the other, but since they are similar but non-equivalent you can try both and see which one works better.
The later is a precursor to Latent Dirichlet Allocation, one of the key models for topic modeling.
In the SVD the basis vectors for the column space consist of entries that are not necessarily positive.
Harder optimization problem to solve though due to that constraint. The connection is especially clear when looking at the projected least squares algorithms for NMF — plain PCA minimizes the Euclidean reconstruction loss, while NMF algorithms alternates gradient descent steps to do that with projection steps to force everything positive.