
Sparse matrix representations in SciPy (2017) - heydenberk
https://heydenberk.com/blog/posts/sparse-matrix-representations-in-scipy/
======
goerz
A good sparse matrix implementation is key to any serious numerics. SciPy's
implementation is very nice and quite complete, and a foundation for the
entire scientific Python ecosystem. Unfortunately, though, many of the low-
level sparse operations are not implemented very efficiently, which has lead
to some projects having to do their own implementation, see e.g. this
discussion for the QuTiP project:
[https://github.com/qutip/qutip/issues/850#issuecomment-38400...](https://github.com/qutip/qutip/issues/850#issuecomment-384005667)

It would be really nice if _all_ of the sparse linear algebra in SciPy could
be heavily optimized (to a similar extent as e.g. Intel optimizes their sparse
operations in MKL), so that the entire ecosystem could benefit from that. This
is probably something that would require some corporate support, but given how
many data science and finance companies use Python for their workflow these
days, it might be a wise investment.

~~~
timr
This is true across the board. NumPy, Tensorflow, Pytorch...no library has
anywhere near 100% coverage of core functionality in sparse tensors. It's just
sort of generally assumed that if you're using sparse tensors, you're on your
own.

Tensorflow is better than most, in that it at least _implements_ most ops for
sparse tensors, even if it doesn't have efficient implementations for things.
I was recently frustrated to find that PyTorch doesn't implement a lot of
basic functionality for sparse tensors at all -- it just throws an exception.

~~~
yvdriess
TensorFlow is indeed the one-eyed king here. We found that running with large
SparseTensors, CPU training jobs is actually faster than on GPU nodes. An
individual job is faster on GPU, but SparseMatMul just blows up GPU memory use
to the point that only one train job fits in its memory. SparseMatMul is
inefficient on CPU as well, but you can keep getting some decent speedup by
cranking up the parallel trade jobs per node.

I did have great success using Netflix Vectorflow. I roughly gained 2x
performance for my use case. It does leave some other performance on the
table, no vectorized minibatches etc. If you are training on huge sparse
datasets with a shallow ffw network, Vectorflow is a good alternative.

------
joe_the_user
There seems to be no end of sparse formats, potential sparse formats and
algorithms for dealing with them.

But it also seems like for a given sparse matrix in a given situation, there's
no guaranteed that there's algorithm for handling it. The whole thing requires
deep experience or black magic.

~~~
perimo
To get the best performance you typically want to know something about the
sparsity structure of your problem. Do most nonzeros fall near the diagonal?
Or do they clump into dense submatrices?

If you don't have that information, you can look at how your matrix will be
accessed. Do you need fast access to random rows, or columns? Do you need to
write new values? Are those values already in the sparsity structure?

All of these questions can lead you to choosing the right sparse format, but
it does take some experience to know where to look.

~~~
timr
That's true, but mostly you just want to be able to store massive sparse
matrices without filling your RAM with zeros (or spending tons of time
reading/writing zeros from disk). Any implementation will do, as long as it
works.

~~~
srean
Storage is a minor worry. The main thing is whether the storage makes the
frequent operations efficient. The tuple (index_position, value) is reasonably
storage efficient but abysmal for linear algebraic operations. So one cannot
decouple storage efficiency concerns from compute concerns too much.

I think this is one of those shibboleths that separate folks who are experts
in ML/data-science at scale vs those who are coming more from a enterprise/DBA
background. I often have to convince folks why its not a great idea to write
sparse matrix multiplication in SQL.

~~~
timr
_" Storage is a minor worry."_

For many domains, it's a big worry. I routinely encounter large, sparse
datasets where I/O and storage dominate other concerns. Big sparse tensors are
worse than the "slowness" associated with a suboptimal sparse vector math
implementation.

 _" I think this is one of those shibboleths that separate folks who are
experts in ML/data-science at scale vs those who are coming more from a
enterprise/DBA background."_

I've been doing ML work for a long time. This is actually a shibboleth that
separates folks who mostly do image classification from other kinds of ML
people.

~~~
srean
"Minor" in the sense relatively easy to solve if not constrained by the
compute needs.

Interesting comment regarding image classification. I would have assumed they
are the ones who don't have to worry about sparsity much. Physics, and image
and video are some of the places you encounter large and dense matrices.
Lapack's wet dream :)

