
Fast Differentiable Sorting and Ranking - etaioinshrdlu
https://arxiv.org/abs/2002.08871
======
kxyvr
Something that I didn't know, and may help others understand this paper
better, is that there's a way to define the sorting a vector through the
creative use of the Birkhoff–von Neumann theorem:

[https://en.wikipedia.org/wiki/Doubly_stochastic_matrix](https://en.wikipedia.org/wiki/Doubly_stochastic_matrix)

which is better explained here:

[https://cs.stackexchange.com/questions/4805/sorting-as-a-
lin...](https://cs.stackexchange.com/questions/4805/sorting-as-a-linear-
program)

where the sorting operation is defined as a linear program. Evidently, this
has been known for at least half a century. That said, if a solution to a
linear program can be found in a way that's differentiable, this means that
the operation of sorting can be found to be differentiable as well. This
appears to be trick in the paper and they appear to have a relatively fast way
to compute this solution as well, which I think is interesting.

~~~
Mmrnmhrm
While the roots have been known for a long time, my impression is that the key
paper that started this line of thought was Marco Cuturi's NIPS 2013 paper
"Sinkhorn Distances", which is, IMHO, a very nice read.

~~~
kxyvr
Certainly I may be missing something, but it seems like the advance in this
series of papers is that they figured out a way to calculate a differentiable
solution to the sorting problem quickly, whereas it was already known that the
a differentiable solution already existed, no?

------
GistNoesis
>"While our soft operators are several hours faster than OT, they are slower
than All-pairs, despite its O(n^2)complexity. This is due the fact that, with
n= 100, All-pairs is very efficient on GPUs, while our PAV implementation runs
on CPU"

Paper is interesting but not yet sure of practical uses.

The trick I use in practice when I need a differentiable sort, is usually a
pre-sort step which involves thresholding (i.e. selecting with sparsity only
values greater than a certain score (usually either a constant, or a fraction
of the best score, or the Kth score ) ). Then pay the quadratic price with
n=10 or 20.

I don't see when the relevance of rank between garbage results would really
matter. When n get bigger and you don't want to ignore bad results, usually
quantile approximations suffice.

In the applications they cite :

The smart use (cross-validation) of threshold by the Huber loss in section 6.4
works better 2 out of 3 times in their own graphs).

The other use cases when order matters (for example like in section 6.3 is
where the rankings are given as input). If n is low you pay the quadratic
cost, if n is high you usually need to process samples a subset at a time for
memory reasons and use some comparison losses (triplet loss...). So this is
relevant only in the sweet spot in between if you need exact calculations.

------
etaioinshrdlu
I find this paper super cool, and highly unintuitive that an operation as
discrete as sorting can be done entirely with smooth functions, and
efficiently to boot.

However I must admit that I do not fully grasp the implications of this paper.
Why do we really need differentiable sorting for deep learning in the first
place? What new possibilities open up as a result? My best uneducated guess is
that the gradients produced by differentiable sorting are more informative
than regular piecewise sorting, and this allows the gradient descent to
progress faster, therefore training faster. (Think about how you can know an
entire analytic function can be completely known from any small neighborhood.
Are these sorting functions analytic too?) My intuition tells me that the
derivatives produced with this technique allow the optimizer to see true
gradients across classes.

Are higher order derivatives also meaningful here?

~~~
billconan
A few months ago, I played with a Julia differentiable programming framework,
and I thought what if I make a differentiable virtual machine and use unsorted
and sorted numbers as training data. will it learn a sorting algorithm,
something similar to deep Turing machine. My conclusion was I can't ....

~~~
mpoteat
The million dollar question is if it's possible to construct a theory of
computation where the input language itself is automatically differentiable,
and where the execution semantics are also so. Perhaps a continuous spatial
automata that has been proven to be Turing complete.

------
quotemstr
Huh. I've only read the paper superficially, but it definitely looks cool. I
wouldn't have thought to implement sorting by geometric projection onto an
unfathomably huge polygon, then optimizing the projection by transforming it
into isotonic optimization ([1], apparently?). I'm not sure my geometry-fu is
strong enough to properly understand the details of the approach.

I do have one question though: what is the resulting algorithm actually
"doing" when analyzed as a conventional sorting algorithm and not a geometric
operation?

[1]
[https://en.wikipedia.org/wiki/Isotonic_regression](https://en.wikipedia.org/wiki/Isotonic_regression)

------
formalsystem
Skimming papers like this make me think maybe we spend too much time computing
stuff in discrete spaces vs continuous ones.

This textbook covers CS theory using real numbers instead of integers.

[https://www.amazon.com/Complexity-Real-Computation-Lenore-
Bl...](https://www.amazon.com/Complexity-Real-Computation-Lenore-
Blum/dp/0387982817/ref=sr_1_2?keywords=real+computation&qid=1582431445&sr=8-2)

~~~
gumby
> Skimming papers like this make me think maybe we spend too much time
> computing stuff in discrete spaces vs continuous ones.

I would go one step further and argue that we shouldn't teach kids discrete
math first, but rather continuous math instead.

Sure, you have discrete digits and toys, but Piaget (and his student Papert)
observe that kids begin pouring water between different containers in the bath
before they can do integer counting and from that develop understanding that
objects of different shape can have the same volume and concepts of partial
filling, ratios etc.

The human scale world is continuous more than it is discrete.

~~~
threatofrain
Current K-12 curriculum is, from a perspective, all about preparing kids for 3
years of calculus. From this pedagogical perspective, discrete narratives are
there to be a stepping stone into continuous narratives. Some people like
Gilbert Strang believe that there's way too much emphasis on calculus and not
enough on algebra.

------
cs702
Based on an initial read, this looks like a significant breakthrough to me.

I wonder if the techniques developed by the authors could make it feasible to
take _other_ piecewise linear/constant algorithms (which until now have been
considered "non-differentiable" for practical purposes) and turn them into
differentiable algorithms.

Think beyond sorting and ranking.

~~~
currymj
it’s not completely unprecedented because there were other ways of getting
equivalent results before (the Sinkhorn based optimal transport approach
cited, for one), which have been used for all kinds of interesting tasks. the
contribution is that it does so more efficiently.

~~~
cs702
Agree. That's what I mean when I wrote "for practical purposes" above...
although in hindsight I could have articulated it better. Thanks!

------
jeremysalwen
I have always thought of the LambdaRank objective
([https://www.microsoft.com/en-us/research/publication/from-
ra...](https://www.microsoft.com/en-us/research/publication/from-ranknet-to-
lambdarank-to-lambdamart-an-overview/)) as mapping the scores to a probability
distribution over rankings, or as they call it "projections onto the
permutahedron".

------
goldenkey
This is immensely useful for my vying attempt to produce a compact lookup-
tableless perfect hash generator for big datasets using ML.

Naively one might think, why not just do a standard loss - a point to point
metric like mean squared error. But this is deeply flawed. Because it requires
assigning each sample to a specific natural number, effectively reducing the
solution space by an order of magnitude, n!. In practice, believe me, I have
tried - the network never converges because the mapping is entirely arbitrary
and has nothing to do with the samples.

To remedy this, we need an innovation in loss functions / mathematics. The
loss function for the output of the net needs to be a set function [1]. This
set function should measure the distance between the _Set_ of outputs of the
net and the Set {1,2,...,n-1,n}. This is different from KL and all the other
standard loss metrics, because we do not care about the point to point
mappings, and have no ability to histogram or compute the probability
distribution since those are non differentiable operations.

On sorting: tensorflow has a differentiable sort but it is a hack and simply
propagates the loss backwards to the position the original data was in before
it ended up in its sorted position. This loss of dist(sort(Y_pred),[1,n])
provides better results but still fails for large data sets due to the
fakeness of the sort derivative.

I have a hunch that there is a mathematical way to uniquely measure some
arithmetic quality to optimize for, that is maximum when the output set is the
discrete uniform distribution.

The mean, standard deviation and other statistical measures are terrible
identifiers and actually, via statistical theory, we would need n moments for
a dataset of size n, to uniquely identify the distribution..so scratch those
off the list [2].

So there are two ways to achieve this milestone in ML:

1) a truly differentiable distance metric between two sets d(S,T)

2) a differentiable measure of ideal dispersion / density that forces the
output set S to converge to the discrete uniform distribution (this is more
problem specific to perfect hashes.)

Perhaps this sort is the key to doing #1 generically so we can have a new type
of NN based on the output Set instead of the specific points. It is late here
but I am excited to hear feedback.

[1]
[https://en.wikipedia.org/wiki/Set_function](https://en.wikipedia.org/wiki/Set_function)

[2]
[https://en.wikipedia.org/wiki/Hausdorff_moment_problem](https://en.wikipedia.org/wiki/Hausdorff_moment_problem)

~~~
bionhoward
is the loss you're talking about like fused gromov-wasserstein distance?

we hit permutation invariance issues like what you're talking about in some
atomistic simulations because the atoms need to be permutable if you want to
use the same model for chemistry as protein folding/docking, and the FGW algo
from e.g.
[https://arxiv.org/pdf/1811.02834.pdf](https://arxiv.org/pdf/1811.02834.pdf)
[https://tvayer.github.io/materials/Titouan_Marseille_2019.pd...](https://tvayer.github.io/materials/Titouan_Marseille_2019.pdf)

relaxes the invariance issue by adding a feature distance to the euclidean
distance.

higher order distance matrices are a neat trick, but blow up VRAM past 10-50k
atoms, but if you did it in mixed precision with newer gpus it could scale
damn far. problem is, the distance between distance matrices assumes the
target and source items are matched, so you get into iterative closest point
alignment, and pretty soon you're just reinventing RMSD

it would be cool for molec stuffs to have fast permutation-invariant set based
loss functions using transport theory, but this might be better handled with a
model-free approach (just let the AI figure out the loss function itself)

------
salty_biscuits
Reminds me of this older result by Brockett about using dynamical systems to
do these types of problems that I always really found interesting

[https://ieeexplore.ieee.org/document/194420](https://ieeexplore.ieee.org/document/194420)

------
rsp1984
I find this super interesting but unfortunately don't have enough background
in Deep Learning to recognise what this would be used for. To train a model
that knows how to sort stuff (probably not)? Would someone have mercy and
ELI5?

~~~
nestorD
Deep learning models require that all of their component are differentiable in
order to fit their parameters.

This means that most building blocks for neural networks are basic linear
algebra and not much else (I am simplifying, nowadays we have access to a
surprisingly large array of operations).

This paper gives you two new building block, a sorting function and a ranking
function. The ranking function might have direct applications for recommender
systems.

------
brianpgordon
I feel like this is over my head, but if the problem is that sorting a vector
produces non-differentiable kinks in the output then why not just run a simple
polynomial regression over it and differentiate _that_?

------
breatheoften
Just skimmed the abstract — on a practical level — can this be used to better
train a global ranking function given subsets of example ranked data?

