
Ask HN: Is there a differentiable measure on the equality of two sets? - goldenkey
I am doing some ML and I have two equal sized sets of numbers, A and B. B is actually a discrete uniform distribution - [0, n]<p>The net I am writing needs to take A as input and generate B as its output. The specific pointwise mapping a -&gt; b is irrelevant, any injective function is sufficient.<p>I first wrote a custom loss function equal to the sum of the mean and std deviation minus expected. This worked somewhat but not well enough because those two measures aren&#x27;t enough to uniquely identify a distribution.<p>I don&#x27;t think the Kullback–Leibler divergence will work because it enforces an ordering &#x2F; specific mapping.<p>Is there some differentiable measure that can be ran over f(A) aka the predictions, to ensure it is equal to B?<p>I cannot find anything about loss functions that are for sets rather than point-wise mapping.
======
SamReidHughes
Having little experience in ML:

Write the sets in ascending order and, treating them as an n+1 dimensional
vector, and use Euclidean distance.

Googling, you could look at section 2.1 of
[https://papers.nips.cc/paper/3708-ranking-measures-and-
loss-...](https://papers.nips.cc/paper/3708-ranking-measures-and-loss-
functions-in-learning-to-rank.pdf)

------
PaulHoule
Can't you have input and output vectors that look like

[0,1,0,1,1]

for the set (1,3,4)?

~~~
goldenkey
Yes but the size of the sets is going to be enormous. Would like to avoid use
of one-hot encodings.

That still doesn't solve the issue of solving for any injective mapping though
- solving for a specific one can be done thru the standard loss functions -
but the density of solutions is sparse in 1/n! where n is the output set size.

------
shoo
i'm confused. can you define what spaces A, B and f(A) live in? what space
does f(a) live in, for a in A?

edit:

> two equal sized sets of numbers, A and B. B is actually a discrete uniform
> distribution - [0, n]

> The net I am writing needs to take A as input and generate B as its output.
> > The specific pointwise mapping a -> b is irrelevant, any injective
> function is sufficient.

> I don't think the Kullback–Leibler divergence will work because it enforces
> an ordering / specific mapping.

I don't see how KL divergence enforces an ordering / specific mapping.

I'm going to assume that you want to learn a function f[Theta] that accepts
sets as inputs and maps them to output sets, where Theta are some parameters
you're going to fit. Further, to make things more plausibly continuous, i'm
going to assume that f[Theta] accepts distributions as inputs and maps them to
output distributions.

Let X be some reasonable space. Let A and B be distributions over X, i.e. A, B
in D(X).

Let's say we want to learn some function f[Theta] : D(X) -> D(X) such that
f(A) = B .

Aside: it seems a bit weird to be doing ML with a single training example (A,
B) -- I'm assuming you have some very heavy structural constraints on
`f[Theta]`, and/or a bunch of other training data {(A, B)_i}_{i=1...M} ....

If we tried to define loss using KL divergence, it might look something like

loss(f(A), B) = sum_{y in X} f[Theta](A)(y) log( B(y) / f[Theta](A)(y) )

as long as f[Theta](A)(y) is sufficiently smooth in your parameters theta, the
expression for the loss might be reasonably differentiable.

So, how to interpret f[Theta](A)(y) ?

Assuming X is discrete, the dumbest representation of `f` I can think of would
be to let theta be a matrix of weights addressed by pairs of indices from X*X
. So the input distribution A could be interpreted as a vector of weights
indexed by X (say with some constraints about non-negativity and summing to
1), and y itself is just another index from the space X, so

f[Theta](A)(y) := matvec(Theta, A)_y = sum_{x in X} theta_{y, x} A_x

Then the loss function would look something like

sum_{y in X} (sum_{x in X} Theta_{y, x} A_x) log( B(y) / (sum_{x in X}
Theta_{y, x} A_x) )

So it's (1) definitely something you could evaluate (in principle) and
differentiate, and (2) doesn't force some pre-defined mapping between points
-- you can still set the weights Theta_{y, x} to whatever you want.

~~~
shoo
more coherently:

given a pair of vectors x and y representing our input and target sets
(encoded as distributions)

find a function f from some space that best solves: f(x) = y

if we require x, y, f(x) to be distributions, we might restate this as a
minimisation problem:

    
    
      given input x and target y s.t.:
        0 <= x <= 1 and sum(x) = 1
        0 <= y <= 1 and sum(y) = 1
      constraints:
        for any z s.t. 0 <= z <= 1 and sum(z) = 1 we require 0 <= f(z) <= 1 and sum(f(z)) = 1
      minimise
        d( f(x), y )
    

where d( . , . ) is some appropriate metric. such as KL divergence or inner
product or so on.

in the simple case where we additionally require that f be a linear operator
A, and pick the L^2 norm as our loss function, we get a constrained
minimisation problem to solve for a matrix A that minimises ||Ax - y||^2
subject to all our constraints.

