
Building a Content-Based Search Engine IV: Earth Mover's Distance - deepideas
http://www.deepideas.net/building-content-based-search-engine-earth-movers-distance/
======
SteveJS
There is a paper that indicates Earth mover, aka Wasserstein-1 is better for
convergence of GANs:
[https://arxiv.org/pdf/1701.07875.pdf](https://arxiv.org/pdf/1701.07875.pdf)

I tried it out on one of the Udacity Deep learning assignments using the
Wasserstein loss functions built into tensorflow. I was unsuccessful in my
limited use. The discriminator always ‘won out’ rather than the combo finding
a saddle point. I eventually got my project to work without it, and did not go
back to compare against just swapping EM back in.

------
nl
EMD is something that really needs more, better implementations.

The one that everyone uses from Python isn't the easiest thing to install,
doesn't have a great API and isn't easy to extend.

I think Gensim recently added it, but I think they use the same backend
solver.

Edit: this is a better article on EMD anyway:
[https://markroxor.github.io/gensim/static/notebooks/WMD_tuto...](https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html)

Edit 2: I forget Textacy has an implmentation built on Spacy. Still uses the
same backend solver, but the API is nice ([https://chartbeat-
labs.github.io/textacy/api_reference.html#...](https://chartbeat-
labs.github.io/textacy/api_reference.html#textacy.similarity.word_movers))

~~~
cup-of-tea
It needs the Hungarian algorithm to solve it and it's not the easiest
algorithm to implement. In fact, it's by far the hardest algorithm I've
implemented (I can't exactly remember why). I wrote it in Common Lisp and
worked on the performance quite a bit. It's still an O(n^3) algorithm, though.

------
amorroxic
Sentence similarity were my explorations with WMD too, reached a setup in
Keras with a siamese configuration, Wasserstein + KL loss (have a known
vocabulary and feeding both word vector sequences as well as their LDA
distributions as input). Post training cosine distance between encodings of
such sequences look pretty decent - with one issue I've spotted though: WMD
really seems to like about the same number of valid tokens in both sentences
which is not how real world looks like - eager to see results of EM distance
between image feature vectors, cheers.

------
rkwasny
The problem is WMD/EMD is that solving it is very slow.

Currently it's unfeasible to implement even simple real time search engine
using it.

------
kk58
Wmd seems to be the best metric to use for similarity

~~~
barbegal
WMD as in word movers distance?

