Hacker News new | comments | ask | show | jobs | submit login
Building a Content-Based Search Engine IV: Earth Mover's Distance (deepideas.net)
70 points by deepideas 9 months ago | hide | past | web | favorite | 7 comments

There is a paper that indicates Earth mover, aka Wasserstein-1 is better for convergence of GANs: https://arxiv.org/pdf/1701.07875.pdf

I tried it out on one of the Udacity Deep learning assignments using the Wasserstein loss functions built into tensorflow. I was unsuccessful in my limited use. The discriminator always ‘won out’ rather than the combo finding a saddle point. I eventually got my project to work without it, and did not go back to compare against just swapping EM back in.

EMD is something that really needs more, better implementations.

The one that everyone uses from Python isn't the easiest thing to install, doesn't have a great API and isn't easy to extend.

I think Gensim recently added it, but I think they use the same backend solver.

Edit: this is a better article on EMD anyway: https://markroxor.github.io/gensim/static/notebooks/WMD_tuto...

Edit 2: I forget Textacy has an implmentation built on Spacy. Still uses the same backend solver, but the API is nice (https://chartbeat-labs.github.io/textacy/api_reference.html#...)

It needs the Hungarian algorithm to solve it and it's not the easiest algorithm to implement. In fact, it's by far the hardest algorithm I've implemented (I can't exactly remember why). I wrote it in Common Lisp and worked on the performance quite a bit. It's still an O(n^3) algorithm, though.

Sentence similarity were my explorations with WMD too, reached a setup in Keras with a siamese configuration, Wasserstein + KL loss (have a known vocabulary and feeding both word vector sequences as well as their LDA distributions as input). Post training cosine distance between encodings of such sequences look pretty decent - with one issue I've spotted though: WMD really seems to like about the same number of valid tokens in both sentences which is not how real world looks like - eager to see results of EM distance between image feature vectors, cheers.

The problem is WMD/EMD is that solving it is very slow.

Currently it's unfeasible to implement even simple real time search engine using it.

Wmd seems to be the best metric to use for similarity

WMD as in word movers distance?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact