
Query2vec: Search query expansion with query embeddings - eggie5
https://bytes.grubhub.com/search-query-embeddings-using-query2vec-f5931df27d79
======
vector_spaces
As a frequent user in Albuquerque, my personal experience with GrubHub search
is that it returns very strange results, and categorizations are unreliable.
For example a restaurant whose name contains my query (for a query like
"Chinese" or "Thai") may not even appear in the results for that query, and
relevant results will be buried beneath completely irrelevant results (like a
taco place and an ice cream joint might appear above a Thai restaurant in the
above example).

Further restaurants will be miscategorized or eg have the word "sushi" in the
name, and yet not appear in the sushi category (not sure if there's actually a
sushi category in the app, just using that as an example).

~~~
eggie5
Author here: We hope to fix this w/ the techniques described in the post! The
search engine will first collect a high-recall set of candidates which is then
passed to a high-precision ranker. This should help you w/ your cuisine and
dish queries misclassifications. It will also introduce semantic understanding
into the query engine meaning if you type "French" it will not give you
"French fries" but French cuisine.

~~~
throwawayhhakdl
This seems like a pretty good write up, but I do wonder if it’s necessary? I
use grub hub and Uber eats fairly often. If I want to search for a restaurant
I would hope it’s just doing a match on exactly what I search with minor
spelling checks.

If I’m looking for a dish or cuisine, why not display it in a visual fashion
like Uber eats? There’s usually not that many options to actually sift
through.

For example the first graphic of the embedding space has a very clear mixture
of restaurant names and food items. Feels a bit messy.

------
dnc
"The difference between this and word2vec is that the primitive atomic unit we
are embedding isn’t a word, but a search query. Therefore this isn’t a
language model, but a query model."

If a primitive embedding unit is a search query instead of a word, then I
assume that a query vector model is trained on a limited dictionary of
queries. I wonder if that implies that the trained vector model can encode
only search queries that are already present in its dictionary? If not, I
think it would be interesting to know more about how the closed dictionary
problem was solved.

~~~
eggie5
Correct, the model has a fixed vocabulary set at train-time. However, we can
benefit from the fact that the distribution of queries in most search systems
follows the Zipfian distribution. That means we can capture the vast majority
of queries that are ever issued in a fixed vocabulary. Daily retraining helps
us pick up new queries outside of the vocabulary.

~~~
mlthoughts2018
How is your model fundamentally different from doc2vec though? The idea of
swapping a vocab of words for a vocab of tokens, sentences, paragraphs,
queries, articles, pages, etc., has been around and used in research and
production virtually since the same time as word2vec itself.

One of the most common “off the shelf” search solutions is to train an
embedding for X (where X is whatever you want to search) using some implicit
similarity or triplet loss for examples that have natural positive labels
(like due to proximity in the word2vec case) plus random negative sampling,
and use the embedding in an exact or approximate nearest neighbor index.

In fact, it’s even very “off the shelf” to use Siamese networks for multi-
modal embeddings, for example simultaneously learning embeddings that put
semantically similar queries and food images into the same ANN vector space.

I think the blog post is very cool don’t get me wrong, but no part of this is
novel (if that’s what you were going for, I can’t tell). This exact end to end
pipeline has been used for image search, collaborative filtering (customer &
product embeddings), various recommender systems where the unit of embedding
is something like “pages” or “products” or “blog posts” or other product
primitives for whatever type of business, in production across several
different companies I’ve worked for.

~~~
eggie5
I agree:

In the IR community you will see image2image search implemented by passing an
image through a headless CNN and then do ANN search on the embedding space.
Then that can lead to discussion about cross-model hashing which you
suggested. I think also you touched on LTR techniques w/ siamese networks. For
example we can use a siamese network to learn the diff between two embeddings.

In the Recys community for collaborative filtering you can take interaction
data between users and items and decompose that matrix into two parts using
SVD which is called the user and item embeddings. Also in the recsys community
you see people doing item2vec (what I did) for the item embeddings and then
aggregating those for the user representation (embeddings).

Maybe you will grant me the novelty of this method in making the connection
between queries and converted restaurants as a similarity metric and then
applying it to the IR use case of search query expansion? Should the post be
more clear that I am not claiming to have invented item2vec or ANN
(approximate nearest neighboors)?

~~~
mlthoughts2018
Your final paragraph is reasonable, and I do think the post would benefit from
more clearly talking about this.

The “trick” is to find some naturally occurring property that indicates a pair
of examples has the positive label, such as two users selecting a given
product, two text queries leading to clicking on the same image, two different
blog posts with the same keyword, etc. This combined with an efficient way to
sample acceptable negative pairs that do not express the trait that indicates
positive label.

That + deciding how to approach the loss function (e.g. cosine similarity,
euclidean distance, distance in a hash space, triplet loss, should it have a
margin) is the whole trick of the problem.

The black box that produces embeddings (DNN, doc2vec, sparse vector from
tfidf-like features, whatever) and the nearest neighbor piece are standard
plug and play.

~~~
eggie5
thanks, I think we are in agreeance.

------
gourabmi
Excellent post. Is this a candidate for
[https://sigir.org/sigir2020/](https://sigir.org/sigir2020/) industry track ?

~~~
eggie5
Author here. Thanks, for the recommendation. I actually got inspiration for
this idea from chatting w/ industry colleagues at SIGIR. Last SigIR in Paris,
I spoke to some people from Apple, where they are using a technique like this
for the app store search.

------
bayonetz
Neat! eggie5 and friends, I did something related for Caviar that you might be
interested in: [https://developer.squareup.com/blog/caviars-word2vec-
tagging...](https://developer.squareup.com/blog/caviars-word2vec-tagging-for-
menu-item-recommendations/)

~~~
eggie5
Although your post is orthogonal to what we present in this work, I think it
still merits discussion.

The high-level concept of a product is important for any commerce company w/
an unbounded and unstructured product catalog. In this case, the concept of a
Dish is a very valuable primitive for Caviar and GH to understand.

You propose an interesting technique for dish tagging:

1\. train word2vec on menu item text 2\. Generate menu item embeddings by
aggregating respective word vectors 3\. Curate dish (label) set 4\. Generate
embeddings for dish labels by aggregating word vectors 5\. Tag menu items w/
their dish label by NN search.

Then once you have this high-level concept of dish you can drive all kinds of
interesting product innovations:

* trending asian noodle dishes * tacos in san Diego * It also helps w/ the item sparsity problem in matrix factorization

Do I understand?

I'm going to try this on our next learning Friday.

We have taken a different route for dish tagging. First we have a graphical
model tag some items w/ low coverage but high precision and then we feed that
into a classifier to generalize and provider full coverage. It's like a
generative model fed into a discriminative model.

~~~
misterman0
>> items w/ low coverage but high precision

Newbie NLP dev here. What's this concept of coverage vs precision that you
speak of? It sounds intriguing.

~~~
sdnlafkjh34rw
It's the ratio of false positives and false negatives to their true
equivalents.

------
lootsauce
This is a fantastic post. Would love to see an even more detailed walk through
some code examples, and discussion of development to production of these
models. Do you have any other resources you would suggest?

~~~
lootsauce
For example, what ANN do you use in production, what are the reasons for using
it. Another question is how much data does it take to make these models work,
can they be primed effectively with little data if doing transfer learning
from a more general model?

~~~
eggie5
Hi, I'm the author of the post. We actually just presented this work and more
at PyData NYC. We share more implementation details. Here are the slides:
[https://www.slideshare.net/AlexEgg1/discover-
yourlatentfoodg...](https://www.slideshare.net/AlexEgg1/discover-
yourlatentfoodgraphwiththis1weirdtrick-pydata-nyc-2019)

But to answer you question here: For production ANN we have Annoy integrated
into our backend as a service. Annoy was an easy choice bc it checked out box
for JVM support.

For training the models we have endless amounts of behavioral data, so we
didn't even need to look at transfer learning. For this query2vec example, it
was trained on 1 year of queries which takes 15min/epoch on an AWS p2 GPU. We
do all our preprocessing (heavy normalization) in pyspark.

------
manigandham
Neat tech write up but ironically Grubhub has just lost 50% of its market cap
due to poor earnings where the company has basically admitted its business
model is indefensible and borderline unsustainable.

~~~
eggie5
Author here. Thanks for the comment. This is a tough business to be in, but
very exciting a the same time.

I'm trying to effect change by making meaningful contributions to our search
pipeline by applying novel techniques from information retrieval literature.

For example, with this query expansion technique, we can service search
queries in small markets (not a lot of restaurants) more completely by showing
something instead of nothing. Or in big markets, we can service obscure, tail
queries by fining something similar. This helps increase the overall recall of
the search system, making it better!

Don't loose faith in GH yet!

~~~
manigandham
I'm happy to read about the tech. Not so much about the business but as long
as you're working on what you like then who am I to say anything. Good luck!

------
nautilus12
Curious, what's the underlying persistence. Are you actually using a graph
database?

~~~
eggie5
The latent graph structure is encoded into the embeddings. Each node in the
graph is a vector and you traverse the graph by preforming NN search in the
embedding space.

