
Show HN: A search engine that lets you 'add' words as vectors - jakek
http://insightdatascience.com/blog/thisplusthat_a_search_engine_that_lets_you_add_words_as_vectors.html
======
gjm11
Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't
surprising; it's a very difficult problem. But I was hoping to be amazed.)
Here are the examples I tried (all of them, no cherrypicking):

daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }

(Hopeless; should have had "son" in there)

pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger
Time (arcade game), Phantasy Star (series of games) }

(Hopeless; should have had "Mac" in there)

violin - string + woodwind -> { clarinet, oboe, flute }

(OK)

mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol
(pharmaceutical), stone (geological term) }

(Poor; should have had Jagger or Richards or something of the kind in the top
few results)

sofa - large + small -> { relaxing, asleep, cupboard }

(Poor; I'd have hoped for "armchair" or something of the kind)

~~~
juxtaposicion
Some of this is that underlying model is insufficiently trained, but some of
this is disambiguation. Disambiguation in text is a very, very, hard problem.

So some of your examples, when clarified, are a bit clearer: Paul McCartney -
Beatles + Rolling Stones = Mick Jagger is in the 3rd spot.
([http://www.thisplusthat.me/search/Paul%20McCartney%20-%20Bea...](http://www.thisplusthat.me/search/Paul%20McCartney%20-%20Beatles%20%2B%20Rolling%20Stones))
Change stone -> Rolling Stones.

Thank you for the comment!

~~~
minikomi
Japan - Tokyo + History = Kyoto was what I was expecting but I guess the
corpus isn't quite there yet. This was the answer ..
[http://en.wikipedia.org/wiki/Atpara_Upazila](http://en.wikipedia.org/wiki/Atpara_Upazila)
At least it's geographic!

~~~
mjfl
Hitler -German +Italian = Sofiene Zaaboub apparently. Was hoping for
Mussolini.

------
nemo1618
Lisp + JVM = Clojure

I'm sold. This is really cool! (Though it's worth noting that a Google search
with the same terms returns the exact same result...)

~~~
ccanassa
I tried Lisp + Java and got Objective C instead.

------
leephillips
I was really excited by this writeup, so I tried it. Four test queries
returned nothing that seemed useful or even relevant:

fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got
Maxwell’s equations and classical mechanics (not useful);

verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got
nonsense;

writer + American + Russian + Great - Nobel Prize : expected Nabokov, got
Meirkhaim Gavrielov + 1 nonsense result;

plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants”
(thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.

EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried
"Nixon + Clinton - JFK" and got nothing that looked interesting. Then I
noticed that the "Nixon" part of my query was "disambiguated" to something
like "non_film", and the word "Nixon" was just stripped out. I think this
thing is just broken.

~~~
juxtaposicion
It's not perfect; the real limiting factor is the volume of text. For the
research paper behind word2vec, to get accurate associations for common words
(King, man, woman, etc.) required a news sources training text of order a
_billion_ words. This is roughly the size of all of Wikipedia, but the text in
Wikipedia has many more rare words, which spreads the number of training
examples per word rather thin. To address that, I'm thinking of expanding the
project to use Common Crawl data
([http://commoncrawl.org/](http://commoncrawl.org/)) to dramatically expand
the available supply of data.

------
doctoboggan
Hey juxtaposicion, fascinating work. I have many questions so I am just going
to shoot them rapid fire.

What is the dimensionality of each word vector and what does a words position
in this space "mean"? What is this dimensionality determined by? Have your
tried any dimensionality reduction algorithms like PCA or Isomap? It would be
interesting to find the word vectors that contain the most variation across
all of wikipedia. Have you tried any other nearest neighbor search methods
other than a simple dot product, such as locality sensitive hashing?

I guess most of those questions are about the word2vec algorithm, but you are
probably in a good place to answer them after working with it. Anyways, cool
work, and I am glad you did it in python so I can really dig in and understand
it.

~~~
juxtaposicion
Hey doctoboggan, awesome questions!

>> What is the dimensionality of each word vector and what does a words
position in this space "mean"? What is this dimensionality determined by?

Each dimension roughly is a new way that words can be similar or dissimilar.
So I've got 1000-dimensional vectors, so words can be similar or dissimilar in
only one thousand 'ways'. So associations like 'luxury', 'thoughtful',
'person', 'place' or 'object' are learned (roughly speaking). Of course, real
words are far more diverse, so this is an approximation. The 1000 dimensions
is configurable, and in theory more dimensions means more contrast is
captured, but you need more training data. In practice, the number 1000 is
chosen because that maxes out the size of my large memory machine. That said,
the word2vec paper shows good results with 1000D, so it doesn't seem to be a
bad choice.

>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?

Yes! I've tried out PCA, and some spectral biclustering using the off-the-
shelf algorithms in SciKits Learn. I only played around with this for an hour
or so but got discouraging results. Nevertheless, the word2vec papers actually
show that this works really well for projecting France, USA, Paris, DC,
London, etc. on a two-dimensional plane where the axis roughly correspond to
countried & capitals -- exactly what you'd hope for! I wasn't able to
replicate that, but Tomas Mikolov was!

>> It would be interesting to find the word vectors that contain the most
variation across all of wikipedia.

Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation'
\-- would this amount to isolating word clusters and finding the most dense
ones? Something like finding a cluster with a hundred variations of the word
'snow' (if you're Inuit)? I'd be willing to part with the raw vector database
if there's interest.

>> Have you tried any other nearest neighbor search methods other than a
simple dot product, such as locality sensitive hashing?

Only a little bit, although I'm very interested in finding a faster approach
than finding the whole damn dot product (see:
[https://news.ycombinator.com/item?id=6720359](https://news.ycombinator.com/item?id=6720359)).
I worry that traditional location sensitive hashes, kd-trees, and the like
work well for 3D locations, but miserably for 1000D data like I have here.

I should reiterate out that most of the hard work revolves around the word2vec
algorithm which I used but didn't write. It's awesome, check it and the papers
out here:
[https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/)

Whoa, that was alot. Thanks!

~~~
dhammack
For 2D visualization, t-sne is an excellent tool. I've used it with word2vec
and you can see clusters of similar words:

[https://raw.github.com/dhammack/Word2VecExample/master/visua...](https://raw.github.com/dhammack/Word2VecExample/master/visualizations/figure_11.png)

[https://raw.github.com/dhammack/Word2VecExample/master/visua...](https://raw.github.com/dhammack/Word2VecExample/master/visualizations/figure_13.png)

[https://raw.github.com/dhammack/Word2VecExample/master/visua...](https://raw.github.com/dhammack/Word2VecExample/master/visualizations/figure_5.png)

And more in
[https://github.com/dhammack/Word2VecExample/tree/master/visu...](https://github.com/dhammack/Word2VecExample/tree/master/visualizations)

------
juxtaposicion
Harvard - Boston + Silicon
[http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B...](http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B%20Silicon)

~~~
benmanns
Myspace - bands + success
[http://www.thisplusthat.me/search/Myspace%20-%20bands%20%2B%...](http://www.thisplusthat.me/search/Myspace%20-%20bands%20%2B%20success)

~~~
joshschreuder
I did 'Facebook - users' and got Myspace as the second result :)

~~~
zamalek
Facebook - Exploitation + Privacy = Blog

Kinda makes sense.

------
SandB0x
I saw what you wrote about your dot product speed issue. Did you try using
NumPy's einsum function?
[http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...](http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html)

It's really fast for this kind of stuff. Happy to give details about how to
use it if you need.

~~~
juxtaposicion
Hey SandB0x, thanks for the advice! I actually started off by using numpy.dot,
which is precisely what's needed. The problem is that I need it to go even
faster (currently takes a few seconds) but this function is already heavily
optimized and uses Intel MKL to accelerate the math. In fact, my cython
implementation would be slower than numpy.dot were it not for some embedded
logic that breaks out of the dot product half way through calculation. As I'm
computing the row * row, going element by element, if the running sum of those
products gets to be really negative (indicating that in the dimensions
multiplied so far, the two rows are highly dissimilar) I stop tryin to
calculate the rest of the row, thereby saving me from calculating the full dot
product. This is cheating obvioisly, but it's x3 or x4 faster numpy.dot. So,
because there's branching logic in my implementation of the dot product, I
can't express it interns of Einstein summations.

------
emehrkay
Interesting. I've played around with words as vectors (with values) and the
cosign similarity algorithm
([http://en.wikipedia.org/wiki/Cosine_similarity](http://en.wikipedia.org/wiki/Cosine_similarity)).
This is very cool stuff. I wonder how they're doing it in real-time, it is
heavy number crunching

~~~
juxtaposicion
Thanks! The number crunching is indeed very intense. The full body of vectors
has a million rows and thousand columns, which fills up all of the 10GB of
available memory. When you punch in a query, it adds or subtracts the
requested vectors and takes an approximate dot product between every row in
the table and the search query. This is about 10^9 operations. I'm only
interested in the most similar (high cosine similarity) dot products, so to
speed things up I wrote a Cython dot product implementation. This aborts the
calculation if the sum starts to look like it'll be very dissimilar,
essentially skipping lots of bad guesses. This speeds things up by a factor of
~5 or so. I'm debating offloading this computation to the GPU, which would be
perfect for this.

Edit. In case you're interested in the source:
[https://github.com/cemoody/wizlang](https://github.com/cemoody/wizlang)

~~~
sillysaurus2
Since the aim is accuracy rather than throughput, would more memory help?

------
donretag
Interesting concept, but how will it work with more dynamic content? You can
train the model on a fairly static corpus such as Wikipedia, but what if you
content changes with a greater frequency?

Since MapReduce is used, perhaps the model is already being trained on small
batches making incremental updates possible.

~~~
juxtaposicion
Hi, creator here (Chris Moody). Great question. The underlying algorithm,
word2vec,
([https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/))
isn't built for streaming data which means that at the moment it assumes a
fixed number of words from the beginning of the calculation. Unfortunately,
until the state-of-the-art advances to accepting streaming data, the whole
corpus will have to be rescanned to accept dynamic content. Furthermore,
word2vec doesn't scale past OpenMP, single node, shared-memory resources. So
while I used MapReduce, it's just for cleaning and preprocessing the text, not
training the vectors, which is the hard part.

So there's some exciting work to be done in parallelizing and streaming the
word2vec algorithm!

------
logn
daft punk - repetitive + lyrics == La Roux

nice work!

------
axblount
I guess we all just need a little more LeAnn Rimes.
[http://www.thisplusthat.me/search/the%20world%20-%20violence...](http://www.thisplusthat.me/search/the%20world%20-%20violence%20%2B%20love)

------
est
Sounds like this paper from Google

[http://www.technologyreview.com/view/519581](http://www.technologyreview.com/view/519581)

For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that
is similar to ‘queen’.

------
jeorgun
Is it just me, or do almost half of my searches return `Dvbc' for no apparent
reason?

[http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%2...](http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%20Spot)

[http://www.thisplusthat.me/search/Chrome%20%2B%20open%20sour...](http://www.thisplusthat.me/search/Chrome%20%2B%20open%20source)

[http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source](http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source)

------
toolslive
Does this relate to Latent Semantic Indexing?
[http://en.wikipedia.org/wiki/Latent_semantic_indexing](http://en.wikipedia.org/wiki/Latent_semantic_indexing)

~~~
dhammack
Kind of. LSI is a dimensionality-reduction method, which means it takes really
sparse "bag of words" vectors and compresses them in a way which approximates
the original structure. Word2vec gives each word a non sparse vector based on
the hidden units of a neural network language model. The model is trained to
predict the middle word from it's surroundings, which makes it learn how words
are used in context.

------
CurtMonash
Sounds like another go-around at 1990s (& early 2000s) concept search --
Excite, Northern Light, etc.

And it sounds really close to what I was trying at Elucidate.

------
dhammack
Hey, nice work! Can you explain the "comma delimited list" functionality any
more? It seems (awesomely) similar to a hack I did a while back with Word2Vec
which would pick out the word which didn't belong in a list.

My hack:
[https://github.com/dhammack/Word2VecExample](https://github.com/dhammack/Word2VecExample)

------
Danieru
Fun bug: handheld - sony + nintendo = {Wii, Wii, Snes}

I was hoping for the DS or Gameboy but expecting at least something handheld.

~~~
emehrkay
I would guess that the "handheld" keyword is big with the Wii because of the
new control scheme it introduced which was the main (sometimes only) thing
talked about with regard to the system

------
grishma
Interesting. Currently it generates garbage for lot of queries but, some stuff
is kinda fun. Forrest Gump - comedy + romance gives pulp fiction (!), as good
as it gets (match) and polar express (?) Avatar - action + comedy gives The
Office (haha!)

------
yetanotherphd
I know people like to keep things positive, but this is completely useless.
Apart from a few cherry picked examples, subtracting words makes no sense most
of the time, and there is no clear advantage for their method when it comes to
adding words.

------
jboynyc
This is neat, and I found a few queries that added interesting results.
However, I tried

    
    
        Slavoj Žižek - Jacques Lacan - Hegel
    

which yielded an internal server error, probably due to the diacritics not
being encoded properly.

------
cocoflunchy
Bug report: using some non-ascii characters crashes the server (for example é
or É).

~~~
juxtaposicion
Thanks! I'll have to spend more time sanitizing the input.

------
zhemao
Albert Einstein - Smart = Niels Bohr, Werner Heisenberg, Wolfgang Pauli

Ouch, that's cold

------
breck
Neat stuff juxtaposicion.

Seems like this is how Numenta's AI works:
[http://www.youtube.com/watch?v=iNMbsvK8Q8Y](http://www.youtube.com/watch?v=iNMbsvK8Q8Y)

------
akennberg
Stanford - American + Canadian = University of Toronto

I think it should be Waterloo.

~~~
omaranto
Maybe because you thought CS and the algorithm thought some other field, say
math.

------
whistlerbrk
Works for me:

[http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%2...](http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%20%2B%20good)

~~~
fredsanford
Try this slight refinement...

The results were... Interesting.

[http://www.thisplusthat.me/search/Dick%20Cheney%20-evil%20%2...](http://www.thisplusthat.me/search/Dick%20Cheney%20-evil%20%2Bdumb%20%2Bgood)

------
somberi
Fantastic work and is relevant to something we are working on in this space.
Thanks.

On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary
Clinton and Mitt Romney.

------
pit
Also interesting to try something like:

sleep - sleep

[http://www.thisplusthat.me/search/sleep%20-%20sleep](http://www.thisplusthat.me/search/sleep%20-%20sleep)

------
bocanaut
[http://www.thisplusthat.me/search/Germany%20-%20Fun](http://www.thisplusthat.me/search/Germany%20-%20Fun)
Germany - Fun = USA

:)

------
corobo
Hey this is pretty cool!

superman - male + female:

    
    
      - Lex Luthor (hmm..)
      - Superman's pal Jimmy Olsen (haha, what?)
      - Wonder Woman (That'll do it!)

------
ppymou
Great writeup. Curious, are there clear advantages that the vector
representation has over graph models (FB graph search, Google Knowledge
graph)?

------
SergeyHack
The default example "justin bieber - man + women" was ok, but I have found a
better one - "justin bieber - women + man "

------
Lucy_karpova
What are the use cases for this fancy feature? I'm thinking of e-advisor for
fun, but what are the real life serious use cases?

------
iLoch
ThisPlusThat.me - fast + slow...

Just kidding! :)

You could also say...

ThisPlusThat.me - another rant + something cool

Thanks for posting this, very interesting work!

------
iamchmod
I thought this one was good "Stanford - Red + Smart" = Berkeley

------
elwell
Server apparently wasn't ready for HN frontpage load

------
epaga
Pretty impressive for my first try.

iPad - cool -> Windows Phone

------
dlsym
reddit - dumb

Expected: HN, Got: Digg

