

A Word Is Worth a Thousand Vectors - legel
http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

======
ggchappell
This is interesting stuff. I recall that at one time Google seemed to be
heading in somewhat similar directions with Google Sets (now sadly gone -- I
miss it).

I know that the author is looking squarely at use cases along the lines of a
recommendation engine that would replace a human expert. But personally, I
think it might be more interesting to examine things the algorithm can do that
humans would find difficult or unintuitive. Sure, king - man + woman = queen
is a very significant achievement; it's also obvious, to a human. Now, what
can this algorithm come up with that is worthwhile, but that I would not find
so obvious?

A couple of little comments:

> The algorithm eventually sees so many examples that it can infer the gender
> of a single word, ....

Do we really want to say that? Perhaps we should say that the algorithm is
eventually able to make inferences that people would make based on knowledge
of the gender of words -- which is not quite the same thing. (And again, I
ask: what useful inferences can the algorithm make that humans would _not_
make so quickly?)

> Despite the impressive results that come with word vectorization, no NLP
> technique is perfect. Take care that your system is robust to results that a
> computer deems relevant but an expert human wouldn't.

It should be noted that that "no NLP technique is perfect" idea applies to the
NLP techniques used by human brains.

~~~
Retra
It sounds like you are concerned that these inferences are over-fitting the
training. But it is important to note that once you have a highly trained data
set -- maybe one that can't produce anything novel but does extremely well at
doing what humans agree on -- then it is probably very easy to relax your
process to produce novel results through iteration. At least then you know
you've built a robust enough system to get things right in a sterile setting.

On the other hand, demanding novelty from a system that can't even get the
basics right isn't likely to give you useful novelties so much as it will give
you random mish-mash.

To find a local maximum, you'll want to overshoot your target slightly and
then back up slowly.

------
jerf
It occurs to me the most-likely singularity won't be when humanity is wiped
off the face of the Earth, Skynet-style, or forcibly absorbed by encroaching
grey goo, but when you realize that you and every other human on the planet
went bankrupt in the same week because suddenly we are all getting targeted
with absolutely, utterly, completely irresistable targeted advertisement
beyond our human ability to resist, and we've spent every dollar we have or
can get access to... and still the ads are coming. Thus ends Humanity, in a
Tantalus-ian hell of infinitely targeted ads we will no longer have the
wherewithal to respond with purchases, spending our remaining days in
unbounded consumerist ennui....

~~~
hellbanner
Have you seen "Alternatives to the Singularity"?
[https://docs.google.com/presentation/d/1B75jindDAWsm8lBHPl4y...](https://docs.google.com/presentation/d/1B75jindDAWsm8lBHPl4yT6u6yi-
IQMGimLcq8zWkW7Q/present?slide=id.i0)

------
Radim
Very nicely written -- as usual for Chris :)

One minor nitpick: near the end, Chris recommends LSH for similarity
retrieval. This may be a bad idea. That implementation seems to perform very
poorly: [benchmarks]([https://github.com/erikbern/ann-
benchmarks/pull/5#issuecomme...](https://github.com/erikbern/ann-
benchmarks/pull/5#issuecomment-111750051))

As is often the case, simpler algorithms have fewer moving parts, and due to
cache localities can even perform better than theoretically-big-O superior
ones (see "bruteforce" in that same benchmark graph -- that's a simple linear
database scan! Observe how it's faster than most fancy approximate algos).

Note that these benchmarks are run specifically on real world vectors (100
dimensional GloVe word vectors trained over 2 billion tweets), so they're
highly relevant here.

------
hyperbovine
A thousand-dimensional vector, no?

~~~
sp332
Equivalently, a thousand one-dimensional vectors.

~~~
sirseal
No.

~~~
sp332
What's the difference?

~~~
bl
I'll give a try...

Imagine three, 4-by-1 vectors, each "one-dimensional". Twelve total scalars,
each vector with four rows and one column. Arrange these three vectors side by
side and merge them into a single 4-by-3 matrix. This matrix is "two-
dimensional".

Now, let's imagine five such matrices, each 4-by-3. Stack the five matrices
one on top of the other. We currently have a 4x3x5 matrix. This matrix, which
contains 60 scalars, is "3-dimensional".

Repeat a similar exercise 997 more times and you have a 1000-dimensional
matrix.

Compare that matrix to this: 1000 of our original 4-by-1 vectors arranged side
by side, which gives a 4x1000 matrix, which is simply a "two-dimensional"
matrix with 4000 elements.

~~~
sp332
A vector with four rows and one column is a four-dimensional vector. A one-
dimensional vector can be described with a single number, a 1x1 matrix if you
like.

~~~
bl
Oops. You're correct: a 2x1 vector is two-dimensional, a 3x1 vector is three-
dimensional, etc. <Trying to remember the terminology from linear algebra 15
years ago.> Each element of the mx1 vector represents a magnitude along an
orthogonal dimension ('scalars' for a set of 'basis vectors'). So then a
1000x1 vector would be "thousand-dimensional"; each element represents a
magnitude along an axis. But is this strictly equivalent to 1000 single-
dimensional vectors? `eli173 suggests not, and I agree.

In constructing my incorrect answer in the grand-parent comment, my though
process was being guided by the way Matlab/numpy treats these items (and I
think I'm on solid ground that Matlab/numpy treat them differently because
mathematicians consider them differently). The built-in functions operate
_very_ differently (if they work at all) for

    
    
        size(A) = (m,1)

and

    
    
        size(A) = (m,n≠1)

So there may be 1000 numbers floating in the ether, but conceptually they're
not the same. Multiplying a 1000x1 vector by a 1xp vector has a completely
different result than multiplying one thousand 1x1 vectors by that same 1xp
vector.

Although, only many hours later do I realize that the original submission
title might've been wordplay on the phrase "a picture is worth a thousand
words", so my brain is _not_ reliable today. I shall refrain from spewing
more-likely-than-not incorrect statements concerning linear algebra.

