
King – man + woman is queen; but why? - stared
http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html
======
antiquark
For a D&D "alignment chart", set x axis to [illegal -- legal] and y axis to
[evil -- good]. Then start typing in words. Some surprises there!
[https://lamyiowce.github.io/word2viz/](https://lamyiowce.github.io/word2viz/)

~~~
antiquark
Preliminary findings!

The most illegal thing is "heroin"

The most legal thing is "CEO"

The most good thing is "teacher"

The most evil thing is "lucifer"

~~~
openfuture
Murder is more legal than money

Priests are about as legal and rich as criminals, same for nuns wrt. janitors

'Sad' is rich and legal, 'happy' is poor and illegal. Same delta with 'power'
and 'money'.

~~~
nitrogen
Also interesting to use "peasant" -> "ruler" as the Y axis, while leaving the
X axis as "she" -> "he". It shows a possible gender bias in language, with all
of the male words being higher on the peasant-ruler scale than female words.

------
sauravjain
I really like these posts by Sanjeev Arora :

1) [http://www.offconvex.org/2015/12/12/word-
embeddings-1/](http://www.offconvex.org/2015/12/12/word-embeddings-1/)

2) [http://www.offconvex.org/2016/02/14/word-
embeddings-2/](http://www.offconvex.org/2016/02/14/word-embeddings-2/)

For a more theoretical explanation :

[https://arxiv.org/abs/1502.03520](https://arxiv.org/abs/1502.03520)

~~~
stared
I learned about them just a few days before publishing this post (and I link
to the second one, in the "technicalities" section).

Though, some statements about analogies should be taken with a grain of salt,
see: Tal Linzen, Issues in evaluating semantic spaces using word analogies,
[https://arxiv.org/abs/1606.07736](https://arxiv.org/abs/1606.07736).

------
stephencanon
I've never found the "vector space" of word2vec remotely satisfying. In order
to form a vector space, you need to also be able to make sense of scalar
multiplication, and you need to be closed under arbitrary linear combinations.
What is 2king? What is 3king - 2green + 0.5brutality? You can kind of make
sense of this for adjectives, but it really breaks down with nouns.

~~~
rm999
You may already know this part, but for anyone who doesn't know: vector space
models very often use cosine distance to make comparisons, instead of
Euclidean distance. In this model, you can visualize your vectors as points on
a unit hypersphere, where the distance between two vectors is how far apart
they are on the sphere. x+y finds the point between them, x-y finds the point
between x and the antipode of y (or, perhaps more intuitively, pushes x away
from y). 2*x doesn't have any additional meaning (it's the same as x). But
x+2y is the equivalent of finding a point that is proportionally closer to y
than x (I think this is 2 times as close to y than x, but I didn't do the
math). edit: to be clear this paragraph is very hand-wavy on the math, and is
just meant to create an accurate-enough visual.

My intuition of why things like king + man - woman work is because the points
in the vector space model happen to create a well-behaving manifold with
smooth meaning changes. It's not very principled, but it does work.

I wrote a series of blog posts with a coworker about doing this with music:

[https://tech.iheart.com/mapping-the-world-of-music-using-
mac...](https://tech.iheart.com/mapping-the-world-of-music-using-machine-
learning-part-1-9a57fa67e366)

[https://tech.iheart.com/mapping-the-world-of-music-using-
mac...](https://tech.iheart.com/mapping-the-world-of-music-using-machine-
learning-part-2-aa50b6a0304c)

Instead of mixing nouns and adjectives, we do things like mixing songs and
artists and radio stations etc. In the second post we show how Nirvana - Kurt
Cobain + Female Vocalist works remarkably well. I've studied empirically why
this worked, and the best I could come up with is that the high dimensional
space we created had a very dense set of points in the region of popular
western music that led to a smooth manifold.

~~~
CPLX
Those posts are really interesting.

------
DonaldFisk
Why is this approach preferred to treating words as symbols connected to other
symbols? E.g.

    
    
       (king (genl headOfState)
             (sex male))
       (woman (genl person)
              (minimumAge 18)
              (sex female))
    

What are the advantages of storing words as floating point vectors? I can see
how inputting huge amounts of text into a simple algorithm might be less
labour intensive than manually building (or using a more complex algorithm to
build) a dictionary. However, at least you can use your dictionary for other
purposes, and its contents are readily verifiable by non-technical people.

~~~
barrkel
This is ridiculously simplistic - take woman, for example. The distinction
between woman and girl isn't nearly as simple as age. There's a sorites
paradox in trying to find a dividing point of age, but actually woman connotes
a bunch of other concepts probabilistically. It suggests independence and
maturity vs childishness and cuteness. While girl is often used in the context
not of children, but women as objects of courtship - girls on a night out,
girlfriend etc. And a poem or song may play with the multiple meanings and
ambiguities - where is Britney's Not a Girl in this symbolic analysis?

The only way you can get symbols to work is with weights. Follow this to its
logical conclusion and I think you'll end up with a system isomorphic with the
vector approach, with dimensions representing something like symbols.

~~~
DonaldFisk
It was a very simplified example. The Cyc database contains more realistic
examples. Yes, I'd seriously consider augmenting symbols with numeric data.
You could even store the vector obtained from word2vec as a property of the
symbol if that's found to be useful in capturing nuance, but you can't store
symbols on floating point vectors.

~~~
barrkel
Here's another example.

    
    
        The fat cat is sitting on a mat.
    

Only someone who has spent too much time away from people could possibly think
that the meaning of this sentence is that somewhere, there is a fat cat
sitting on a mat. It's a direct allusion to books used to teach reading to
young children. It's a reference to simple sentences and simple words.

And even within the world of the sentence, it only has a vague meaning. What
exactly makes the cat fat? Is it neutered or lazy or overfed? Why is it
sitting on a mat - is the mat outside a door, is it waiting to be let in? What
kind of a cat is it - could it be a big, dangerous cat? The sentence is laden
with signifiers and unknowns. Western children are taught to consider
sentences like these in the abstract, but it's not a natural way of thinking,
because it's not practical in a life lived connected to the world.

Abstract hypotheticals are the hallmark of more disconnected concerns, and we
teach our children this early, in part using silly, deliberately vague
sentences like these, and discouraging curious questions that might resolve
the ambiguities.

An AI system that's designed to handle abstract sentences like these is not
one designed to understand human language, because humans don't reason like
this unless they're thinking analytically - and even then, they do so
blinkered with biases and errors.

~~~
homarp
>Only someone who has spent too much time away from people could possibly
think that the meaning of this sentence is that somewhere, there is a fat cat
sitting on a mat.

or someone who did not grow up using the English language.

~~~
czep
I grew up with English and when I read that sentence, to me it means that
there is a fat cat sitting on a mat. Not sure what point was trying to be made
here.

------
dottrap
I wonder, if you could train on just Disney movie scripts, would

King – man + woman = Princess?

Or maybe: King – man + woman = villain?

(because of characters like Cruella, Maleficent, Ursula, etc.)

------
ANaimi
Here's word2vec as REST API (sign in to interact from the browser):
[https://algorithmia.com/algorithms/nlp/Word2Vec](https://algorithmia.com/algorithms/nlp/Word2Vec)

~~~
stared
I didn't know that one. Is it Google's word2vec dataset?

~~~
ANaimi
Yep, based on Google News. Model is 1.5GB when zipped.

------
MichailP
Can someone explain a bit more on vector space model for words (and
documents)? I first saw that approach in prof. Erik Demaine lecture on
algorithms [1], and also here. It's fascinating how linear algebra and vector
spaces pop up in unexpected places.

[1]
[https://courses.csail.mit.edu/6.006/spring11/lectures/lec01....](https://courses.csail.mit.edu/6.006/spring11/lectures/lec01.pdf)

~~~
stared
For documents there are various approaches. I would suggest using Latent
Dirichlet Allocation, see my links here:
[https://pinboard.in/search/u:pmigdal?query=lda](https://pinboard.in/search/u:pmigdal?query=lda).

------
sandworm101
And here i thought this was a statement about the strange rules of sucession
in modern western monarchy. Riddle me this: why are women who marry kings made
queens while men who marry queens are made only princes?

~~~
qbrass
Because, like a deck of cards, king trumps queen in the ranking of titles and
they want to place the person who married into the family below the one who
was born into it.

~~~
sandworm101
Close, but i would say that "queen" covers two different jobs: being a female
monarch and the older 'wife to the king'. Some queens are monarchs and some
are only wife, with no right to rule on thier own (ie they are not daughter to
any previous monarch). Kings are only ever kings by right.

~~~
true_religion
I believe there is such a thing as the king-consort.

------
GotAnyMegadeth
Interesting how in the diagram at the top that all of the female words are
either less than or equal on the queen axis to male counterparts.

------
ilaksh
Its interesting but ultimately to really 'under' 'stand' language in a deep
way, the systems will need representations based on lower-level (possibly
virtual) sensory inputs.

That is one of the main enablers for truly general intelligence because its
based on this common set of inputs over time, i.e. senses. The domain is sense
and motor output and this is a truly general domain.

Its also a domain that is connected to the way the concepts map to the real
physical world.

So when the advanced agent NN systems are put through their paces in virtual
3d worlds by training on simple words, phrases, commands, etc. involving
'real-world' demonstrations of the concepts then we will see some next-level
understanding.

~~~
ilaksh
See
[http://www.goertzel.org/papers/PostEmbodiedAI_June7.htm](http://www.goertzel.org/papers/PostEmbodiedAI_June7.htm)
or start with
[http://courses.media.mit.edu/2004spring/mas966/Harnad%20symb...](http://courses.media.mit.edu/2004spring/mas966/Harnad%20symbol%20grounding.pdf)

------
hauleth
As a Pole you shoul know better. It isn't true in all cases, i.e. Jadwiga was
king, not queen.

------
candiodari
I've always found it funny what these machine learning insights mean for how
humans think.

You have people who focus on grammar and spelling. But word embeddings collect
their insights by taking any sequence of 5 words, taking out the middle word,
jumble up the result (technically they express it in a way that ignores order.
They're expressed as 1 bit per word, 1 means the word is in the sentence, 0
means it's not. The sequence of the words in the input to the network is
completely independent of their sequence in the sentence). And they understand
that "king is to man as queen is to woman" and lots of other things.

When going deeper you quickly start to realize a few things : in 90% of
sentences the sequence of words does not matter. No, not even if "not" appears
before or after the verb (and thus refers to the subject of object of the
sentence). Word sequence. Doesn't matter. Which noun you place an adjective
next to. It is semantically important. Really important. Every English (or any
language I imagine) teacher will hammer the point home again and again. And
yet ... it almost never matters, in the sense that getting it wrong will not
cause something dumber than a human to misinterpret the resulting sentence. So
why do we care ? Social reasons (ie. to fuck other humans, or more generally,
to get them to do stuff for us)

It's a weird thing that keeps coming back in machine learning. Humans think
their reasoning high level. Yet algorithms that keep track of maybe 2 or 3
variables per individual can predict the actions of crowds with uncanny
accuracy. Tens to hundreds of thousands of people, each believing they're
individuals and think about what they're doing, take not just the same
decision, but with an enormous probability will come to that decision within
minutes of each other.

I am reminded of a quote by Churchill. Humans appear smart individually, but
it's a trick, an impression, it's a facade, almost an illusion. If they act in
group, said intelligence is utterly gone, and they almost always act in dumb
ways in large groups, even when they are acting alone. Intelligence is 95% a
parlor trick used in conversation, to make friends, or to mate, like a
peacock's feathers, and only 5% or less something we actually use to act. So
it's purpose, from a species' perspective, is not at all to act intelligent,
merely to appear intelligent to others. Second is that everyone, even if they
are smart and correctly reason about the world around them, will still act
stupid. Without someone to impress, you could have a triple nobel prize, you
won't act it. So intelligence doesn't work in an individual, and it doesn't
work in most groups. It only works in groups where the interaction of the
group has people impressing each other with what they did, with some sort of
reward being given for that.

~~~
cfmcdonald
How can it be semantically important yet not matter? There are in fact
languages where word order is very free, but English is not one of them. e..g.
"Bob shot Alice" is very different from "Alice shot Bob". Which noun you place
an adjective next to doesn't matter? So "The dumb teacher taught the student"
means the same thing as "The teacher taught the dumb student"?

~~~
candiodari
No, it means that in the majority of cases the context is clear regardless of
the word order, so a neural network can learn the meaning despite jumbling the
word order.

"attacks mouse cat"

"attacks cat mouse"

"cat attacks mouse"

"cat mouse attacks"

"mouse cat attacks"

Only "cat mouse attacks" could mean something that's even slightly different.

