
King – Man + Woman = King? - flo_hu
https://medium.com/p/king-man-woman-king-9a7fd2935a85
======
JD557
At first I thought "What's the big deal? You also remove the query from
recommender systems, for example. It's an obvious uninteresting result!".

But then I read the linked article [Nissim 2019] and it all became much
clearer with te example in the title: "Man is to Doctor as Woman is to
Doctor". If you remove "Doctor" from the results, you'll get "Nurse" instead,
not because the dataset/society/... is biased, but because you inserted the
bias in your model!

With this small "optimization" (and others presented in the article, such as
the "threshold-method" and hand-picked results from the Top-N words), it's
trivial to use the anology method to show that any dataset is biased, since
you are filtering the unbiased results!

One a side note, I wish articles like this were more popular. I find that it's
really easy to use AI techniques "the wrong way" and there's are a lack of
articles pointing to common pitfalls (and, to make things worse, there are a
lot of blog posts doing things wrong, which only validates bad methodology).

[Nissim 2019]:
[https://arxiv.org/pdf/1905.09866.pdf](https://arxiv.org/pdf/1905.09866.pdf)

~~~
tgb
The other reason to not exclude King is that it's pretty likely that Queen is
the closest word to King. Then King - X + Y = Queen is just saying that X and
Y are close to each other, not an interesting result.

~~~
derefr
From my perspective (linguistic anthropology), they’re not actually all that
close. Most historical “queens” (which have that label applied to them by
modern English speakers) were not rulers (and we have a separate term, “queen
regnant”, for that) but rather the gender dual to the male “royal consort.” It
was only in recent history where you see examples of “equal-opportunity”
monarchies that could have either a male or female monarch of equal power, and
thus usages of “queen” to denote those monarchs.

Thus—given that we’re defining words based on their centroids of usage in a
historical corpus—if a woman is a monarch of a kingdom, “king” is a tighter
historical fit to describe her role than “queen” is.

~~~
tareqak
One example:
[https://en.wikipedia.org/wiki/Jadwiga_of_Poland#Coronation_(...](https://en.wikipedia.org/wiki/Jadwiga_of_Poland#Coronation_\(1384\))

------
Retric
The first term may be more accurate than you might think. Elizabeth I of
England ruled as Elizabeth Rex where Rex is Latin for "King" and Regina is the
female term.

Technically in England Queen is an ambiguous term, queen regnant is the actual
ruler, queen dowager is the widow of a king etc. But, again several European
an other female rulers used the male term.

Hatshepsut crowned herself Pharaoh and maintained an elaborate legal fiction
of maleness, because ya know she was in charge and could do what she wanted.

Hungary had two female kings Mary of the House of Anjou, and Maria Theresa.

Etc etc.

~~~
flo_hu
Good point! I would see this rather as yet another argument for why you should
simply give the actual output of the NLP algorithm.

So if people actually do the calculation King-Man+Woman and it comes closest
to King, than they should report "King-Man+Woman~=King" and not "King-
Man+Woman=Queen" (only because that's what they expected).

~~~
Sean1708
To be honest, I think the idea that we should expect ML algorithms to give a
single, certain answer is misguided. I would expect the output from this
algorithm to be "King - Man + Woman = King (90%), Queen (83%), Prince (70%)"
or something like that, i.e. a list of answers with some measure of how "good"
those answers are. Then again, I work in a field that doesn't really have
categorical answers so maybe I'm missing something obvious.

~~~
flo_hu
That's pretty much correct. You would typically calculate a vector for "King-
Man+Woman" and then do a query on this based on a cosine distance (or similar
measure) over the entire vocabulary.

The query would give you a ranked list of the closest word vectors with scores
that indicate how good the match is.

------
slx26
I don't know. The problem here might be human expectations. As the article
says, word2vec sometimes is "sold" with these kinds of analogies that seem so
_intuitive_. So we come to expect the system to behave intuitively. And in
many cases it does.

But, what are you really doing there? Well, you are operating with vectors.
And what do those vectors represent? Not the semantics about the words
themselves. The vectors encode the _context_ in which they appear. They encode
what other words tend to surround them.

And there are more limitations. The most obvious one is polysemy. But even
without polysemy, different words might have more or less "diffuse"
representations. The more technical, precise and uncommon a term is, the less
"diffuse" its vector is. King and queen, man and woman are far from precise,
technical or uncommon terms. And therefore, there's a lot of "diffusion".
Again, the problem here is our expectations versus the results we see in
practice. But the models are what they are. And they are not magic.

~~~
SiempreViernes
The titular example is very often _sold_ as something very close to the
semantics; so the problem is indeed in the expectations, expectations created
by how word2vec embeddings are often presented as vectors that magically
understand the semantics.

------
derEitel
What happens if you get the latent space of king, do no algebra and return the
outcome with king being excluded? In case it's queen, then the author is
correct and these examples are highly misleading. In case it's something like
prince, Lord or ruler of the seven kingdoms the latent space algebra would be
suitable imho.

Also, if we think about it in terms of decision manifolds, it seems the
distance between queen and king is too large for the simple - man + woman to
have an effect. Why not scale that substraction, so it leads to a change in
predicted class without removing king? But of course finding a justifiable
weight would be hard..

~~~
philh
In one model I found online[1], the closest word to "king" was "kings" (0.71)
and the second closest was "queen" (0.65).

[1] [http://bionlp-www.utu.fi/wv_demo/](http://bionlp-www.utu.fi/wv_demo/)
(making sure to select the English model)

~~~
derEitel
If we eliminate king from the result list I would assume it's plural is
removed as well. Bummer, that would mean the latent space algebra in this
example has no effect whatsoever..

~~~
Majromax
> If we eliminate king from the result list I would assume it's plural is
> removed as well.

In fact, the plural isn't removed. You can see the effect by analogizing A:B
:: A:?.

For man:king :: man:?, you get [kings, queen, monarch, crown_prince] as the
top 4. For man:king :: woman:?, the results are [queen, monarch, princess,
crown_prince], with 'kings' as #6.

Of course, your model may vary.

------
mr_crankypants
A related thing that amuses me: One of the (less popular†) benchmarks for
embedding models is their performance on a TEFL word analogy test. At least as
of the last time I perused the literature, which was a few years ago, there
was one algorithm with which I've seen anyone report really high scores on
this test. For all the crowing about how great Word2Vec is at finding
analogies, it wasn't SGNS or CBOW. Not GloVe, either. It was boring, unsexy,
30-year-old, doesn't-even-have-a-strong-theoretical-foundation latent semantic
analysis.

† Because it doesn't measure performance at something people typically want to
do with these models in real life - there's a strong "party trick" component
to the word analogy stuff.

~~~
Der_Einzige
Topic models are going to get blown the hell out by UMAP and other manifold
based dimensionality reduction algorithms

~~~
mr_crankypants
I haven't been following the literature too closely lately - have there been
promising results in using it for topic modeling or semantic vector modeling?

My own sense is, the real problem that the space needs to contend with isn't
that the math isn't elaborate enough. It's that we're still waiting for
someone to come up with a really good way to deal with polysemy, and with
multi-word phrases that form a single semantic unit.

~~~
Der_Einzige
Re: promising results

Yes, there have been.
[https://github.com/lmcinnes/umap_paper_notebooks](https://github.com/lmcinnes/umap_paper_notebooks)
The author of UMAP shows impressive results with 3 million word vectors

For dealing with multiple meanings, It's already been done. Sense2vec resolves
most of these issues and a wordnet integrated version of word2vec or the newer
"XLNet" would be state of the art by a long shot but no one seems to want to
implement it so the world waits longer for good NLP models I guess...

~~~
mr_crankypants
Sense2vec isn't really what I'm waiting for.

It relies on word sense disambiguation, which tends to be one of those very
language-specific things, and so I'd expect (but haven't verified) that, like
other techniques that rely on language-specific bits, it wouldn't work as well
on most non-English text. And the most interesting polysemy problems aren't to
do with part of speech. They're things like "apple-as-in-food" vs "apple-as-
in-computer", or figuring out that "The Big Apple" doesn't have anything to do
with either of those. What would be _really_ interesting is dealing well with
jargon, slang, and terms of art.

As far as those notebooks, is there one in particular I should be looking at?
I might have missed something, but the stuff I saw basically just
demonstrated, "Hey, we can handle a lot of training data really fast." What
I'd be more interested in seeing is, "Hey, plug us into your document
classification pipeline and your performance (as in accuracy) metrics won't
know what hit them."

edit: For a more concrete example of what I'd like to see, and going back to
the analogy task: The holy grail I'm looking for isn't "king - man + woman =
queen". It's more like "software engineer = programmer", and also "software
engineer != software + engineer".

~~~
Der_Einzige
You'll find that if you run UMAP on a large corpus (the same size as your
original word embeddings), the ones it'll generate (especially if you feed it
any labels as UMAP supports semi-supervised and supervised dimensionality
reduction) should outperform those generated I'd even wager by modern
transformers. if they don't, than they'll be like 2% worse for a lot of speed
improvement on the currently single threaded implementation of UMAP

Oh and you can use UMAP to concat tons of vector models together and all other
side data for super-loaded embeddings

~~~
visarga
So, do you run UMAP on the PMI matrix or on precomputed word embeddings? Seems
like UMAP requires dense vectors as input.

------
program_whiz
A subtle point not really mentioned or brought out. These vectors are 300
dimensional, and once you perform an "analogy " of "X - Y + Z = W" its almost
certain that W does not exactly match anything in your vector space
(probability of overlap is almost certainly zero). That means you must have an
algorithm for choosing which non-matching word to pick. The point of the
article is that if you pick strictly the closest one (in terms of euclidean
distance in the space) then you usually end up back where you started. There
are many measures that could be used, and who is to say that distance along
all dimensions are equally important?

In 3d space for example (much simplified from 300d), things that are separated
along a vertical axis are usually very different in kind than things separated
even by a great distance on the horizontal axis (100m below you is likely to
be much different from 100m in front of you).

Or imagine a sky scraper most of the "sameness" lies in a small horizontal
region (one block say), but a vast vertical region (100 stories).

This is a simplified analogy to word vectors, but the point is that because
two words are "close" in 300d space, if we don't understand what those
dimensions mean, we can't say which one is more likely to be "similar" for a
specific pair of words (King/Queen vs Mouse/Cat). Using euclidean or cosine
similarity may or may not be relevant for one particular case.

------
logfromblammo
If the system returns the same result word that was in the query, doesn't that
just mean that the word doesn't have the intrinsic meanings you thought it
did?

Besides that it's actually "King - Man + Woman = Queen [Regnant]", which
broken down into smaller pieces should be "King - Man = Monarch", "Monarch +
Woman = Queen [Regnant]". Then you also have the complicating factor of
"Monarch + Wife = Queen [Consort]" and "Monarch + Husband = Prince [Consort]".
It seems obvious from this, and "Husband - Man = Spouse" and "Wife - Woman =
Spouse" that "Monarch + Spouse = Consort".

This allows a little thought experiment. If the king marries a man, you have a
king and a prince. That's fine; no ambiguity there. If the queen [regnant]
marries a woman, you have a queen and a queen. This is confusingly ambiguous.
Do you rename the queen [regnant] to king, or do you rename the queen
[consort] to princess [consort]?

------
konz
Related submission from 42 days ago: Differences between the word2vec paper
and its implementation
[https://news.ycombinator.com/item?id=20089515](https://news.ycombinator.com/item?id=20089515)

------
dang
A related thread from 2017:
[https://news.ycombinator.com/item?id=13346104](https://news.ycombinator.com/item?id=13346104)

------
martindbp
I started playing around with word embeddings in SpaCy for finding synonyms,
antonyms or any kind of word similarity that could be used to cluster a list
of words into different conceptual groups. So far, the results doesn't seem to
correlate very well with what I would consider "similar". Hopefully combining
it with tools like WordNet and lemmatization + Levenshtein distance can get me
closer to something useful though.

~~~
nerdponx
Word vectors like this only work if the two words you care about appear in
similar contexts in the corpus it was trained on.

So the assumption is that words from similar context should be similar. But
you're always going to miss out on some words that are similar but do not
appear in similar contexts.

~~~
feanaro
If the words are really similar in meaning, you should still be able to arrive
at a useful result using some graph manipulations.

The words might not appear in a directly shared context, but given their
semantic similarity, they should share more contexts of distance 1 (or
something along those lines) than an arbitrary pair of words.

------
peapicker
The whole cartoon about "martini - gin + whiskey" cartoon isn't great, as
"cocktail" really is the best answer it could give. (A manhattan would also
have to switch from dry vermouth to sweet vermouth, and change the garnish
from an olive or lemon peel to a cherry, and traditionally it would not use
just any whiskey, but use rye specifically.)

~~~
dllthomas
A martini _can_ be sweet vermouth (in its early days that was actually more
common), but your point overall is valid.

~~~
crazygringo
But then from what I learned it’s not a Martini, it’s a Martinez! And yes it
did come first. :)

[https://en.m.wikipedia.org/wiki/Martinez_(cocktail)](https://en.m.wikipedia.org/wiki/Martinez_\(cocktail\))

~~~
dllthomas
Apparently it's all quite a bit less clear than I remembered it. _shrug_

------
Der_Einzige
Has the author made sure he's using the exact same set of word embeddings as
utilized by the authors of that paper? Are they trained on exactly the same
corpus with the same parameters? Do the md5 hashes match? This could be
cheating... Or it could be a case of model mismatch. I'm left without a clear
reason to prefer one explanation over the other.

~~~
habitue
If the embedding relationships aren't stable over different trainings, then
they aren't really meaningful right? It's just noise in that case

------
tgb
I've pondered making a video game based off Word2Vec. User collects words from
conversations with NPCs and had to combine them to get new words which they
use in conversation or maybe as physical items to cross barriers, etc. Not
quite sure how it would work except that it would probably devolve into just
random guessing. Still, I'd love to see it tried.

~~~
duckmysick
While not based on Word2Vec, there's a puzzle game that uses words to craft
and manipulate rules of a physical world: Baba Is You.
[https://hempuli.com/baba/](https://hempuli.com/baba/)

------
zerubeus
Stop publishing to medium for God’s sake!!!

~~~
anchpop
Medium wants me to start a $5/month to read this article. If I pay does the
author see any of that money? Or is it just medium capitalizing off their
writing

~~~
flo_hu
I removed my blog post from medium's distribution. So it should now be freely
accessible! [https://blog.esciencecenter.nl/king-man-woman-
king-9a7fd2935...](https://blog.esciencecenter.nl/king-man-woman-
king-9a7fd2935a85)

------
ajuc
BTW sometimes in history that equation was true. For example in medieval
Poland Jadwiga was formally elected a king (król) not a queen (królowa). Only
several years later her husband became a king and she got demoted to a queen.

------
DoctorOetker
A bit off topic: does anyone know where I can easily download pretrained
simultaneously trained separate word and context vectors? that is before they
are added together to get a single set of vectors?

------
scotty79
So

King - tomato + potato = Queen

?

~~~
flo_hu
No. That's of course still "King" :)

(but sure one could also pick queen, prince, royal form the list...)

Just tested it here:
[http://vectors.nlpl.eu/explore/embeddings/en/calculator/#](http://vectors.nlpl.eu/explore/embeddings/en/calculator/#)

And it gave me 0.63 King, 0.6 Prince etc...

~~~
scotty79
So =Prince because you should exclude King similarly how you exclude to get
Queen in original example.

------
ajuc
Fixing "he - doctor" : "she - doctor" to "he - doctor" : "she - nurse"

is kinda sexist.

------
manuka
Who thinks here that:

"King – (Man + Woman)"

and

"(King – Man) + Woman"

Must have different outcome? I do. No wonder Word2Vec sucks! Words are not
elements of vector space :)

~~~
pure-awesome
Subtraction is not associative. One cannot just move parentheses around like
that with numbers either:

5 - (2 + 3) =/= (5 - 2) + 3

The same holds for elements of a vector space.

~~~
raverbashing
You are correct

I was about to say you were wrong, but you are correct, and it is a bit
unintuitive why - [https://www.quora.com/Is-vector-subtraction-
associative](https://www.quora.com/Is-vector-subtraction-associative)

~~~
EForEndeavour
Maybe I'm missing the point here, but what's so unintuitive about subtraction
(vector or scalar) not being associative? Counterexamples seem easy to find.
For example, 1 - 1 - 1 is not the same as 1 - (1 - 1).

------
raverbashing
The author seems to have just learned that most machine learning algos are not
"free lunch" but there's a good amount of poking and tuning and some
"cheating" needed.

It's part of the learning curve, and I spent quite some time to learn the
basics

~~~
flo_hu
I'm fine with the free lunch thing. But here the cheating is done on the level
of how people present the capabilities of the tool. If you ask the algorithm
how "SHE is to LOVELY as HE is to X", the reported answer (Bolukbasi 2016) was
"BRILLIANT", which in this case suggests a heavy gender-bias. But what the
algorithm actually gives for X is: "LOVELY". The authors justed picked the
10th example in the list without clearly stating it.

~~~
yorwba
> The authors justed picked the 10th example in the list without clearly
> stating it.

That's not an accurate description what Bolukbasi et al (2016) [0] did. In
particular, they do not list _x_ close to _lovely + he - she_ and then pick
arbitrarily from that list. Instead, they explicitly reject that approach (see
appendix A), because they're looking for pairs of words that are maximally
gendered. They do that by finding _x_ and _y_ such that the angle between _x -
y_ and _she - he_ is minimized. Since the task they're solving is different,
you can't fault them for getting different results.

[0] [https://arxiv.org/abs/1607.06520](https://arxiv.org/abs/1607.06520)

~~~
flo_hu
Ok, thanks a lot for bringing this up! I will have a closer look at that.

------
johnchristopher
[https://miro.medium.com/max/700/1*tFmq8FUI8SzxKKqo_kgmpQ.jpe...](https://miro.medium.com/max/700/1*tFmq8FUI8SzxKKqo_kgmpQ.jpeg)
I think the AI should spend more time in a gender equality classroom (and
probably its maker too). :)

~~~
roel_v
...

Did you even read the article? Its whole point is that sensationalist
conclusions like yours are wrong.

