For a more theoretical explanation :
Though, some statements about analogies should be taken with a grain of salt, see: Tal Linzen, Issues in evaluating semantic spaces using word analogies, https://arxiv.org/abs/1606.07736.
The most illegal thing is "heroin"
The most legal thing is "CEO"
The most good thing is "teacher"
The most evil thing is "lucifer"
Priests are about as legal and rich as criminals, same for nuns wrt. janitors
'Sad' is rich and legal, 'happy' is poor and illegal. Same delta with 'power' and 'money'.
Apparently being a secretary is more moral than being a priest.
Also, note the values on the x axis; only "spirit", "true" and "faith" are more good than evil within this dataset, and only slightly so. "Allah" is associated with being illegal?
Sentimential is Lawful Good
A Kitsch is Lawful Evil, and
Evokes/Evoking is Chaotic Evil
My intuition of why things like king + man - woman work is because the points in the vector space model happen to create a well-behaving manifold with smooth meaning changes. It's not very principled, but it does work.
I wrote a series of blog posts with a coworker about doing this with music:
Instead of mixing nouns and adjectives, we do things like mixing songs and artists and radio stations etc. In the second post we show how Nirvana - Kurt Cobain + Female Vocalist works remarkably well. I've studied empirically why this worked, and the best I could come up with is that the high dimensional space we created had a very dense set of points in the region of popular western music that led to a smooth manifold.
For differences - see this section: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html#....
I mention also that multiplying a word by a factor (for PMI compression) results in a word of similar meaning, just being more characteristic (bear in mind that for other models in can be related e.g. to word frequency or other properties).
But you are right, that there are some problems with linear structure. Some of them were brought to me be Omer Levy (a researcher in this subject). I think an article that summarises it the best (or rather: shows empirically that it does not always work as intended) is:
- Tal Linzen, Issues in evaluating semantic spaces using word analogies, https://arxiv.org/abs/1606.07736
(Also, by "scalar product" I meant "scalar multiplication"--the product of a scalar and a vector, not the dot product. It's pretty clear how to make some sense out of the dot product, but it's pretty hard to make consistent sense out of scalar multiplication. Apologies for being unclear.)
"Essentially, all models are wrong, but some are useful." - George Box
I guess you know this, but for others: London-Paris makes sense, but has a different type to London and Paris (it's a vector, not a position) and while 2London doesn't make sense, 2(London-Paris) does make sense (it's a vector with the same direction but twice the length).
Such a system, with two distinct types -- positions and vectors, with vectors being the differences between positions -- is called an affine space. You can identify positions with vectors by picking a distinguished origin, but then you don't get the type-safety that forbids ridiculous expressions like 2London.
What is a secondmeter? A voltvoltgram?
I'm glad it can be useful. But, I agree it seems to leave a lot lacking.
However, something like the dot-product here does make sense, since you can use it to determine similarities of vectors.
The engine does have an emergent notion of the relationship, which is the whole point.
Similarly, add a woman to a person, and you just have two people. One being a woman.
I get that there is an answer that seems fun... But there is not a deep meaning to the math.
This is like the games where "send + more = money". Fun. But is there really something illuminating?
Similarly with woman + person, the concept of woman is femininity. It makes sense if you think in terms of concepts.
Yes, and some of those meanings can be closer to the common understanding of reality or the weight of each notion or its probability than others.
Which is also why we can solve riddles and don't get lost in their infinite similar possibilities.
It's not supposed to be english, it's a query language.
Consider, king + woman = queen + man. Which looks neat, but is not a universal truth. It could be concubine, for example.
So, is queen + man also concubine?
Again. I'm glad this works for some things. But really just shows which words are often used together. It does not show any good rationale for their meanings. Unlike math, where 1 + 1 equals 2. Possibly in different encodings. But not just from convention of often being used together.
Precisely. Models which use this space do not propose strong equality (==). Rather, they would output a series of probabilities, and choose the most likely. Stating king + woman = queen + man is somewhat disingenuous; what should be said (mathematically) is something like the following: the word lying closest to the vector vec('king') + vec('woman') - vec('man') is 'queen'.
To suggest that a NN can't learn something about the meaning of words from a large corpus of text is unsubstantiated, I believe. The statement above suggests they do, I would say. I would not be too surprised if a sufficiently complex NN could 'learn' the concept of gender with a corpus of English text to a decently high degree of accuracy, simply based on vestigial features left from French and Old English.
So king + woman = queen + man is better described as:
Masculine monarch feminine person = feminine monarch masculine person
A bit of reordering of adjectives and it is exactly the same. Even monarch is the wrong word, because you seem to be getting hung up on nouns, when these are all actually a bunch of chained adjectives. Perhaps "regality + nobility + rulery". English is a bad language to describe this, because we tend to noun and verb our adjectives regularly.
That's totally fime, because actual words don't define universal truths either.
Queen could be a band, a transvestite, an actual queen, and several other things besides.
It's still useful -- you can classify millions of pictures in a meaningful way much faster than before.
That is, x+y has meaning and use in most maths. Here, it seems primarily use.
That said, I definitely agree with you. An English speaker may find any of these reasonable:
1. King - man = expensive clothing
2. King - man = prince
3. King - man = queen
What is "king"? What is "-"? What is "man"?
If a king is a dressed up wealthy man, and you remove the man, you have wealthy clothes? Or, does removing the man mean degrading the king back into a boy? Or, does removing a man mean adding a woman? Wait -- what if a king is more than a dressed up wealthy man? Should we include his home? Do we need to subtract the home? How do you subtract a home? Is the king minus a man a prince if the king was a beggar when he was young? ... death of the universe ...
Like you said, there's a combinatoric explosion here. Maybe this example is akin to trying to model each and every trajectory of all 10^23 particles in a gas. It looks like these scientists are stepping back, and looking at the big picture, instead, trying to find something more akin to PV = NkT
The fundamental problem is that linearity is probably not that accurate in representing words relationships, but I don't think this is the goal.
(king (genl headOfState)
(woman (genl person)
"The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning."
"As of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent." - Tomasz Malisiewicz 
Superlong version: https://plato.stanford.edu/entries/logic-ai/
The author of the computer vision blog post doesn't seem to know much about symbolic AI. Some of the comments point this out.
I'm busy on other things at the moment, but I intend to develop a rule-based system some time soon. I can rule out a neural network straight away because there's no data available, the rules are explicit, well documented, and have to be followed, and the system has to justify its reasoning.
That is not to say I wouldn't consider using a neural network for a perception task.
The only way you can get symbols to work is with weights. Follow this to its logical conclusion and I think you'll end up with a system isomorphic with the vector approach, with dimensions representing something like symbols.
The fat cat is sitting on a mat.
And even within the world of the sentence, it only has a vague meaning. What exactly makes the cat fat? Is it neutered or lazy or overfed? Why is it sitting on a mat - is the mat outside a door, is it waiting to be let in? What kind of a cat is it - could it be a big, dangerous cat? The sentence is laden with signifiers and unknowns. Western children are taught to consider sentences like these in the abstract, but it's not a natural way of thinking, because it's not practical in a life lived connected to the world.
Abstract hypotheticals are the hallmark of more disconnected concerns, and we teach our children this early, in part using silly, deliberately vague sentences like these, and discouraging curious questions that might resolve the ambiguities.
An AI system that's designed to handle abstract sentences like these is not one designed to understand human language, because humans don't reason like this unless they're thinking analytically - and even then, they do so blinkered with biases and errors.
or someone who did not grow up using the English language.
Or, you know, you can use a cut-off point, like 18 or so.
Predicate calculus is an entirely parallel system that only has a representation in a strict subset of human language. A reverse mapping is hopeless and misguided.
I feel like this line of thought, that you can box up words with really concrete meanings that you can then reason about with logic systems is a kind of trap for people who've spent too much time in an analytic frame of the world. It's one where the smarter you are, the further down the road you can get without realising it's a dead end. At best, such a system could only augment.
It will totally be fine for 99% of applications. Marginal returns.
Peter Norvig, On Chomsky and the Two Cultures of Statistical Learning, http://norvig.com/chomsky.html
Also, we want something that is automatic. It means easily adjustable to other contexts (and languages), inferring information about neologisms (e.g. semantic meaning of emoji), etc.
If my recollection is correct, Peter Norvig changed his views around the time he joined Google. Google have a particular way of doing things which doesn't include symbolic processing.
It should be possible to infer using a symbolic approach, or more simply just provide a new definition.
I've met some really really brilliant people who've been banging their head against that particular wall since the 1980s. The MIT AI Lab crew, for example, poured untold brainpower into symbolic inference. There was the whole "expert systems" movement. This all failed miserably, in disgrace, because nobody could ever get it to work, and "AI" became a dirty word. Later, there was Cyc, which was hyped on and off throughout the 90s. http://www.cyc.com/ After that, there were people who tried to reason over RDF tuples, which didn't work either: http://www.shirky.com/writings/herecomeseverybody/semantic_s...
This idea pops up every 5 or 10 years and wastes a generation of brainpower. It never works. And let me be clear: There have been some terrifyingly brilliant people who were convinced that it ought to work, and who spent years of their life on it.
Meanwhile, any joker who can code up Bayes theorem or single value decomposition can get some results in a couple of weeks. Probability and statistics get results (as do more advanced numeric techniques). Logic deduction fails. I'm not even sure I could explain why. But I encourage you to think long and deeply on what Norvig has written on this subject. Or just buy Norvig's two AI textbooks (written before and after he discovered the joys of probability, basically), do some exercises, and compare the results you get.
Statistics isn't sufficient for fully understanding natural language (for that, a computer would have to go out and experience the world in the first person). But it is necessary.
The idea of breaking meaning into an inventory of discrete components like this is at least as old as Hjelmslev, I think even Saussure touches on it.
Part of it seems to me to be that you're breaking down words into words. Even if you write it in uppercase and call it a primitive, you don't have to spend too much time on cognitive linguistics to know that there's really nothing primitive about MAN or MONARCH...
See also: Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness/, and try to replicate it with any formal semantics (good luck!).
Additionally, formal systems rarely incorporate for actual language, with some things being technically correct, but sounding weird, or things being incorrect, yet - prevalent (and a root for language evolution). See also: char2char translation (which accommodate for e.g. neologisms or typos).
The original Altavista Babelfish (which used SYSTRAN) used rule-based machine translation. It has been replaced by Bing Translator, but my recollection of the Babelfish was that it was accurate enough to be usable, and better than Google Translate was when it was first released. Google Translate has improved a lot recently. My only problem with it is that it doesn't understand the meaning of the words it's translating.
Neural networks are perfectly suitable for perception, e.g. image recognition. No argument there.
I think that's a fallacy that will haunt AI forever (or, more likely, will be the definitive civil rights struggle ca. March 25, 2035 6:25:45am to March 25, 2035 6:25:48am)
We tend to move the goalpost whenever AI makes advances. Where many people would have considered chess a pretty good measure of at least some aspect of intelligence, it seems like mundane number crunching once you know how it works.
It may be that we really mean consciousness when we say "intelligence", although if we ever find an easy formulation that creates a perfect "illusion" of consciousness, it may end up having some strong effects on people's conception of themselves that I don't necessarily want to witness.
If there are symbols, they're in the dimensions of the vector; but words only probabilistically suggest meaning, they don't categorically denote it.
With a symbolic approach, the right thing is to use a different symbol for each meaning, and disambiguate based on context (which you could identify either statistically or by using rules).
In particular, words are not repositories of meaning. They allude to concepts; new concepts are created and get forgotten on a regular basis. The connection between words and concepts waxes and wanes over time, and even the very timeline of a connection's strength can be used for allusion: using language to represent concepts that can only be coherently mapped by using previously-stronger allusions conveys a sense of being old-fashioned, while the reverse conveys future-thinking. Using allusions that are stronger within a milieu conveys social signalling information about group membership. Etc.
There's no way a human-maintained database is going to capture the subtlety here on anything like a timely basis. There's no universal truth, everyone's map is a little bit different, and the map is changing all the time.
People who speak the same language are able to communicate perfectly well, almost all of the time, across continents and centuries. New words are rare compared to existing vocabulary and often soon disappear from use. New concepts can readily be mapped onto existing vocabulary. People are able to learn other languages and improve their own with the help of dictionaries and grammar books. Things like humour, cryptic crosswords, and social signalling are edge cases. And things like deception are unrelated to language understanding at the semantic level.
We're discussing the best way for computers to understand natural language and communicate with people, and in practice that's either going to use unambiguous language or it's going to need human help or verification.
People who speak the same language are still prone to misunderstandings.
We may be overestimating our success in communication. Heck, I'm not even sure if my understanding of soon equals yours. Is it a century, a few decades, a few years? And why is it so different from the meaning when I use it to answer when lunch will be ready?
The former doesn't exist in human languages (we'd otherwise have gotten rid of the lawyers long ago), and the latter is infeasible. There is a another way.
Wouldn't that mean it was an actual full AI?
There is work being done to merge the formal/structural semantics and distributional semantics. I have been working on that for over a year at my startup (not necessarily having successes, mind).
The benefit of using distributional semantics (word embeddings, etc) is GPU processing and libraries. My logical form parsing library uses dynamic programming and chart parsing with a LOT of tree pruning to parse a simple sentence, while distributed semantics merely require me to multiply 2 matrices/vectors together - something GPUs excel at.
I'm quite happy with a hybrid symbolic/statistical approach, which could be useful for disambiguating words and phrases in context.
I think the trick is to avoid generating the parse tree in the first place.
Today's hardware is pessimized for symbol and list processing. They can't be done on GPUs, and CPUs work better on contiguous data.
King – man + woman = Princess?
King – man + woman = villain?
(because of characters like Cruella, Maleficent, Ursula, etc.)
That is one of the main enablers for truly general intelligence because its based on this common set of inputs over time, i.e. senses. The domain is sense and motor output and this is a truly general domain.
Its also a domain that is connected to the way the concepts map to the real physical world.
So when the advanced agent NN systems are put through their paces in virtual 3d worlds by training on simple words, phrases, commands, etc. involving 'real-world' demonstrations of the concepts then we will see some next-level understanding.
You have people who focus on grammar and spelling. But word embeddings collect their insights by taking any sequence of 5 words, taking out the middle word, jumble up the result (technically they express it in a way that ignores order. They're expressed as 1 bit per word, 1 means the word is in the sentence, 0 means it's not. The sequence of the words in the input to the network is completely independent of their sequence in the sentence). And they understand that "king is to man as queen is to woman" and lots of other things.
When going deeper you quickly start to realize a few things : in 90% of sentences the sequence of words does not matter. No, not even if "not" appears before or after the verb (and thus refers to the subject of object of the sentence). Word sequence. Doesn't matter. Which noun you place an adjective next to. It is semantically important. Really important. Every English (or any language I imagine) teacher will hammer the point home again and again. And yet ... it almost never matters, in the sense that getting it wrong will not cause something dumber than a human to misinterpret the resulting sentence. So why do we care ? Social reasons (ie. to fuck other humans, or more generally, to get them to do stuff for us)
It's a weird thing that keeps coming back in machine learning. Humans think their reasoning high level. Yet algorithms that keep track of maybe 2 or 3 variables per individual can predict the actions of crowds with uncanny accuracy. Tens to hundreds of thousands of people, each believing they're individuals and think about what they're doing, take not just the same decision, but with an enormous probability will come to that decision within minutes of each other.
I am reminded of a quote by Churchill. Humans appear smart individually, but it's a trick, an impression, it's a facade, almost an illusion. If they act in group, said intelligence is utterly gone, and they almost always act in dumb ways in large groups, even when they are acting alone. Intelligence is 95% a parlor trick used in conversation, to make friends, or to mate, like a peacock's feathers, and only 5% or less something we actually use to act. So it's purpose, from a species' perspective, is not at all to act intelligent, merely to appear intelligent to others. Second is that everyone, even if they are smart and correctly reason about the world around them, will still act stupid. Without someone to impress, you could have a triple nobel prize, you won't act it. So intelligence doesn't work in an individual, and it doesn't work in most groups. It only works in groups where the interaction of the group has people impressing each other with what they did, with some sort of reward being given for that.
"attacks mouse cat"
"attacks cat mouse"
"cat attacks mouse"
"cat mouse attacks"
"mouse cat attacks"
Only "cat mouse attacks" could mean something that's even slightly different.