
Wittgenstein’s theories are the basis of all modern NLP - ghosthamlet
https://towardsdatascience.com/neural-networks-and-philosophy-of-language-31c34c0796da
======
idoubtit
In what sense are these theories a "basis" to NLP? Did they have any
influence? Do they bring any practical contributions? I suspect a slight
similarity between popular domains (Wittgenstein and NLP) was contrived into
an article that seems very light on the W part.

The "Wittgenstein’s theories" that appear here is just that "the meaning of a
word is its use in the language". If such a plain concept was all of
Wittgenstein’s theories, he would be long forgotten.

For centuries, dictionaries have presented words through one or several
explanations as well as quotes and examples. 150 years ago, Émile Littré wrote
a wonderful French dictionary that contains 80,000 words and about 300,000
literary quotes. He knew no word has a simple and permanent meaning, and that
one needs to know many real world contexts to get a fine view on a word.

~~~
yesenadam
Yeah, it's a weird kind of philosophy clickbait or something. An only slightly
less valid alternate subheading might be "Football is the basis of all modern
NLP"[1]

 _For Wittgenstein_ \- in philosophy, one always refers to the _early_ or
_later_ Wittgenstein; they're totally different, and saying 'for Wittgenstein'
without specifying which one doesn't make a lot of sense. The early (
_Tractatus_ ) one had a picture theory of language.[0]

[0]
[https://en.wikipedia.org/wiki/Picture_theory_of_language](https://en.wikipedia.org/wiki/Picture_theory_of_language)

[1] "One day when Wittgenstein was passing a field where a football game was
in progress the thought first struck him that in language we play _games_ with
_words_. A central idea of his philosophy, the notion of a ‘language-game’,
apparently had its genesis in this incident." \- Norman Malcom, _Ludwig
Wittgenstein: A Memoir_

I can't resist a more entertaining extract from that book:

"My wife once gave him some Swiss cheese and rye bread for lunch, which he
greatly liked. Thereafter he would more or less insist on eating bread and
cheese at all meals, largely ignoring the various dishes that my wife
prepared. Wittgenstein declared that it did not much matter to him _what_ he
ate, so long as it was always the _same_. When a dish that looked especially
appetizing was brought to the table, I sometimes exclaimed 'Hot Ziggety!' \- a
slang phrase that I learned as a boy in Kansas. Wittgenstein picked up this
expression from me. It was inconceivably droll to hear him exclaim 'Hot
Ziggety!' when my wife put the bread and cheese before him. ...

One of Wittgenstein's favourite phrases was the exclamation, 'Leave the
_bloody_ thing _alone_!' He delivered this with a most emphatic intonation and
mock solemnity of expression. It had roughly the signification that the thing
in question was adequate and one should not try to improve it. He used it on a
variety of occasions: one time meaning that the location of his bed was
satisfactory and it should not be moved; another time, that the mending that
my wife had done on a jacket of his was sufficient and that she should not try
to make it better."

~~~
mbrock
That sandwich anecdote reminded me to post a link to this essay about
_Tractatus_ as a demonstration of the "autistic cognitive model _par
excellence_."

[http://autisticsymphony.com/wittgenstein.html](http://autisticsymphony.com/wittgenstein.html)

"Ludwig Wittgenstein was almost certainly autistic. Several notable
psychiatrists, such as Christopher Gillberg in A Guide to Asperger Syndrome,
have written extensively about the evidence backing this assertion."

~~~
yesenadam
Thanks!, that was very interesting. I don't remember reading anywhere that LW
had Asperger's (which is what that writer means by 'autistic' I guess), but it
seems to fit perfectly. I've just tried to get to know him without labelling.
I was close friends with people like Nietzsche and Kierkegaard, but never felt
friends with LW hehe. I didn't get far into the _Tractatus_ , probably just a
few lines. But I've enjoyed everything else he wrote, always thought-
provoking. A fascinating character, with a great sense of humour--he had a
friend with whom he used to exchange picture postcards with very silly
irrelevant things written on them. Wish I could remember what book I saw those
in. Monk, maybe. Also, he was born extremely rich, but believed money corrupts
people, so gave it all away to his siblings. When asked why them, since they
were already rich, he said 'They're already corrupted'...

(Hmm It just struck me, maybe that explains some of Thoreau too? He didn't
seemed incapable of friendship in the normal sense, said No to everything,
preferred solitude, seemed utterly unlike other people, his 'duties to
himself' overruled all others etc.)

~~~
YellowSuB
Someone actually published a case report on it in European Child & Adolescent
Psychiatry.

You can read it here
[https://www.researchgate.net/publication/12521251_Did_Ludwig...](https://www.researchgate.net/publication/12521251_Did_Ludwig_Wittgenstein_have_Asperger's_syndrome)

------
mlucy
It's really difficult to overstate how important embeddings are going to be
for ML.

Word embeddings have already transformed NLP. Most people I know, when they
sit down to work on an NLP task, the first thing they do is use an off-the-
shelf library to turn it into a sequence of embedded tokens. They don't even
think about it; it's just the natural first step, because it makes everything
so much easier.

In the last couple years, embeddings for other data types (images, whole
sentences, audio, etc.) have started to enter mainstream practice too. You can
get near-state-of-the-art image classification with a pretrained image
embedding, a few thousand examples, and a logistic regression trained on your
laptop CPU. It's astonishing.

(Note: I work on [https://www.basilica.ai](https://www.basilica.ai) , an
embeddings-as-a-service company, so I'm definitely a little bit biased.)

~~~
madhadron
What I find particularly neat are the non-Euclidean embeddings, such as
hyperbolic spaces to generate hierarchies.

~~~
minkzilla
Do you know of any good resources for learning about such things for someone
with a cursory of Word2Vec ?

~~~
Radim
"Implementing Poincaré Embeddings" (hyperbolic embeddings implemented in
Gensim):

[https://rare-technologies.com/implementing-poincare-
embeddin...](https://rare-technologies.com/implementing-poincare-embeddings/)

------
akozak
Figuring out how to process context is important for NLP, no question.

But I think this is probably wrong on Wittgenstein. I'm pretty sure his entire
point in the Philosophical Investigations was that "meaning" is exactly NOT
probabilities of symbol co-occurrence, or just names of objects in the world.
Symbols acquire meanings from their use by humans. Accounting for context in
NLP via probabilities of occurrence might be useful in better reproducing
language, but we should be careful not to say that this is the essence of
meaning and language.

~~~
whatshisface
> _Accounting for context in NLP via probabilities of occurrence might be
> useful in better reproducing language, but we should be careful not to say
> that this is the essence of meaning and language._

Yes, and the article actually includes evidence in favor of this and against
its own conclusion. It mentions that vector "cat" is closer to vector "dog"
than vector "dog" is to vector "dogs," which makes sense if you interpret it
as a measure of appearance in sentences but no sense at all if you force it
into the mold of "the meaning of words."

~~~
akozak
Yes - but for me it was this paragraph:

> And it’s now quite clear where the Wittgenstein’s theories jump in: context
> is crucial to learn the embeddings as it’s crucial in his theories to attach
> meaning. In the same way as two words have similar meanings they will have
> similar representations (small distance in the N-dimensional space) just
> because they often appear in similar contexts. So “cat” and “dog” will end
> up having close vectors because they often appear in the same contexts: it’s
> useful for the model to use for them similar embeddings because it’s the
> most convenient thing it can do to have better performances in predicting
> the two words given their contexts.

I am actually fine to say that this approach is useful and convenient - and
that we can fairly call measuring probabilities of co-occurrence measuring
"context" in some sense.

But "context" for Wittgenstein in his account of meaning was clearly _not_
word or symbol occurrences. It was a much broader view of the way that
language fits in with human intentions and behavior and the wide variety of
uses for a word. I hate to quote Wikipedia, but from the PI article:
"Wittgenstein argues that definitions emerge from what he termed "forms of
life", roughly the culture and society in which they are used."
[https://en.wikipedia.org/wiki/Philosophical_Investigations#M...](https://en.wikipedia.org/wiki/Philosophical_Investigations#Meaning_and_definition)

~~~
kolbe
I'm in total agreement. It's especially confusing that the author would go
through the bother of invoking Wittgenstein when it seems like he meant the
exact opposite.

------
jeromebaek
The author has seriously misunderstood Wittgenstein's contributions to
philosophy of language.

>And it’s now quite clear where the Wittgenstein’s theories jump in: context
is crucial to learn the embeddings as it’s crucial in his theories to attach
meaning.

Yes, Wittgenstein said context is important for meaning, but that is hardly
his unique or even most important contribution to philosophy of language.
Wittgenstein's real contribution is in showing that meaning cannot be pinned
down like butterflies under glass -- that meaning spontaneously arises in each
playthrough of a language-game, and that any effort to find a "canonical",
"authoritative" definition is grasping at an illusion.

But word embeddings try to do almost exactly what Wittgenstein says is an
illusion -- trying to pin down a canonical n-dimensional vector for each word.
To correspond with Wittgenstein's theory, there cannot exist any mapping from
a word to a vector. Perhaps each vector can be dynamically changing in a by
principle uncomputable way. But to get there we are going to need a lot more
advances than the state of the art NLP.

~~~
visarga
> Perhaps each vector can be dynamically changing in a by principle
> uncomputable way.

The BERT language model does dynamic (contextual) embeddings and is state of
the art in NLP.

[https://towardsdatascience.com/bert-explained-state-of-
the-a...](https://towardsdatascience.com/bert-explained-state-of-the-art-
language-model-for-nlp-f8b21a9b6270)

~~~
jeromebaek
I don't think we are using the same definition of the word "dynamic" here.

------
mlthoughts2018
One interesting concept I read in Wittgenstein was the idea of decomposing a
word into its constituent parts. I’ll use the term broom for it because that
was the classic example and also the motivation for David Foster Wallace’s
novel “Broom of the System.”

So you take “broom” and you could decompose it into “handle” and “bristles”.
But then you could decompose it more, by recursively decomposing “handle” into
“grains of wood” and “bristles” into “pieces of fiber” (or whatever).

You keep doing this ad infinitum, I guess on down to the summation of a bunch
of quarks or whatever.

The question of interest to Wittgenstein was where does this process bottom
out. What would it mean, either physically or semantically, to have a word
identifying a concept that could not be broken down into further constituent
parts.

Wittgenstein was interested in this for the philosophy of language. But I got
interested in it by thinking about the decomposition as a mathematical
operator,

D(“broom”) = {“handle”, “bristles”}

and then asking what it could mean if this operator D had an “eigenvector”
with an “eigenvalue” of 1, so that Dx = x for some non-decomposeable word x.

In some ways, you can see how it could relate to things like word2vec and
embedding representations if you could represent a decomposition operator, and
define a hierarchical relationship of words as an ordering of how to more or
less specifically decompose a word’s representation.

~~~
NotAnEconomist
I've always wondered if 'exists' is something like that -- and hence why it
can't be a property. (Well, if you believe that Kant guy.)

You can sort of think of all objects -- broom, bristles, quarks, etc -- as
being codata that decomposes to some version of "existence existing", an
interference pattern of some fundamental object self-interacting.

------
atrudeau
These older word embedding models (word2vec, GloVe, LexVec, fastText) are
being superseded by contextual embeddings (
[https://allennlp.org/elmo](https://allennlp.org/elmo) ) and fine-tuned
language models ( [https://ai.googleblog.com/2018/11/open-sourcing-bert-
state-o...](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-
pre.html) ). These contextual models can infer that "bank" in "I spent two
hours at the bank trying to get a loan" is very different from "The ocean bank
is where most fish species proliferate."

------
nostrademons
It's interesting how different this is from 10 years ago, when Chomsky's
theories were the basis of all modern NLP, or even 5 years ago, when most NLP
used a hybrid of formal grammars + embeddings. I remember attending a tech-
talk on part-of-speech tagging in 2011; the state-of-the-art then was a
probabilistic shift-reduce parser where the decision to shift vs. reduce at
each node was done by a machine-learned classifier.

~~~
ppod
Wittgenstein emphasized meaning as context and usage before Chomsky, but the
actual method was first properly investigated by structural linguists such as
JR Firth and Zelig Harris, who was Chomsky's supervisor. Good articles here:

[https://en.wikipedia.org/wiki/Distributional_semantics](https://en.wikipedia.org/wiki/Distributional_semantics)

[https://aurelieherbelot.net/research/distributional-
semantic...](https://aurelieherbelot.net/research/distributional-semantics-
intro/)

------
lettergram
For those interested, I recently wrote a guide on using neural networks for
NLP[1].

I wrote the guide with the explicate goal of trying to help the people
understand NLP (sentence classification) without the need to understand the
math.

I cover word embeddings:

[https://austingwalters.com/word-embedding-and-data-
splitting...](https://austingwalters.com/word-embedding-and-data-splitting/)

As well as FastText:

[https://austingwalters.com/fasttext-for-sentence-
classificat...](https://austingwalters.com/fasttext-for-sentence-
classification/)

Hope someone finds it useful.

[1] [https://github.com/lettergram/sentence-
classification](https://github.com/lettergram/sentence-classification)

------
kolbe
I am really struggling to find where Wittgenstein fits into any of this at
all.

>And it’s now quite clear where the Wittgenstein’s theories jump in: context
is crucial to learn the embeddings as it’s crucial in his theories to attach
meaning.

That's not at all clear to me. The crucial part of W's tome is that two
sentient beings are knowingly engaging in a game where they have 'agreed' on
meanings. My guess from reading Philosophical Investigations is that W would
only think NLP were possible in formal settings like law, where all players of
the game know the rules quite well, and the program could be trained as if it
were a player in that game.

~~~
sp332
I think the point is that the only way to learn what a word means is to see
how it is used. Trying to define a word from some kind of first principles,
dictionary-style, is not going to be very effective. The best way for a
computer to learn what words mean is to analyze a lot of real-world data.

I would love for a computer to be able to ask questions, or at least surface
marginal cases for more training, but that seems to be a very uncommon feature
at least in these toy examples.

~~~
akozak
The issue isn't that this approach won't be useful in building systems we can
interact with linguistically. The problem is in describing the system as
having _learned a meaning_.

It might seem pedantic or like something only philosophers of language would
care about. But it gets to the core of how we should talk and think about the
nature of AI as NLP gets more and more sophisticated.

~~~
sp332
Well it may not be very satisfying, but Wittgenstein's point is that there
isn't anything more to understand about the meaning of words than the ability
to _use_ words effectively.
[http://existentialcomics.com/comic/268](http://existentialcomics.com/comic/268)

------
andybak
I really wish NLP didn't have two common meanings.

~~~
xyproto
Are you thinking about no-light perception and nonlinear programming?

~~~
YellowSuB
or Neuro-Linguistic Programming

~~~
xyproto
Or National Library of the Philippines

------
libertas
I would think that the tractatus would be more useful to an AI. But
Witgebstein's remarkable ability to shift the paradign and over extend into a
meta level of analysis seems similar to the way alpha mind and Leela play
chess. The tools W uses to understand perception have a more probabilistic and
irrational nature then the tools he uses in his previous work. As if he
realized that human communication cannot be considered as a closed and finite
system, hence I cannot see how his ideas are implemented in these
applications, yet.

~~~
sswaner
Yes, and considering the Tractatus as a framework for a closed and finite set
of linguistic rules, such as a domain specific language has great
applicability.

For example, I flipped open my copy (yes I keep a copy on my desk) and opened
to 4.122: rules to indicate internal and external relations between objects.
Almost reads like a system requirements document.

~~~
akozak
A conception of language he famously threw out in the Philosophical
Investigations!

~~~
sswaner
Yes, because humans don't operate in a limited, context explicit vocabulary.
But your point doesn't destroy the value of Tractatus to be used for another
purpose.

------
southerndrift
>As human beings speaking English it is quite trivial to understand that a
“dog” is an “animal” and that is more similar to a “cat” than to a “dolphin”
but this task is far from easy to be solved in a systematic way.

Are they? A dog can be trained like a dolphin, unlike a cat. In the context of
training, dogs are more similar to dolphins.

~~~
AaronFriel
Yes, I suspect you had to cherry-pick a dimension in which dog and dolphin are
closer than dog and cat. That would defy conventional wisdom justifies that
dogs are closer to cats than to dolphins, but that's also modeled by word
vector embeddings.

In the metric space, the distance between dog and cat might be lower than dog
and dolphin in many dimensions, but higher in this specific one. A general
distance function will have to take all of the dimensions into account, not
just those cherry-picked. So the conventional wisdom _and_ your personal
belief are both accounted for, and in the context of training "dog" and
"dolphin" might be more similar.

I still suspect that's not actually true, and I'd be really surprised if a
survey of users found dog and dolphin to be closer than dog and cat in _any_
dimension.

~~~
sswaner
What about the dimension of people who train animals to assist humans in
complex tasks. Not familiar with too many drug-sniffing cats...

~~~
darkpuma
> Not familiar with too many drug-sniffing cats...

Get them some catnip.

------
perfmode
Can someone ELI5 the term "embedding"?

~~~
mlucy
A word embedding transforms a word into a series of numbers, with the property
that similar words (e.g. "dog" and "canine") produce similar numbers.

You can have embeddings for other things, such as pictures, where you would
want the property that e.g. two pictures of dogs produce more similar numbers
than a picture of a dog and a picture of a cat.

~~~
perfmode
Ah. Sounds like a vector space. How does one select a basis?

~~~
leereeves
It is indeed a vector space. You don't really choose a basis, an ML tool like
word2vec [1] does. And like most advanced applications of ML, exactly how it
works is a mystery.

1:
[https://en.wikipedia.org/wiki/Word2vec](https://en.wikipedia.org/wiki/Word2vec)

> The reasons for successful word embedding learning in the word2vec framework
> are poorly understood. Goldberg and Levy point out that the word2vec
> objective function causes words that occur in similar contexts to have
> similar embeddings (as measured by cosine similarity) and note that this is
> in line with J. R. Firth's distributional hypothesis. However, they note
> that this explanation is "very hand-wavy" and argue that a more formal
> explanation would be preferable.

------
KasianFranks
Inaccurate. This is absurd. Epigraphy is the basis of all modern NLP/NLU. Add
computational epigraphy, neuroscience, linguistics and cognition. Ref:
Word2Vec is based on an approach from Lawrence Berkeley National Lab

""Google silently did something revolutionary on Thursday. It open sourced a
tool called word2vec, prepackaged deep-learning software designed to
understand the relationships between words with no human guidance. Just input
a textual data set and let underlying predictive models get to work learning."

“This is a really, really, really big deal,” said Jeremy Howard, president and
chief scientist of data-science competition platform Kaggle. “… It’s going to
enable whole new classes of products that have never existed before.”
[https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-
learn...](https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-
the-masses-you-can-thank-google-later/)

Spotify seems to be using it now:
[http://www.slideshare.net/AndySloane/machine-learning-
spotif...](http://www.slideshare.net/AndySloane/machine-learning-spotify-
madison-big-data-meetup) pg 34

But here's the interesting part:

Lawrence Berkeley National Lab was working on an approach more detailed than
word2vec (in terms of how the vectors are structured) since 2005 after reading
the bottom of their patent:
[http://www.google.com/patents/US7987191](http://www.google.com/patents/US7987191)
The Berkeley Lab method also seems much more exhaustive by using a fibonacci
based distance decay for proximity between words such that vectors contain up
to thousands of scored and ranked feature attributes beyond the bag-of-words
approach. They also use filters to control context of the output. It was also
made part of search/knowledge discovery tech that won the 2008 R&D100 award
[http://newscenter.lbl.gov/news-
releases/2008/07/09/berkeley-...](http://newscenter.lbl.gov/news-
releases/2008/07/09/berkeley-lab-wins-four-2008-rd-100-awards/) &
[http://www2.lbl.gov/Science-
Articles/Archive/sabl/2005/March...](http://www2.lbl.gov/Science-
Articles/Archive/sabl/2005/March/06-genopharm.html)

A search company that competed with Google called "seeqpod" was spun out of
Berkeley Lab using the tech but was then sued for billions by Steve Jobs
[https://medium.com/startup-study-group/steve-jobs-made-
warne...](https://medium.com/startup-study-group/steve-jobs-made-warner-music-
sue-my-startup-9a81c5a21d68#.jw76fu1vo) and a few media companies
[http://goo.gl/dzwpFq](http://goo.gl/dzwpFq)

We might combine these approaches as there seems to be something fairly
important happening here in this area. Recommendations and sentiment analysis
seem to be driving the bottom lines of companies today including Amazon,
Google, Nefflix, Apple et al."

~~~
f00_
Really we are building on the shoulders of giants (calculus, linear algebra,
statistics) but it seems like the modern use of recurrent neural networks
crystalized in the 80s with the publication of Parallel Distributed Processing
by David Rumelhardt, James L. McClelland, and PDP Research Group ( that
included Geoffrey Hinton) which discussed backpropagation and recurrent neural
networks even providing a handbook with code samples.

Jeffrey Elman (with others) wrote a successor to the PDP books called
Rethinking Innateness: A Connectionist Perspective on Development (1997)

His paper Finding Structure in Time (1990) adapted backpropagation to take
time into account, backpropagation through time (BPTT):

[https://crl.ucsd.edu/~elman/Papers/fsit.pdf](https://crl.ucsd.edu/~elman/Papers/fsit.pdf)

[https://en.wikipedia.org/wiki/Jeffrey_Elman](https://en.wikipedia.org/wiki/Jeffrey_Elman)

>Elman's work was highly significant to our understanding of how languages are
acquired and also, once acquired, how sentences are comprehended. Sentences in
natural languages are composed of sequences of words that are organized in
phrases and hierarchical structures. The Elman network provides an important
hypothesis for how neural networks - and, by analogy, the human brain - might
be doing the learning and processing of such structures.

[https://web.stanford.edu/group/pdplab/pdphandbook/handbookch...](https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html)

>Here we briefly discuss three of the findings from Elman (1990). Elman's work
was highly significant to our understanding of how languages are acquired and
also, once acquired, how sentences are comprehended. Sentences in natural
languages are composed of sequences of words that are organized in phrases and
hierarchical structures. The Elman network provides an important hypothesis
for how neural networks - and, by analogy, the human brain - might be doing
the learning and processing of such structures.

>The concept ‘word’ is actually a complicated one, presenting considerable
difficulty to anyone who feels they must decide what is a word and what is
not. Consider these examples: ‘linedrive’, ‘flagpole’, ‘carport’, ‘gonna’,
‘wanna’, ‘hafta’, ‘isn’t’ and ‘didn’t’ (often pronounced “dint”). How many
words are involved in each case? If more than one word, where are the word
boundaries? Life might be easier if we did not have to decide where the
boundaries between words actually lie. Yet, we have intuitions that there are
points in the stream of speech sounds that correspond to places where
something ends and something else begins. One such place might be between
‘fifteen’ and ‘men’ in a sentence like ‘Fifteen men sat down at a long table’,
although there is unlikely to be a clear boundary between these words in
running speech.

> Elman’s approach to these issues, as previously mentioned, was to break
> utternances down into a sequence of elements, and present them to an SRN. In
> his letter-in-word simulation, he actually used a stream of sentences
> generated from a vocabulary of 15 words. The words were converted into a
> stream of elements corresponding to the letters that spelled each of the
> words, with no spaces. Thus, the network was trained on an unbroken stream
> of letters. After the network had looped repeatedly through a stream of
> about 5,000 elements, he tested its predictions for the first 50 or so
> elements of the training sequence.

Schmidhuber developed the LSTMs, LeCun developed CNN, the ideas were refined
and processing capabilities developed and Hinton revived these connectionist
ideas leading up to Imagenet in 2012

~~~
KasianFranks
Lets also not forget Computational Theory of the Mind.

