
Why Deep Learning Cannot Be Applied to Natural Languages Easily - jonbaer
https://www.linkedin.com/pulse/google-hyping-why-deep-learning-cannot-applied-easily-berkan-ph-d
======
ma2rten
No offense, but this person has no idea what they are talking about. Google
published a paper that explains exactly what they are doing and provides
metrics as well [1,2].

It's true that neural networks can not be easily applied to natural languages,
but there are less obvious ways of applying them (namely Embeddings, LSTMs,
Attention) and provided you have enough data and computational resources they
give so much better results than any other method that it doesn't even help
anymore to combine them.

[1] [https://arxiv.org/abs/1609.08144](https://arxiv.org/abs/1609.08144)

[2] [https://arxiv.org/abs/1611.04558](https://arxiv.org/abs/1611.04558)

EDIT: I also don't want to give the impression that Deep Learning can solve
every NLP problem. We are still far away from passing the Turing test. It's
true as well in my opinion that Google's Machine Translation is oversold. It's
best at what it's trained and evaluated for: translating individual sentences
from news sources.

EDIT 2: There are some tasks where traditional methods can work better, e.g.
text classification on long documents. That's mainly because deep learning
methods are too expensive computationally.

~~~
stared
Seconding it with:
[https://arxiv.org/abs/1610.03017](https://arxiv.org/abs/1610.03017) Fully
Character-Level Neural Machine Translation without Explicit Segmentation.

In general, RNNs are especially fit for discrete sequences. And their
continuous representation is actually an advantage (so they can see that two
words are similar or analogous). BTW: see my draft on word2vec:
[http://p.migdal.pl/2016/12/30/why-do-word2vec-analogies-
work...](http://p.migdal.pl/2016/12/30/why-do-word2vec-analogies-work.html)

~~~
lucidrains
this paper still consistently blows my mind whenever it is brought up. the
future is clearly headed for complete end-to-end learning systems, maybe even
with learning to learn methods.
[https://youtu.be/x1kf4Zojtb0?t=1h4m54s](https://youtu.be/x1kf4Zojtb0?t=1h4m54s)

------
babakd
The argument the author uses is invalid. NNs are equally strong at modelling
sparse signals, provided that they could be mapped into a continuous space,
what is commonly referred to as an 'embedding'.

The premise of the article is valid though, in that NLP is a hard problem. The
reason is partly because NLP is ill-defined; how do you define language
understanding?

NNs are very effective at learning mappings of y=f(X), given enough examples.
One of the reasons that they're so effective at modelling speech, vision,
translation, etc., is that such mappings exist in high volumes. Because of the
above-mentioned ambiguity of NLP, it's harder to come up with such pairs for
'understanding' a language. How do you come up with a dataset of sentences and
their 'meaning'? Probably the best you could do is to map them to some action.
And critics will readily disregard such attempts as 'not really NLP'.

~~~
letitgo12345
I think he's arguing that the current NN approach for NLP is not going to lead
to embeddings that are going to make revolutionary progress in NLP.

And there have been attempts to ascribe a semantics to natural language from
text (for ex. see CCG grammars). The datasets are not as big as for vision
tho, yes. But I'm not convinced that we need such explicit datasets to be able
solve this problem.

~~~
MrQuincle
I would not be sure that mapping of discrete linguistic objects to a
continuous space is necessary. Why can't we handle the original space?

There are just a lot of things that have to be figured out still.

\+ Different time scales. There is semantics on a sentence level while there
is also semantics on a plot level. It's convenient to know key elements from
the start of a story if you want to understand the plot. LSTMs are a perfect
starting point.

\+ When to stop learning. The so-called stability-plasticity dilemma. Our
ability to pay attention to what matters might be tightly linked to our
capability to forget vast bodies of texts that we just read. Current NNs do
not seem to forget correctly. This was the rationale behind ART and ARTMAP
(Grossberg) and might enter AI mainstream again soon.

\+ Grammar constructions. Some aspects of grammar seem simpler than computer
vision, where we also have a lot of structure in the environment, models like
things that can be inside of other things, be balanced on top of other things,
temporarily occluded by other things, etc. Other aspects seem more
complicated, like the pleasantness of a poem. My gut feeling is that some of
this gets spilled over from (a) structure in other modalities and (b)
idiosyncrasies from our generative system (vocal cords, etc.). In other words,
our grammatical preferences might be sampled not only from listening and
reading.

\+ Emphasis.

Just a few things that might lead to interesting NNs. Contrary to the author I
think they are definitely in line with current research.

------
vonnik
He makes a good point about continuous functions. But actually neural nets are
quite good at handling discrete elements like words, likes, atoms in a
molecule, etc.

Deep learning is part of enormous advances in NLP[0], just as it has set
records to accuracy in almost every field of machine perception, from vision
to audio.

Neural word embeddings like those produced by word2vec[1] make for very useful
feature vectors when fed into other neural nets.

The headline of this post should be that NLP is harder than, say, image
processing. In fact, for non-specialists, none of it is easy, because tuning
hyperparameters is hard.

The kind of NLP that tries to reproduce human-level sentences and
understanding is simply a more complex problem, given the plasticity of
language.

[0] [https://arxiv.org/abs/1611.04558](https://arxiv.org/abs/1611.04558) [1]
[https://deeplearning4j.org/word2vec](https://deeplearning4j.org/word2vec)

~~~
fnl
Good point. I think the author quite clearly meant continuous wrt. a whole
sentence or even text, not single (sparse) words, though. Take German, for
example: The last word in a sentence that ends in a verb defines the whole
structure of the sentence. Like, "Wir haben heute etwas über Neuronale
Netzwerke gelernt.", which is in part also reflected in particularly
problematic languages not being "projective". This gets even worse with "long-
range" semantic cross-references, like anaphora. And, the latest Google
Translate is still notoriously bad at the German language, at least [1].
Therefore, I think the article has a valid point, and I'd really like to hear
some actual points where he is wrong before dismissing it (though, yes, I
agree it is rather shallow, and best a good thought starter).

[1] Google Translate on the German Wikipedia entry for Weihnachten (X-mas):
[https://translate.googleusercontent.com/translate_c?depth=1&...](https://translate.googleusercontent.com/translate_c?depth=1&nv=1&rurl=translate.google.com&tl=en&u=https://de.wikipedia.org/wiki/Weihnachten)

~~~
nl
Why do you say German is _notoriously bad_? I don't speak German, but that
linked page reads fine to me, and their neural translation system has state-
of-the-art performance for both English->German and German->English at
least[1]. I would have thought that the well-structured nature of German makes
it reasonably easy to translate, assuming you have a NN architecture with
sufficient memory range.

[1] [https://arxiv.org/abs/1611.04558](https://arxiv.org/abs/1611.04558)

~~~
fnl
The translation isn't "only nonsense", and in many cases even grammatically
correct, but mostly semantically wrong, or rather weird. In any case, a long
shot from a human translation. But I wouldn't agree that it's "quite good".
And, this is the easy case. EN->DE works even worse.

------
humbledrone
The article doesn't mention "embedding" even once. How can an argument about
discontinuities in language space leave out things like word2vec, etc, which
are designed to make things continuous?

~~~
empath75
He sort of does, and then says that the information loss from that process
will make it ineffective for NLP.

I don't think anyone deny that 'true' translation would require some kind of
general intelligence that somehow understands what is being translated, but it
seems to be the case that a 'dumb' translation works well enough for a great
many use cases, regardless.

He's really just making the Chinese room argument. We have a computer
shuffling symbols around according to some rule set, that doesn't know what
they mean. I don't think it really matters, though, if it produces a
reasonably accurate translation.

~~~
deong
In a way, he's making an even stronger argument than Searle did in his Chinese
Room. Even Searle would probably admit that it's possible in principle for the
computer in the chinese room argument to fool people into thinking it's
intelligent. Searle just objects to the idea that it could ever really __be
__intelligent in the way a human is. To Searle, the human __obviously __isn 't
just running some algorithms on the input text to produce output.

I think the counter to Searle's argument isn't really that it doesn't matter
as long as the result is close enough. The counter to that is that we don't
understand how human intelligence works either. Searle is simply assuming that
it's "magic" (or less condescendingly, some sort of metaphysical process) that
can't be simulated by algorithmic machine. I think it's far more likely that
intelligence is physical and we just don't understand the machinery than it is
that it's mystical and cannot in principle ever be understood.

For this article, all that is seemingly unnecessary. He's just saying they
won't work well enough to even fake it convincingly. Which is very nearly
falsifiable just by running today's algorithms.

~~~
westoncb
Searle's Chinese Room is addressing something different from intelligence;
it's concerned with the what happens /within/ the intelligent mind, whereas
inputs and outputs are the only things that matter here. To be more specific,
in the question of whether deep learning can be used to generate and/or
comprehend natural language, we are not concerned with whether the algorithm
is conscious of what it's doing as longs as the results are good.

~~~
visarga
In order to be conscious it has to be more than a reactive or feedforward
system. It has to loop back on itself, like RNNs, and hold internal state.

------
LukeB42
"Text, a sequence of words, is not a byproduct of a statistical process, it is
a byproduct of a cognitive process."

That cognitive inference process is what we've formalised as probability
theory.

Whenever you do /anything/ your brain may be selecting from a probability
distribution over things that can be done immediately.

As for text as continuous data just chuck it in glove, word2vec, lexvec or
fasttext. Given enough training you could model the velocity of concepts as
they're being introduced to the dataset / model.

Also, on the whole, shallowish learning can be applied to Natural Languages
pretty easily. Keras includes a memory network (LSTM with an autoencoder) that
averages around 98% accuracy on the bAbI 10k Q/A task.

~~~
aligajani
_Whenever you do /anything/ your brain may be selecting from a probability
distribution over things that can be done immediately._

There is no concrete evidence that the brain does math. If the brain did
select things from a probability distribution, then why isn't everyone a math
genius.

No one really knows how it works.

~~~
circuithunter
There's actually quite a bit of evidence suggesting that brains, both
behaviorally and mechanistically, are Bayesian [0].

As for your second point, assuming that humans are Bayesian, there are many
reasons why people would have variability in their mathematical ability,
including different priors and differences in the ability to estimate
posteriors.

[0]
[https://scholar.google.com/scholar?q=brain+bayesian&hl=en&bt...](https://scholar.google.com/scholar?q=brain+bayesian&hl=en&btnG=Search&as_sdt=1%2C22&as_sdtp=on)

------
scotchmi_st
> Another way to look at this problem is by data size. Let's assume we are
> training a NN using a data set of 10,000 pages text. The total number of
> pages of all possible knowledge in the world (ever) is an incalculable
> number, but let's assume it is a sextillion: 1,000,000,000,000,000,000,000.
> Then the question is, how much of the sextillion pages of data can be
> handled by training a NN using 10,000 pages? The answer would obviously be
> too small compared to the whole.

> On the other hand, grammar rules and ontological semantics mastered by the
> human brain can handle the entire sextillion (since those pages were written
> by human). If you know how to read and write, the entire sextillion will be
> understandable to you. This is the horrifying truth between the capabilities
> of the human brain versus the current state of neural networks.

This bit especially doesn't make any sense. I'm a human who has been reading
all my life, does that mean I understand every grammar rule or ontological
semantic ever created? Of course not! I barely understand all of them in my
own language! My 'neural network' (brain) would need a bit more 'training'
(studying) before that could happen. Even more, if all I had read in my life
were 10,000 pages (which still may be true).

------
lucky1988
I do not think that deep learning can be applied to learning languages. It is
based on an algorithm. Language and speech are too complex to be learned in
such a way. I am studying a few foreign languages and cannot imagine how that
can be possible. For example, machine translation cannot beat human one. I
have tried to translate a few things with the help of online machine
translators, but still had to contact
[https://www.translateshark.com/spanish.html](https://www.translateshark.com/spanish.html)
to make a proper translation.

------
21
Just a datapoint regarding the supposed hyping of the recent Google Translate
NN switch.

Last year I tried translating some everyday text from Turkish to English.
Complete garbage, you could barely understand even what they were talking
about.

I tried it now, albeit with different texts, and there's a world of a
difference. Now you can actually understand what the text is saying, even if
compared to other languages, I would still classify the Turkish->English
translation as awful. Another big difference is that the resulting English
text has relatively good grammer, as opposed to the previous version which was
a broken English word soup.

------
GCA10
I'd be so much happier if paragraph six didn't call for "putting every symbol
in allocated sluts."

~~~
Rotten194
I was wondering if that was a piece of jargon I didn't recognize or a very
unfortunate typo...

------
tshadley
> Neural networks (NNs), recently referred to as deep learning, only work
> "effectively" with data that is produced from a process of a continuous
> function.

I think the author is overlooking the fact that images are not continuous
functions but Deep-Learning image-recognition systems have been very
successful representing discontinuities in images as hierarchical visual
abstractions. In the same way, Deep Learning with recurrent neural networks
should be able to learn discontinuities in symbol streams as hierarchical
language abstractions, given enough data.

------
stevehiehn
I don't have a deep understanding but isnt this where the 'word2vec' Algorithm
comes in play. Isnt the idea to make one pass dedicated to finding
relationships in words first?

------
jayajay
Attention (see Google papers) seems like a pretty promising way of dealing
with countable states. In quantum mechanics, we change the question we are
asking. Instead of "which countable eigenstate are we in?", we ask "how much
of our current state is _in_ the Nth eigenstate?". We just expand our current
state over the basis of interest. The coefficient of the state of interest is
a continuous variable. This is what attention does for neural networks.

