
Are There Deep Reasons Underlying the Pathologies of Deep Learning Algorithms? [pdf] - Jach
http://goertzel.org/DeepLearning_v1.pdf
======
Teodolfo
The author clearly doesn't understand the Szegedy et al. result and isn't
really saying anything interesting. Pretty much ALL machine learning image
classifiers suffer from these pathologies from Szegedy et al. so calling them
"deep learning pathologies" is absurdly misleading. Logistic regression has
them too as do almost all algorithms that learn to classify images. Human
visual systems can also get confused by sensory input that (too us) looks like
some object, we just are much better at integrating evidence from other
sources to suppress the mistaken neurons.

~~~
adorable
I guess you are referring to
[http://cs.nyu.edu/~zaremba/docs/understanding.pdf](http://cs.nyu.edu/~zaremba/docs/understanding.pdf)

For those interested, the papers and discussion are part of this weekly
collection of AI-related news and resources:
[https://aiweekly.curated.co](https://aiweekly.curated.co)

~~~
Houshalter
He's also referring to this follow up paper which attempts to explain the
issue and show that it exists in other models too:
[http://arxiv.org/abs/1412.6572](http://arxiv.org/abs/1412.6572)

------
nl
I don't think this paper is very useful.

The author is from the OpenCog group, who have spent years building a
structured knowledge base as a predecessor to building an artificial general
intelligence. It isn't clear how AGI emerges from this work.

The OpenCog KB is a useful piece of work, but it's interesting to note that
word/phrase embedding models (word2vec etc) can give similar or better results
on most practical tasks that you'd use OpenCog for.

My view is that Deep Learning techniques are insufficient for an AGI, but are
good candidates for component parts in the same way that the human optical
system does significant preprocessing before hitting the "intelligent" brain.

Also thing like memory networks specifically address some of the episodic
memory issues the author raised.

~~~
Animats
_" I don't think this paper is very useful."_

That's probably correct. The author discusses a known problem; those mis-
labeled images are well known, and have been discussed on HN before. It's
clear that feature extraction in deep learning sometimes fastens on irrelevant
features that somehow work. Some algorithms result in models where data points
are too close to at least one decision boundary in a high-dimensional space,
which makes them brittle when faced with small, noise-type changes. I don't
know enough about the subject to know how that will be fixed, but there are
people working the problem and they don't seem to be stuck.

Then the author, who is from OpenCog, goes on to claim, without supporting
evidence, that OpenCog can somehow fix the problem. The paper proposed an
"internal image grammar", but doesn't say much about what they mean by that or
how to do it. Trying to decompose images into some symbolic representation has
a long, disappointing history. The computational neural network people are
getting results without doing that.

I think we're reading a grant proposal here.

~~~
nl
Regarding the internal image grammar:

I know some people working using everyday knowledge to attempt to do better
image labeling. The idea is that a tree is much more likely to appear in a
park than in a kitchen, so you can bias the probable interpretations using
that.

I guess that's what they could be talking about. But you need a CNN to get the
basic partial label in the first place.

------
bra-ket
Semantic hashing by similarity is one way to create more sensible
representations (Re: Proposition 2):
[http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pd...](http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pdf)

This would also fit well into memory networks:
[http://arxiv.org/abs/1410.3916](http://arxiv.org/abs/1410.3916), SDM:
[http://en.wikipedia.org/wiki/Sparse_distributed_memory](http://en.wikipedia.org/wiki/Sparse_distributed_memory)
or global workspace model:
[http://en.wikipedia.org/wiki/Global_Workspace_Theory](http://en.wikipedia.org/wiki/Global_Workspace_Theory)

------
1971genocide
"So one could view them as just being mathematical pathologies found by
computer science geeks with too much time on their hands."

How is this a real paper ?

~~~
fiatmoney
It's a valid point - if the failure cases aren't realistic they're not
relevant. I would rather have an interesting paper with a colloquialism or two
thrown in than yet another tiny methods variation paper written in Proper
Academese. And by the looks of it this is basically a white paper or long,
footnoted blog post in PDF form, not a submission to Nature.

------
teh
I'm not sure I understand how image grammars (whatever that is exactly)
suddenly pop up as a solution after such a long introduction. I could not find
any evidence or literature about them in relation to learning.

The author is stating a hypothesis but no way to test it. I'm not sure what
point he's trying to make.

~~~
bra-ket
well if you take "sparse
autoencoders"([http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf](http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf))
they basically give you principal components of the data, so from blobs of raw
pixels you get a "dictionary" of lines in different orientation and size, like
an edge detector. This is also similar to how the famous Fourier transform
works (
[http://en.wikipedia.org/wiki/Fourier_transform](http://en.wikipedia.org/wiki/Fourier_transform)),
it decomposes the raw signal like speech into its base forms or "harmonics".
And if you add another layer (to a stacked autoencoder) it will extract higher
level forms(e.g. basic shapes like triangles, ellipses etc) and so on until
you get a dictionary of different forms, with each layer kind of compressing
the signal into more compact summary. At the higher layer you can arrive at
the abstract "chair" or "cat" representation which is based on all these lower
level forms: shapes, lines and dots.

Then once you got this dictionary of image "words", next thing is to infer how
these words interact with each other, i.e. build a grammar (it's also called
"grammar induction" in natural language processing
[http://techtalks.tv/talks/deep-learning-of-recursive-
structu...](http://techtalks.tv/talks/deep-learning-of-recursive-structure-
grammar-induction/58089/)).

By learning a grammar you essentially define a concept of "chair" or "cat" at
a higher level of abstraction by determining how these forms relate to their
world (i.e. to other forms), e.g. you can determine that "cat sits on a chair"
is a legal phrase in the grammar and "chair sits on a cat" is not.

So extracting a grammar (visual or linguistic) from training data is
equivalent to restricting the system to common sense reasoning which operates
on concepts in terms of "production rules" of the grammar:
[http://en.wikipedia.org/wiki/Production_(computer_science)](http://en.wikipedia.org/wiki/Production_\(computer_science\)),
and it is a basic goal of AI.

------
quonn
Looking at Figure 1, it seems that the classifications "baseball" and
"electric guitar" are not that silly (but the certainty of 99.6% is). The
shown images could probably be generated from a guitar and a baseball by
warping and zooming.

------
akyu
So he is basically saying Jeff Hawkins was right all along. At least on SDRs

