
One-shot Learning with Memory-Augmented Neural Networks - astdb
http://arxiv.org/abs/1605.06065
======
SubiculumCode
I study the neuroscience of episodic memory. Episodic memory requires one shot
learning critically dependent on the hippocampus which is posited to use
operations of pattern separation (to reduce intereference for distinct
memories with similar features) and completion ( retrieve whole from partial
input of features) to encode and retrieve bound representations, respectively.
Prior to the hippocampus are operations that detect feature novelty in the
perirhinal cortex. In contrast to these functions is the slow learning of the
neocortex which supports semantic memory. Episodic and semantic memory
funcions are intertwined in a complex way.

Seems to my mind that deep learning is the cortex, and the binding function is
the hippocampus. Add some executive control functions to guide learning
strategically and it will start sounding eery.

~~~
rybosome
It seems like the binding function (e.g. the hippocampus) is the next major
challenge. I think this is related to what is referred to as "transfer
learning", where the training a model has received in one domain can be
partially applied to a new domain such that training time is reduced; in
particular, "retriev(ing) whole from partial input of features" reads to me as
exactly the missing link that would allow transfer learning to take place. If
memory-augmented neural networks put us down that path, then I'd wager that
executive control functions are less complex (at least to fake passably)...we
could be in for some interesting times.

------
bra-ket
If you're interested in this you may want to take a look at SDM content-
addressable memory, which uses neural networks as address encoders/decoders:
[https://en.wikipedia.org/wiki/Sparse_distributed_memory](https://en.wikipedia.org/wiki/Sparse_distributed_memory)

it was developed by Pentti Kanerva at NASA in the 80s

For related memory-augmented models see:
[https://en.wikipedia.org/wiki/Deep_learning#Networks_with_se...](https://en.wikipedia.org/wiki/Deep_learning#Networks_with_separate_memory_structures)

There was also "Reasoning, Attention, Memory (RAM)" NIPS Workshop 2015
organized by Jason Weston on this topic:
[http://www.thespermwhale.com/jaseweston/ram/](http://www.thespermwhale.com/jaseweston/ram/)

There is disproportionate amount of work on training learning models while
ignoring memory mechanics. For some reason this research is pursued by very
few machine learning/AI labs, mostly at Google DeepMind/Brain, FB, and
Numenta, and may be Tom Mitchell's 'never-ending learning' project at CMU.

~~~
ericjang
I think part of the reason why memory / attention mechanisms haven't caught on
quite yet is that they add a layer of complexity to reasoning about what DNNs
can/cannot do, which is already challenging for researchers. It takes awhile
to prove that these are scalable enough for a "killer app".

Also, industry applications of deep learning are at least 1-2 years behind
academia, so I don't expect to see any use of differentiable memory mechanisms
in production anytime soon.

Anyway, the workshop reading list looks amazing. Thanks for this!

------
colllectorof
Grandiose claims, barely-readable and buzzword-saturated language, really
weird experiment setup, lack of critical self-examination. These common
features of recent AI research papers begin to annoy me. Especially when it's
seen in corporate research, where (supposedly) results should be more
important than a kind of lingual tribalism seen in academia. Is it really that
hard to produce a human-readable description of the architecture and put it in
a particular place of the paper instead of spreading them all over? Also, it
took me quite a while to understand what the heck they were measuring (and
how) in the first place.

~~~
daveguy
> Is it really that hard to produce a human-readable description of the
> architecture and put it in a particular place of the paper instead of
> spreading them all over?

That would allow easier reproducibility, which is the opposite of the goal for
papers that come out of industry. This is the purpose of industry papers (in
order of importance):

1) Avoid sufficient clarity such that competing companies could reproduce the
work.

2) Brag about the company's capabilities such that it increases interest from
potential customers.

3) Maintain just enough scientific rigor and clarity that it is still
publishable.

1 is generally top priority. 2 is the goal. If 1 prevents 3 then they just
call it a white paper and publish only to arxiv or on their website. That way
you still get exposure for 2 without compromising 1.

------
alexbeloi
Here's the original paper on Neural Turing machines:
[https://arxiv.org/abs/1410.5401](https://arxiv.org/abs/1410.5401)

IIRC, that paper's results focus on the model's ability to extrapolate
prediction of sequential data to arbitrary lengths having only been trained on
sequences of a fixed length.

------
bytefactory
This sounds like a big deal, can somebody with some expertise in the field
comment on it?

~~~
sherjilozair
It has been popular belief that although deep learning has been successful in
a lot of domains, its biggest shortcoming is high sample complexity, i.e. even
for simple problems you need tens of thousands of samples. Thus, deep learning
was believed to be unsuitable for one-shot (or low-shot) learning, where the
model needs to learn from one or a handful of samples per class.

Brenden Lake et al. showed that a bayesian approach outperforms deep learning
based methods in this Nature article:
[http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.p...](http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.pdf)

DeepMind took the challenge, and delivered a deep learning based method, which
uses no feature engineering and outperforms human-level performance in low-
shot learning, thus providing strong evidence against the widely-held notion
that deep learning cannot work with small amounts of data.

~~~
emcq
Do you have a reference to the work from DeepMind?

~~~
teraflop
[http://arxiv.org/abs/1605.06065](http://arxiv.org/abs/1605.06065)

------
alistproducer2
Can anyone succinctly provide a concrete example of how this technique reduces
the amount of data needed to train a network?

~~~
colllectorof
It doesn't. The claim they're making is that instead of learning labels from
bazillion similar example, they use bazillion of diverse examples to train a
more generic classifier capable of "learning" correct labels from just a few
(i.e. 5) examples.

They compare it with human performance claiming their algorithm is better, but
it's kind of a bullshit claim, since people weren't allowed to use scratch
paper or see previous examples. So instead of a pattern-matching test they
turned this into a memory test. Typical of current AI research. Those people
will do _anything_ to be able to claim "better than human" performance.

~~~
eli_gottlieb
They also didn't do any comparisons, or even make mention of, the hierarchical
Bayesian models that were used to attain human-level performance (by trying to
mimic exactly how the human mind does it).

------
Animats
This paper is either really important or total bullshit. Anyone know which?

