Hacker News new | past | comments | ask | show | jobs | submit login
One-shot Learning with Memory-Augmented Neural Networks (arxiv.org)
103 points by astdb on May 20, 2016 | hide | past | favorite | 19 comments

I study the neuroscience of episodic memory. Episodic memory requires one shot learning critically dependent on the hippocampus which is posited to use operations of pattern separation (to reduce intereference for distinct memories with similar features) and completion ( retrieve whole from partial input of features) to encode and retrieve bound representations, respectively. Prior to the hippocampus are operations that detect feature novelty in the perirhinal cortex. In contrast to these functions is the slow learning of the neocortex which supports semantic memory. Episodic and semantic memory funcions are intertwined in a complex way.

Seems to my mind that deep learning is the cortex, and the binding function is the hippocampus. Add some executive control functions to guide learning strategically and it will start sounding eery.

It seems like the binding function (e.g. the hippocampus) is the next major challenge. I think this is related to what is referred to as "transfer learning", where the training a model has received in one domain can be partially applied to a new domain such that training time is reduced; in particular, "retriev(ing) whole from partial input of features" reads to me as exactly the missing link that would allow transfer learning to take place. If memory-augmented neural networks put us down that path, then I'd wager that executive control functions are less complex (at least to fake passably)...we could be in for some interesting times.

Hook that up to a rat and see what happens?

[5 months later]

We welcome our new AI-Rat overlords? ;)

If you're interested in this you may want to take a look at SDM content-addressable memory, which uses neural networks as address encoders/decoders: https://en.wikipedia.org/wiki/Sparse_distributed_memory

it was developed by Pentti Kanerva at NASA in the 80s

For related memory-augmented models see: https://en.wikipedia.org/wiki/Deep_learning#Networks_with_se...

There was also "Reasoning, Attention, Memory (RAM)" NIPS Workshop 2015 organized by Jason Weston on this topic: http://www.thespermwhale.com/jaseweston/ram/

There is disproportionate amount of work on training learning models while ignoring memory mechanics. For some reason this research is pursued by very few machine learning/AI labs, mostly at Google DeepMind/Brain, FB, and Numenta, and may be Tom Mitchell's 'never-ending learning' project at CMU.

I think part of the reason why memory / attention mechanisms haven't caught on quite yet is that they add a layer of complexity to reasoning about what DNNs can/cannot do, which is already challenging for researchers. It takes awhile to prove that these are scalable enough for a "killer app".

Also, industry applications of deep learning are at least 1-2 years behind academia, so I don't expect to see any use of differentiable memory mechanisms in production anytime soon.

Anyway, the workshop reading list looks amazing. Thanks for this!

Grandiose claims, barely-readable and buzzword-saturated language, really weird experiment setup, lack of critical self-examination. These common features of recent AI research papers begin to annoy me. Especially when it's seen in corporate research, where (supposedly) results should be more important than a kind of lingual tribalism seen in academia. Is it really that hard to produce a human-readable description of the architecture and put it in a particular place of the paper instead of spreading them all over? Also, it took me quite a while to understand what the heck they were measuring (and how) in the first place.

> Is it really that hard to produce a human-readable description of the architecture and put it in a particular place of the paper instead of spreading them all over?

That would allow easier reproducibility, which is the opposite of the goal for papers that come out of industry. This is the purpose of industry papers (in order of importance):

1) Avoid sufficient clarity such that competing companies could reproduce the work.

2) Brag about the company's capabilities such that it increases interest from potential customers.

3) Maintain just enough scientific rigor and clarity that it is still publishable.

1 is generally top priority. 2 is the goal. If 1 prevents 3 then they just call it a white paper and publish only to arxiv or on their website. That way you still get exposure for 2 without compromising 1.


Here's the original paper on Neural Turing machines: https://arxiv.org/abs/1410.5401

IIRC, that paper's results focus on the model's ability to extrapolate prediction of sequential data to arbitrary lengths having only been trained on sequences of a fixed length.

This sounds like a big deal, can somebody with some expertise in the field comment on it?

It has been popular belief that although deep learning has been successful in a lot of domains, its biggest shortcoming is high sample complexity, i.e. even for simple problems you need tens of thousands of samples. Thus, deep learning was believed to be unsuitable for one-shot (or low-shot) learning, where the model needs to learn from one or a handful of samples per class.

Brenden Lake et al. showed that a bayesian approach outperforms deep learning based methods in this Nature article: http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.p...

DeepMind took the challenge, and delivered a deep learning based method, which uses no feature engineering and outperforms human-level performance in low-shot learning, thus providing strong evidence against the widely-held notion that deep learning cannot work with small amounts of data.

"No feature engineering" claim would be far more convincing if they demonstrated network's abilities on several datasets of different types. It's not like that would be hard to generate synthetically. An interesting test would be "logic puzzles" that use shapes, object counts and so on. For example, a test where you have to label a picture based on the number of objects, regardless of their shapes.

As it is, it could be that this particular setup happens to extract features of this particular dataset, while failing miserably on others.


Thinking about this at a higher level, I ask myself what constitutes true one-shot learning. The reason we care about it is because in real life most problems don't involve huge datasets of available solutions. On the other hand, this paper does involve training a model based on a large dataset of similarly typed, labeled data. The problem the algorithm solves involves items from the same set.

The first obvious question is the one I asked above: does this approach work for other types of data? The second one is whether the mechanism would work for more diverse datasets. Finally, the most important question is: how well will it perform on tasks that fall far outside of the initial training data? Because that's the true challenge behind one-shot learning.

It's kind of amazing that the paper doesn't try to answer any of those questions. Isn't that the real purpose of research in AI? (Most likely they tried and the results weren't good, but hey, there is no way to tell without re-implementing the whole thing.)

Do you have a reference to the work from DeepMind?

Can anyone succinctly provide a concrete example of how this technique reduces the amount of data needed to train a network?

It doesn't. The claim they're making is that instead of learning labels from bazillion similar example, they use bazillion of diverse examples to train a more generic classifier capable of "learning" correct labels from just a few (i.e. 5) examples.

They compare it with human performance claiming their algorithm is better, but it's kind of a bullshit claim, since people weren't allowed to use scratch paper or see previous examples. So instead of a pattern-matching test they turned this into a memory test. Typical of current AI research. Those people will do anything to be able to claim "better than human" performance.

They also didn't do any comparisons, or even make mention of, the hierarchical Bayesian models that were used to attain human-level performance (by trying to mimic exactly how the human mind does it).


This paper is either really important or total bullshit. Anyone know which?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact