Hacker News new | past | comments | ask | show | jobs | submit login
Agent57: Outperforming the human Atari benchmark (deepmind.com)
87 points by EvgeniyZh on March 31, 2020 | hide | past | favorite | 44 comments



This whole evolution looks more and more like expert systems from 1980s where people kept adding more and more complexity to "solve" a specific problem. For RL, we started with simple DQN that was elegant but now the new algorithms looks like a massive hodge podge of band aids. NGU, as it is, extraordinarily complex and looks adhoc mix of various patches. Now on the top of NGU, we are also throwing in meta-controller and even bandits among other things to complete the proverbial kitchen sink. Sure, we get to call victory on Atari but this is far and away from elegant and beautiful. It would be surprising if this victory generalizes to other problems where folks have built different "expert systems" specific to those problems. So all this feels a lot like Watson winning jeopardy moment to me...

PS: Kudos to DeepMind for pushing for median or even betten bottom percentile instead of simplistic average metric which also hides variance.


You're just witnessing the natural progression of research. Occasionally somebody has a brilliant idea that works a lot better than the state of the art, and then people add bells and whistles to that idea until the improvements don't justify a new paper. Sometimes a new brilliant idea emerges from the bells and whistles, sometimes the field gets stuck for a while, sometimes the field dies.


I agree with you overall, but while it was relatively simple, DQN was never "elegant". The elegant and most important part of DQN was to use a CNN for learning a good image representation - but CNNs are completely unrelated to RL.

The sole contribution of DQN was to introduce two ugly band-aids, experience replay and a target network, to make the CNN training stable in an RL setting.

So, DQN = CNN (elegant and simple) + ugly band aid


Just a side note. Probably not very fair to attribue the introduction of experience replay to DQN authors. This technique was popularized in 1992 by Lin[1] and used later in many works to improve performance. The difference, perhaps, is that in DQN, experience replay is not only a way to improve performance, but it's a must to obtain stability.

[1] http://www.incompleteideas.net/lin-92.pdf


Doesn't the human brain have millions of years of band aids?


Yes and at this pace it may take thousands (or at least hundreds) of years to patch up our A.I. to general A.I.


No, most likely not. Our best theories of the brain are both computationally and evolutionarily parsimonious.


You can make a simple theory of anything. An airplane as a spherical ball of varying friction with a steerable propulsion. Or a massive electro-mechanical system. There's a fidelity curve as you progress in complexity. Fidelity curve of a literal ball stabilizes quickly, while the fidelity curve of an airplane extends far.

The brain is massively complex, we still don't understand the ways that it computes and learns in detail except generalities such as using electrical and chemical signals, neurotransmissors, and some knowledge of architecture and region differentiation.

Those models all point to an extremely complex system with quite a number of different specialized parts, with neurons individually having quite amazing complexity (and so on for capsules, neuronal systems).

I think it's unlikely a "simple" system (if you could consider a vanilla CNN simple, for example) will solve human-like AGI efficiently (either data-efficiently or computationally efficiently). There are too many requirements.

Also, while simplicity is good... a researcher shouldn't be caught obsessing over it exclusively, as it is not the end goal, which personally would be the advancement of understanding and of our capabilities instead.


>The brain is massively complex, we still don't understand the ways that it computes and learns in detail except generalities such as using electrical and chemical signals, neurotransmissors, and some knowledge of architecture and region differentiation.

Again, not really. Broadly speaking, the AI community is pretty far behind the neuroscience community in terms of understanding the basic principles underlying how brains work. https://mitpress.mit.edu/books/principles-neural-design


The human brian is encoded via surprisingly limited amounts of information in DNA. It’s therefore more elegant than many modern AI systems.

The often quoted 6×10^9 base pairs overstates the complexity as a lot of non coding DNA exists, and much of what’s left is redundant or does not relate to the brain.

PS: We actually know quite about about how the brain learns and computes information. The are really interesting information about how the optic nerve encodes information for example.


DNA by itself doesn't produce a human or a cell. You need another functional cell or human to create a replica. I would compare DNA with instructions to create a new car: the instructions are incomplete and assume the existence of a car factory. DNA tells "now paint the car with of these colors", but it doesn't tell how.


Computer programs don’t include the CPU design. DNA on the other hand encodes everything which then gets leveraged during cell division.

Basically, DNA includes the design for both the car, car factory, and all the component factories that feed the car factory, and the mines that feed the component factories. People have dependencies on their diet for things like vitamin C, but that’s a whole other issue.

PS: Mitochondria also have DNA, but that’s just location specific and can simply be added to the total for the cell.


Why do you think DNA includes everything?


Mitosis and RNA.


> The human brian is encoded via surprisingly limited amounts of information in DNA.

Same DNA that is referred to in other discussions as 'the ultimate spaghetti code'.


I find the comparison a bit facile.

Expert systems were taylored to the problem at hand.

The components of the solution here are based on learning and are meant to address what are widely believed to be important and general facets of human cognition and/or fundamental machine learning problems (e.g. credit assignment, episodic and short-term memory, meta-cognition).

It's unclear how general the solutions here actually are, but they certainly don't look very specific to ATARI, on the face of it.

It's also worth noting that DeepMind's research has a pretty good track record of not being overly engineered to solve specific tasks. E.g. DQN, AlphaGo and successors (with the possible exception of Alpha*?)...


Contrast with papers like "Learning to Fly via Deep Model-Based Reinforcement Learning" (https://arxiv.org/abs/2003.08876), in which a principled approach was refined over years until it could be used for a real-world task.


To everyone : stop to criticize other people's work about "elegance", both on internet and in real life. This is not NYC nor Paris Fashion Week.

Criticisms about the estheticism of an algorithm do not make sense. We have objective metrics to judge codes and algorithms, such as Big O notation, resource usages, epoch, reproductability, etc. The questions the R&D team try to answer are

a) "Is this possible?", and b) "Is it efficient?".

This has nothing to do with estheticism. Simpleness is not a metric.

Neural networks are not simple to understand nor use. The Deep Mind team shows that they are able to use them as building blocks in different places of a bigger algorithm to solve a goal, iteratively. All of these results required time, effort and ressources to obtain. They are not obligated to share their results with the public, so now just stop to criticize without even suggesting a better solution.


"The human" is an "average human" from the "Human-level control through deep reinforcement learning" paper.

For example, for Montezuma's Revenge:

- "average human" score is 4753.30

- Agent57 score is 9352.01

- human record is 1219200.0


According to the paper this takes around 53418 hours to train (5e10 frames / 260 frames/sec), distributed over 256 machines that's around 200 total hours wall-clock time.


Mind-boggling. They train on each game for a total of 1e11 frames of experience. At 30 FPS that is ~106 calendar years of constant gameplay.


This is how I explain AI to clients...

Think of your best employee. You show them how to operate the videoconferencing system, and that's done.

Now think of your worst employee. You have to show them how to operate the videoconferencing system 3 times because they keep getting it wrong or not understanding how to start a meeting. Eventually, when they've got the hang of it, they're fine too.

AI is like an employee that needs to be shown how to do it a million times. Unless you have the time or the data to show the AI how to do a simple task millions of distinct times, or someone else has already done it, AI can't help you.

Here deepmind is just showing that for slightly more complex tasks, that ~1e6 factor turns to ~1e11...


That's only the current state of the art however. The goal is transfer learning, meaning that when the AI learns something somewhere, it will improve its knowledge on other taks too.

A better analogy would be a baby and an adult. The adult is full of past experiences and can use that experience to learn some other tasks faster.

Currently, our AI technology isn't even at the "baby" stage, as it is not able to transfer knowledge between arbitrary tasks. This is an active research domain.


I think that's a really cogent explanation and fits a narrative that I've seen work when I try to tell non-technical leaders why they probably aren't ready to do ML/DL.


Reading the blog post (haven't read the paper yet) makes it sound like this technique might apply to fuzzing. If this thing seeks out and exploits novel states in a large state space, that's the kind of direction you want in your fuzzer too.


Definitely getting closer, another landmark development just like alphago.


Not exactly related to Agent57, but to the post itself. Perusing a bunch of those games in the playlist gave me a deep respect for Atari game makers' creativity. Many things that were "games" then may not be considered a game today (eg, a gopher digging dirt to get carrots from a farmer). Also I recognized many similarities between the game mechanics and NES games for example atari road running is like excitebike, pitfall like a boy and his blob, etc.


Shouldn't be a comparison between single agent and human performance a better measurement? I mean ok, its defined against "human performance" but a human cannot play parallel on multiple environments or on a tree of possibilities. It just plays again and again single shot/agent.


Would it be fair to say this is a resnet moment for rl


This seems more like overfitting a complex system to a specific problem set.


This feels like the typical HN less storage than a Nomad comment. To me this system reads as very important and necessary for the development of AI in general.


Meh, I'll take the bait.

There are no fundamental new ideas in this paper compared to the preceding papers. What they do is to tune hyperparameters (BPTT length, exploration/exploitation tradeoff, and policy parameterization) in a smart fashion as to fit the bottom 5% of Atari games. Obviously the parameters, or equivalently, architecture choices, are tuned to achieve exactly that - good performance at the bottom 5% of Atari games. None of these choices will generalize outside of this specific set of Atari games.

The reasons we are doing badly at these games are well-understood. They typically require "world knowledge" (what is a key? what is a door?) and reasoning (I found a key, that can be used to open a door). That is, the visual representations need to encode such knowledge. Algorithms don't possess this world knowledge as they are not embodied in our world, so they need to learn it from scratch, i.e. brute-force it. That's exactly what this paper is doing - brute-forcing the solutions by finding just the right hyperparameters with millions of hours worth of compute.

A good analogy is what would happen if you took the game but flipped the pixels in some deterministic way so that the screen would look like noise to a human. A key would no longer be a key, but the structure is still the same. If someone asked you to solve Montezuma's revenge with that representation, you would not be able to. Does that make you stupid or non-human? So, because these games require human world knowledge, solving them in the same way as simpler games is kind of besides the point.


Thanks for explaining your take but this sounds very reductive. When it comes down to it every problem is solved by tuning for that problem specifically.

While Never Give Up (NGU) is not fundamentally new, it is an important step in computers learning. You need to be able to generalize solutions to problems where you don't have contextual information. Imagine if you were a caveman and asked to operate an iPhone. You're not stupid if you don't know how, but if I tell you, "never give up", and put you in a room for 5 years I'd expect some results from a sentient being. This is an important process too.

We would be much closer to good AI if it can figure things out by itself instead of being constantly fed "clean" data.


There's a difference between research and engineering. Is this system impressive? Definitely! It's a complex engineering effort, and highly tuned to solve a specific problem - beat the Atari benchmark.

Does it teach us anything fundamentally new? No, it still has horrible sample complexity and does not generalize to anything outside of Atari unless you completely re-tune it. And I don't mean re-train. I mean changing the architecture and assumptions. That's different from projects such as e.g. AlphaZero or MuZero.

IMO this would've been more appropriate to be published as an open-source system so that it can be applied and tuned to other problems as opposed to a research paper. As research, nobody outside of DeepMind can ever reproduce this.

You are completely changing the topic with this:

> While Never Give Up (NGU) is not fundamentally new, it is an important step in computers learning. You need to be able to generalize solutions to problems where you don't have contextual information

We're not even talking about NGU, we're talking about the paper linked in this post. This specific paper proposes little new in that regard. It just engineers a system to do this specifically for Atari games by taking a previous paper and changing some parameters. Neat, but it's not some kind of breakthrough.


> Does it teach us anything fundamentally new? No, it still has horrible sample complexity and does not generalize to anything outside of Atari unless you completely re tune it.

I thought it was interesting that they let the agent learn the exploration/exploitation tradeoff, also combining memory and intrinsic motivation.

Another contribution of this paper would be that they showed all these tricks can build on each other, thus, are complementary.

Humans brains are also a bag of tricks fine-tuned to the goal and requirements of making more humans.


I think you're both right but are pointing at different ways to approach the intelligence problem. This is kind of the Connectionist vs Symbolic debate.

The fundamental question is, is a representational (contextual) bootstrap required in the long run for a contained computational system to perform at human level across a large number of domains? This isn't a solved problem.

So yes, AI would be better if it could "figure things out by itself" however humans don't "figure things out by themselves" they come pre-wired with a lot out of the box and a lot of help cleaning the data (parents, teachers, literal labels etc...)


Wouldn't the 'pre wiring' in this instance be the code and hardware that the algorithm was running on?


> Algorithms don't possess this world knowledge as they are not embodied in our world, so they need to learn it from scratch, i.e. brute force it.

You mean, almost like a human baby?


It's not really... They make a very nice and clear summary of the current state of RL, then introduce an incremental improvement by combining existing approaches and throwing a lot of compute at the problem.

Keep in mind DeepMind has a lot of money for PR - but nice prose and diagrams shouldn't affect your judgement of whether something is important or not!


Does anyone still have an iPod?


Impressive! Even more impressive would be outperforming a human with the same amount of training...


That comparison would be difficult to do. From the time of first experience to beating Agent57 it takes a lot of time to train the human too.


Ok, maybe it's time to slow down and consider if this is going in a good direction, before just building whatever it is that can be built?


Moloch whose mind is pure machinery! Moloch whose blood is running money! Moloch whose fingers are ten armies! Moloch whose breast is a cannibal dynamo! Moloch whose ear is a smoking tomb!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: