Hacker News new | past | comments | ask | show | jobs | submit login

Thank you for taking the time to respond, I appreciate it.

Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?

Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.

Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?




The best way we know to think of it is in terms of variance of the gradient.

In a hard environment, your gradients will be very noisy — but effectively no more than linear in the duration you are optimizing over, provided that you have a reasonable solution for exploration. As you scale your batch size, you can decrease your variance linearly. So you can use good ol' gradient descent if you can scale up linearly in the hardness of the problem.

This is a handwavy argument admittedly, but seems to match what we are seeing in practice.

Simulators are nice because it is possible to take lots of samples from them — but there's a limit to how many samples can be taken from the real world. In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go. As a community, we are just starting to get fast enough computers to test out ideas there.


Yo, this probably isn't the type of HN comment you're used to, but I just wanted to say thanks for enriching the dota community. I know that's not really why you're doing any of this, but as someone who's deeply involved with the community, people get super hyped about what you guys have been doing.

They also understand all of the nuances, similar to HN. Last year when you guys beat Arteezy, everyone grokked that 5v5 was a completely different and immensely difficult problem in comparison. There's a lot of talent floating around /r/dota2, amidst all the memes and silliness. And for whatever reason, the community loves programming stories, so people really listen and pay attention.

https://imgur.com/Lh29WuC

So yeah, we're all rooting for you. Regardless of how it turns out this year, it's one of the coolest things to happen to the dota 2 scene period! Many of us grew up with the game, so it's wild to see our little mod suddenly be a decisive factor in the battle for worldwide AI dominance.

Also 1v1 me scrub


Agreed! Can't wait to not have to play Dota 2 with humans :p


> Also 1v1 me scrub

I wanted to play SF against the bot so badly - even knowing I'd get absolutely destroyed over and over agin


EDIT (I work at OpenAI and wrote the statement about the variance of the gradient being linear): Here's a more precise statement: the variance is exponential in the "difficulty" of the exploration problem. The harder the exploration, the worse is the gradient. So while it is correct that things become easy if you assume that exploration is easy, the more correct way of interpreting our result is that the combination of self play and our shaped reward made the gradient variance manageable at the scale of the compute that we've use.


> In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go.

Will those models be introspectible / transferrable? One thing I'm curious about is how AI's learn about novel actions / scenarios which are "fatal" in the real world? Humans generally spend a lot of time being taught these things (rather than finding out for themselves obviously) and eventually come up with a fairly good set of rules about how not to die in stupid ways.


Transferability depends on the way the models is set up, and moves on a scale.

Introspectable: given that you can ask unlimited "What if" questions models, we should be able to get a lot of insights into how the models work internally. And you can often design them to be introspectable as some performance or complexity cost. (if that's what you meant by introspectable).


Can you clarify why variance only scales linearly in the duration you are optimizing over? I would have expected it to be exponential, since the size of the space you are searching is exponential in the duration.


Re variance, the argument is not entirely bullet proof, but it goes like this: we know that the variance of the gradient of ES grows linearly with the dimensionality of the action space. Therefore, the variance of the policy gradient (before backprop through the neural net) should similarly be linear in the dimensionality of the combined action space, which is linear in the time horizon. And since backprop through a well-scaled neural net doesn't change the gradient norm too much, the absolute gradient variance of the policy gradient should be linear in time horizon also.

This argument is likely accurate in the case where exploration is adequately addressed (for example, with a well chosen reward function, self play, or some kind of an exploration bonus). However, if exploration is truly hard, then it may be possible for the variance of the gradient to be huge relative to the norm of the gradient (which would be exponentially small), even though the absolute variance of the gradient is still linear in the time horizon.



That makes sense, thanks for clarifying!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: