Hacker News new | past | comments | ask | show | jobs | submit login
ericvanular 7 days ago | hide | past | web | favorite

Clickbait title. It's as much superhuman as you can call pretty much anything done on computer and as it's admitted in previous to last paragraph in the article it doesn't even use reinforcement learning.

The motivation here was to expose most people who probably haven't heard of RL concepts like rewards, Markov decision processes, etc to the ideas. Appreciate your comment and I understand that a naive search method might be basic for advanced practitioners such as yourself :)

How do rewards work in games like Wari? https://en.wikipedia.org/wiki/Oware

Sometimes it happens that a possible move captures pieces and therefore on paper brings you closer to victory, but also puts you in a strategically bad position (which is only apparent after subsequent moves) and will eventually lose you the game.

Nothing prevents you from rewarding 0 for anything other than a win. A win would be +1 and a loss would be -1 and every intermediate step is 0. However, this yields the new problem of your reinforcement learner needs to explore all paths to all possible outcome - the exact thing we are trying to get around. Thus, we can use human heuristics or neural networks to try and learn these heuristics. The latter approach is deep reinforcement learning and the basis for Alpha Zero.

This is nothing more than a naive random policy search.

Clickbait aside, this is a gross misrepresentation of what reinforcement learning is.

Thanks for the comment. 100% it is a naive random policy search. The intention is to help those who might not be familiar with RL concepts understand basic tools & how to start playing around rather than diving into things like DQL, PPO, etc right away

Hi everyone, I've received so much from this community over the years. I wanted to start to give back so I decided to start writing about topics that I hope will help people to learn. Please let me know if you have questions!

Have you solved the lunar lander environment? It's been kicking my ass for a few months off and on. I finally solved it with vanilla policy gradiant ascent without a baseline, it just took 30,000 episodes.

Does what you've created in the blog post solve the lunar lander?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact