Since diamonds are surrounded by danger and if it dies, it loses its items and s...

taneq · 2025-04-07T12:21:19 1744028479

In all reinforcement learning there is (explicitly as part of a fitness function, or implicitly as part of the algorithm) some impetus for exploration. It might be adding a tiny reward per square walked, a small reward for each block broken and a larger one for each new block type broken. Or it could be just forcing a random move every N steps so the agent encounters new situations through “clumsiness”.

kevindamm · 2025-04-07T20:15:16 1744056916

That is right, there is usually a parameter on the action selection function -- the exploitation vs exploration balance.

danijar · 2025-04-07T22:26:24 1744064784

When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.

It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.

Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.

SpaceManNabs · 2025-04-08T02:15:02 1744078502

> Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.

This is such gold. Thanks for sharing. Immediately added to my notes.