Article makes it seem like finding diamonds is some kind of super complicated logical puzzle. In reality the hardest part is knowing where to look for them and what tool you need to mine them without losing them once you find them. This was given to the AI by having it watch a video that explains it.
If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.
Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.
It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2
Since diamonds are surrounded by danger and if it dies, it loses its items and such, why would it not be satisfied after discovering iron pick axe or somesuch? Is it in a mode where it doesn't lose its item when it dies? Does it die a lot? Does it ever try digging vertically down? Does it ever discover other items/tools you didn't expect it to? Open world with sparse reward seems like such a hard problem. Also, once it gets the item, does it stop getting reward for it? I assume so. Surprised that it can work with this level of sparse rewards.
In all reinforcement learning there is (explicitly as part of a fitness function, or implicitly as part of the algorithm) some impetus for exploration. It might be adding a tiny reward per square walked, a small reward for each block broken and a larger one for each new block type broken. Or it could be just forcing a random move every N steps so the agent encounters new situations through “clumsiness”.
When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.
It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.
Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
> Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
This is such gold. Thanks for sharing. Immediately added to my notes.
I just want to express my condolences in how difficult it must be to correct basic misunderstandings that can be immediately corrected from reading the fourth paragraph under the section "Diamonds are forever"
it didn't watch 'a video', it watched many, many hours of video of playing minecraft (with another specialised model feeding in predictions of keyboard and mouse inputs from the video). It's still a neat trick, but it's far from the implied one-shot learning.
I don't think it was videos. Almost certainly it was replay files with a bunch of work to transform them into something that could be compared to the model's outputs. (Alphastar never 'sees' the game's interface, only a transformed version of information available via an API)
starcraft provides replay files that start with the initial game state and then every action in the game. Not user inputs, but the actions bound to them.
The other replies have observed that the AI didn't get any "videos to watch" but I'd also observe that this is being used as an English colloquialism. The AIs aren't "watching videos", they're receiving videos as their training data. That's quite different from what is coming to your mind as "watching a video" as if the AI watched a single YouTube tutorial video once and got the concept.
I feel like you are jumping to conclusions here, I wasn't talking about the achievement or the AI, I was talking about the article and the way it explains finding diamonds in minecraft to people who don't know how to find diamonds in minecraft.
I am not american nor I live in america so I don't really have a horse in this race, but the DOGE approach seems to be the classic "move fast and break things" approach. The reactions to it are the classic reactions to that approach, competent people speak out to get broken things fixed and others are confused about what is happening.
I did talk to a doctor. He quoted me $12,000 for a surgery that sounded excessive and had a long recovery time. I try to get second opinions, but doctors are so busy that I never get a call back or schedule many months out.
Oddly the wealthier I get the more I distrust doctors. Why perform a $300 tooth filling, example, when you can creatively justify a $5000 root canal and crown. They know I have the money and their kids private school ain't cheap.
Before "Information Systems" there was no need for a company/college wide network; computers with printers on the same desk were a replacement for the typewriter.
I don't share my data with anyone so I don't need cloud storage.
SSDs are so cheap at this point that I just buy a new SSD every few years and install a fresh copy of Windows there. Old drive gets unplugged and "archived" on the shelf.
If I need to access the "archive" I use a SATA to USB cable and plug the drive in.
I avoid large capacity drives (my biggest one is about 200GB) and use spacesniffer so I'm forced to keep the data I store lean.
That's the easy part. The hard part is getting people to admit that the metric has been discovered already. Most problems with automation are organizational problems not technical ones.
If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.