Why neural networks struggle with the Game of Life (2020)

moconnor · 2024-05-17T11:01:05 1715943665

For everyone reading neither the article nor the paper:

- both show neural networks can learn the game of life just fine

- the finding is that to learn the rules reliably the networks need to be very over-parameterised (e.g. many times larger than the minimal size needed for hand-crafted weights to perfectly solve the problem)

This is not really a new result nor a surprising one, nor does it say anything about the kinds of functions a neural network can represent.

It's an attempt to understand an existing observation: once we have trained a large overparameterized neural network we can often compress it to a smaller one with very little loss. So why can't we learn the smaller one directly?

One of the theories referred to in the article and paper is the lottery hypothesis, which states that a large network is a superposition of many small networks and the larger you are the more likely at least one of those gets a "lucky" set of weights and converges quickly to the right solution. There is already interesting evidence for this.

G3rn0ti · 2024-05-17T12:21:04 1715948464

> the lottery hypothesis

Isn’t that another way of saying the optimization algorithm used in finding the network‘s weights (gradient descent) can not find the global optimum? I mean this is nothing new, the curse of dimension prevents any numeric optimizer to completely minimize any complicated error function and it’s been known for decades. AFAIK there is no algorithm that can find the global minimum of any function. And this is what currently limits neural network models: They could be much simpler and less resource hungry if we had better optimizers.

staunton · 2024-05-17T13:41:00 1715953260

In practice, you don't want the global optimum because you can't put all possible inputs in the training data and need your system to "generalize" instead. Global optimum would mean overfitting.

badrunaway · 2024-05-18T06:08:35 1716012515

Can someone explain this? Isn't it possible for the global optimum to be also be the right generalisation optimum?

rcxdude · 2024-05-18T08:41:18 1716021678

it's possible, but unlikely. The issue is your training examples are essentially a noisy representation of the general function you are trying to get it to learn. Generally any representation that fits too well will be incorporating the noise and that will distort the general function (in the case of NN it'll generally mean memorising the input data). Most function-fitting approaches are vulnerable to this.

G3rn0ti · 2024-05-23T14:28:49 1716474529

Hm. I see. But, ultimately, overfitting is a consequence of too many parameters absorbing the noise. Perhaps one could fit smaller models and add artificial noise.

nextaccountic · 2024-05-18T21:11:18 1716066678

The global optimum would be taken in reference to the training data (because that's all you have to set the weights). Unless the training data represents all real world data perfectly, fully optimizing for it will pessimize the model in relation to some set of real world data.

WithinReason · 2024-05-17T14:04:51 1715954691

Regularisation should not be done with the optimiser but with the loss function and the architecture.

uoaei · 2024-05-17T19:44:30 1715975070

The entire reason SGD works is because the stochastic nature of updates on minibatches is an implicit regularizer. This one perspective built the foundations for all of modern machine learning.

I completely agree that the most effective regularization is inductive bias in the architecture. But bang for buck, given all the memory/compute savings it accomplishes, SGD is the exemplar of implicit regularization techniques.

staunton · 2024-05-17T18:43:54 1715971434

Maybe it should not be done but the large neutral networks this decade absolutely rely on this. A network at the global minimum of any of the (regularized) loss functions that are used these days would be waaay overfitted.

redox99 · 2024-05-17T16:44:10 1715964250

Regularization only helps you so much.

bjourne · 2024-05-18T15:01:20 1716044480

In addition to that the hypothesis asserts that a local minimum is likely not good enough. This is different from a few years ago when most thought that the solution space was full of local minima so parameter initialization wouldn't matter that much. But that is perhaps because the threshold for acceptable performance is higher so luck is more important.

WithinReason · 2024-05-17T14:04:01 1715954641

I think you're right, but the issue might be local minima which a better optimiser wouldn't help with much. A reason a larger network might work better is that there are fewer local minima in a higher dimension too.

jxy · 2024-05-17T18:05:15 1715969115

> there are fewer local minima in a higher dimension

Is it actually proven, or another hypothesis? What is the reason behind this?

jcrites · 2024-05-17T18:40:18 1715971218

Just reasoning about this from first principles, but intuitively, the more dimensions you have, the more likely that you are to find a gradient in some dimension. In an N-dimensional space, a local minimum needs to be a minimum in all N dimensions, right? Otherwise the algorithm will keep exploring down the gradient. (Not an expert on this stuff.) The more dimensions there are, the more likely it seems to be that a gradient exists down to some greater minimum from any given point.

actionfromafar · 2024-05-17T11:11:55 1715944315

> once we have trained a large overparameterized neural network we can often compress it to a smaller one with very little loss. So why can't we learn the smaller one directly?

I feel something similar goes on in us humans. Interesting to think about.

seanhunter · 2024-05-17T18:04:59 1715969099

Yes. My naive intuition about this is you need the extra parameters precisely to do the learning because learning a thing is more complicated than doing the thing once you have learned how. There are lots of natural examples that fit this intuition eg in my mind "junk" DNA is needed because the evolutionary mechanism is learning the sequences which work in a similar way. You don't need all that extra DNA once you have it working but once you have it working there's little selection pressure to clean up/optimise the DNA sequence so the junk stays.

patcon · 2024-05-17T12:02:14 1715947334

Also perhaps why the evolved pattern of death is important: a subnetwork is selected in a brain, which is suited to a specific geological, physical, biological and cognitive environment that the brain is navigating. But when the environment shifts beneath the organisim (as culture does and the living world in general does), then the subnetwork is no longer the correct one, and needs to be reinitialized.

Or in other words, even in an information theoretic sense, it's true: you can't teach a old dogs new tricks. You need a new dog.

batshit_beaver · 2024-05-17T17:01:36 1715965296

Neuroplasticity is a thing though, with plenty of cases of brains recovering from pretty significant damage. They also do evolve and adjust over time to gradual changes in environment. Lots of elderly people are keeping up with cultural and technological change.

Can't say the same about neural networks (yet?).

yumong · 2024-05-17T12:07:03 1715947623

From an evolutionary perspective, wouldn't we expect to have developed a way to "reset" parts of our brain then?

sitkack · 2024-05-17T12:16:32 1715948192

That is what kids are for.

ianmcgowan · 2024-05-17T17:38:32 1715967512

This reminds me of a hacker news comment that blew my mind - basically "I" am really my genetic code, and this particular body "I" am in is just another computer that the code has been moved to, because the old one is scheduled to be decommissioned. So I am really just the latest instance of a program that has been running continuously since the first DNA/RNA molecules started to replicate.

patcon · 2024-05-30T14:11:42 1717078302

Oh! This reminds me of some lines from a 1970s essay called "The Origin of Death", which was on HN 5 months ago:

https://news.ycombinator.com/item?id=38505856 https://www.elijahwald.com/origin.html

> The strange thing about all this is that we already have immortality, but in the wrong place. We have it in the germ plasm; we want it in the soma, in the body. We have fallen in love with the body. That’s that thing that looks back at us from the mirror. That’s the repository of that lovely identity that you keep chasing all your life. And as for that potentially immortal germ plasm, where that is one hundred years, one thousand years, ten thousand years hence, hardly interests us.

> I used to think that way, too, but I don’t any longer. You see, every creature alive on the earth today represents an unbroken line of life that stretches back to the first primitive organism to appear on this planet; and that is about three billion years. That really is immortality. For if that line of life had ever broken, how could we be here? All that time, our germ plasm has been living the life of those single-celled creatures, the protozoa, reproducing by simple division, and occasionally going through the process of syngamy -- the fusion of two cells to form one—in the act of sexual reproduction. All that time, ^^that germ plasm has been making bodies and casting them off in the act of dying. If the germ plasm wants to swim in the ocean, it makes itself a fish; if the germ plasm wants to fly in the air, it makes itself a bird. If it wants to go to Harvard, it makes itself a man.^^ #weirding The strangest thing of all is that the germ plasm that we carry around within us has done all those things. There was a time, hundreds of millions of years ago, when it was making fish. Then at a later time it was making amphibia, things like salamanders; and then at a still later time it was making reptiles. Then it made mammals, and now it’s making men. If we only have the restraint and good sense to leave it alone, heaven knows what it will make in ages to come.

> I, too, used to think that we had our immortality in the wrong place, but I don’t think so any longer. I think it’s in the right place. I think that is the only kind of immortality worth having -- and we have it.

interroboink · 2024-05-17T17:47:33 1715968053

If you're interested in such things, then start layering on epigenetics. The "I" is a product not just of genes, but of your environment as you developed. I was just reading about bees' "royal jelly" recently, and how genetically identical larvae can become a queen or a worker based on their exposure to it.

So the program is not just the zeroes and ones, so to speak, but also more nebulous real-time activity, passed on through time. Like a wave on the ocean.

sitkack · 2024-05-17T18:17:00 1715969820

And my children are not only my offspring, but everyone I come into memetic contact with.

patcon · 2024-05-30T14:06:51 1717078011

I like you.

patcon · 2024-05-30T14:04:30 1717077870

YES. great response

actionfromafar · 2024-05-17T12:55:26 1715950526

Haha, exactly. "We did".

rmnclmnt · 2024-05-17T11:39:34 1715945974

Indeed, you have to over-engineer something before converge to a leaner solution to a problem

brookst · 2024-05-17T12:05:04 1715947504

And I think this is because the ideal complex system is one where all the subsystems and parts combine to produce adequate reliability.

Exponentiation means it is more efficient to start by far exceeding the required reliability and then optimizing the most expensive subsystems/parts. It is less efficient and far more frustrating if multiple things have to be improved to meet requirements.

actionfromafar · 2024-05-17T14:35:16 1715956516

This is really insightful.

nopinsight · 2024-05-17T11:51:46 1715946706

There is indeed an analogous process in the brain.

"The number of synapses in the brain reaches its peak around ages 2-3, with about 15,000 synapses per neuron. As adolescents, the brain undergoes synaptic pruning. In adulthood, the brain stabilizes at around 7,500 synapses per neuron, roughly half the peak in early childhood.

This figure can vary based on individual experiences and learning." -- written by GPT-4o

Confirmed by e.g. https://extension.umaine.edu/publications/4356e/

tomxor · 2024-05-17T13:31:51 1715952711

> So why can't we learn the smaller one directly?

The lottery hypothesis intuitively makes sense, but as an outsider I find this concept for evaluating learning methods really interesting - To hand craft a tiny optimal networks for simple yet computationally irreducible problems like GoL as a way to benchmark learning algorithms. Or is it more than that? for a sufficiently small network maybe there aren't that many combinations of "correct solutions", so perhaps the way the network emerges internally could really be interrogated by comparison.

DrScientist · 2024-05-17T12:52:49 1715950369

This may be a silly question but - so rather than train a big network and hope a subnetwork wins the lottery - why not just train a smaller network with multiple runs with different starting weights?

scarmig · 2024-05-17T22:23:37 1715984617

The larger network contains exponentially more subnetworks. 10x the size contains far more than 10x subnetworks (although it'd also take more than 10x as long to train).

DrScientist · 2024-05-21T17:14:18 1716311658

Ah ok - so you are saying the explosion of subnetworks is higher than the explosion of training time - leading to a win.

In this case a positive benefit of combinatorial complexity.

MalphasWats · 2024-05-17T13:08:46 1715951326

Largely for the same reason people don't "just win" the lottery by buying every possible ticket (any more).

dilyevsky · 2024-05-17T17:59:35 1715968775

Isn’t this basically the idea behind dropout technique?

Grimblewald · 2024-05-18T04:01:03 1716004863

No,the idea behind dropout is to reduce an over-reliance on specific outputs thereby, in theory and typically in practice, making the network learn more reliable representations reducing the chance of overfitting.

mysecretaccount · 2024-05-17T11:45:39 1715946339

Thanks, very clear explanation!

red75prime · 2024-05-17T10:56:06 1715943366

I struggled with the Game of Life too. I was fascinated by it and evolved cell populations on graph paper by hand (yeah, I'm that old). When I've got a computer, I checked my drawings and all of them were wrong.

maxbond · 2024-05-17T14:56:31 1715957791

Reminds me of Knuth's dragon mosaic, which also contains a mistake.

https://www.youtube.com/watch?v=v678Em6qyzk

mondrian · 2024-05-18T09:51:51 1716025911

Really fun & informative interview, thanks for linking!

zaphar · 2024-05-17T12:04:15 1715947455

I used to do this as a kid on long car trips. I would play GoL on paper. It's a good way to eat up time enjoyably.

phaedrus · 2024-05-17T17:08:34 1715965714

I wonder if anyone has tried to approach the problem from the other end: start with the hand-tuned network and randomize just some of the weights (or all of the weights a small amount), and see at what point the learning algorithm can no longer get back to the correct formulation of the problem. Map the boundary between almost-solved and failure to converge, instead of starting from a random point trying to get to almost-solved.

kozlovsky · 2024-05-17T10:51:24 1715943084

If we show a neural network some examples from the Game of Life and expect it to master the rules of a cellular automaton, then aren't we asking too much from it? In some ways, this is analogous to expecting that if we show the neural network examples from the physical world, it will automatically derive Newton's three laws. Not every person observing the world around him can independently deduce Newton's laws from scratch, no matter how many examples he sees.

moconnor · 2024-05-17T11:26:46 1715945206

This is exactly what we ask of neural networks and in the case of the game of life the article and paper show that yes they do derive the rules. Equally, we can expect them to derive the laws of physics by observation - certainly diffusion networks appear to derive some of them as they pertrain to light.

passwordoops · 2024-05-17T11:01:46 1715943706

"then aren't we asking too much from it"

Not according to the hype merchants, hucksters, and VCs who think word models are displaying emergence and we're 6 months from AGI, if only we can have more data

int_19h · 2024-05-18T02:45:48 1716000348

Not according to the actual article that you're commenting on, either.

"As the researchers added more layers and parameters to the neural network, the results improved and the training process eventually yielded a solution that reached near-perfect accuracy."

So, no, we aren't asking too much from it. We just need more compute.

nottorp · 2024-05-17T11:25:21 1715945121

Let's be snarky a bit:

Can you do a neural network that, given a starting position of the game of life, decides if it cycles or not? ;)

Ok, not cycles... dies, stabilizes, goes into a loop etc.

GrantMoyer · 2024-05-17T11:45:53 1715946353

So Oracle's working on an LLM too, eh?

nottorp · 2024-05-17T12:00:04 1715947204

<cough> halting problem. But now I'm spoiling it.

markisus · 2024-05-17T12:22:49 1715948569

We know neural networks cannot solve the halting problem. But isn’t the question whether they can learn the transition table for game of life? Since each cell depends only on neighbors, this is as easy as memorizing how each 3x3 tile transitions.

nottorp · 2024-05-17T12:30:51 1715949051

The original question, maybe. Mine is basically the halting problem, I think.

The other difference is I don't take it seriously.

theGnuMe · 2024-05-18T12:05:02 1716033902

Wolfram looked at this recently on his blog.

https://writings.stephenwolfram.com/2024/05/why-does-biologi...

He says it's possible for smaller games (fewer rules) but unlikely for larger ones.. IMHO anything Turing complete would have this problem.

DeathArrow · 2024-05-17T11:55:58 1715946958

Everybody and their mom are into LLMs.

yumong · 2024-05-17T12:09:50 1715947790

And Second Life and Myspace.

brookst · 2024-05-17T12:07:55 1715947675

Same thing happened with the Internet.

scarmig · 2024-05-17T22:18:55 1715984335

The halting problem doesn't mean you can never decide if something cycles etc, just that you can't always decide.

As it stands, my guess is that the LLM would always confidently make a decision, even if it were wrong, and then politely backtrack if you pushed backed, even if it were originally right.

catlifeonmars · 2024-05-17T12:33:39 1715949219

For a grid of a fixed size, yes.

elevatedastalt · 2024-05-17T20:19:23 1715977163

Every other day we see demos of AIs doing things that were thought of an impossible 6 months earlier, but sure, sounds like it's the "hype merchants" who are out of touch with reality.

danielbln · 2024-05-17T12:01:55 1715947315

From the HN comment rules:

> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

> Please don't fulminate. Please don't sneer, including at the rest of the community.

bell-cot · 2024-05-17T12:15:42 1715948142

My read of the comment is: "You are correct, but bear in mind that the world seems infested with people who are far less realistic and honest than you."

iiovemiku · 2024-05-17T22:02:48 1715983368

The rules also say "Please don't complain that a submission is inappropriate. If a story is spam or off-topic, flag it. Don't feed egregious comments by replying; flag them instead. If you flag, please don't also comment that you did."

I'm not really sure it's the best idea to accuse someone of breaking the rules if in doing so you're also breaking one yourself.

HeatrayEnjoyer · 2024-05-18T06:21:57 1716013317

??

They are displaying emergence. They might as well be the walking definition of it.

scotty79 · 2024-05-17T14:00:40 1715954440

> These findings are in line with “The Lottery Ticket Hypothesis,”

If the fit was due to a lucky subset of weights you could have train smaller networks many times instead of using many times bigger network.

So it must be something more. Like increased opportunity to create best solution out of large number of random lucky parts.

I think there should be way more research on neural pruning. After all it's what our brains do to reach the correct architecture and weights during our development.

phaedrus · 2024-05-17T17:19:31 1715966371

Philosophically, why should it be the case that aggregations of statistical calculations (one way of viewing the matrix multiplications of ANNs) can approximate intelligence? I think it's because our ability to know reality is inherently statistical.

To be clear, I'm not suggesting macro scale, i.e. not quantum, reality itself is probabilistic, only that our ability to interpret perception of it and model it is statistical. That is, an observation or a sensor doesn't actually tell you the state of the world; it is a measurement from which you infer things.

Viewed through this standpoint, maybe the Game of Life and other discrete, fully-knowable toy problem worlds aren't as applicable to the problem of general intelligence as we imagine. A way to put this into practice could be to introduce a level of error in both the hand-tuned and learned networks' ability to accurately measure the input states of the Life tableau (and/or introduce some randomness in the application of the Life rules on the simulation), and see whether the superiority of the hand-tuned network persists or if the learned network is more robust in the face of uncertain inputs or fallible rule-applications.

K0balt · 2024-05-17T10:57:29 1715943449

What’s not clear to me is if it is non trivial to (a)create a GOL NN that models the game cells directly as neurons (which would seem to be an efficient and effective method) or if it’s just (b) nontrivial to create a transformer architecture model that can model the game state n-turns in the future.

I would be very surprised if (a) was not effective, but that (b) is difficult is not surprising, since that is a very nontrivial task that requires intermediary modelling tools to perform for humans (arguably the most advanced NN that we have access to at the moment)

(a) is actually a form of (b) in the form of a modelling tool.

moconnor · 2024-05-17T11:27:43 1715945263

Both are ~trivial as detailed in the paper and article.

puttycat · 2024-05-18T11:15:58 1716030958

Similar discussion with a potential solution, in the domain of formal language learning:

Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length

https://arxiv.org/abs/2402.10013

ngrilly · 2024-05-17T12:02:06 1715947326

This paper from 2020 has been peer reviewed: https://openreview.net/forum?id=uKZsVyFKbaj

hamilyon2 · 2024-05-17T14:16:58 1715955418

> In machine learning, one of the popular ways to improve the accuracy of a model that is underperforming is to increase its complexity. And this technique worked with the Game of Life.

For those who didn't read the article, the content doesn't support the title.

rasca · 2024-05-17T12:02:59 1715947379

Here's a nice article about how another team solved Game of Life in LLMs using Claude 3 Opus: https://x.com/ctjlewis/status/1786948443472339247 .

It's a really nice read.

ellis0n · 2024-05-17T20:11:46 1715976706

Amazing idea! GoL/LLM

leric · 2024-05-17T09:50:30 1715939430

A good example of irreducible complexity

rustcleaner · 2024-05-17T19:42:47 1715974967

based

Simple systems built on simple rules creating universally complete computation behaviors are both unintuitive to and underrated by common man.

nottorp · 2024-05-17T10:47:08 1715942828

> Interestingly, no matter how complex a grid becomes, you can predict the state of each cell in the next timestep with the same rules.

How is that interesting? It's the definition of the Game of Life... it's not like it's a natural system that you don't know the full rules for...

brookst · 2024-05-17T12:06:42 1715947602

It’s interesting the way Go is interesting: very simple rules can produce extreme complexity. At least, I think that’s interesting.

yumong · 2024-05-17T12:08:34 1715947714

Stephen Wolfram pushes this to the extreme.

Surprised that nobody mentioned him yet.

nottorp · 2024-05-17T13:13:00 1715951580

I wouldn't compare the game of life that has exactly one "next state" with something like Go.

SkyBelow · 2024-05-17T17:16:58 1715966218

Why not?

We could even think of both as collections of 3d structures showing all valid structures possible for a board of size n by n. There are some differences, every single 3d Conway structure has a unique top layer, while Go does not. But that seems like an overall minor difference. There are many more Go shapes than Conway shapes given the same N, but both are already so numerous that I'm not sure that is a difference worth stopping the comparison.

brookst · 2024-05-17T13:23:27 1715952207

It’s interesting that you wouldn’t, yet I would.They aren’t isomorphic, for sure.

Go’s complexity comes from two players alternately picking one out of a very large number of options.

GoL’s complexity comes from a very large number of nodes “picking” between two states. That’s not precise, just illustrating that there is some symmetry of simplicity/complexity, at least to my eyes.

bell-cot · 2024-05-17T12:08:44 1715947724

From a quick skim, then string search for "interesting" - I'd say that word is fluff, added to keep their audience reading through their dull background intro.