For everyone reading neither the article nor the paper:
- both show neural networks can learn the game of life just fine
- the finding is that to learn the rules reliably the networks need to be very over-parameterised (e.g. many times larger than the minimal size needed for hand-crafted weights to perfectly solve the problem)
This is not really a new result nor a surprising one, nor does it say anything about the kinds of functions a neural network can represent.
It's an attempt to understand an existing observation: once we have trained a large overparameterized neural network we can often compress it to a smaller one with very little loss. So why can't we learn the smaller one directly?
One of the theories referred to in the article and paper is the lottery hypothesis, which states that a large network is a superposition of many small networks and the larger you are the more likely at least one of those gets a "lucky" set of weights and converges quickly to the right solution. There is already interesting evidence for this.
Isn’t that another way of saying the optimization algorithm used in finding the network‘s weights (gradient descent) can not find the global optimum? I mean this is nothing new, the curse of dimension prevents any numeric optimizer to completely minimize any complicated error function and it’s been known for decades. AFAIK there is no algorithm that can find the global minimum of any function. And this is what currently limits neural network models: They could be much simpler and less resource hungry if we had better optimizers.
In practice, you don't want the global optimum because you can't put all possible inputs in the training data and need your system to "generalize" instead. Global optimum would mean overfitting.
it's possible, but unlikely. The issue is your training examples are essentially a noisy representation of the general function you are trying to get it to learn. Generally any representation that fits too well will be incorporating the noise and that will distort the general function (in the case of NN it'll generally mean memorising the input data). Most function-fitting approaches are vulnerable to this.
Hm. I see. But, ultimately, overfitting is a consequence of too many parameters absorbing the noise. Perhaps one could fit smaller models and add artificial noise.
The global optimum would be taken in reference to the training data (because that's all you have to set the weights). Unless the training data represents all real world data perfectly, fully optimizing for it will pessimize the model in relation to some set of real world data.
The entire reason SGD works is because the stochastic nature of updates on minibatches is an implicit regularizer. This one perspective built the foundations for all of modern machine learning.
I completely agree that the most effective regularization is inductive bias in the architecture. But bang for buck, given all the memory/compute savings it accomplishes, SGD is the exemplar of implicit regularization techniques.
Maybe it should not be done but the large neutral networks this decade absolutely rely on this. A network at the global minimum of any of the (regularized) loss functions that are used these days would be waaay overfitted.
In addition to that the hypothesis asserts that a local minimum is likely not good enough. This is different from a few years ago when most thought that the solution space was full of local minima so parameter initialization wouldn't matter that much. But that is perhaps because the threshold for acceptable performance is higher so luck is more important.
I think you're right, but the issue might be local minima which a better optimiser wouldn't help with much. A reason a larger network might work better is that there are fewer local minima in a higher dimension too.
Just reasoning about this from first principles, but intuitively, the more dimensions you have, the more likely that you are to find a gradient in some dimension. In an N-dimensional space, a local minimum needs to be a minimum in all N dimensions, right? Otherwise the algorithm will keep exploring down the gradient. (Not an expert on this stuff.) The more dimensions there are, the more likely it seems to be that a gradient exists down to some greater minimum from any given point.
> once we have trained a large overparameterized neural network we can often compress it to a smaller one with very little loss. So why can't we learn the smaller one directly?
I feel something similar goes on in us humans. Interesting to think about.
Yes. My naive intuition about this is you need the extra parameters precisely to do the learning because learning a thing is more complicated than doing the thing once you have learned how. There are lots of natural examples that fit this intuition eg in my mind "junk" DNA is needed because the evolutionary mechanism is learning the sequences which work in a similar way. You don't need all that extra DNA once you have it working but once you have it working there's little selection pressure to clean up/optimise the DNA sequence so the junk stays.
Also perhaps why the evolved pattern of death is important: a subnetwork is selected in a brain, which is suited to a specific geological, physical, biological and cognitive environment that the brain is navigating. But when the environment shifts beneath the organisim (as culture does and the living world in general does), then the subnetwork is no longer the correct one, and needs to be reinitialized.
Or in other words, even in an information theoretic sense, it's true: you can't teach a old dogs new tricks. You need a new dog.
Neuroplasticity is a thing though, with plenty of cases of brains recovering from pretty significant damage. They also do evolve and adjust over time to gradual changes in environment. Lots of elderly people are keeping up with cultural and technological change.
This reminds me of a hacker news comment that blew my mind - basically "I" am really my genetic code, and this particular body "I" am in is just another computer that the code has been moved to, because the old one is scheduled to be decommissioned. So I am really just the latest instance of a program that has been running continuously since the first DNA/RNA molecules started to replicate.
> The strange thing about all this is that we already have immortality, but in the wrong place. We have it in the germ plasm; we want it in the soma, in the body. We have fallen in love with the body. That’s that thing that looks back at us from the mirror. That’s the repository of that lovely identity that you keep chasing all your life. And as for that potentially immortal germ plasm, where that is one hundred years, one thousand years, ten thousand years hence, hardly interests us.
> I used to think that way, too, but I don’t any longer. You see, every creature alive on the earth today represents an unbroken line of life that stretches back to the first primitive organism to appear on this planet; and that is about three billion years. That really is immortality. For if that line of life had ever broken, how could we be here? All that time, our germ plasm has been living the life of those single-celled creatures, the protozoa, reproducing by simple division, and occasionally going through the process of syngamy -- the fusion of two cells to form one—in the act of sexual reproduction. All that time, ^^that germ plasm has been making bodies and casting them off in the act of dying. If the germ plasm wants to swim in the ocean, it makes itself a fish; if the germ plasm wants to fly in the air, it makes itself a bird. If it wants to go to Harvard, it makes itself a man.^^ #weirding The strangest thing of all is that the germ plasm that we carry around within us has done all those things. There was a time, hundreds of millions of years ago, when it was making fish. Then at a later time it was making amphibia, things like salamanders; and then at a still later time it was making reptiles. Then it made mammals, and now it’s making men. If we only have the restraint and good sense to leave it alone, heaven knows what it will make in ages to come.
> I, too, used to think that we had our immortality in the wrong place, but I don’t think so any longer. I think it’s in the right place. I think that is the only kind of immortality worth having -- and we have it.
If you're interested in such things, then start layering on epigenetics. The "I" is a product not just of genes, but of your environment as you developed. I was just reading about bees' "royal jelly" recently, and how genetically identical larvae can become a queen or a worker based on their exposure to it.
So the program is not just the zeroes and ones, so to speak, but also more nebulous real-time activity, passed on through time. Like a wave on the ocean.
And I think this is because the ideal complex system is one where all the subsystems and parts combine to produce adequate reliability.
Exponentiation means it is more efficient to start by far exceeding the required reliability and then optimizing the most expensive subsystems/parts. It is less efficient and far more frustrating if multiple things have to be improved to meet requirements.
There is indeed an analogous process in the brain.
"The number of synapses in the brain reaches its peak around ages 2-3, with about 15,000 synapses per neuron. As adolescents, the brain undergoes synaptic pruning. In adulthood, the brain stabilizes at around 7,500 synapses per neuron, roughly half the peak in early childhood.
This figure can vary based on individual experiences and learning." -- written by GPT-4o
The lottery hypothesis intuitively makes sense, but as an outsider I find this concept for evaluating learning methods really interesting - To hand craft a tiny optimal networks for simple yet computationally irreducible problems like GoL as a way to benchmark learning algorithms. Or is it more than that? for a sufficiently small network maybe there aren't that many combinations of "correct solutions", so perhaps the way the network emerges internally could really be interrogated by comparison.
This may be a silly question but - so rather than train a big network and hope a subnetwork wins the lottery - why not just train a smaller network with multiple runs with different starting weights?
The larger network contains exponentially more subnetworks. 10x the size contains far more than 10x subnetworks (although it'd also take more than 10x as long to train).
No,the idea behind dropout is to reduce an over-reliance on specific outputs thereby, in theory and typically in practice, making the network learn more reliable representations reducing the chance of overfitting.
I struggled with the Game of Life too. I was fascinated by it and evolved cell populations on graph paper by hand (yeah, I'm that old). When I've got a computer, I checked my drawings and all of them were wrong.
I wonder if anyone has tried to approach the problem from the other end: start with the hand-tuned network and randomize just some of the weights (or all of the weights a small amount), and see at what point the learning algorithm can no longer get back to the correct formulation of the problem. Map the boundary between almost-solved and failure to converge, instead of starting from a random point trying to get to almost-solved.
If we show a neural network some examples from the Game of Life and expect it to master the rules of a cellular automaton, then aren't we asking too much from it? In some ways, this is analogous to expecting that if we show the neural network examples from the physical world, it will automatically derive Newton's three laws. Not every person observing the world around him can independently deduce Newton's laws from scratch, no matter how many examples he sees.
This is exactly what we ask of neural networks and in the case of the game of life the article and paper show that yes they do derive the rules. Equally, we can expect them to derive the laws of physics by observation - certainly diffusion networks appear to derive some of them as they pertrain to light.
Not according to the hype merchants, hucksters, and VCs who think word models are displaying emergence and we're 6 months from AGI, if only we can have more data
Not according to the actual article that you're commenting on, either.
"As the researchers added more layers and parameters to the neural network, the results improved and the training process eventually yielded a solution that reached near-perfect accuracy."
So, no, we aren't asking too much from it. We just need more compute.
We know neural networks cannot solve the halting problem. But isn’t the question whether they can learn the transition table for game of life? Since each cell depends only on neighbors, this is as easy as memorizing how each 3x3 tile transitions.
The halting problem doesn't mean you can never decide if something cycles etc, just that you can't always decide.
As it stands, my guess is that the LLM would always confidently make a decision, even if it were wrong, and then politely backtrack if you pushed backed, even if it were originally right.
Every other day we see demos of AIs doing things that were thought of an impossible 6 months earlier, but sure, sounds like it's the "hype merchants" who are out of touch with reality.
My read of the comment is: "You are correct, but bear in mind that the world seems infested with people who are far less realistic and honest than you."
The rules also say "Please don't complain that a submission is inappropriate. If a story is spam or off-topic, flag it. Don't feed egregious comments by replying; flag them instead. If you flag, please don't also comment that you did."
I'm not really sure it's the best idea to accuse someone of breaking the rules if in doing so you're also breaking one yourself.
> These findings are in line with “The Lottery Ticket Hypothesis,”
If the fit was due to a lucky subset of weights you could have train smaller networks many times instead of using many times bigger network.
So it must be something more. Like increased opportunity to create best solution out of large number of random lucky parts.
I think there should be way more research on neural pruning. After all it's what our brains do to reach the correct architecture and weights during our development.
Philosophically, why should it be the case that aggregations of statistical calculations (one way of viewing the matrix multiplications of ANNs) can approximate intelligence? I think it's because our ability to know reality is inherently statistical.
To be clear, I'm not suggesting macro scale, i.e. not quantum, reality itself is probabilistic, only that our ability to interpret perception of it and model it is statistical. That is, an observation or a sensor doesn't actually tell you the state of the world; it is a measurement from which you infer things.
Viewed through this standpoint, maybe the Game of Life and other discrete, fully-knowable toy problem worlds aren't as applicable to the problem of general intelligence as we imagine. A way to put this into practice could be to introduce a level of error in both the hand-tuned and learned networks' ability to accurately measure the input states of the Life tableau (and/or introduce some randomness in the application of the Life rules on the simulation), and see whether the superiority of the hand-tuned network persists or if the learned network is more robust in the face of uncertain inputs or fallible rule-applications.
What’s not clear to me is if it is non trivial to (a)create a GOL NN that models the game cells directly as neurons (which would seem to be an efficient and effective method) or if it’s just (b) nontrivial to create a transformer architecture model that can model the game state n-turns in the future.
I would be very surprised if (a) was not effective, but that (b) is difficult is not surprising, since that is a very nontrivial task that requires intermediary modelling tools to perform for humans (arguably the most advanced NN that we have access to at the moment)
(a) is actually a form of (b) in the form of a modelling tool.
> In machine learning, one of the popular ways to improve the accuracy of a model that is underperforming is to increase its complexity. And this technique worked with the Game of Life.
For those who didn't read the article, the content doesn't support the title.
We could even think of both as collections of 3d structures showing all valid structures possible for a board of size n by n. There are some differences, every single 3d Conway structure has a unique top layer, while Go does not. But that seems like an overall minor difference. There are many more Go shapes than Conway shapes given the same N, but both are already so numerous that I'm not sure that is a difference worth stopping the comparison.
It’s interesting that you wouldn’t, yet I would.They aren’t isomorphic, for sure.
Go’s complexity comes from two players alternately picking one out of a very large number of options.
GoL’s complexity comes from a very large number of nodes “picking” between two states. That’s not precise, just illustrating that there is some symmetry of simplicity/complexity, at least to my eyes.
From a quick skim, then string search for "interesting" - I'd say that word is fluff, added to keep their audience reading through their dull background intro.
- both show neural networks can learn the game of life just fine
- the finding is that to learn the rules reliably the networks need to be very over-parameterised (e.g. many times larger than the minimal size needed for hand-crafted weights to perfectly solve the problem)
This is not really a new result nor a surprising one, nor does it say anything about the kinds of functions a neural network can represent.
It's an attempt to understand an existing observation: once we have trained a large overparameterized neural network we can often compress it to a smaller one with very little loss. So why can't we learn the smaller one directly?
One of the theories referred to in the article and paper is the lottery hypothesis, which states that a large network is a superposition of many small networks and the larger you are the more likely at least one of those gets a "lucky" set of weights and converges quickly to the right solution. There is already interesting evidence for this.