
MuGo: A minimalist Go engine modeled after AlphaGo - luu
https://github.com/brilee/MuGo
======
cr0sh
Something I'm finding interesting is how so many of these deep-learning neural
nets use RELU for their activation function; RELU is known as a "lazy
engineer's activation" function - very simple to implement, and despite
looking like a hack, seems to work very well for many tasks.

I tend to wonder - beyond "ease of implementation and good 'nuff" reasons - if
there are other reasons to use RELU, over other activation functions like TANH
or Sigmoid?

I'm beginning to suspect that we may be seeing the "engineering side" of
neural networks coming into play; that instead of using the more "biologically
accurate" activation of the sigmoid function, we instead use RELU (and other
ELU derivatives) because it works well, and is easier to understand?

Much like how things progressed better in heavier-than-air flight once
engineers realized that flapping wings weren't absolutely needed, and low-
weight engines turning propellers, with fixed wings, worked better for flying
than what nature uses...?

~~~
jimfleming
The reasons to use ReLUs are sparsity and improved gradient flow. ReLUs
encourage sparsity because when the input to the ReLU is less than 0 as the
activation becomes 0. This means some fraction of activations in a given layer
will be omitted which can encourage better representations. They also have
improved gradient flow because the gradients are zero or constant and thus
don't suffer from vanishing/exploding gradients.

In deep learning, I would _generally_ not look towards biology for the reasons
behind why things are done as this is usually an after-the-fact explanation.
When in doubt, blame the gradients.

~~~
ma2rten
LeakyReLUs work as well or better than ReLUs, so it can't be because of
sparsity.

~~~
ekelsen
LeakyReLUs often have very small slopes on the negative side; this can help
solve the problem of no gradients getting through a layer because the
activations were all 0. But I would say it still has a sparsity effect on a
trained network because of how small the slope usually is compared to one.

~~~
argonaut
That is not sparsity. In machine learning there is a very strong distinction
between values that are exactly 0, and values that are close to zero (see:
difference between L1 and L2 regularization).

~~~
ekelsen
L1 regularization does not lead to values that are _exactly_ 0 either.

~~~
makeset
Yes, it does. You might have been thrown off by the fact that L1
regularization is not L0 regularization, i.e. it doesn't _explicitly_ limit
the number of nonzero coefficients. Still, the linearity of L1 constraint
boundaries creates spikes in directions with zero components, thus forcing
constrained solutions to occur where many variables are driven to exactly
zero. See here:

[https://en.wikipedia.org/wiki/Lasso_(statistics)#Geometric_i...](https://en.wikipedia.org/wiki/Lasso_\(statistics\)#Geometric_interpretation)

~~~
ekelsen
If we have a parameter x, and some cost function J(x), then with L1
regularization the cost function would be J(x) + beta * abs(x).

The derivative of that loss with respect to x would be J'(x) + beta * sgn(x).
So using some variant of SGD (which is what basically all neural network
training does these days) we would essentially update x as: x = x - alpha *
(J'(x) + beta). (The specifics depend on the algorithm, but it doesn't change
the result).

So for x to end up as _exactly_ 0, we have to be extremely lucky, which in
practice I have never observed. Using L1 regularization definitely leads to
small weights, but not to ones that are _exactly_ 0.

~~~
argonaut
Your conclusion is theoretically false.

You can _prove_ that L1 regularization is equivalent to taking the optimal
unregularized parameters, setting to parameters below a threshold to 0 (the
threshold depends on the regularization parameter), and penalizing the other
parameters.

~~~
ekelsen
Yes, but how do you actually optimize that loss in practice? I'm not saying
that a perfect solution with an L1 penalty wouldn't have weights exactly equal
to 0. I'm saying that with the optimization techniques that are commonly used,
you don't end up with exact zeros.

~~~
argonaut
You're not making sense. If the loss function is convex, adding L1
regularization is still convex. So iterative methods for convex problems
(which include SVMs, linear regression, and logistic regression) will find the
global optimum.

~~~
ekelsen
1) Neural Network Loss functions are not convex. But that isn't the issue
here.

2) When you use actual numerical optimization techniques with floating point
arithmetic, you don't find an exact minimum (global or local). And you don't
get exact zeros.

Have you tried this on a real problem? I wouldn't consider MNIST a real
problem, but even there you will not get _exact_ zeros. Try it.

~~~
makeset
If you rolled your own naive numerical approximation of L1 regularization, you
might not have gotten exact zeros. If you use e.g. LARS or cyclical coordinate
descent for the L1-regularized parameter cohort, as suited to the problem, you
will get exact zeros, as prescribed by the mathematics of L1.

~~~
ekelsen
I've never seen anyone optimize a neural network using LARS or cyclical
coordinate descent. I thought that's what this entire discussion was about -
not arbitrary optimization theory.

------
gcp
This doesn't include the data set for the value network.

This is critical, because producing that one requires implementing the full
reinforcement learning. Even if you skip that and use the policy network, you
still have the task of playing a few tens of million games.

Learning a value network as big as AlphaGo from public data does not work: you
overfit to hell.

Which playout policy is this using? There doesn't seem to be any?

Looks like it's just a neural network player. There's dozens of those already.
You don't need to credit AlphaGo if you're only using policy networks for Go:
The critical research for that was done at the University of Edinburgh.

------
apetresc
Cool. Some students at Rochester are also re-implementing AlphaGo based on
DeepMind's paper in Nature: [https://github.com/Rochester-
NRT/RocAlphaGo](https://github.com/Rochester-NRT/RocAlphaGo)

------
brilee
To give an update on status of project, I'm currently at the point where the
policy net alone plays at 2-3kyu with a day or so of training. MCTS is
implemented but Python is slow enough that I don't get significant Playouts.
As a result I don't have insight into the scalability of my implementation.
I'm currently working on code to play batches of games in parallel which
should be of use both for RL and parallelized MCTS playouts.

------
GolDDranks
How strong it is? Is there any pre-trained data available?

IIRC, Monte Carlo Tree Search with "dumber" heuristics than NNs yielded
amateur dan level AIs for the first time (somewhere around 2006?). Lately
there has also been some AIs that bolt a NN in, and get around 1 stone
stronger (which is still miles away from AlphaGo!).

But since this is specifically modelled after AlphaGo, I wonder how it fares
against other AIs.

------
AsyncAwait
Sadly, it's not written in Go :-)

~~~
marvy
To be fair, neither is AlphaGo :) But I admit that the pun is hard to resist.

------
tonetheman
Super cool stuff

