
Exploring Weight Agnostic Neural Networks - lamchob
https://ai.googleblog.com/2019/08/exploring-weight-agnostic-neural.html
======
elamje
Unpopular quote from my image and video processing professor - “The only
problem with machine learning is that the machine does the learning and you
don’t.”

While I understand that is missing a lot of nuance, it has stuck with me over
the past few years as I feel like I am missing out on the cool machine
learning work going on out there.

There is a ton of learning about calculus, probability, and statistics when
doing machine learning, but I can’t shake the fact that at the end of the day,
the output is basically a black box. As you start toying with AI you realize
that the only way to learn from your architecture and results is by tuning
parameters and trial and error.

Of course there are many applications that only AI can solve, which is all
good and well, but I’m curious to hear from some heavy machine learning
practitioners - what is exciting to you about your work?

This is a serious inquiry because I want to know if it’s worth exploring
again. In the past university AI classes I took, I just got bored writing tiny
programs that leveraged AI libraries to classify images, do some simple
predictions etc.

~~~
ma2rten
There has been a lot of research in understanding neural networks and making
them less of a black box. If you classify cat from dog videos on YouTube, it
doesn't matter if you make a mistake every now and again. But if you want to
build a self-driving car or make a medical diagnosis, you better be able to
explain which your network made a certain decision.

~~~
bonoboTP
I'm hearing this in the last few years quite often but I'm not sure what kind
of explanation you mean.

The x-ray image was incorrectly misdiagnosed, because... What type of thing
should come here?

... it didn't look like the other class. ... it didn't have this weird smudge
thing on the top left in which case usually there should be a little hazier
blob in the middle, except when the pointiness of the thing that is to the
right of the brightest blabla...

You get the idea, my description is even exaggerating the nameability and
describability of these structures. You'd get a long and complex description
at best, because simple models don't work for pattern recognition. But even if
you made sure to just use well understood features like edge thickness,
angles, sizes of connected components etc, how would a boolean formula of a
hundred such terms be helpful in court or wherever you want to use these
explanations?

~~~
ma2rten
For example there is a concept visual attention. You plot which areas of the
image the model pays attention to when making it's decision.

------
baylearn
Previous discussion (about the actual research article at
[https://weightagnostic.github.io/](https://weightagnostic.github.io/) rather
than the blog post):

[https://news.ycombinator.com/item?id=20160693](https://news.ycombinator.com/item?id=20160693)

------
antpls
How is it different than pruning a neural network?

It seems you could train the weights of a state of the art NN, then quantizite
it, then prune it. It will remove some weights of the NN, then all the
remaining weights are set to the same value. Isn't training then pruning more
efficient than using an architecture search algorithm ?

~~~
drewm1980
At the risk of broad oversimplification, pruning trains and then does an
architecture search. This does an architecture search and then trains.

~~~
p1esk
No, here the architecture search is the training.

------
phaedrus
I wrote a series of Markov chat simulators as a teenager. Often I used a
simpler algorithm which ignored the probability weight (all out-links, once
learned, given equal probability). These version performed subjectively as
well as, if not better, than the versions which tracked the weight of links.
I'm not surprised therefore that weight agnostic neural networks can work,
too.

~~~
meowface
I think it may not be a great comparison. N-grams (of words) of human
speech/writing are way more deterministic than the kinds of things ML usually
tries to tackle, I think. If you write the word "because", then "of", "the",
or some pronoun are all extremely safe bets for the next word, regardless of
their recorded probabilities. I imagine you could also totally randomize the
probabilities and not see any issues.

But I'm no expert and hardly even an amateur, so maybe it is a similar kind of
thing here with ML. And I know randomized optimization is a big thing in ML,
though I'm not sure to what extent that could be analogized with randomizing
Markov model probabilities.

------
jangid
The analogy given in the article is interesting. Some organisms perform
certain actions even before they start to learn. I myself have seen some
animals start running immediately after birth. Less number of parameters
(shared parameters) could also be thought of as less complexity and hence less
processing power requirements; which implies faster training. Phew! too much
similarity.

------
nurettin
[https://github.com/google/brain-tokyo-
workshop/tree/master/W...](https://github.com/google/brain-tokyo-
workshop/tree/master/WANNRelease/prettyNEAT) to me, this is the really
interesting part of the article. NEAT (neuro-evolution of augmenting
topologies) is an algorithm for GANN. For those who are looking to implement
the algorithm from scratch, see
[http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf](http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf)
for hours of fun.

------
ilaksh
This seems like it has the potential for massive efficiency gains and maybe
could help with better generalization if the much simpler networks could more
easily be reused or recursed or something.

------
kolar
How is this different from genetic programming?

~~~
zwaps
How was early Machine Learning different from statistics?

New names makes things exciting for people to oick up. Who wants to estimate
multinomial regression when you can learn a shallow softmax activated neural
network!

Its all about creating hype.

~~~
YeGoblynQueenne
>> How was early Machine Learning different from statistics?

Some of the very early work in machine learning, in the 1950's and '60s was
not statistical. The first "artificial neuron", the Pitts & McCulloch neuron,
from 1938 was a propositional logic circuit. Arthur Samuel's 1952 checkers-
playing programs used a classical minimax search with alpha-beta pruning.

Machine learning in the '70s and '80s was for the most part not statistical,
but logic-based, in keeping with the then-current trend for logic-based AI.
Early algorithms did not use gradient descent or other statistical methods and
the models they learned were sets of logic rules, and not the parameters of
continuous functions.

For instance, a lot of work from that time focused on learning decision lists
and decision trees, the latter of which are best remembered today. The focus
on rules probably followed from the realisation of the problems with knowledge
acquisition for expert systems, that were the first big success of AI.

You can find examples of machine learning research from those times in the
work of researchers like Ryszard Michalski, Ross Quinlan (know for C4.5 and
IDR and the first-order inductive learner FOIL), (the) Stuart Russel, Tom
Mitchell, and others.

------
s_Hogg
I'm pretty sure this was posted a while back (maybe a month or two)?

~~~
mannykannot
Yes, with some insightful and informative comments:

[https://news.ycombinator.com/item?id=20160693](https://news.ycombinator.com/item?id=20160693)

------
scribu
Discussion from 3 months ago:
[https://news.ycombinator.com/item?id=20160693](https://news.ycombinator.com/item?id=20160693)

------
TekMol
Is each architecture given _one_ set of random weights?

Or is the architecture of the net tested against a bunch of random weights so
that it performs well independently of the weights?

~~~
patresh
Each architecture is tested multiple times against different samples of the
shared weight

From the paper :

(1) An initial population of minimal neural network topologies is created

(2) each network is evaluated over multiple rollouts, with a different shared
weight value assigned at each rollout

(3) networks are ranked according to their performance and complexity

(4) a new population is created by varying the highest ranked network
topologies, chosen probabilistically through tournament selection

~~~
TekMol
What is a "shared weight"?

~~~
patresh
Usually in neural networks, each neuron has their own weights with different
values that will be tuned during training.

Shared weight here means that every neuron in the network shares the exact
same weight value.

------
DoctorOetker
how is this different from boring old evolutionary algorithms?

In my opinion the big breakthrough that enabled optimization and machine
learning was the discovery of reverse mode automatic differentiation, since
the space or family of all possible decision-functions is high dimensional,
while the goal (survival, reproduction) is low dimensional. Unless I see a
mathematical proof that evolutionary algorithms are as efficient as RM AD, I
see little future in it, and apparently neither did biology since it decided
to create brains.

It's not an ideological stance I take here (of nature vs nurture).

For simplicity, lets pretend humans are single-cellular organisms, what does
natural selection exert pressure on? our DNA code: both the actual protein
codes and the promotor regions. I claim that variation on the proteins are
risky (a modification in a proteinn coding region could render a protein
useless) while a variation on the promotor regions is much less risky:
altering a nucleotide there would slightly affect the affinity modulating
transcription, so the cell would behave essentially the same but with
different treshold concentrations, think of continuous parameters that
describe our body (assuming same nurture, food, etc) some people are a bit
taller, some people a bit stronger, etc... so how many of these continuous
parameters do we have? On the order of the same number as the total number of
promotor regions in DNA in the fertilized egg: both on human DNA and in one
mitochondria (assuming there isn't a chemical signals addressing and reading
and writing scheme for say 10 mitochondria)...

EDIT: just adding that for a certain fixed environment, there are local (and a
global) optimum of affinity values for each protein, so that near a local
optimum the fitness is roughly shaped like -s(a-a_opt)^2 where s is spread and
a_opt the local optimum affinity value. In other words, it is not so that
"better affinity", means fitter, not at all, a collection of genomes from an
identical environment will hover around an affinity sweet spot.

According to wikipedia [0] that would result in about

about 2x 20412 "floats" for just protein-coding genes

about 2x 34000 "floats" when also including the pseudo-genes

about 2x 62000 "floats" when also including long ncRNA, small ncRNA, miRNA,
rRNA, snRNA, snoRNA

these "floats" are the variables that allow a species to modulate the reaction
constants in the gene regulatory network, since natural selection can not
directly modulate the laws of physics and chemistry, and modulating the
protein directly instead of the promotor region affinities / reaction rates
risks disfunctional proteins...

so my estimate of _an_ upper limit of the number of "floats" in the genetic
algorithm is ~120000 (and probably much less if not each of the above has a
promotor region).

thats not a lot of information, if we think about the number of synaptic
weights in the brain, and many of these are shared _in utilization_ by the
other cell types besides neurons.

I consider the possibility that: sperm cell, egg cell, or fertilized egg cell
performs a kind of POST (power-on-self-test) that checks for some of the
genes, although simply reaching the fertilized state may be enough of a
selftest so no spontaneous abortion test may be needed (to save time and avoid
resources spent on a probably malformed child).

[0]
[https://en.wikipedia.org/wiki/Human_genome#Molecular_organiz...](https://en.wikipedia.org/wiki/Human_genome#Molecular_organization_and_gene_content)

EDIT2: regarding:

>This makes WANNs particularly well positioned to exploit the Baldwin effect,
the evolutionary pressure that rewards individuals predisposed to learn useful
behaviors, without being trapped in the computationally expensive trap of
‘learning to learn’.

The computationally expensive trap of having to 'learn to learn' could end up
being as mundane as a low number of hormones to which neurons in the brain
globally or collectively respond, which enables learning by reward or
punishment, and from then on anticipating reward or punishment, and our
individual end goal stems from this anticipation, and anticipating the
anticipation etc...

~~~
buboard
Evolutionary algorithms are perhaps easier to approach theoretically. Backprop
works amazingly but how does one begin to approach why. There is also an
element of backprop in evolution via epigenetics

~~~
p1esk
_Evolutionary algorithms are perhaps easier to approach theoretically_

Why?

------
ianamartin
Next thing you know, google will be telling us that the fastest websites are
server side rendered from templates with minimal JavaScript.

How could anyone possibly have known?

