
How much innate knowledge can the genome encode? - hardmaru
http://www.wiringthebrain.com/2020/01/how-much-innate-knowledge-can-genome.html
======
allovernow
I think the article sort of misses the real effect of genome on "innate"
knowledge.

There's a new soon to be hot (I predict) topic in AI/ML called inductive bias.
It turns out that the way your neural net is structured, i.e. the overall
shape and the way the neurons are connected, can strongly bias which
information the neural net tends to learn, and how quickly. To the point that
there was a paper (can't find it now) where IIRC the authors were able to
develop an algorithm which creates neural nets which can solve basic problems
with little to no training. Point being, you don't need to encode, say, the
image of a nipple and the action of suckling in DNA to create innate suckling
knowledge - you simply need to structure the brain such that it is biased to
perform and learn certain sequences of behaviors.

I think this may be the origin of much if not all innate knowledge and I
believe that research in AI/ML will lead to better understanding of human (and
animal) cognition.

~~~
ggggtez
eh... Information theory wise, at best you can say you've compressed the
knowledge to the action to the steps required to construct the geometry. But
you've still encoded it on some level. So there is an information-theoretical
limit that is still worth thinking about. I think they kind of miss the boat
on their discussion of Shannon. You've encoded the structure, and the actual
behavior then arises from actually running the network. I think they got a
little confused to be honest.

And indeed, what people will find is exactly as you say: The DNA encodes how
to grow the network such that it will respond with the evolutionary
advantageous instinct.

I don't know if I'd quite say that no one is looking into that. Obviously we
know that there are NN that use deep networks. But also we can see that we can
increase the computation power by removing nodes and edges that don't
contribute to the correct answer. In evolution, this is a literal cost savings
in that you need to feed less neurons.

~~~
allovernow
I understand your point, and I'm having trouble articulating this thought, but
I think with inductive bias you can structure your net such that it is primed
to naturally learn certain behaviors that are more complex than the Shannon
entropy required in the encoding DNA. The minimal structural information
combined with environmental interaction can produce far richer information
content, past the limit of the original DNA, but from a design perspective
your actual code is very compact because you gain the rest of the knowledge
required for innate behaviors naturally from the environment.

~~~
asdff
It's interesting to think about all this. Everything DNA encodes is responses
to the environment.

On the molecular level, proteins are only functional at certain temperatures,
pressures, and pH levels.

On the cell level, certain proteins might only be found in specific areas, and
essential amino acids might be harvested from the environment rather than
produced by the cell itself.

On the organism level, we are predisposed to shiver if it's too cold or sweat
if it's too hot, an uncomfortable feeling, so innately we go to where the
climate is milder and won't kill us. We clothe ourselves and that range is
expanded, suddenly there is a selective advantage for being able to clothe
yourself, just like there is to be a white beetle on a white sand beach
instead of a green one visible to predators, or a cell that's able to just
mooch someone elses amino acids rather than expend energy making their own.

------
mirimir
It's a great article, loaded with information, and very insightful.

But I'm left disappointed that there's no actual estimate of maximum
information content. Just this:

> *[Note from Tony Zador: The length of the genome, which in humans is around
> 3 billion letters, represents an upper bound on the Kolmogorov complexity.
> Only a fraction of that carries functional information, however, so the
> upper KC value may be quite a bit lower than that].

That must be 3 x 10^9 base pairs, in the haploid genome. Wikipedia tells me
that's 800MB.[0] However, only 1%-2% of that codes for proteins.[1] Which is
basically what Zador's note says.

So what's possible for the rest? While 800MB doesn't seem that much, maybe
it's enough for basic "firmware" and "software". Including instinctual
behaviors.

But even all that is clearly not enough for a full blueprint and assembly
instructions. That, as TFA explains, must flow from interactions during
development.

0)
[https://en.wikipedia.org/wiki/Human_genome](https://en.wikipedia.org/wiki/Human_genome)

1)
[http://sitn.hms.harvard.edu/flash/2012/issue127a/](http://sitn.hms.harvard.edu/flash/2012/issue127a/)

~~~
RosanaAnaDana
An important thing to remember/ consider about DNA, is that in its somatic
state it is 3 dimensional, so while its informational content may be ~3
billion letters, those letters can and are part of a dynamical system when
encoded in DNA. I'm not sure if what this means in regards to its total
informational content.

~~~
Palomides
thinking of DNA as some kind of abstract turing tape is wildly misleading,
here's a graphic that summarizes some of the ways DNA is
conformed/stored/managed in the cell.

[https://upload.wikimedia.org/wikipedia/commons/4/4b/Chromati...](https://upload.wikimedia.org/wikipedia/commons/4/4b/Chromatin_Structures.png)

each of these have huge implications for what can be done with the DNA

there's tons more information that's attached to the strands of DNA, too, like
methylation or promoters, etc. that the simple ATGC doesn't cover at all

if we want to stretch the turing tape metaphor, it's more like marking the
symbols as various kinds of magnets on a string, then putting a bunch of them
in a sack and shaking them up. does that change the amount of information?

if you're interested, you really can just pick up an intro bio textbook, skim
it, then do an intro genetics text (and skip the classical genetics crap).

~~~
mirimir
OK, so how much could that increase the 800MB estimate?

How many orders of magnitude?

~~~
Palomides
looking at the first step in that graphic, for example, a histone is made of 4
distinct types of parts, and in humans each of those has a dozen variants, and
the final assembled histone can additionally be tagged with other pieces. and
the physical interaction between the histone and the DNA is influenced by the
local sequence.

I don't know how I would quantify that, but there's a lot going on.

------
glofish
For me the most enlightening summary regarding information encoding in genomes
is the following quote from the article:

 _Being amazed that you can make a human with only 20,000 genes is thus like
being amazed that Shakespeare could write all those plays with only 26
letters. It’s totally missing where the actual, meaningful information is and
how it is decoded._

~~~
jacquesm
I can do a lot better than that with just 1's and 0's.

The sentence makes exactly the error that it accuses others of making, the
'20,000 genes' is not the encoding, that's the codons of which there are 64
some of which encode the same amino acid. The 20,000 really is like a single
play of Shakespeare.

~~~
glofish
Genes are regions on the genome, now whether or not are translated by 3 (as
codons) does not seem to be relevant.

The quote is a criticism of a common way of thinking in biology, where many
(if not most) practitioners think in terms "one gene does one thing" (or
perhaps few things at most).

When one follows that mentality it becomes hard to explain how just 20K genes
could make a living organism.

~~~
Real_S
The analogy I often use is houses. Think of all the components used to build a
complete house. One can use the same bricks, wood, tile, etc, to build many
different houses. But the location of these components in the house is not
enough, the order in which they are added is critical.

The genome must encode all the components used to make an organism, as well as
the instructions on how and when to use these components.

------
chrchang523
I agree with the author re: the relative importance of coding vs. non-coding
regions, but disagree with the section on Shannon information.

Shannon information provides a trivial _upper_ bound on Kolmogorov complexity:
imagine a tiny program with a huge data segment, which just copies the
contents of the data segment to standard output. In the case of the human
genome, this value is on the order of 6 billion bits. The key takeaway is that
a much larger fraction of these bits than one might expect from the phrase
"20000 genes" are potentially relevant, not that Shannon information somehow
underestimates complexity.

~~~
downerending
Perhaps along these lines, when considering compression, there is a tradeoff
of sorts between the size of the compressed result and the size of the
decompressor. If I get to choose the decompressor for a file, the answer to
the question of how small I can make the compressed file is "all the way"
(i.e. zero bits).

It seems likely that the amount of information in the DNA "decompressor" (in
mammals, the mother and her environment) dwarfs the amount actually present in
the DNA itself.

~~~
DoctorOetker
except that the decompressor is encoded in the genome, just like actual
compressor comparisons compare the sizes of competing compressor + data

~~~
downerending
It is indeed proper to use that sum as a metric. It's not necessarily the case
that the decompressor is encoded in the genome, though. It may very well be
that crucial biological information is passed from mother to child in other
ways, and it could turn out to be a _lot_ of information.

------
sebastianconcpt
_a purely random sequence of letters has greater Shannon information than a
string of words that make up a sentence, because the words and the sentence
have some higher-order patterns in them (like statistics of letters that
typically follow each other, such as a “u” following a “q”), which can be used
to compress the message. A random sequence has no such patterns and thus
cannot be compressed.

Thinking in those terms naturally leads to the kind of “counting arguments”
that Bengio makes. These seem to take each gene as a bit of information, and
ask whether there are enough such bits to specify all the bits in the brain,
usually taken as the number of connections. Obviously the answer is there are
not enough such bits. (There aren’t even enough bits if you take individual
bases of the genome as your units of information)._

------
TTPrograms
How much confidence do we have that the literal base pair encoding of DNA
encodes the entirety (or even the majority) of the information needed to
construct an individual, anyway? As the author mentions, you have things like
histone dynamics as well as other potentially relevant variations in the many
organelles that make up the cell. Eggs in particular are chock full of
proteins that are directly passed down from the parent that could have
important implications for gene expression frequency. These effects would
either have to be self-propagating in some way via further protein synthesis
or their effect would be primarily early-developmental, but they could still
have large effects on grown individual.

As evidence for this we know there are all sorts of heritable epigenetic
effects. It would also partially explain the partial failure of GWAS to
pinpoint all an individual's features as was previously hoped before the human
genome project completed.

It seems like there's a process called Somatic Cell Nuclear Transfer that
might make possible to evaluate aspects of this, but it doesn't appear to have
been performed sufficiently many times with different individuals to draw
major conclusions.

------
stanfordkid
Interesting article, but I think the "information" metaphor sort of misses the
point. DNA is not a store of information but rather the result of computation
to solve a hard problem (survival/reproduction in a given environment). Think
about it this way: it is really hard to crack a SHA256 hash -- but the
information contained in the solution is very small. DNA is the same thing.
Using information is not really the appropriate metaphor to describe DNA.
Geneticists and biologists have a concept called conservation [1] that sort of
gets at this point. If we think of an environment as an encryption algorithm,
then certain bits in the hashes ("genomes") that provide suitable solutions
tend to be stable over time.

[1]
[https://en.wikipedia.org/wiki/Conserved_sequence](https://en.wikipedia.org/wiki/Conserved_sequence)

------
hirenj
Given two genes, A and B, you could list out the information encoded by them
as “A”,”B” and “AB”. Which is basically the Shannon estimate on complexity,
right?

However in reality, the gene dose makes a difference - so it’s more like the
genome (+genome architecture) encodes for capacity to produce different
amounts of A and B. So, with high A expression you will see things like
“AAAAB”,”AAAAAA” etc.

The point is that rather than encoding for different functions, the genome
encodes for the diversity of different message queues - and it is the length
of these queues and rate of message firing that encodes (as much as you can in
a probabilistic sense) for our cell programming. The messages could be “build
a transcription factor” or “put a receptor on the surface”, or “make a crank
that increases the capacity of these queues”.

------
teekert
The brain is the computer, not the genome, no doubt about that. But the genome
was not designed but evolved, as a consequence it contains many strange and
overly complex mechanisms that just settled somewhere in a minimum in some
highly complex landscape. It would be better to ask what it would take to
store everything we learn and act upon it, in the most simple way.

If we are looking for the innate information, we should be asking the
question: In how much information can we store the computer, and all machinery
that builds it from basic components (molecules basically). It does become
easy then to imagine that there is some kind of suckling response, a computer
also starts doing things when it is powered up.

------
alfonsodev
very interesting article, and intriguing conclusion towards the end, make me
feel like knowing more

> You don’t need to specify where every synapse is to get the machine to work
> right. You just need to specify roughly the numbers of different cell types,
> their relative positions, the other types of neurons they tend to connect
> to, etc. The job of building the brain is accomplished statistically, and,
> crucially, probabilistically. This is why there is lots of variation in
> brain structure and function even between monozygotic twins and why
> intrinsic developmental variation is such a crucial (and overlooked) source
> of differences in people’s psychology and behaviour (the subject of my book
> Innate).

------
dekhn
the scientific community has discussed this for some time. A lot of
discussions I see on HN about this start from "let's understand this from de-
novo first order principles". But to make rapid progress, scientists in these
fields use evidence from many different sources rather than first order
principles.

One of the interesting things that we've learned over the years, painfully and
repeatedly, is that fairly simple encodings of fairly simple data can
repeatedly generate extremely complex visible outcomes. A great example of
this is patterning in animal fur- there isn't a section that encodes a pixel
map of the resulting spots or their X/Y locations. INstead a collection of
genes operate in a feedback loop to generate the patterns. Turing's theory of
morphogenesis has turned out to be highly prescient application of very simply
theories that does a good job of explaining observed patterning.

What I've seen repeatedly over the years is that the proteins themselves
aren't that interesting- all the interesting complex repeatable phenotypes
come from careful regulation of the expression of proteins with feedback loops
and spatial gradients. The best work on this has been done in evo-devo field
on model organisms (not humans). C Elegans was an exceptionally good choice
for a model organism for many reasons, not the least of which is that every
instead of the organism has exactly the same cells in exactly the same
location, and that appears to be completely genetically encoded (again, not
like a pixel map, but as generative rules).

------
hmmmhmmmhmmm
Thinking of the encoded genetic information developing over time as simply
"training" the neural network of the brain doesn't quite capture development;
proteins are manufactured and move in space as the brain is growing in space
over time.

It sort of feels like training a neural network is like rewiring an adult
brain which has already fully grown. Are there neural architectures which grow
over time, or "age", emulating the early development of the brain?

------
benlivengood
I've wondered in the past if humans have traded off hard-coded instincts for
more cerberal cortex or more connectivity between brain regions. Culture and
communication is now part of our environment and so our genes don't have to
carry as much information about the environment itself; our parents/society
can add it later.

------
russfink
Trivial sci-fi comment here - would love to see the 10,000 year endgame of
this reseaech: near death, all your memories are encoded into a new genome,
which becomes cloned into a new person that is you and also retains all of
your life experiences. You die, and a new one of you regenerates, then picks
up where you left off.

~~~
n4r9
To piggyback on the sci-fi theme, some of the best sci-fi from the past few
years has been Adrian Tchaikovsky's novel Children of Time and its follow-up
Children of Ruin, which explore the idea of an artificial virus that interacts
with genomes and allows for a sort of genetic memory.

------
cyorir
The information stored in DNA is important, but so is the state/information
given by extant proteins. Reading up on epigenetics may prove interesting:

[https://en.wikipedia.org/wiki/Epigenetics](https://en.wikipedia.org/wiki/Epigenetics)

------
ggggtez
I personally had not seen the eyeless gene information. While I don't find the
outcome surprising, I do find the images breathtaking. It's also very
interesting that including the similar (but not identical) gene from the mouse
has the same effect of inducing eye growth.

------
grenoire
This is likely akin to the question of how many individual songs you can write
with the chromatic scale
([https://youtu.be/DAcjV60RnRw](https://youtu.be/DAcjV60RnRw)).

------
AtlasBarfed
The genome gets Gene expression cues from the environment, so that means it
externalized some of its information implicitly outside of its structure.

One form of this is culture and species taught behaviors.

------
awinter-py
is there any science on how reflexes / behaviors are encoded in the genome?

have labs successfully changed these? I'm familiar with a prenatal brain
surgery that made chickens behave like ducks, but it wasn't genetic

~~~
xg15
(Not a biologist by any stretch, so the following is speculation)

As for reflexes, don't we kow how they work on an anatomical/neural level in a
grown adult? From my high school biology lessions, I remember that for certain
reflexes, there are nerves that directly connect "sensors" to motor neurons in
the spinal cord, bypassing the brain altogether.

If you stick to the "genes as a program" metaphor, "encoding" this reflex in
the genome would mean that the genome contains the correct triggers and
"subroutines" (i.e. as cascades of gene expression) so that a particular group
of cells will form that nerve and connect it to the correct other neurons.

------
kleer001
I kept expecting references to Steven Pinker or at the very least Fractals.
Was disappointed.

Otherwise excellent, if short, article.

