
Biological Function Emerges from Unsupervised Learning on 250M Protein Sequences - smhx
https://www.biorxiv.org/content/10.1101/622803v1
======
gigantum
Like some of the other ML/AI posts that made it to the top page today, this
research too does not give any clear way to reproduce the results. I looked
through the pre-print page as well as the full manuscript itself.

Without reproducibility and transparency in the code and data, the impact of
this research is ultimately limited. No one else can recreate, iterate, and
refine the results, nor can anyone rigorously evaluate the methodology used
(besides giving a guess after reading a manuscript).

The year is 2019, many are finally realizing it's time to back up your results
with code, data, and some kind of specification of the computing environment
you're using. Science is about sharing your work for others in the research
community to build upon. Leave the manuscript for the pretty formality.

~~~
Havoc
>any clear way to reproduce the results.

Given that it's evolved I'd imagine this is a given? Or more accurately you
could probably duplicate some kind of emergent behaviour but it would be
different given different randomized parameters

~~~
lysium
Usually you use an RNG for which you can publish the seed. So, although it’s
random, you can reproduce the results.

~~~
tastroder
Glancing through the paper it seems like they use the recent Transformer
model. Does whatever underlying stack they use expose something to share RNG
seeds and the exact hardware optimizations your environment applies during
training? Otherwise "publishing the seed" sounds nice but might not be as
trivial as the phrase suggests.

~~~
arthur_pryor
reproducibility should be something that's baked into an experiment's design.

so, if their experiment was designed such that reproduction is inherently
difficult, they should have designed it in a better way, and they should've
used a toolset that wouldn't run into that problem.

a non-reproducible experiment isn't necessarily completely without value, but
it's a thing that everyone should look askance at till it proves its worth.

(apologies if my comments don't apply to this experiment and if it is
reproducible -- i didn't have time to read through the OP, but i thought this
reply was still a worthwhile response to its specific parent comment)

~~~
tastroder
No that's absolutely a fair and true point, my comment was more pointed at the
RNG aspect. I have not looked into this specific one either but normally
people would hopefully not publish their best randomly achieved run if the
system cannot reproduce it or similar results.

That being said the paper in question doesn't seem to reference open source
code anyway so I guess my point was kind of moot, apologies.

------
andbberger
I find this paper to be so steeped in hype and dogma so as to be nearly
incomprehensible.

Which is a shame, because it's a reasonable approach. I just wish they just
frickin described what they did instead of spending the whole paper
monologuing and showcasing unconvincing experiments. No need to justify what
you're doing, just do it.

------
ArtWomb
Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)

Still a long way from a Theory of Biogenesis. But a good next step is using a
differentiable model to predict novel proteins which have no analogue in
Nature. Much like Materials Genome researchers searching for stable phases of
matter!

"Training ever bigger convnets and LSTMs on ever bigger datasets gets us
closer to Strong AI -- in the same sense that building taller towers gets us
closer to the moon." \--François Chollet

~~~
visarga
> "Training ever bigger convnets and LSTMs on ever bigger datasets gets us
> closer to Strong AI -- in the same sense that building taller towers gets us
> closer to the moon." \--François Chollet

The Transformer layer has radically leaped over LSTMs and CNNs. While LSTMs
can model sequences and CNNs regular grids, they have no efficient long range
interaction mechanism. Transformer does. It's a huge leap similar to the one
in computer vision from a few years ago.

What is needed besides spatial translation invariance (CNN) and temporal
invariance (LSTM) is permutation invariance. Whenever the problem can be
described as a graph, then the ordering of the vertices and edges should not
matter. You can't do that with CNNs and LSTMs, but you can do it with Graph
neural nets and Transformers.

Apparently Transformers are the best for language modelling (GPT-2), playing
games (Dota2 from OpenAI), composing music and possibly now in modelling
proteins. I assume they will play a huge role in working with graph structured
data, with multiple entities and relations.

~~~
nl
It's not really as clear cut as that.

Transformers work well in sequence tasks because both compare well in terms of
accuracy but also scale better than a RNNs like a LSTM or a GRU. That means
they can be trained on more data.

This isn't really the same as CNNs, where they model images by running at
different scales. I'm not aware of any cases of Transformers being used
particularly successfully on images.

They can be used on graphs of course, by translating the problem into a graph
walk problem (ala DeepWalk).

All the examples you gave (language modelling, Dota2, music and protein
modelling) are setup as sequence prediction problems, so are perfect for
Transformers.

~~~
p1esk
[https://arxiv.org/abs/1904.09925](https://arxiv.org/abs/1904.09925)

~~~
nl
Nice. I guess I'm 8 days behind on the SOTA...

But I'd note that it is build on top of a CNN base (ResNet or RetinaNet) and
that the Attention-only system performed slightly worse than the one including
the CNN layers.

Also, this isn't really a Transformer architecture, even though it uses
Attention.

But maybe this is too much nitpicking? I agree that Attention is a useful
primitive - my point is that the Transformer architecture is too specific.

(Also, this is a really nice paper in that it lays out the hyperparameters and
training schedules they used. And that Appendix is amazing!)

------
obviuosly
> The resulting model maps raw sequences to representations of biological
> properties without labels or prior domain knowledge.

A couple of questions:

1\. What are those representations?

2\. Also what is "biological function"?

3\. What kind of information does the learned representation extract that is
not already in the "biological properties" it is trained to map to?

------
tepal
This blog post seems to anticipate this happening:
[https://moalquraishi.wordpress.com/2019/04/01/the-future-
of-...](https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-
science-will-not-be-supervised/)

~~~
dnautics
> It does a surprisingly good job of predicting protein function across a
> diverse set of tasks, including ones structural in nature, like the
> induction of a single neuron that is able, with some degree of accuracy (ρ =
> 0.33) to distinguish between α helices and β strands (I suspect the network
> as a whole is far more performant at this task than the single neuron we’ve
> identified, but we didn’t push this aspect of the analysis as the problem is
> well tackled using specialized approaches.)

I hate to be that guy, but distinguishing between alpha helices and beta
strands is not really that hard.

It's a good start though. I would propose the following test: Let's see if we
can use the activations from the neurons to predict the luminosity of a 'base'
GFP molecule (under a fixed set of experimental conditions). Train the set on
10,000 mutations (this could maybe be done in very high throughput by
tethering the XNA to a bead, synthesizing, and then measuring the beads one by
one), and see if can extrapolate the effects of 10k more, or heck, just by
doing it brute-forcedly, we've got high throughput robots, right?

~~~
jostmey
And predicting protein function is not that hard either. The ground truth
labels are often determined by sequence alignment similarity, not by
experiment. So the results are far from profound

~~~
jerven
Doing it right is quite hard. Doing it usefully is even harder [1]. Getting a
good training set without to many biases is the really hard part. Generating a
ground truth that is actually a truth is very expensive.

I have to read the paper carefully again. But for the contact point prediction
I think the training set will cover most of the data used in the validation.
Due to they way PDB "sequences" are distributed over UniParc as well as how
PDB 3D structures are generated experimentally. i.e. there are 120,000 pdb
related sequences in UniParc, but they cover 45,000 ones in UniProtKB. Because
PDB derived sequences are rarely full length, often mutated and highly
duplicative in coverage.

[1] predicting the root GO terms will give you and insane TP/FP rate but is
completely useless.

------
superfx
See also:
[https://www.biorxiv.org/content/10.1101/589333v1](https://www.biorxiv.org/content/10.1101/589333v1)

------
shpongled
This is cool, but would be significantly cooler if they did some kind of
biological follow up. Perhaps getting their model to output an "ideal"
sequence for a desired enzymatic function and then swapping that domain into
an existing protein lacking the new function.

~~~
inciampati
Bingo. That would be really interesting. And useful.

There are probably already enzymes in this data set that have measurements of
their behavior. Could this modelling approach be coaxed to find the one with
the highest processivity? Or do we need more labeled data?

~~~
shpongled
I'm sure they have a bunch of enzymes in their dataset for which kinetic
measurements have been published. Another interesting follow up study would
attempting to improve kinetic behavior. They could, for instance, analyze some
of the catalytically perfect enzymes out there (TIM, SOD, catalase, etc) and
see if the model could project improvements onto existing orthogonal protein
classes.

~~~
jerven
Not in a structured way that is easily useable. Swiss-Prot has most of this
data but it is not quite normalized in units. If you did this annoying work I
would like to talk to you so we can plug it into Swiss-Prot.

------
lucidrains
Language, music, and now amino acid sequences. Attention is all you need.

~~~
mfatica
I would say you also need a fair bit of data too...

~~~
return1
and transformers

~~~
nl
The _Attention Is All You Need_ paper is where Transforms were introduced:

 _We propose a new simple network architecture, the Transformer, based solely
on attention mechanisms, dispensing with recurrence and convolutions
entirely._

~~~
return1
Yup, and the state of the art BERT and gpt-2 are both based on transformers.

------
a_bonobo
Here's a very cool GitHub repository which uses unsupervised learning (ULMFiT)
in the genomics space: [https://github.com/kheyer/Genomic-
ULMFiT](https://github.com/kheyer/Genomic-ULMFiT)

Very impressive accuracies on hard tasks, and it's open source!

------
cellular
I find these emergent behaviours fascinating:
[https://youtu.be/gaFKqOBTj9w](https://youtu.be/gaFKqOBTj9w)

~~~
jakeogh
FPGA do interesting things when allowed to exploit sidechannels/analog
effects:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.9691&rep=rep1&type=pdf)

