
Structural Propensity Database of Proteins - ktamiola
https://doi.org/10.1101/144840
======
jfarlow
Interesting approach to getting at some dark spaces in our current
understanding. What kind of computation time does it take to run against a
single amnio acid chain? Can you get much (nice disorder predictions?) out of
an amino acid sequence alone?

Our protein design work has its origins coming from a different angle. But
there are some interesting thoughts we've had about utilizing (well, dreamed
up) technologies like yours to quickly make up a lot of ground between
structural prediction, and empirical functional design.

Starting to see such 21st century data standards for bio data! Yay for
everyone!

~~~
ktamiola
Thank you for very kind comment. We are now finishing a predictor, which
utilizes protein propensity data for mass-scale disorder and order
predictions.

The training times obviously vary on the network architecture, software and
hardware. I can safely say you can process 7200+ protein sequence with average
sequence length of 120 amino acids in 2h on 2 x NVIDIA Titan XP

------
cing
Nice work lowering the entry barrier for machine learning in this space, which
appears to be the aim of your company, but it's a bit of a tease to claim your
data representation is great for supporting ML and stop short of doing any of
that in the manuscript. I take it that's the next step?

On that note, is there any reason why the propensity classes alpha/beta/coil
are still so widely used? Especially coil/turn/"other". It seems to me that
these are ancient artifacts of structural biology that could definitely
restrict human understanding of protein dynamics. Perhaps there is nuance in
the chemical shift data that may help the design of better structural classes
using ML.

~~~
chiggins
All models are wrong but some are useful. My PhD was in empirical protein
dynamics (solution NMR, CD, DSF, etc..) and the long and short of it is that
disordered states are particularly difficult to distinguish from one another.
When you consider that an even partially disordered ensemble has essentially
an infinite number of nearly degenerate conformations inter-converting on
timescales ranging from picoseconds to milliseconds, lumping them into
coil/turn/other turns out to be just the law of large numbers in action
(reversion to the mean, etc.).

That and the biophysical properties conferred by partially disordered proteins
makes them a motherfucker to work with outside of some archetypal domains. I
liked to explain it like this. Imagine you have a piece of string three feet
long. Along the length of that string you have ~1 inch segments consisting of
velcro (both kinds), zippers, magnets, balloons, strawberry jello, and
marshmallows--all randomly distributed along the length. Now try to fold all
of that up so the jello, velcro, and balloons are on the inside. That's a
simple model of a protein. Now make it start opening and closing. Now put 5 of
them next to each other.

~~~
cing
The traditional structural classifications just have very low information
content in the context of protein dynamics. Coil especially. You've given the
example of a disordered region interconverting on different timescales, but
these timescales can, purportedly, be predicted from chemical shift data, etc.
[1], so why not call it "fast coil" or "slow coil"? It's not only about
timescales either, because you may need to do extra experiments for that data.
It's about finding the highest information content descriptors for an amino
acid and using machine learning to do it. Your descriptors (jello, velcro,
balloons) are actually much better at conveying dynamics than the static
descriptors used in structural biology.

[1]
[http://pubs.acs.org/doi/abs/10.1021/ct501085y](http://pubs.acs.org/doi/abs/10.1021/ct501085y)

~~~
ktamiola
Excellent reference!

------
tiplus
If I understand correctly, the product is the database and its Tensorflow API?

I am wondering how this compares to the TALOS-N [1] server from Ad Bax (NIH)
with 9000+ proteins in its DB? This, too, uses machine learning to 'fit' a
predictor for secondary structure (dihedral angles) for backbone and side
chain torsions based on chemical shifts.

[1]
[https://spin.niddk.nih.gov/bax/software/TALOS-N/](https://spin.niddk.nih.gov/bax/software/TALOS-N/)

~~~
ktamiola
And yes, the product right now is Keras and Tensorflow database integration +
the interactive database interface. The tools around are currently under
stringent testing.

