Hacker News new | comments | show | ask | jobs | submit login
Structural Propensity Database of Proteins (doi.org)
49 points by ktamiola 175 days ago | hide | past | web | favorite | 12 comments

Interesting approach to getting at some dark spaces in our current understanding. What kind of computation time does it take to run against a single amnio acid chain? Can you get much (nice disorder predictions?) out of an amino acid sequence alone?

Our protein design work has its origins coming from a different angle. But there are some interesting thoughts we've had about utilizing (well, dreamed up) technologies like yours to quickly make up a lot of ground between structural prediction, and empirical functional design.

Starting to see such 21st century data standards for bio data! Yay for everyone!

Thank you for very kind comment. We are now finishing a predictor, which utilizes protein propensity data for mass-scale disorder and order predictions.

The training times obviously vary on the network architecture, software and hardware. I can safely say you can process 7200+ protein sequence with average sequence length of 120 amino acids in 2h on 2 x NVIDIA Titan XP

Nice work lowering the entry barrier for machine learning in this space, which appears to be the aim of your company, but it's a bit of a tease to claim your data representation is great for supporting ML and stop short of doing any of that in the manuscript. I take it that's the next step?

On that note, is there any reason why the propensity classes alpha/beta/coil are still so widely used? Especially coil/turn/"other". It seems to me that these are ancient artifacts of structural biology that could definitely restrict human understanding of protein dynamics. Perhaps there is nuance in the chemical shift data that may help the design of better structural classes using ML.

All models are wrong but some are useful. My PhD was in empirical protein dynamics (solution NMR, CD, DSF, etc..) and the long and short of it is that disordered states are particularly difficult to distinguish from one another. When you consider that an even partially disordered ensemble has essentially an infinite number of nearly degenerate conformations inter-converting on timescales ranging from picoseconds to milliseconds, lumping them into coil/turn/other turns out to be just the law of large numbers in action (reversion to the mean, etc.).

That and the biophysical properties conferred by partially disordered proteins makes them a motherfucker to work with outside of some archetypal domains. I liked to explain it like this. Imagine you have a piece of string three feet long. Along the length of that string you have ~1 inch segments consisting of velcro (both kinds), zippers, magnets, balloons, strawberry jello, and marshmallows--all randomly distributed along the length. Now try to fold all of that up so the jello, velcro, and balloons are on the inside. That's a simple model of a protein. Now make it start opening and closing. Now put 5 of them next to each other.

The traditional structural classifications just have very low information content in the context of protein dynamics. Coil especially. You've given the example of a disordered region interconverting on different timescales, but these timescales can, purportedly, be predicted from chemical shift data, etc. [1], so why not call it "fast coil" or "slow coil"? It's not only about timescales either, because you may need to do extra experiments for that data. It's about finding the highest information content descriptors for an amino acid and using machine learning to do it. Your descriptors (jello, velcro, balloons) are actually much better at conveying dynamics than the static descriptors used in structural biology.

[1] http://pubs.acs.org/doi/abs/10.1021/ct501085y

Excellent reference!

BTW, greetings from GROMACS group in Groningen :) I happend to do my PhD in NMR and Molecular Dynamics.

Coming back to your comments about the canonical secondary structures; I couldn't agree more with you. The problem is quite simple, how are we going to convince the >90% of structural biochemistry society to simply accept the fact proteins are bloody dynamic and X-ray / eye candy structures may have quite little to do with the "real" picture at room temperature?

Crystallographic structures are very useful in determining the functions of proteins. Structural biochemists are very aware that proteins are dynamic and believe the concept is covered in most introducing courses.

Cing! Thank you for very flattering comment. Obviously, this is database only paper. Please bare in mind, the vast majority, or perhaps even >95% of protein structure prediction methods deal with canonical secondary structure classes. We want to provide a coherent data set as a benchmark + source of information.

We have in "stock" a network (obviously another paper) that will aim at propensity prediction, still in trivial alpha/coil/beta phase space.

If I understand correctly, the product is the database and its Tensorflow API?

I am wondering how this compares to the TALOS-N [1] server from Ad Bax (NIH) with 9000+ proteins in its DB? This, too, uses machine learning to 'fit' a predictor for secondary structure (dihedral angles) for backbone and side chain torsions based on chemical shifts.

[1] https://spin.niddk.nih.gov/bax/software/TALOS-N/

And yes, the product right now is Keras and Tensorflow database integration + the interactive database interface. The tools around are currently under stringent testing.

This is an excellent question. When we have our network up and running we will compare against Ad's methodology. Ad is known for top-notch solutions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact