Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Can we write a programming language for biology?
113 points by hsikka 47 days ago | hide | past | web | favorite | 51 comments
I’m specifically talking about an abstracted language that compiles down to nucleic acids. I know projects like Cello exist to generate plasmids and genetic circuits, but do we need ML? Do we know any determiistic rules?

The demon lurking in the corner here is the a priori assumption that we know enough about nucleic acids to be confident in our knowledge of how they'll behave once compiled.

I.e., i write some code, hit F5, and yeast comes out the other side of the compiler-machine. How confident can I be while debugging that the changes in behavior between the compiled output (yeast) and my code are bugs versus as-yet-unknown behaviors of complex molecules bumping into each other in a very crowded bag of other complex molecules sitting in a dish somewhere? As of right now, and I posit, as a biological researcher, for a very long time to come, the answer is "A lot less than 100%". Simple stuff like gene-toggle switches [0] have been known and pretty much work since the late 90s, but scaling is deeply nonlinear.

This doesn't mean that synbio is useless, but it means that the error bars are a lot higher than they are when you're running code on printed circuits and can have a high degree of confidence that your computational substrate actually does what you think it does.

[0]: https://www.nature.com/articles/35002131

A great example of this is Green Fluorescent Protein (GFP), one of the most important and foundational fluorescent proteins for biology research.

The fluorescent molecule at the center is not actually a nucleic acid, but is formed several hours after the protein has folded after a secondary reaction between some of the atoms at the center of the protein find a lower energy state. This seems to me (not a protein engineer) to be virtually impossible to design a-priori.

Except this exact thing was accomplished and published a few months ago [1, 2]. The Baker lab at the University of Washington designed a synthetic fluorescent protein, de novo.

[1] Meta-Meta-Article: https://www.bakerlab.org/index.php/2019/01/02/nature-article...

[2] Actual paper: https://www.nature.com/articles/s41586-018-0509-0

eh, it still took ~56 real-world attempts to get two functional designs. That's not even quantifying how strongly they fluoresce, and other qualitative/quantitative aspects (eg, how good of a GFP were these proteins). It's very impressive, and the Baker lab has done amazing stuff, but we are a long ways off from rational design.

From the paper: "Synthetic genes encoding the 56 designs were obtained and the proteins were expressed in E. coli. Thirty-eight of the proteins were wellexpressed and soluble; SEC and far-ultraviolet CD spectroscopy showed that 20 were monomeric β-sheet proteins (Supplementary Table 3). Four of the oligomer-forming designs became monomeric upon incorporation of a disulfide bond between the N-terminal 3–10 helix and the barrel β-strands. The crystal structure of one of the monomeric designs (b10) was solved to 2.1 Å, and was found to be very close to the design model (0.57 Å backbone r.m.s.d., Fig. 3c)."

"Two of the 20 monomeric designs—b11 and b32—were found to activate DFHBI fluorescence by 12- and 8-fold with binding dissociation constants (KD) values of 12.8 and 49.8 μM, respectively (Extended Data Fig. 6f)"

Nice! More recently (Jan 2019) from the same group:

De novo Design of Potent and Selective Mimics of IL-2 and IL-15: https://www.nature.com/articles/s41586-018-0830-7

Lab's accompanying media post: https://www.bakerlab.org/index.php/2019/01/09/potent-anti-ca...

" Potent anti-cancer proteins with fewer side effects. Today we report in Nature the first de novo designed proteins with anti-cancer activity. These compact molecules were designed to stimulate the same receptors as IL-2, a powerful immunotherapeutic drug, while avoiding unwanted off-target receptor interactions. ..."

A remarkable and powerful achievement.

Incredibly cool! Even if this took lots of iteration, to me this represents an important step (if my understanding of the details is correct).

As a side note, this is why I appreciate HN, never would have seen this paper otherwise.

This is an awesome detailed response from an informed perspective, thank you! I’ll check out the article.

Do you think that despite the high error resulting from the sheer complexity of those interactions, design tools would be useful to researchers? Or is it mostly just long shot proof of concepts with no practical purpose right now?

Design tools are absolutely useful to researchers. However, the economics of building/selling software to researchers don't work. The state of the art DNA design app that is the gold-standard for every molecular biologist is not much changed over the past 15 years [1]. And it's free.

[1] http://jorgensen.biology.utah.edu/wayned/ape/

One could argue that the state of the art is now benchling. But I have similar thoughts on the economics of building software to researchers.

The domain is complex, the customers require a high level of correctness, and features are often implemented to try things out in the lab and then abandoned once it's a dead-end in the real world. Oh, and most academics don't have money to pay for software, nor do they respect the software engineering process.

I'm in cell biology, and many of the programs used in research these days are run from the command line and are open source and freely available. I don't even have a desktop powerful enough to work with genomic sequencing data, I need to ssh to a supercomputer. There's a lot of documentation so it's really not hard to go from understanding Unix command line syntax to writing short scripts then full on python or Perl with biological examples every step of the way. There are also a lot of very powerful web tools, such as the ucsc genome browser, which are freely available to anyone.

Commercial gui-based software is usually sold along with proprietary equipment like a microscope or something, and universally loathed even by people who managed to wrap their heads around it. It's much easier to work with different software from different people or groups if it's all the same syntax, rather than relearning clunky proprietary gui after clunky proprietary gui.

Well, if you were creating the language you would have to test that the expected output was repeatable. And presumably before someone who was creating the language would advise the end users of that software not to use it in production until the 1.0 version. Basically you would wait until someone smarter than you or I posts it on Hacker News and says it's production ready.

You may be interested in the NSF-funded Molecular Programming Project -- http://molecular-programming.org/. Projects listed there include "gro: the cell programming language" and "DNA origami" -- designing synthetic DNA that folds together into recognizable 2D and 3D shapes.

I also encourage you to look at the Qian Lab. Lulu has created neural networks out of DNA -- https://www.nature.com/articles/nature10262. Her research is entirely about programming molecular systems -- http://www.qianlab.caltech.edu/research.html.

I was fortunate enough to collaborate with Lulu and with Erik Winfree's DNA computation lab at Caltech a while back.

My work was about building switching circuits using DNA. We designed DNA sequences that would bind together probabilistically. Using these, we created "probabilistic switches" -- analogous to Shannon's original on/off switches. Then, we used some earlier work of mine to design (and build!) circuits of "pswitches" that realize certain probability distributions.

We published an open-access paper last year here -- https://www.pnas.org/content/115/5/903. I suggest looking at Figure 1 to get a good idea how it works.

The semantics of such a language would the the primary challenge. If you want to be able to compile a statement like "implement a protein that cleaves proinsulin to produce insulin" there is SO much context that the real challenge is not in managing the compilation to ACTG, but the surrounding environment.

At what temperatures? In what pH range? In what species? With what condon bias? When should it be expressed? At what rate should it mutate? Do you want any other nearby 'functions' from mutation? Is this a membrane protein? Do you want restriction sites? Is this for insertion into a plasmid or for insertion directly into a genome? What bacterial cell line will you use to maintain the plasmid? If you are expressing the protein for purification what cell line or bacterial strain will you be amplifying? Do you want a single sequence that will work for all of these or are you willing to compile a different version for each combination? etc. etc. etc. The list goes on and on.

The sequence you compile to will depend on those environmental parameters and the semantics of that 'environmental ISA' will likely be highly specific for many high level descriptions. I imagine you could produce sequences that were more robust, but the simulation time required to generate and validate them would grow accordingly.

All of this not even mentioning the fact that you also absolutely must specify _all_ the things it should not do, such as cleave a bunch of other proteins, or bind non-specifically and form aggregates in the cytoplasm, etc. A language level list of defaults here would certainly be a requirement, and that means the space that you will be optimizing in is absolutely massive. I don't even want to imagine how slow it would be. Probably faster to synthesize a bunch of variants and test them all in real cells.

The simpler case of taking an NCBIGene identifier and smashing it into an Addgene plasmid identifier and setting codon biases and optimizing for expression is a much more manageable task, and would probably be a building block for the more complex version.

Great considerations. Also what chaperone proteins are necessary for folding

Do you have suggestions on learning the basics of all that you discussed here? I know nothing about it but it sounds fascinating.

If you search on "protein folding" you should be able to find some basics.

That was my interest when I started my career, but didn’t do much on it.

Bio-informatics deals with the stuff you mentioned in-silico. There’re a whole bunch of libraries for many different languages, but I’ve not come across any specific language for bio-informatics.

You might find https://youtu.be/8X69_42Mj-g interesting, they developed a DSL in common lisp to generate C++ code using LLVM.

Hey thanks, I’ll check out the video! That seems super cool

Microsoft Research Cambridge, has been doing this for a long while.

"It turns out that there are lots of similarities between modelling concurrent systems and biological systems. Just like a computer, biological systems perform information processing, which determines how they grow, reproduce and survive in a hostile environment. Understanding this biological information processing is key to our understanding of life itself.

It’s probably easier to understand some of the output of this work – specifically the Stochastic Pi Machine, or SPiM as it’s often referred to. SPiM is a programming language for designing and simulating models of biological processes. The language features a simple, graphical notation for modelling a range of biological systems – meaning a biologist does not have to write code to create a model, they just draw pictures.

You can think of SPiM as a visual programming language for biology. In addition, SPiM can be used to model large systems incrementally, by directly composing simpler models of subsystems. Historically, the field of biology has struggled with systems so complex they become unwieldy to analyse. The modular approach that is often used in computer programming is directly applicable to this challenge."



Read this with an open mind: consider a dialect of Lisp. There is the saying “Lisp is not the language that will solve your problem, but rather, Lisp is the language that will allow you to create the language that will solve your problem.” Perhaps what you need is a biology DSL. Lisp is excellent for creating DSLs.

This seems to roughly be what this Schafmeister fellow is doing with clasp, no?

This is a really interesting perspective, another HNer said something similar. I’ll spend some time noodling on this.

There is already a complete lisp tool being implemented, aiming to take advantage of C++ libraries called Clasp. You can also take a look to some major Clisp/scheme libraries like biolisp (i think it's the name). I'd never make a group project in common lisp if this project do not involve either a finite-state machine or a DSL (rewriting a prolog interpretor in lisp is a breeze), and in this case, the language fits perfectly imho.

Yes, that's possible and a subject of active research. Have a look at Luca Cardelli's website [1]. He's probably the leading researcher in this field. He's got quite a few talks online e.g. [2].

[1] http://lucacardelli.name/

[2] https://www.youtube.com/watch?v=o8q7kFeGUTM

Awesome, I hadn’t come across Luca Cardelli’s work before, I’ll check this out! Thank you :)

Computer programming languages were developed to overcome technical challenges when people had a very hard time writing pure 0s and 1s to code CPU instructions.

They knew what those instructions will do and they had 100% certainty about it.

This is not the case in biology, however. People still struggle to understand how proteins interact on a molecular level.

Few years ago I took part in a project that aimed to engineer a new, synthetic metabolic pathway in yeast. A key enzyme in this pathway had to be introduced from an exotic bacteria. But no matter how much we tried, the enzyme worked very poorly (2 order of magnitude slower) when introduced to cerevisiae.

The problems we have in biology today aren't technical, they are still fundamental... Creating a programming language dedicated for wiring genetic circuits is nice, but won't be a game changer.

Does it have to be nucleic acids? You can probably get a dataflow-like language using concentrations of various enzymes and substrates. And there are pre-existing models for molecular pathways that may help figure out the "key" for such a code. (In fact, IIRC analysis of metabolite concentration variation tends to be modeled using systems of differential equations, so you could use neural nets to find optimal input states for simple or well characterized networks.)

There are some problems with using nucleic acids, chief among which is that the central dogma of molecular biology isn't nearly as true (or as central) as we used to think. Don't know as much about RNA -- there certainly are RNA "machines" where the nucleic acid itself has a function and maybe those could be coded for specific functionality. RNA isn't as stable though (because it's single stranded and RNAses are everywhere) so it sounds hard to deal with experimentally, but I'm no expert.

The closest thing to this is probably DNA origami for nanoscale structures.

Beyond that, were not quite at the place where we can compute desired phenotype -> genetic code, except for when single enzymatic reactions are all you are doing. Going bigger requires understanding the nonlinear complexities of the system in much deeper detail.

We can and do go from phenotype -> genetic code as long as the phenotypic ask is something we have some familiarity with. There are enough examples out there that even though the toolbox is still small, there's plenty of new things we can rationally design - even if our success rate is not 100%.

Transcription Factor + GFP -> green transcription factor -> codon-optimized plasmid

Membrane tag + light-activated conformational change + cytoskeleton -> optogenetic mechanotransduction

Small-molecule-activated degradation domain + transcription factor -> chemically-induced genetic switch.

And though there's still a bit of trial and error to debug which/how those components go together, it's a common enough thing to be reasonably accessible.

you work in this field ? It's of the most fascinating one I've seen..

We do this at Serotiny [1]. And others are working on this as well for different specific purposes. We use some very basic ML (the data is very sparse, wide, but any hit above a low baseline is very valuable).

The curious part, from our perspective is that biology has massive surface area - and the surface area is 3D. It not only scales between species/functions, but it also scales up and down, from atoms to organs. And the expertise/abstraction layers that work at one scale become complicated if you try too hard to account for all variation at a different scale. HIV's genome is a backwards, upside-down, mirrored, fugue of an engineering design that uses exotic molecules, exotic regulation, exotic proteins, and exotic physics. We're starting at a different place, just trying to write very simple scales.

In our case, we've chosen a single size scale to work with - proteins, but are wide enough to look across every species & discipline to understand those proteins as common tools. We compile all of our designs for a particular function that we're interested in, down to DNA, literally. And finding the niche where we do not have to deal with all of the DNA-regulation, or cellular regulation, or tissue synthesis, etc. allows us to expand and build in complexity at the protein level - while keeping other parts of biology constant. And that also allows us to interact and work with others who are working at different scales.

And there are others that build in complexity at other biological levels (gene regulation, pathway flux, etc.). Companies like Asimov [2] are involved in similar work at some of those abstraction layers. The open-source design language, SBOL is an attempt to standardize the DNA layer [3]. And this contributes to the challenge in that a lot of people/companies/labs have projects to build an abstraction layer that compiles down to DNA - but they might be talking past each other and be doing separate projects.

We've built an entire API of 'high-level' commands at an abstraction layer above DNA, where the output compiles down to literally, a JSON file specifying the DNA sequence to be manufactured by a 3rd party, as well as human-level citations to enable turning the new designs into intellectual property.

There is still a LOT of data missing, and there's a lot of empirical work to do - and you need to keep your compiling system constant enough that when you make changes at your abstraction layer you know that when you hit a roadblock you know it's because of a change you made, and not just a bug in the system.

[1] https://serotiny.bio

[2] https://www.asimov.io/

[3] http://sbolstandard.org/

Wow very cool. I’m actually a graduate student at Harvard getting my master’s in biology and CS, so this is right up my alley. I’m going to dive deeper into Serotiny and check out some of the awesome work you guys are doing. Do you offer internships for folks like me?

I’d heard about Asimov from their original work on Cello at MIT, but SBOL is news to me. I’ll check it out as well.

It sounds like this space is becoming pretty competitive, which is interesting.

I think the fun part is that it's not necessarily super competitive, yet. A lot of these tools are still complementary. But they're also not quite integrated yet either.

Always looking for good work. The field is growing rapidly right now. You've got a good combination of talents to help.

I'm not OP, but what about the realm outside of rational protien design? DNA base paring rules are pretty well understood and we should be able to build useful tools using them. Is there any work out there using only for computation?

Yep - and because DNA's base pairing rules are so well-studied, so predictable, and information-carying, we can use DNA for its material properties in addition to or even separate from its genetic properties. In terms of software, Shawn Douglas built CADNano [1] - software to do precisely that. By using DNA as a material it can be useful in its own right - with all sorts of interesting 3D structures, 3D logic, and with atomic precision, built into the encoded base pairs. But these structures generally do not interact with DNA at a genetic level in an organism.

In terms of protein design at that atomic level, the computation traditionally has relied on knowing or guessing at the structure (atomic arrangement) of the protein. And without that, there's not much to do (that's where our work picks up). A lot of that kind of protein design computation work is being done with software like Rosetta [2].

[1] https://cadnano.org/

[2] https://www.rosettacommons.org/

Pray tell, what exotic physics is involved?

For instance, though not specific to HIV, and not exotic from a physicist's perspective, but certainly exotic in terms of taking physics into account at a different abstraction layer when compiling down to DNA:

DNA seems to be able to detect lesions and mis-matches based on conductivity of electrons down the double-strands themselves [1].

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2741234/

Ah yes, my advisor used to work on some of this stuff about 10-15 years ago around the same time as her.

Vaguely related but check out http://ceqea.sourceforge.net/, paper: https://arxiv.org/abs/1811.02478

This is a very interesting area of research, which I'm personally fascinated by. There are several in active development.

Check out:


For one example.

I know this might be a bit off-topic, but what is the current de facto standard either open-source or paid for drug design/discovery?

OP here, the only open source drug discovery library/framework I’ve seen is DeepChem https://github.com/deepchem/deepchem

Another question is: could we beat (progress in) Quantum Computing by exploiting massive parallelism in biological computing?

I’m highly interested in this, and in the capacity of biocomputing in general!

you are thinking the wrong way...its the embryology sequence of genes that turn on and off combined with dna from other organelles such as the mitochondria

not one of the end products nucleic acid by itself...this is why creating a dinosaur from blood, etc is such a fictional dream

there's this language called gro that i'm hoping to look into when i get some time - http://depts.washington.edu/soslab/gro/

Just don't code an AI in it...

Despite an oft-cited testimonial to the contrary, the Word was nowhere to be found in the Beginning.

Sure you could device something to take written instructions and produce peptides according to it, but it won't be a "programming language for Biology".

Is the genome not the programming language of God, and the junk DNA his inline comments?

If so, it is fit only for Him, and it behooves us not to try to write in it.

"Language" and "code" are shitty metaphors for what DNA does in living organisms, and it misses almost everything about what's needed to go from the molecule complex to the phenotype. Divine beings (and possibly Real Programmers, who mastered the use and internals of C-x M-c M-butterfly [https://xkcd.com/378/]) may get away with doing their work despite thinking about it that way. Everyone else need not apply.

Incidentally a good deal of the "junk" DNA turns out not to be.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact