
Ask HN: Can we write a programming language for biology? - hsikka
I’m specifically talking about an abstracted language that compiles down to nucleic acids. I know projects like Cello exist to generate plasmids and genetic circuits, but do we need ML? Do we know any determiistic rules?
======
hprotagonist
The demon lurking in the corner here is the _a priori_ assumption that we know
enough about nucleic acids to be confident in our knowledge of how they'll
behave once compiled.

I.e., i write some code, hit F5, and yeast comes out the other side of the
compiler-machine. How confident can I be while debugging that the changes in
behavior between the compiled output (yeast) and my code are bugs versus as-
yet-unknown behaviors of complex molecules bumping into each other in a very
crowded bag of other complex molecules sitting in a dish somewhere? As of
right now, and I posit, as a biological researcher, for a very long time to
come, the answer is "A lot less than 100%". Simple stuff like gene-toggle
switches [0] have been known and pretty much work since the late 90s, but
scaling is deeply nonlinear.

This doesn't mean that synbio is useless, but it means that the error bars are
a lot higher than they are when you're running code on printed circuits and
can have a high degree of confidence that your computational substrate
actually does what you think it does.

[0]:
[https://www.nature.com/articles/35002131](https://www.nature.com/articles/35002131)

~~~
etrautmann
A great example of this is Green Fluorescent Protein (GFP), one of the most
important and foundational fluorescent proteins for biology research.

The fluorescent molecule at the center is not actually a nucleic acid, but is
formed several hours after the protein has folded after a secondary reaction
between some of the atoms at the center of the protein find a lower energy
state. This seems to me (not a protein engineer) to be virtually impossible to
design a-priori.

~~~
jfarlow
Except this exact thing was accomplished and published a few months ago [1,
2]. The Baker lab at the University of Washington designed a synthetic
fluorescent protein, de novo.

[1] Meta-Meta-Article: [https://www.bakerlab.org/index.php/2019/01/02/nature-
article...](https://www.bakerlab.org/index.php/2019/01/02/nature-article-ipd-
work-voted-2018-readers-choice/)

[2] Actual paper:
[https://www.nature.com/articles/s41586-018-0509-0](https://www.nature.com/articles/s41586-018-0509-0)

~~~
iso1337
eh, it still took ~56 real-world attempts to get two functional designs.
That's not even quantifying how strongly they fluoresce, and other
qualitative/quantitative aspects (eg, how good of a GFP were these proteins).
It's very impressive, and the Baker lab has done amazing stuff, but we are a
long ways off from rational design.

From the paper: "Synthetic genes encoding the 56 designs were obtained and the
proteins were expressed in E. coli. Thirty-eight of the proteins were
wellexpressed and soluble; SEC and far-ultraviolet CD spectroscopy showed that
20 were monomeric β-sheet proteins (Supplementary Table 3). Four of the
oligomer-forming designs became monomeric upon incorporation of a disulfide
bond between the N-terminal 3–10 helix and the barrel β-strands. The crystal
structure of one of the monomeric designs (b10) was solved to 2.1 Å, and was
found to be very close to the design model (0.57 Å backbone r.m.s.d., Fig.
3c)."

"Two of the 20 monomeric designs—b11 and b32—were found to activate DFHBI
fluorescence by 12- and 8-fold with binding dissociation constants (KD) values
of 12.8 and 49.8 μM, respectively (Extended Data Fig. 6f)"

------
psyklic
You may be interested in the NSF-funded Molecular Programming Project --
[http://molecular-programming.org/](http://molecular-programming.org/).
Projects listed there include "gro: the cell programming language" and "DNA
origami" \-- designing synthetic DNA that folds together into recognizable 2D
and 3D shapes.

I also encourage you to look at the Qian Lab. Lulu has created neural networks
out of DNA --
[https://www.nature.com/articles/nature10262](https://www.nature.com/articles/nature10262).
Her research is entirely about programming molecular systems --
[http://www.qianlab.caltech.edu/research.html](http://www.qianlab.caltech.edu/research.html).

I was fortunate enough to collaborate with Lulu and with Erik Winfree's DNA
computation lab at Caltech a while back.

My work was about building switching circuits using DNA. We designed DNA
sequences that would bind together probabilistically. Using these, we created
"probabilistic switches" \-- analogous to Shannon's original on/off switches.
Then, we used some earlier work of mine to design (and build!) circuits of
"pswitches" that realize certain probability distributions.

We published an open-access paper last year here --
[https://www.pnas.org/content/115/5/903](https://www.pnas.org/content/115/5/903).
I suggest looking at Figure 1 to get a good idea how it works.

------
hyperion2010
The semantics of such a language would the the primary challenge. If you want
to be able to compile a statement like "implement a protein that cleaves
proinsulin to produce insulin" there is SO much context that the real
challenge is not in managing the compilation to ACTG, but the surrounding
environment.

At what temperatures? In what pH range? In what species? With what condon
bias? When should it be expressed? At what rate should it mutate? Do you want
any other nearby 'functions' from mutation? Is this a membrane protein? Do you
want restriction sites? Is this for insertion into a plasmid or for insertion
directly into a genome? What bacterial cell line will you use to maintain the
plasmid? If you are expressing the protein for purification what cell line or
bacterial strain will you be amplifying? Do you want a single sequence that
will work for all of these or are you willing to compile a different version
for each combination? etc. etc. etc. The list goes on and on.

The sequence you compile to will depend on those environmental parameters and
the semantics of that 'environmental ISA' will likely be highly specific for
many high level descriptions. I imagine you could produce sequences that were
more robust, but the simulation time required to generate and validate them
would grow accordingly.

All of this not even mentioning the fact that you also absolutely must specify
_all_ the things it should not do, such as cleave a bunch of other proteins,
or bind non-specifically and form aggregates in the cytoplasm, etc. A language
level list of defaults here would certainly be a requirement, and that means
the space that you will be optimizing in is absolutely massive. I don't even
want to imagine how slow it would be. Probably faster to synthesize a bunch of
variants and test them all in real cells.

The simpler case of taking an NCBIGene identifier and smashing it into an
Addgene plasmid identifier and setting codon biases and optimizing for
expression is a much more manageable task, and would probably be a building
block for the more complex version.

~~~
xrd
Do you have suggestions on learning the basics of all that you discussed here?
I know nothing about it but it sounds fascinating.

~~~
DoreenMichele
If you search on "protein folding" you should be able to find some basics.

------
girishso
That was my interest when I started my career, but didn’t do much on it.

Bio-informatics deals with the stuff you mentioned in-silico. There’re a whole
bunch of libraries for many different languages, but I’ve not come across any
specific language for bio-informatics.

You might find [https://youtu.be/8X69_42Mj-g](https://youtu.be/8X69_42Mj-g)
interesting, they developed a DSL in common lisp to generate C++ code using
LLVM.

~~~
hsikka
Hey thanks, I’ll check out the video! That seems super cool

------
Kip9000
Microsoft Research Cambridge, has been doing this for a long while.

"It turns out that there are lots of similarities between modelling concurrent
systems and biological systems. Just like a computer, biological systems
perform information processing, which determines how they grow, reproduce and
survive in a hostile environment. Understanding this biological information
processing is key to our understanding of life itself.

It’s probably easier to understand some of the output of this work –
specifically the Stochastic Pi Machine, or SPiM as it’s often referred to.
SPiM is a programming language for designing and simulating models of
biological processes. The language features a simple, graphical notation for
modelling a range of biological systems – meaning a biologist does not have to
write code to create a model, they just draw pictures.

You can think of SPiM as a visual programming language for biology. In
addition, SPiM can be used to model large systems incrementally, by directly
composing simpler models of subsystems. Historically, the field of biology has
struggled with systems so complex they become unwieldy to analyse. The modular
approach that is often used in computer programming is directly applicable to
this challenge."

[https://www.microsoft.com/en-us/research/group/biological-
co...](https://www.microsoft.com/en-us/research/group/biological-computation/)

[http://research.microsoft.com/en-
us/projects/spim/](http://research.microsoft.com/en-us/projects/spim/)

------
lkrubner
Read this with an open mind: consider a dialect of Lisp. There is the saying
“Lisp is not the language that will solve your problem, but rather, Lisp is
the language that will allow you to create the language that will solve your
problem.” Perhaps what you need is a biology DSL. Lisp is excellent for
creating DSLs.

~~~
hsikka
This is a really interesting perspective, another HNer said something similar.
I’ll spend some time noodling on this.

~~~
orwin
There is already a complete lisp tool being implemented, aiming to take
advantage of C++ libraries called Clasp. You can also take a look to some
major Clisp/scheme libraries like biolisp (i think it's the name). I'd never
make a group project in common lisp if this project do not involve either a
finite-state machine or a DSL (rewriting a prolog interpretor in lisp is a
breeze), and in this case, the language fits perfectly imho.

------
YorkshireSeason
Yes, that's possible and a subject of active research. Have a look at Luca
Cardelli's website [1]. He's probably the leading researcher in this field.
He's got quite a few talks online e.g. [2].

[1] [http://lucacardelli.name/](http://lucacardelli.name/)

[2]
[https://www.youtube.com/watch?v=o8q7kFeGUTM](https://www.youtube.com/watch?v=o8q7kFeGUTM)

~~~
hsikka
Awesome, I hadn’t come across Luca Cardelli’s work before, I’ll check this
out! Thank you :)

------
Protostome
Computer programming languages were developed to overcome technical challenges
when people had a very hard time writing pure 0s and 1s to code CPU
instructions.

They knew what those instructions will do and they had 100% certainty about
it.

This is not the case in biology, however. People still struggle to understand
how proteins interact on a molecular level.

Few years ago I took part in a project that aimed to engineer a new, synthetic
metabolic pathway in yeast. A key enzyme in this pathway had to be introduced
from an exotic bacteria. But no matter how much we tried, the enzyme worked
very poorly (2 order of magnitude slower) when introduced to cerevisiae.

The problems we have in biology today aren't technical, they are still
fundamental... Creating a programming language dedicated for wiring genetic
circuits is nice, but won't be a game changer.

------
arandr0x
Does it have to be nucleic acids? You can probably get a dataflow-like
language using concentrations of various enzymes and substrates. And there are
pre-existing models for molecular pathways that may help figure out the "key"
for such a code. (In fact, IIRC analysis of metabolite concentration variation
tends to be modeled using systems of differential equations, so you could use
neural nets to find optimal input states for simple or well characterized
networks.)

There are some problems with using nucleic acids, chief among which is that
the central dogma of molecular biology isn't nearly as true (or as central) as
we used to think. Don't know as much about RNA -- there certainly are RNA
"machines" where the nucleic acid itself has a function and maybe those could
be coded for specific functionality. RNA isn't as stable though (because it's
single stranded and RNAses are everywhere) so it sounds hard to deal with
experimentally, but I'm no expert.

------
subcosmos
The closest thing to this is probably DNA origami for nanoscale structures.

Beyond that, were not quite at the place where we can compute desired
phenotype -> genetic code, except for when single enzymatic reactions are all
you are doing. Going bigger requires understanding the nonlinear complexities
of the system in much deeper detail.

~~~
jfarlow
We can and do go from phenotype -> genetic code as long as the phenotypic ask
is something we have some familiarity with. There are enough examples out
there that even though the toolbox is still small, there's plenty of new
things we can rationally design - even if our success rate is not 100%.

Transcription Factor + GFP -> green transcription factor -> codon-optimized
plasmid

Membrane tag + light-activated conformational change + cytoskeleton ->
optogenetic mechanotransduction

Small-molecule-activated degradation domain + transcription factor ->
chemically-induced genetic switch.

And though there's still a bit of trial and error to debug which/how those
components go together, it's a common enough thing to be reasonably
accessible.

------
jfarlow
We do this at Serotiny [1]. And others are working on this as well for
different specific purposes. We use some very basic ML (the data is very
sparse, wide, but any hit above a low baseline is _very_ valuable).

The curious part, from our perspective is that biology has massive surface
area - and the surface area is 3D. It not only scales between
species/functions, but it also scales up and down, from atoms to organs. And
the expertise/abstraction layers that work at one scale become complicated if
you try too hard to account for all variation at a different scale. HIV's
genome is a backwards, upside-down, mirrored, fugue of an engineering design
that uses exotic molecules, exotic regulation, exotic proteins, and exotic
physics. We're starting at a different place, just trying to write very simple
scales.

In our case, we've chosen a single size scale to work with - proteins, but are
wide enough to look across every species & discipline to understand those
proteins as common tools. We compile all of our designs for a particular
function that we're interested in, down to DNA, literally. And finding the
niche where we do not have to deal with all of the DNA-regulation, or cellular
regulation, or tissue synthesis, etc. allows us to expand and build in
complexity at the protein level - while keeping other parts of biology
constant. And that also allows us to interact and work with others who are
working at different scales.

And there are others that build in complexity at other biological levels (gene
regulation, pathway flux, etc.). Companies like Asimov [2] are involved in
similar work at some of those abstraction layers. The open-source design
language, SBOL is an attempt to standardize the DNA layer [3]. And this
contributes to the challenge in that a _lot_ of people/companies/labs have
projects to build an abstraction layer that compiles down to DNA - but they
might be talking past each other and be doing separate projects.

We've built an entire API of 'high-level' commands at an abstraction layer
above DNA, where the output compiles down to literally, a JSON file specifying
the DNA sequence to be manufactured by a 3rd party, as well as human-level
citations to enable turning the new designs into intellectual property.

There is still a LOT of data missing, and there's a lot of empirical work to
do - and you need to keep your compiling system constant enough that when you
make changes at your abstraction layer you know that when you hit a roadblock
you know it's because of a change you made, and not just a bug in the system.

[1] [https://serotiny.bio](https://serotiny.bio)

[2] [https://www.asimov.io/](https://www.asimov.io/)

[3] [http://sbolstandard.org/](http://sbolstandard.org/)

~~~
hsikka
Wow very cool. I’m actually a graduate student at Harvard getting my master’s
in biology and CS, so this is right up my alley. I’m going to dive deeper into
Serotiny and check out some of the awesome work you guys are doing. Do you
offer internships for folks like me?

I’d heard about Asimov from their original work on Cello at MIT, but SBOL is
news to me. I’ll check it out as well.

It sounds like this space is becoming pretty competitive, which is
interesting.

~~~
jfarlow
I think the fun part is that it's not necessarily super competitive, yet. A
lot of these tools are still complementary. But they're also not quite
integrated yet either.

Always looking for good work. The field _is_ growing rapidly right now. You've
got a good combination of talents to help.

------
cmrx64
Vaguely related but check out
[http://ceqea.sourceforge.net/](http://ceqea.sourceforge.net/), paper:
[https://arxiv.org/abs/1811.02478](https://arxiv.org/abs/1811.02478)

------
yaseer
This is a very interesting area of research, which I'm personally fascinated
by. There are several in active development.

Check out:

[https://kappalanguage.org/](https://kappalanguage.org/)

For one example.

------
grawprog
There's this.

[http://www.cs.cmu.edu/~phoenix/nsc1/paper/3-1.pdf](http://www.cs.cmu.edu/~phoenix/nsc1/paper/3-1.pdf)

------
bkovacev
I know this might be a bit off-topic, but what is the current de facto
standard either open-source or paid for drug design/discovery?

~~~
hsikka
OP here, the only open source drug discovery library/framework I’ve seen is
DeepChem
[https://github.com/deepchem/deepchem](https://github.com/deepchem/deepchem)

------
amelius
Another question is: could we beat (progress in) Quantum Computing by
exploiting massive parallelism in biological computing?

~~~
hsikka
I’m highly interested in this, and in the capacity of biocomputing in general!

------
fredgrott
you are thinking the wrong way...its the embryology sequence of genes that
turn on and off combined with dna from other organelles such as the
mitochondria

not one of the end products nucleic acid by itself...this is why creating a
dinosaur from blood, etc is such a fictional dream

------
calebwin
there's this language called gro that i'm hoping to look into when i get some
time -
[http://depts.washington.edu/soslab/gro/](http://depts.washington.edu/soslab/gro/)

------
thedevindevops
Just don't code an AI in it...

------
etiam
Despite an oft-cited testimonial to the contrary, the Word was nowhere to be
found in the Beginning.

Sure you could device something to take written instructions and produce
peptides according to it, but it won't be a "programming language for
Biology".

~~~
aetherspawn
Is the genome not the programming language of God, and the junk DNA his inline
comments?

~~~
etiam
If so, it is fit only for Him, and it behooves us not to try to write in it.

"Language" and "code" are shitty metaphors for what DNA does in living
organisms, and it misses almost everything about what's needed to go from the
molecule complex to the phenotype. Divine beings (and possibly Real
Programmers, who mastered the use and internals of C-x M-c M-butterfly
[[https://xkcd.com/378/]](https://xkcd.com/378/\])) may get away with doing
their work despite thinking about it that way. Everyone else need not apply.

Incidentally a good deal of the "junk" DNA turns out not to be.

