Hacker News new | past | comments | ask | show | jobs | submit login
Moderna mRNA sequence released to GitHub [pdf] (github.com/naalytics)
1248 points by aty268 24 days ago | hide | past | favorite | 367 comments

I'm sure it's more complex than I grasp as a layperson, but I'm utterly amazed at how simple this _appears_. I get the feeling that this is something I have a better chance of understanding than the average SaaS Terms and Conditions.

I expected to have to scroll through pages upon pages of indecipherable text. Instead it's no bigger than a large paragraph of text, and I can easily fit it on my screen.

The protein they're trying to manufacture is indeed quite simple - AFAIU both BioNTech and Moderna put together their sequences in a weekend. (Though there was a more involved process of winnowing down the sequences for the most effective ones.)

The technically challenging parts are:

- delivery mechanism: you need to take a very unstable molecule, protect it from the environment - both external, and when inside the patient - and insert it into a human cell. (This is called the "platform", and is usually developed independently from the specific payload.)

- manufacturing: both producing the mRNA itself at a large scale, and inserting it into the delivery mechanism, at a large scale and in low-temperature conditions

- testing: the newly-developed payload and the existing platform were integrated at small scales within weeks, but testing the thing for safety and efficacy took months

EDIT: As schoen pointed out, this was not actually released by Moderna, but reverse engineered by third-party researchers. Original text was: "Hence they feel safe releasing this. Their moat is not the gene sequence, their moat is everything else."

sequence is actually released by Moderna in their patent:


though they do present multiple sequences, so I guess you'd have to go to the FDA application to figure out exactly which one got used.

Reading the primary claim is fascinating: "A composition, comprising: a messenger ribonucleic acid (mRNA) comprising an open reading frame encoding a betacoronavirus (BetaCoV) S protein or S protein subunit formulated in a lipid nanoparticle."

I have a "I'm sure that means something to somebody" feeling. It's also surprising that the remaining claims seem to describe the resulting bits of the sequence, and that that primary claim can stand on its own. Of course, I'm by no means an expert.

> I have a "I'm sure that means something to somebody" feeling.

Break it down! It's not so bad:

> A composition

A bunch of stuff

> a messenger ribonucleic acid (mRNA)

mRNA are cellular instructions on how to make proteins that are read by ribosomes that make those proteins as they read along.

> an open reading frame

This is something that starts with a "start codon" and ends with an "ending codon" and encodes valid instructions to make a protein between.

> encoding a betacoronavirus (BetaCoV) S protein or S protein subunit

The instructions refer to the spike protein of a betacoronavirus, or a fragment thereof, because this is what we want the immune system to pay attention to (and make antibodies to bind to and neutralize).

> in a lipid nanoparticle

The immune system gets pissed off about mRNA floating around, because that's one of the things that happens with active infection. So if you want this to get into cells and tell them to make your protein, you need to encase it so that it mostly escapes immune notice itself.

Why haven’t viruses or bacteria evolved to use a lipid nano particle to evade the immune system? Seems like a big security flaw in our cells?

I remember someone coming up with a song to encourage people to wash their hands more thoroughly much earlier in this pandemic.


According to the song lyric, "novel coronavirus / has a lipid outer shell". So it seems like some viruses have taken advantage of this.

Yeah I think that's the reason for the 20-30 second wash since it takes time for the soap to break down the fatty layer.

That stupid ebola song came to mind. It's very catchy. https://www.youtube.com/watch?v=XGltVAJ4JCk

And it's pretty common; the flu viruses have one.

If you squint you could say that viruses are already doing that. A lipid bilayer is a key component of several kinds of membranes. All of our cells have a lipid bilayers separating the inside from the outside. The corona virus is also built from these same lipids, which is why it's vulnerable to soap.

Anyway, I think the important thing that the other commenter was saying is that mRNA needs to be carefully packaged to be medically useful. You can't inject pure RNA as a vaccine because not only is the mRNA going to be quickly degraded before it gets anywhere but if the immune system sees any RNA floating around by then it kicks itself into a frenzy because free-floating RNA is usually a sign that something nasty is afoot.

Don't give them any good ideas

They have. That spike protein is part of their envelope ("nanoparticle").

The envelopes used in an RNA vaccine are generally simpler, because they're working under different constraints than viruses. For example, their envelopes don't need to be easily manufactured in a cell.

But some RNA and DNA vaccines do use viruses as their delivery mechanisms, eg the J&J COVID vaccine.

Because some have. Phages for example.

Maybe those lipid particles are too unstable if they are not spiked with various proteins?

Moderna vaccine has to be kept in -70 C. Viruses that can survive for any extended period of time only in -70 C won't find many hosts.

Also maybe the process of creating virus shell can't naturally be done without protein scaffold.

Doesn't this describe almost ANY vaccine - I think that it's probably bad public policy to allow anyone to patent ALL COVID vaccines - I think that patenting a particular vaccine (a particular mRNA string) should be allowed, but not effective wildcards in the RNA

No. It doesn't describe the viral vector vaccines (J&J/AstraZeneca). It doesn't describe the inactivated virus vaccines. It doesn't include the viral fragment (NovaVax) vaccines. And it wouldn't describe some possible mRNA vaccines because of differences in formulation or differences in targeting.

While this particular Moderna claim would likely affect BioNTech/Pfizer's mRNA vaccine, it's not clear whether it would survive in litigation, too.

As to a "specific string"-- if you could just pad a few codons onto the end and not be violating the patent, that's not too worthwhile.

Technically you wouldn't even have to do that. Just change a few codons to alternative codons, it wouldn't even change the amino acids being produced.

This is a great animation of the life cycle of an HIV virus. It’s not exactly what happens with the pandemic virus but it gives you a good idea of the complexity of the process of viral reproduction (vaccine or immune response isn’t covered here):


Animation of transcription from mRNA: https://youtu.be/TfYf_rPWUdY

The "SARS-CoV-2 entry" video was really interesting too, and I recommend watching them both to see how they are different.


Nice!!! I didn’t even notice that was uploaded! The differences between HIV and SARS-CoV-2 makes me wonder what reliability there is in this process, and if some viruses are more reliably able to enter the cell than others (presumably those that are more infectious?)

It blows my mind to see how life (and death) work at the molecular level--it's almost like some kind of manmade machine, but far more subtle and complex.

These might interest you then: 3d visualizations of cellular processes in real time. I was shown them in my intro to biology class, which filled me with the same interest.

Transcriptase: https://youtu.be/5MfSYnItYvg DNA polymerase: https://www.youtube.com/watch?v=bee6PWUgPo8 The Ribosome: https://youtu.be/TfYf_rPWUdY

It's been decades since I was in high school but I really hope these videos, or something similarly realistic and mind-bending, is in the modern biology curriculum. Learning about Darwin, Mendel, Watson and Crick and the experiments they did to develop an understanding of biology was informative, but it wasn't compelling to me. These and the work of Drew Berry and WEHI are just amazing:


I like the wehi videos because they take effort to make the molecular motions appear to be random, a result of stuff blundering about: https://youtu.be/7Hk9jct2ozY

Just remember that it's not really orchestrated. All of the molecules involve are kind of randomly blundering about and it's one-in-a-million collisions that are responsible for getting shit done (on membranes it's more like one in a thousand and on ropelike structures like dna or actin it's one in a hundred).

As a biochemist, that's one of the things that has kept me interested in the work - it's truly mind boggling what is going on at the molecular level every second of our lives.

I love reading articles around biology, micro/molecular biology in particular, and looking for references to agency in the text. 'selects', 'filters', 'checks', 'seeks', etc when the reality is that the whole thing is just a massive chemical reaction.

Yes.One must know all programming of molecules and all laws current in the universe. perfect hierarchy from atoms to the cells, from cells to plants, animals from animals to earth, from earth to stars

Coronavirus replication is pretty dramatically different from HIV replication-- coronaviruses are not retroviruses, and do not have a step where viral RNA is converted into DNA and integrated into the host cell's genome. Instead, coronavirus RNA is directly interpreted by the cell's ribosomes to make the proteins that ultimately build and comprise the replicated viruses.

The mRNA vaccines work in much the same way-- it's just that the mRNA vaccines only include the code for the spike protein and not the rest of the virus's machinery. So you get the vaccine and your body produces a bunch of spike protein by itself, which gives your immune system the opportunity to learn how to identify and react to the spike protein before it sees it on a real virus.

Claims are read in the context of the body of the patent and generally known and/or cited knowledge in the field in question. As long as you define precisely what you mean by your terms in the body, you can be as succinct as you want in the claims.

It's an interesting structure, but probably partly out of patent law. To wit, SARS-nCoV-2 is non-patentable, being of natural occurance.

But a specific composition that encodes its spike protein, encapsulated in a lipid nanoparticle? That's much more of a creative work.

> That's much more of a creative work.

Probably worth noting that patents are not required to be creative. That's copyright terminology.

I think the terminology used for patents is something like "inventive and useful".

> It's an interesting structure...

I think that's the question. I'm very much used to reading "A machine-readable medium, comprising..." I'm curious what bits are unique to COVID-19, and what bits are generally protecting the idea of using a carrier to send a specific protein. In this case, is it the "S" protein phrasing that protects the specific embodiment of COVID19's spike?

Here what is claimed is encoding for any betacoronavirus's spike protein.

> Here what is claimed is encoding for any betacoronavirus's spike protein.

Aha! That's what is surprising to me - I assumed that that would have been done previously / protected previously. That explains it.

If it's just the spike protein encapsulated in a lipid nanoparticule (for isolation and transportation?), that looks like something not creative and quite established for people in the field of genetic material transport.

My layperson's understanding is that the actual spike protein and/or mRNA are modified from the natural versions. Both for stabilization (if either falls apart quickly, they're of no use) and for response optimization.

So somewhat like how a fishing fly differs from the insect it represents.

Usually you can't patent facts. If it is natural (not synthetic) gene sequence they shouldn't be able to patent it...

> put together their sequences in a weekend

meh, I could do that over a weekend never sounded so scary, or impressive at the same time. That weekend just so happened to stand on the shoulders of prior decades of research though.

i guess this is big pharma's version of `apt-get install`

Not only did Moderna have a decade of experience with mRNA as 'drug', but the mechanism of coronavirus infection was well-understood from SARS research, namely the importance of the spike protein. All the parts were in place and just waiting for the specific SARS-CoV-2 sequence. They designed it as soon as they had access to the Wuhan sequence.

Modern a has been working on mRNA since 2010 and mRNA vaccines since 2012. They have the process down pretty solid, but vaccines do not bring in the bacon.

The big money’s going to be in cancer treatments. If they can use this to target tumors, they’ll do quite well.

Bad news, it’s not looking very successful right now. Moderna is making back all the lost revenue with the emergency authorization tho

Can you share some of the bad news links? I've only seen the positive ones.

Now that they've spent a year developing the mass-production methods and infrastructure, it might bring in the bacon! e.g. there's a steady annual market for a flu vaccine for whatever the latest strain is, and being able to get that to market faster than the competition could give them an edge.

I just don't think it was every profitable enough for them to put in this enormous capital expenditure.

When you can have Uncle Sam cover that capex, it makes it even more financially appealing.

And Biontech since 2008 and CureVac was founded in 2000 (after their initial CEO made a discovery that enabled Biontech + Moderna).


" but vaccines do not bring in the bacon."

As in, "do not generate enough income"? Really? Now?

Yes, really. Typical vaccines are like $5 to $200 or so. And worst of all, usually just one or two doses.

For all the horror HIV has wrought, global spending on vaccine development for HIV has been around $1 billion a year for the last few years. In contrast, the USA federal government spends $3 billion a year for HIV antiviral drugs for low-income Americans. $20,000 per patient per year for life. Unsurprisingly, new antivirals are where most of the research is.

It can sound almost like a conspiracy if I put it like that, but it's just the economics incentives. Especially since the developed countries where most of the market for charging a decent markup is, have the smallest market for most new vaccines, while having the largest markets for therapeutics for chronic conditions.

Now HIV is genuinely devillish to develop a vaccine for despite our attempts. But vaccines for hepatitis C, gonorrhea, HSV, among others appear to be possible. We almost certainly could develop effective vaccines for these with existing techniques, if someone coughed up the funding. Maybe all the buzz about mRNA vaccines will spur some progress here.

> It can sound almost like a conspiracy if I put it like that, but it's just the economics incentives

Talk about market failures! It's completely obvious that this economical system is not placing the good of the whole human species as its first priority.

"es, really. Typical vaccines are like $5 to $200 or so"

But since the demand is 14 000 000 000 doses, there should still be a little bit of money in it?

Yes, clearly. With COVID-19, there is pretty much guaranteed market for about ten billion doses. Along with direct investment by governments in wealthier countries. Most people, including politicians, want a COVID-19 vaccine real bad.

The parent poster was describing the situation with vaccine development in general, to which COVID-19 is quite the exception. A potential hepatitis C vaccine for example has very different economics, as it would not be deployed anywhere nearly as widely or quickly. Consider that, 40 years after hepatitis B immunization became available, the majority of Americans haven't been jabbed with it.

Yes, but a pandemic only comes around every hundred years or so. Moderna happened to be in the right place at the right time for this one, but delivering vaccines for a pandemic is not much of a solid business plan.

Current pricing for mRNA vaccines is something in the $4-7 dollars (that's $7.00 dollars, not $7,000.00 dollars) range. Compare that to one of the Hepatitis C treatments, which costs north of $350,000 by several accounts. Even remdesivir is something like $3000 for a course.

There are roughly 75 million HCV infections in the world.

That translates to a total cure cost of:

75M * $350k = $26.25T

There are other ARV treatments available which cost now roughly $50-100k and cure in 3 months.

Whereas immunizing everyone against coronaviruses currently costs:

8B * $7 = $56B

Clearly, the costs of the HCV cure are predatory and unreasonable because it doesn't lead to eradication and it's inaccessible to the poor and the third-world.

It's also worth noting that if it's $7 for the payer, there's a lot less than $7 of profit.

The true unit cost is probably $6 per dose. The COVID vaccines may break-even short-term. Long-term, it's probably worth keeping an extra 120M potential customers alive for what will probably result in a small profit.

The true unit cost of HCV cures is unknowable but possibly half of the current price.

"Hepatitis C treatments"

We don't really need the same quantities of that, though.


It can be said that hepatitis care is more necessary than covid.

The world is not in lockdown becuse of hepatitis, though.

Hepatitis isn't airborne, though, so it doesn't have an exponent threatening to blow up in everyone's face.

Now is a very atypical situation :-P

You've never been to Florida have you?

Does Florida typically throw billions of dollars at vaccine development?

Vaccines can be very lucrative. Pfizer has billions in sales for Pentacel.

There was already good work previously on SARS-1 and MERS spike protiens for use in vaccines. This is what enabled the "in a weekend" speed. https://pubmed.ncbi.nlm.nih.gov/28807998/

from what I've gathered, the rate limiting step for production as of yet, is creating the lipid vesicles and getting the RNA inside of them. Only a few companies have a process for this, and the supply chain for the precursors is limited as well.

Could be wrong but AFAIU the Pfizer one doesn't encapsulate in a lipid, hence why it needs lower temperatures.

RNA would get thrashed by your immune system if it isn't encapsulated by something: liposome deliver of therapeutic RNA is really next-generation tech, and the fact that the RNA does what it's supposed inside the your cells is no small feat either.

From the CDC "ingredients" for the BioNTech/Pfizer vaccine, along with cholesterol (which modulates the stability of lipid membranes), they report using this molecule, which would form a phospholipid bylayer, just like our own cells use: https://www.sigmaaldrich.com/catalog/product/avanti/850365P?...

Pfizer one does contain 4 kinds of lipids to encase the RNA. The encapsulation percentage is however unknown. https://www.technologyreview.com/2020/12/09/1013538/what-are...

They're both encased in lipids. Pfizer didn't have long term data on long-term storage at standard freezer temperatures, but has since confirmed their vaccine can be stored at similar conditions to Moderna.

> delivery mechanism: you need to take a very unstable molecule, protect it from the environment - both external, and when inside the patient - and insert it into a human cell. (This is called the "platform", and is usually developed independently from the specific payload.)

Of note, the immune system is pretty good at destroying foreign mRNA so you also need to evade it.

This article is pretty good: https://berthub.eu/articles/posts/reverse-engineering-source...

I wouldn't even say the immune system, your body has a ton of nonspecific RNA-digesting enzymes floating around to patrol for exactly this sort of thing happening, even by accident, as cells can sometimes rupture. It's a problem enough that good RNA researchers have a reputation of being clean freaks. Some RNA labs I've been in had a lingering, slightly sweet smell, that's the nonspecific RNAase inhibitor that gets sprayed on everything.

RNA is also just generally fantastically unstable and reactive. You don't want any surface to be too alkaline, for example. There's a reason that basically every life form switched to DNA.

(Though RNA may have been more stable under the high-UV-exposure conditions the early Earth.)

Though AFAIU once you've gotten the RNA inside the cell, you're home free.

> - delivery mechanism: you need to take a very unstable molecule, protect it from the environment - both external, and when inside the patient - and insert it into a human cell. (This is called the "platform", and is usually developed independently from the specific payload.)

The most amazing thing is that now that the platform is proven secure in dozens of millions of people, it should be be very easy and fast to get approval for other payloads. Biontech for example wants to go after cancers - a platform that can deliver payloads targeted to an individual's cancer is nothing short of a game changer in cancer treatment because the current standard of blasting the patient's body with a lot of highly toxic chemicals is arcane compared to letting the body's immune system do the cleanup.

Even if the platform is safe, the payload itself needs to have its safety proven. Remember, the payload is just instructions, and those instructions make your cells pump out oodles of arbitrary proteins. That in itself can cause health problems. see e.g. the AstraZeneca vaccine's safety issues, which were caused IIUC by immune responses to the manufactured proteins. DNA vaccine, not mRNA, but the principle is the same.

re: cancers, that is actually what this technology was originally developed for! Moderna has been spending about a decade getting this tested and proven out for the cancer role, and they're quite close. From my quick reading of the literature, there seems to be some regulatory confusion about how exactly to run approval for this kind of personalized drug design (testing the method of generating the individual drugs?), but the bar is usually much lower for cancers with high mortality rates.

> Hence they feel safe releasing this. Their moat is not the gene sequence, their moat is everything else.

One or more of the vaccine developers may have released such details, but this particular file is a reverse engineering effort by unaffiliated scientists based on analyzing the dregs of used vaccine vials (!).

Edit: See https://news.ycombinator.com/item?id=26628594 for more substantive discussion about this.

Ah - thanks for pointing this out! Edited to make sure readers see it.

> but testing the thing for safety and efficacy took months

What kind of tweaks were made from "the version they threw together in a weekend" to "the version that is in production now"? What's a typical "mRNA" feedback iteration loop like?

I'm not sure if there were changes in the sequence at all necessary during testing, however if you align the sequences given in that link from Biontech and Moderna, you see they encode the exact same protein (which is of course necessary). However the RNA sequence contains quite a few differences between the companies, they often use a different codons. This could be to make the translation more efficient, and can be a thing to optimize.

Would it be possible to use the same delivery mechanism for other mRNA sequences?

Almost certainly - e.g. the J&J vaccine (a DNA vaccine, not mRNA, but same principle) is using a viral delivery platform that they'd had sitting on the shelf for years and have used for other vaccines.

After the massive capex that has gone into mass-production of encapsulated mRNA delivery systems, I suspect this new technology will be very cost-competitive for the big markets like the annual flu vaccine.

- delivery mechanism: you need to take a very unstable molecule, protect it from the environment - both external, and when inside the patient - and insert it into a human cell. (This is called the "platform", and is usually developed independently from the specific payload.)

Sounds like a problem you solve once and for all, for any vaccine. And also that this problem was already solved since decades (e.g viral vectors)

- testing: the newly-developed payload and the existing platform were integrated at small scales within weeks, but testing the thing for safety and efficacy took months And so many people have been killed by this overly conservative testing, phase ~<2.5 was enough

> Sounds like a problem you solve once and for all, for any vaccine.

I strongly doubt it. It's more like a problem you solve once for a particular class of payload and particular destination. Biology doesn't do packet switching - everything is just rapidly bumping into everything else at random, so your envelope needs to be designed in a way that's ignored by everything else than molecules at your target site, and it needs to not react with the payload it's carrying.

> And so many people have been killed by this overly conservative testing, phase ~<2.5 was enough

Overly conservative? That's what super-accelerated testing looks like. We're lucky it went well; had they screwed up, it would scare a lot more people away from vaccinating, lengthening the pandemic and increasing death toll.

a bad vaccine could kill much more than that. remember that RNA vaccines are developed for 10+ years now and COVID19 is the first time they actually worked without side effects.

Additional reading (was posted here some time ago):


Why manufacturing of these vaccines is a hard part.

Liken it to the 4kb demoscene: it's amazing what can be done with a little bit of information, as long as you don't have to describe the machine running it.

Or the distribution method, or even really invent the thing, since you're mostly just copying someone else's work. Plus it doesn't have to even do anything. In fact, doing anything might be a problem, so best to just sit there and look menacing (and spikey).

> Liken it to the 4kb demoscene

Coincidentally, the mRNA sequences for both vaccines are about 4kb (kilobase) long.

It really is that “simple.”

Getting it designed and building it is more difficult.

At its core, it’s a piece of mRNA that creates a protein. That code gets transcribed into a protein (often those are relatively short). That protein then triggers your bodies immune response, which trains it to attack covid19.

Inject this mRNA into a cell and it’ll create the protein. Anything can be injected at this point once the mechanism for injection is developed

Which makes me wonder. Could you place the entire virus genome in these liposomes and get them to hijack the machinery to make an entire virus? Like plasmid but for viral structures?

Yes, that's one of the concerns many have about this technology. Literally, anything can be injected and done at this point.

Not sure where the technology is exactly at, but I suspect we're no more than 5 years from major incident related to this.

Even this vaccine, we really don't know the long-term impacts or risks involved with this. For instance, this vaccine does appear more risky than the standard flu vaccine:


Presumably this is due to increased inflammation. It's not hard to imagine that we'll be doing genetic editing soon enough with this (if we aren't already).

Do you happen to have a more verifiable source for your claim that COVID-19 vaccines are more risky than the flu vaccine?

Excerpt from the disclaimer in your source:

> VAERS accepts reports of adverse events and reactions that occur following vaccination. Healthcare providers, vaccine manufacturers, and the public can submit reports to VAERS. While very important in monitoring vaccine safety, VAERS reports alone cannot be used to determine if a vaccine caused or contributed to an adverse event or illness. The reports may contain information that is incomplete, inaccurate, coincidental, or unverifiable. Most reports to VAERS are voluntary, which means they are subject to biases. This creates specific limitations on how the data can be used scientifically. Data from VAERS reports should always be interpreted with these limitations in mind.

Give the incentives no, however it should be noted VAERS is only going to undercount not over count. Physicians are required to fill it out if there’s adverse side effects at the hospital.

Wish there was more but the incentives aren’t really aligned for open research on this.

We can get the real cdc idc10 (billing) results in several months.

How do you edit genes with an mRNA vaccine? You'd need DNA, enzymes (maybe requiring post-translational modification) to splice them in, etc.

Also, you might not even be able to print full viruses with this platform. manufacturing mRNA is different from manufacturing all the random types of RNA in a virus, isn't it?

Theres really no reason you can’t have a multitude of mRNA deployed, you definitely don’t need DNA to do editing.

There are people already doing gene editing with mRNA methods:


I’m not sure about these particular platforms, but I wouldn’t be surprised if we see gene editing technology deployed live in the next 3 years.

Why would you want to do that if you can use the virus itself as the delivery mechanism?

Maybe I only have the sequence and not the virus(or small pox) or maybe I want to create a novel virus (respiratory spread Ebola)

Sequencing technologies have improved immensely over the last decade and a half. And, in this particular case, getting the sample RNA is incredibly easy, since its purity and integrity in the vial is quite high.

I didn't look at the details of how they sequenced it, but given that there are chemically modified bases in the mRNA vaccines there is a chance the normal methods for sequencing (and the first step of translating to DNA) don't work. Well, I guess in practice they did.

While not completely equal to the naturally occuring bases, the modified bases in the vaccine mRNA need to be able to complement to the non-modified ones present in tRNA anticodons during translation. If they can pair to their corresponding natural bases, then the chemically modified RNA can be also used as a template by the reverse transcriptase to generate the complementary DNA needed for the sequencing reaction.

Generally I agree, but it could be the case that the modified bases work just well enough for tRNA matching in the ribosome, but not with the reverse transcriptase.

The mechanism of base complementarity is identical in both cases. If a modified uracil complements an adenine in tRNA, it will complement an adenine in the RT primer or an adenine being added to it.

I think it's a bit like a private key- the difficulty is in finding some combination that works in an absolutely massive space of possible proteins, not necessarily in the length of the protein.

Check out this video by The Thought Emporium to see how far we’ve come in these matters:


This should hopefully provide you with some useful perspective.

"but I'm utterly amazed at how simple this _appears_."

Biology is a funny old thing. You can look at that concise description - the orange and so on blocks of a few letters and a few short groupings.

Now ATCG are basic building blocks but they consist of quite a lot of stuff. I think it's a bit more complex than that because this is RNA not DNA so ATCG might not be quite right. Each of those bases are horrifically complicated depending on scale. Search "ATCG" - this is a good start: https://en.wikipedia.org/wiki/Nucleobase

Now dive into one of those bases and decompose it to its constituent atoms. Now look at the maths around this stuff. It gets quite complicated, quite quickly.

That said, the fact that a bloody complicated thingie can be described so concisely is absolutely amazing and as you say it looks so simple.

It'd be cool to make an easy-to-use interface, still.

> This is somewhat of a problem for our vaccine - it needs to sneak past our immune system. Over many years of experimentation, it was found that if the U in RNA is replaced by a slightly modified molecule, our immune system loses interest. For real.

> So in the BioNTech/Pfizer vaccine, every U has been replaced by 1-methyl-3’-pseudouridylyl, denoted by Ψ. The really clever bit is that although this replacement Ψ placates (calms) our immune system, it is accepted as a normal U by relevant parts of the cell.


In case others don't know this, the reason this is abbreviated Ψ (psi) is that Ψ is the first letter of Greek ψευδής 'false, lying', the origin of the prefix pseudo-.

Umm, isn't that kind of scary? Like could you create a virus with this Ψ that our immune system can't fight at all?

It's part of an instruction to cells to make something. Viruses replicate by instructing cells to make viruses. Our cells don't know how to make Ψ, so the replicated virus would have the normal instruction.

Thanks, that is reassuring and makes sense. *at least until we figure out how to put the instructions to make it in the virus code :)

No. For a number of reasons. First of all, the virus uses the nucleotides (A, G, C and U, for wich Ψ is used as a substitute) produced by the attached cell to create a copy of it's genome (RNA). The nucleotides are produced by the cell, the virus does not instruct the cell to produce them. It just tells the cell to produce and assemble the proteins AND the RNA.

Second, our immune system doesn't just attack and recognize free floating RNA, but the virus itself. And different parts of the virus. First and foremost it will recognize the surface proteins (like the spike protein) because those are the things that it can see while the virus is outside of the cell. Also these are the things that the infected cells present on their surface (MHC II sites, if I'm not mistaken) to the immune system. (As far as I can understand, cells have to present the proteins they produce to the immune system otherwise they get killed. If they produce alien virus proteins that get recognized by the immune system, they also get killed.)

Interesting enough, the immune system somehow also recognizes the so called nucleocapsid protein, which is the one used to wrap the viral RNA inside the virus. (But it gets produced by the cells, so I guess they get presented on the cell surface so the immune system can learn to recognize and counter them.) I didn't look into the details too much, but as far as I can understand it's not clear yet how those antibodies (the ones created against this protein) work, because antibodies are supposed to be used outside of the cells, but the nucleocapsids are only present inside the cells and then inside the virus.

To sum it up: the immune system is much more complex, has several recognition mechanisms, the viral RNA is mostly packed into the viruses (or are inside the cells) and the viruses don't have any way to produce Ψ (or any of the other nucleotides).

> I didn't look into the details too much, but as far as I can understand it's not clear yet how those antibodies (the ones created against this protein) work, because antibodies are supposed to be used outside of the cells, but the nucleocapsids are only present inside the cells and then inside the virus.

You are correct, the antibodies are made and they end up not recognizing the nucleocapsid protein while it is in the virus but when the infected cell displays internal proteins with MHC I, which helps T-Cells target the infected cell. MHC I displays self and MHC II displays proteins that have been "eaten" by the surveillance cells of the immune system.

Maybe I'm thinking too simplified here, but wouldn't this only work on the first iteration? After all the virus would replicate with Us in your cells and then the replicas wouldn't have the advantage anymore.

By that point the cell would be producing the associated protein though. Getting it inside the cell is the goal here from what I've read.

But that's what I mean, viruses need to replicate. They do this by injecting their RNA into your cells and hijacking your ribosomes for their replication. So the first viruses would definitely get past your immune system, but the replicated viruses would then be produced by your cells, so they wouldn't have the U->psi replacement that the first generation had. So every subsequent generation of the virus could be fought by your immune system. Effectively giving the virus a head start of one replication cycle. I'm guessing this wouldn't change much. But I'm not a biologist.

1) As someone else pointed out, this molecule substitution would not persist during replication. New viruses being produced in your cells would be made with a normal "U".

2) Your immune system does not usually attack the RNA housed inside a virus, but rather protein fixtures on its "body".

As denoted in the linked article [1]:

"Many people have asked, could viruses also use the Ψ technique to beat our immune systems? In short, this is extremely unlikely. Life simply does not have the machinery to build 1-methyl-3’-pseudouridylyl nucleotides. Viruses rely on the machinery of life to reproduce themselves, and this facility is simply not there. The mRNA vaccines quickly degrade in the human body, and there is no possibility of the Ψ-modified RNA replicating with the Ψ still in there. “No, Really, mRNA Vaccines Are Not Going To Affect Your DNA[2]“ is also a good read."

As far as I could tell, this would work well for getting a synthetic virus into the human body, but without the necessary mechanics within our cell, the special Ψ chemical won't be reproduced by the virus. That'd mean the replicated virus would get snatched up by the immune system as soon as it'd get released from the cell.

Theoretically, a complex enough RNA string could be used to have our cells build the necessary cellular machinery to properly reproduce the virus, but that's a kind of altering DNA that's a whole different can of worms. There's probably cheaper and easier way of defeating the immune system, for example by simply "enhancing" ebola or HIV to make them more infectious and more resistant to our current drugs.

[1]: https://berthub.eu/articles/posts/reverse-engineering-source... [2]: https://www.deplatformdisease.com/blog/no-really-mrna-vaccin...

RNA is genetic material, but it encodes instructions to make proteins, which form the physical shell of the virus crucial to its function. As a very rough analogy, the RNA is source code and the proteins are the compiled program.

It's often the protein molecules that the immune system learns to recognise and attack.

RNA vaccines work because your body automatically translates them into some recognisable part of the viral protein, and then develops an immune reaction to that.

If a virus had Ψ instead of U in its RNA, it's still going to be making the same type of proteins. I can't see why it would be more likely to evade an immune response.

My exact question when I read that.

We are really quite fortunate that there was a ton of work done on coronaviruses, mRNA vaccines, adenovirus vaccines, etc prior to the pandemic. It seems like a pandemic even a year or three prior would have made the vaccine rollout considerably slower.

No wonders some people have an allergic reaction. Those people's immune response is more sensitive to this change.

No they have allergies to the polyethylene glycol PEG compound in the lipid nanoparticles. It is also used in skin creams, toothpastes, condom lubricants and in larger quantities as a laxative. Some people are just allergic to it.

Any individual protein doesn't seem that complex since it's just a combination of some 20 amino acids, but the variations are endless:

"Since each of the 20 amino acids is chemically distinct and each can, in principle, occur at any position in a protein chain, there are 20 × 20 × 20 × 20 = 160,000 different possible polypeptide chains four amino acids long, or 20n different possible polypeptide chains n amino acids long. For a typical protein length of about 300 amino acids, more than 10^390 (20^300) different polypeptide chains could theoretically be made. This is such an enormous number that to produce just one molecule of each kind would require many more atoms than exist in the universe."

proteins are also unique in that not just their sequence matters, but also their physical shape. 2 proteins can have the same sequence but a different physical shape, and therefore have different impacts on the body's chemistry. I started a PhD researching DSP methods for matching protein sequences and locations of amino acids. Fun stuff.

Then there are also post translational modifications, like addition of acetyl or phosphate groups, and sugars to the protein (glycoproteins).

I mean, I can understand how an eye or a brain can evolve by natural selection, but I’m still stunned by abiogenesis. I guess we’ll never know for sure how it all started.

The exponentiation signs got lost in your quote. Would you mind adding them back in?

People tend to think of genetic code as a sort of assembly language which is very verbose, but I wonder if the correct way to view it is in fact a very terse domain-specific language, because it actually depends on the entire complex machinery of the cell to be present in order to work, which in itself contains a lot of information?

> I wonder if the correct way to view it is in fact a very terse domain-specific language

Honestly, na. It's pretty verbose. There's a lot of weird ass things in there like "Skip basepairs until you find the matching terminating sequence" (I think it's AG .* GA but its been a decade since my bioinformatics course), but you still have to include the non-AA-coding basepairs in the middle of that.

Compensating for that is the fact that there are like, multiple independent programs; if a ribosome is offset by a single base pair, the result is entirely different. If it runs the other strand, the result is different. And instead of crashing like any program would, biology just learns to use all of those possible encodings. In part, this works because there are 64 possible codons but only 20 amino acids, and the redundancy allows a substitution to affect only some of the offsets.

Yes. Another important metaphor is that the common idea of DNA as blueprints is entirely wrong. It's not blueprints, it's a recipe. A blueprint describes what something is. A recipe describes the steps needed to make something, making use of a lot of complex existing machinery and parts with only a reference to them.

The nucleotide sequence is obviously important, but people also sometimes forget that DNA and RNA are real things with 3D structure too. That matters too: it’s as if builders make errors where the blueprint rolls up or pages stick together.

The whole thing is absolutely fascinating and wild.

Interesting reasoning. But isn't it true to say that the "complex existing machinery and parts" which interprets the DNA was itself put together from instructions found in other DNA? I suppose that metaphors are rarely entirely comparable.

Some of that machinery and parts isn't directly represented by DNA. As an example, DNA codes some proteins that help extend cell walls, but those only work if you already have cell walls. If you have only the full DNA for a cell, and no other knowledge, you cannot build that cell out of that.

> and I can easily fit it on my screen.

...with GATACCA right in the middle, but unfortunately with no GATTACA that I could find.

Heh. Technically, there isn't even GATTACA in there since it's RNA and hence all the T's are actually U's. It's just convention to use the T's. GAUUACA doesn't have the same ring to it.

I'm estimating roughly 90-ish characters in a row, roughly 40 rows encoding the spike protein. So about 3600 base pairs. There are 3 base pairs per amino acid, so That's 1200 amino acids.

For comparison, the smallest chain that they technically call a protein is 100 amino acids that's an arbitrary limit to separate proteins from enzymes. So this thing isn't tiny tiny.

But Titin (also called connectin), a giant protein responsible for passive elasticity in mucles, is ~27,000-35,000 amino acids. So this thing isn't even close to the biggest proteins out there.

> that's an arbitrary limit to separate proteins from enzymes

Do you mean “to separate polypeptides from proteins”? Enzymatic activity has nothing to do with size. For example, one of the smallest enzymes in humans has 62 amino acid residues. And, under certain conditions, even single amino acids can be catalytic.

But yeah, the polypeptide-protein threshold can get fuzzy, especially with the recent advances in miniprotein characterization.

yes, that is what I meant. It's been a long time since I've used that info.

The story I remember was that Insulin was the first protein that was sequenced, which is funny because it was before they made the distinction. It's actually too small to be considered a protein now.

> "Instead it's no bigger than a large paragraph of text, and I can easily fit it on my screen."

When I saw it, I thought that it could almost fit in a tweet, so I just did it:


The sequence takes 16 tweets, 15 if you don't split at line endings and remove spaces (4175 nucleobases / 280 nucleobases/tweet ~ 14.9 tweets).

Or you can use base2048 [1] to compress it down to 3 tweets (4175 nucleobases * 2 bits per nucleobase / 3080 bits per base2048 tweet = 2.7 tweets).

[1] https://github.com/qntm/base2048/

"but I'm utterly amazed at how simple this _appears_."

Remind me the joke of the consultant engineer knows where to make X by the chalk. LOL

Not Moderna, but this [1] was a very useful primer on grokking how the Pfizer vaccine works, especially for computer programmers.

[1] https://berthub.eu/articles/posts/reverse-engineering-source...

the way I see it we're just at the beginning, and we're mainly copy/pasting a lot of code, we understand some small parts, and generally in the teenage years of genome programming.

I don't know how long it will be before we get a bit more serious with it, but geneticists have a big obstacle in their understanding, any change might needs a thousand strong lifelong population study to be understood. That's way crappier than dumping the assembly or only having the documentation in Chinese.

I will add that moreover the developers might have been even more conservative in their code because they knew it was going for large scale deployment, they probably avoided the cutting edge as much as they could.

Bravo! Nice execution of the tips from yesterday's article!


Great quote from Maurice Hilleman, creator of many (most?) of our childhood vaccines goes something like “Don’t be smart. Instead be careful and accurate”

Lots of these things aren’t complicated. It’s the careful systematic testing and public trust building that’s the hard part.

The genetic code itself is reasonably comparable to ASCII in complexity - every 6 bits is the code for one amino acid in a string, which will fold itself into the required protein.

I remember a lot of features and especially bug fixes where I had to change one line of code, it took hours to figure out how exactly though. I guess this is kinda similar?

The way it reads like source code, truly makes me circle back to the idea we're all living in a simulation.

Mathematical truths about abstract notions of string theory fit in a line.

The New York Times published an article last year with the entire genome of the SARS-Cov-2 virus, with a breakdown of different sections to explain what protein the RNA codes for and what that protein does. Like you said it was amazing that it all fit within an [albeit long] newspaper article. It doesn't surprise me that the RNA for the vaccine, which only targets a single protein, is even smaller than that. Here's the NY Times article I was referring too:


It appears simple, but a whole lot of work went in to producing that string even pte-COVID. Some of it is generic in the sense that it might apply to any mRNA vaccine. Some is quite specific:


There’s also (IIRC, no citation right now) prior work suggesting that coronavirus vaccines against the spike are likely to be effective and that vaccines against the N protein might be counterproductive.

each one of those letters represents a ~15 atom molecule, so in a way it is a compressed representation

See RadVac: https://radvac.org/

Make your own, open-source. Really cool.

A user on lesswrong made their own (with no prior experience): https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vac...

It's not really that simple.

Only two companies in the world succeeded, the French company Sanofi which also tried making a mRNA vaccine failed.

True, most pharmaceuticals can't do it now but given the right knowledge, which is known, it can be done relatively fast. I suspect in the next few years there will be many companies that will be able to replicate and advance the process.

It’s like looking at the binary file and saying “that’s pretty simple” while ignoring the massive amount of machinery that allows us to run that file and use it (CPUs, Motherboards, computers, etc).

I presume a whole bunch goes into making vaccine and this is just the top of the iceberg.

so, explain it to me ?

Cool, but it's the lipid delivery system that is the secret sauce. This is equivalent to giving the source code without a compiler to build it.

Wouldn't the "compiler" be the bioreactor used to mass-produce it and the "installer" be the lipid encapsulation? :)

Maybe the booster shot can be done with a simple apt-get update.

Serious. Unless it requires a separate, native-compiled adaptive immunity package.

Got the “GPG Error: No Public Key” error. Probably for the best!

Fawwwwwk. Gotta edit the /etc/apt/apt.conf.d/covid-19-vaccine.conf

Anyone know if sudo works?

a fellow linux user. ubuntu?

In the case the bioreactor “compiler” is actually our own cells which read out the mRNA “source code” and translate it into protein. The lipid encapsulation delivers the mRNA to our cells, so perhaps it’s more analogous to a network protocol that delivers source code intact across firewalls and other defenses.

InstallGene by Flexera

Just waiting for first copy protection mechanism for mRNA. SafeGene and SecuGene here we come :o

"The authorization server is down, no longer signing this version of this medication, or this medication has expired. Please contact your supplier for more information."

Shit, I'm going to have to google for a working hex edit or look for a bpatch.

Can you imagine, the greedy folk would want some sort of Widevine in humans.

"Drink Brawndo, The Thirst Mutilator." in your DNA too.

Sony is gonna sneak in some DRM and Mark Russinovich is gonna put out a fix.

The vaccine will check your implanted payment chip receipt number before activating the payload.

Can't wait for the DNA "chip art" by gene designers. Considering we have so much HERV, they might use that as an equivalence rationalization to do it.

But wasn't the whole Pfizer/BioNTech "secret sauce" leaked online after the EMA was hacked?

Meh, they probably just used lipofectamine (which has been around since the 90s) or something very similar.



lipofectamine is used for in vitro transfection, not in vivo gene delivery. The vaccines use lipid nanoparticles rather than liposomes

I absolutely do not have a link, but I remember reading that the lipid nanoparticles are actually created by mechanical action (possibly fluid dynamics/turbulence). I thought that was pretty neat.

If they didn't use lipofectamine, what did they most likely use?

My first thought was `wdiff pdizer moderna`. It's short enough to post here in its entirity, but I guess I had better not, anyway it's easy enough to extract from the pdf. Add a space after every letter and wdiff can find the common sequences nicely.

Short except for flavor, this is from near the beginning:


A pairwise sequence alignment done with `needle` starts like this:

                                  |||||.|.|..||||                |||   ||
  Moderna            1 GGGAAATAAGAGAGAAAAGAAGAGTA----------------AGA---AG     31
                       |.|.|    ||       ||||||||||||||||||||||||||||||||

Knowing nothing about biotech – if Moderna and Pfizer were working from the same sequencing data, why would their resulting vaccine mRNA sequences be different? Even slightly?

Edit: I guess what I'm asking is: presumably these vaccines both target the spike protein. Do both of these sequences express the same protein? Or is there a "close enough!" thing in the immune system, where it can be a little different and still be targeted by the immune system?

The sequence can be changed and optimized for several reasons:

* There are untranslated regions (UTR) that could influence the regulation or stability of the mRNA.

* Since most amino acids are encoded by more than codon, the coding region for the spike protein can be codon optimized. Altering the codon composition can improve protein expression.

* Likewise, enrichment of G:C content in the mRNA sequence might result in increased mRNA and expressed protein yields in vivo.

See https://www.nature.com/articles/nrd.2017.243#Sec4 for more information.


> Do both of these sequences express the same protein?

In this case both vaccines express exactly the same amino acid sequence.

> Or is there a "close enough!" thing in the immune system, where it can be a little different and still be targeted by the immune system?

It depends on how different the sequence is. For instance, if it is a little different the immune response should be very similar because, for example, the three-dimensional conformation of the spike protein chain should remain very similar as well. This is why the vaccines can be effective against several SARS-CoV2 variants.

Both sequences express the same protein.

Sequences are different because they are differently codon optimized. See https://en.wikipedia.org/wiki/Codon_usage_bias, especially "Effect on transcription or gene expression" section.

This very cool article explains the mRNA sequence chunk-by-chunk which might give you a flavour of why differences exist: https://berthub.eu/articles/posts/reverse-engineering-source...

That is a super interesting article, thank you for posting it!

But, I guess my question is more about why the abstraction of "protein chunks" doesn't fall apart when there are relatively significant "diffs" in the RNA sequence.

The most significant diffs between both vaccines occur in the untranslated regions located around the protein coding sequence and will never be present in the actual spike protein.

Regarding the protein coding region, because of the degeneracy/redundancy of the genetic code, all changes within it are synonymous and code for identical amino acids.

That is a fascinating read (and the perfect level of depth in this field for me). How did you happen across it? Always looking to add a good source to my RSS feed list

dont know about op, but the link is present in the posted git repo readme.

> Our body runs a powerful antivirus system (“the original one”).


Moderna is better at codon golf.

> if Moderna and Pfizer were working from the same sequencing data, why would their resulting vaccine mRNA sequences be different? Even slightly?

Ever tried to compile the same source with different compilers?

Us folks in biotech have a special tool just for this :) https://blast.ncbi.nlm.nih.gov/Blast.cgi

Unfortunately, the core algorithm dates back to 1990, so it can be real slow in some cases. Biotech takes a while to improve :(

You can also run blast locally if you need to throw more hardware at it.

The thinking behind attaching a PDF with colors and not a Genbank file is why we can't have nice things in biotechnology.

Wait, you mean you don't extract genomic data from Excel? The MARCH1 gene brings many interesting surprises.

Excel finally has a facility for manipulating data that keeps it where you put it. It also incorporates a fairly decent functional programming language. It's called Power Query, not to be confused with all the other things that MS has named starting with "Power" and have no relationship at all and are mostly awful.

The only real annoyance I have with it is that the editor window is modal, like it blocks all the spreadsheets you have open on your machine, and it's primitive even compared to VBA, especially for debugging.

It's not just that it's given me the experience of "this is the way a spreadsheet or BI tool should work" but also "this is the way SQL should work". It's a little cumbersome to do the standard SQL-type operations, but the clean integration of functions means you can implement anything that's missing. Like say, Oracle has grouping sets - you can, and I did, just write a function to do that. I always felt that having a separate procedural language in your database was wrong, but I'd never seen the alternative until now. And I've been falling in love with higher order functions.

Power query is one of the best things to be added to Excel in recent years. I especially like how it makes import/ cleanups easier to reproduce vs the old ways.

I am fond of September 2, myself.

For those not in the know:


Now SEPTIN2! (and MARCHF1)

Exactly. FAIR (Findable, Accessible, Interoperable and Reusable) principles are at a loss here [1]. The "Reusable" part seems to be especially problematic as the sequence is buried in a PDF file though all aspects of FAIR are compromised here. Edit: It looks like there is now a PR to address this issue [2]

[1] https://www.nature.com/articles/sdata201618

[2] https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV...

Things are getting better, but it still so so bad. The funny thing about that Nature article is that I recently had to parse a html table from a recent Nature article. Thankgod pd.read_html did a decent job and I then only needed another hour to hunt down all the typos and weird text issues.

If there is no annotation or metadata FASTA format is usually preferred ;)

Do you have a list of all popular formats?

My thoughts exactly!

Somewhere, Margaret O. Dayhoff is weeping.

Despite how complex this really is, and how many "gotchas" there might be when using this repository, it's nice that it gets a shitload of attention. As a united humanity we should strive to solve our common problems.

If my little knowledge from biology class serves me correct, RNA uses Udenine instead of Thymine. But in this document it uses T.

Can somebody explain to me why?

The convention of genomic research is to present all RNA sequences as equivalent cDNA sequences. As this will be the output of most common sequencing platforms.


DNA is way more stable than RNA. Since you can easily synthesize RNA from DNA, and DNA synthesis technology is much more mature, folks normally synthesize DNA and then derive/make the RNA from it. That makes most researches default to DNA 5' to 3', even when talking about RNA.

You probably mean uracil, not udenine (which doesn't exist AFAIK).

Yeah, English is my second language. I just thought of a Translation that sounded reasonable rather than looking it up.

Note that independently of the notation used the mRNA of those vaccines use even more "weird" bases, such as 1-methyl-3’-pseudouridylyl, to make the vaccine mRNA not be detected by the immune system [1].

[1] https://berthub.eu/articles/posts/reverse-engineering-source...

DNA uses base pairs [A,T] and [G,C], this code is for a piece of DNA,. if you keep a DNA sequence in vials for later use, that is much more stable and easier to manipulate, and repair when corrupted.

normally RNA in vivo is complexed with protiens that prevent RNA from folding, and annealing into structure that is not compatible with translation to protien. In the vaccine this isnt happening, this is why RNA is hard to work with and the vaccine must be kept so cold.

This is not to say that DNA is simple to work with, but it solves problems if you dont need direct access to RNA.

RNA uses uracil/uridine rather than thymine, but uridine is actually quite immunogenic. That's what has prevented people from using mRNA as a therapy until recently, when the founders of BioNTech figured out that they could use pseudouridine (abbreviated as Ψ) instead. See [1] for more information.


Wow Looks like it is analogous to having a header on a TCP packet. [0] Here is an animation of mRNA encoding translated to proteins inside a ribosome. [1]

"The ribosome is composed of one large and one small sub unit that assemble around the messenger RNA, which then passes through the ribosome like a computer tape. The amino acid building blocks, that's the small glowing red molecules, are carried into the ribosome attached to specific transfer RNAs; that's the larger green molecules also referred to as tRNA. The small sub unit of the ribosome positions the mRNA so that it can be read in groups of three letters known as a codon."

Very analogous indeed.

[0] https://xerocrypt.wordpress.com/2014/07/22/how-to-read-almos...

[1] https://www.youtube.com/watch?v=TfYf_rPWUdY

Some parts of gene transcription are so straightforward one can almost be tricked into thinking it has the logic of a computer program. It may be an illusion. To stretch the metaphor, TCP parsers don't match probabilistically along the entire length of the packet in parallel, and they don't interpret the same part of a packet as data in some contexts, and a header in others.

I ended up majoring in biochemistry and molecular biology in my undergrad because I was browsing on Wikipedia one day and came across an article written on an E. Coli variant that had sentences like:

01J3 e. Coli has a DNA Polymerase that contains 3k’-5’ proofreading capability and 5’-3’ error correcting with a polymerisation rate of 50bps

I’ve made the above up because I have never been able to find a Wikipedia page winxe that as succinctly pointed out to me that biology was a machine and I was hooked

Rather disappointingly, neither sequence includes the string 'GATTACA'

A given combination of 7 bases has a probability of occurring of 1/16,384. Since the COVID genome is about 22k bases long I guess you have pretty good chance of it appearing in there somewhere. This assumes uniformity, which of course is not true. COVID’s genome is under crazy intense selection pressure!

The sequence GATTCA appears 4 times in the reference version of the COVID genome :) (Go to https://www.ncbi.nlm.nih.gov/nuccore/NC_045512, pick "Find in this Sequence" on the right)

AWESOME! I just did a text search for it on github. Maybe it didn't pick any up that had a line break in the middle.

I'm much happier now.

Yep, the usual coding tools aren't ideal for bioinformatics. We have our own set of tools that work well with the various "standard" formats for sequence data.

That would have been a killer easter egg (possibly literally).

Whats special about this string?

It´s the title of a cult film: https://www.imdb.com/title/tt0119177/

Very clever title. It was only much later I understood what it meant.

The Human Genome Project was completed almost two decades ago, and somebody solved the protein folding problem recently.

Why are we still doing genetics at the machine code level? Shouldn't we have some compilers, assemblers and linkers by now?

Protein folding is not solved, that headline was overstating the actual achievement by Google's protein folding solution.

If I remember correctly "solving" protein folding was essentially some high probability prediction that state A transform to state B with some reasonably high chance, on a big dataset. Or something like that anyway. It's as far from high level work with genetics as creating nanotubes a few molecules long in lab manually is away from industrial production.

The most fundamental reason for that is that it's just not amenable to human mind. We are quite primitive actually, being able to hold only a handful of "things" in our mind at any one time and relying on abstraction to think of more complex things. However, you can't abstract much in biology; there is no locality or separation of concerns, everything affects everything.

Take that piece of RNA. An intuitive mental model is that it's some form of "instruction" or a bunch of instruction, isn't it? It's also wrong, because it just encodes a protein that acts the way it does only because of its shape (that is, one of its potential energy local minimums) and the shape of other proteins around it. That shape is only weakly local, it can be affected by far-away sections of peptide sequence. So it's almost impossible to systematically break it down, you have to consider and model things as a whole , which is insanely complex both computationally and cognitively.

If you want a good mental model of how it works, imagine you assemble a thing from metal balls and springs. You take a few thousands balls and connect most of them with springs of different strengths. You then take this thing and throw it on the floor; it will assume a shape that is implicitly encoded in spring strengths, its environment, and the way you've assembled it. You can even make it change shape if you poke on it the right way. That's how biology works in a nutshell; it's a nightmare to design anything for systems like that. Again, you can't simplify and break down and encapsulate and abstract like you do in programming.

Because the problem is significantly more complicated than sequencing and folding.

I feel this XKCD describes the situation particularly well: https://xkcd.com/1831/

Maybe after 4 billion years of evolving our code we will get it right.

My thought exactly. So this thing is like a VM with a bunch of primitive opcode, why can’t someone write a higher-level language or at least some gadgets

The problem with trying to program genetics is that there is a bunch of code already running on the system and every variable is a global. You can't just start up a new program with minimal impact on the stuff that is already running, like you can in most human-made computers. Also don't forget that the extremely simplified version of the running system looks like this: https://www.sigmaaldrich.com/technical-documents/articles/bi...

I wouldn’t call it extremely simplified; you can go simpler: https://science.sciencemag.org/content/351/6280/aad6253

Because it's a harder problem than it seems at face value.

I’m a little confused by the title? Looking at the document, it seems to me (knowing next to nothing about this field) it includes both Pfizer and Moderna’s protein spike sequence in figures 1 and 2, respectively. Is that correct?

It’s also interesting the way it’s worded: that the sequence was “assembled from $vaccine”. Does that mean whoever published this has backed into these sequences rather than having gathered this information directly from the source(s)?

You are correct. The researchers here sequenced each vaccine starting with the bit of vaccine left in the vial after administration. The goal was to get a raw sequence of the Moderna mRNA component so it can be easily filtered out as being a signal of therapeutic origin. Pfizer's sequence has already been published; it's incldued here to confirm that the result achieved experimentally matches the published sequence.

The authors reverse engineered the sequences of the vaccines, obtaining them from the remaining mRNA present in the vials.

“Assembly” in this case means that they merged several short sequences they obtained, each representing a fragment of the whole mRNA sequence.

They sequenced vaccine leftover remaining in used vials.

So reverse engineering basically.

And reverse engineering only sounds dramatic until you take a step back and acknowledge that it's what they literally do all the time. Only that usually the sequences they read are not the outcome of some human development effort but of naturally occurring evolutionary processes.

We are simply programmable machines, its pretty interesting that all of human life can be reduced down to 30k editable microservices.

That gives me the feeling that those reflexion models could do some help for improving our understanding of those microservices

"its pretty interesting that all of human life can be reduced down to 30k editable microservices."

I don't know much about DNA and co, but it sounds as microservice is not the right metapher. Rather just 30k sourcecode?

Because a microservice is something that is already compiled and running..

Was looking at it as each gene is a microservice and performs a role. Those microservices can be added to, edited / eliminated or swapped out.

Sure, but if you took that 30k of data and dropped it on a planet just like earth it would still take 10k years or so for us to build civilizations as we know it again.

Not 10k year as it needs to go through the million years scale - rna, hot, uv then dna ... with no oxygen to oxygen etc. Then million of years of evolving ... scale is a bit off.

He's saying if you wiped out human civilization it would take O(10k) years to rebuild, because knowledge/culture isn't stored in the genes.

Yep exactly, but also that's a fun problem to think about how long it would take if we sent our DNA on an asteroid/space probe to another earth-like planet. :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact