
Diffing Coronaviruses - jamesfisher
https://jameshfisher.com/2020/02/09/diffing-coronaviruses/
======
glofish
diffing is not the right approach, it can't detect smaller blocks moving
about.

since the genome is made out of only A,T,G,C for any change beyond the trivial
diffing may generate the incorrect picture, it lines up the "wrong" bases.

Here is a much simpler solution that does a much better job and you don't have
to muck around so much. First install some tools:

    
    
      conda install -c bioconda entrez-direct emboss
    

Now do a:

    
    
      # Get the data.
      efetch -db nuccore -id MG772933 -format fasta > MG772933.fa
      efetch -db nuccore -id MN988713 -format fasta > MN988713.fa
      
      # Align globally.
      stretcher -filter  MG772933.fa MN988713.fa  > aligned.txt
    

A much simpler explanation of what is going on (I picked a variable region
here):

    
    
      ...
      MG7729 CACTCCTCACTATTCATAGAGGA-----GAC-CCT----ATGCCTAATAA
             :  : ::  :: : :::::: :      ::: :::    :: : :  : :
      MN9887 CTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCA
      ...
    

it also reports:

    
    
      # Length:           29922
      # Identity:   26272/29922 (87.8%)
      # Gaps:         160/29922 ( 0.5%)

~~~
SubiculumCode
An example how domain knowledge makes data wrangling more effective and/or
valid.

------
robertelder
I did something similar recently using my custom terminal diff tool. Here is a
more colourful terminal-based view of the differences between MN908947.3 and
sars NC_004718.3:

[https://twitter.com/RobertElderSoft/status/12226906294923796...](https://twitter.com/RobertElderSoft/status/1222690629492379649/photo/1)

To produce that image I used this tool:

[https://github.com/RobertElderSoftware/roberteldersoftwaredi...](https://github.com/RobertElderSoftware/roberteldersoftwarediff)
With these two genomes:

[https://www.ncbi.nlm.nih.gov/nuccore/MN908947](https://www.ncbi.nlm.nih.gov/nuccore/MN908947)

[https://www.ncbi.nlm.nih.gov/nuccore/NC_004718.3](https://www.ncbi.nlm.nih.gov/nuccore/NC_004718.3)

with the leading numbers, spaces and newlines removed from each genome file.

My tool (as well as unix diff with --minimal) uses the myers diff algorithm
which just compute a mathematically minimal diff. More advanced algorithms
exist for computing phylogenetic trees that take into account the biological
likelihood of certain sequence change (deletions vs. additions vs.
translations etc.).

------
hprotagonist
diff is a special case of a more or less obligate preprocessing step for
sequence alignment algorithms.

You need to go further than diff to make meaningful phylogenetic comparisons,
though!

[https://en.wikipedia.org/wiki/Longest_common_subsequence_pro...](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)
has much more.

------
thanatosmin
FYI—sequence alignment algorithms are really just a specialized subset of
diffing algorithms that take into account the properties of biological
sequences. (For example, certain edits are more or less likely to happen.)

[https://en.wikipedia.org/wiki/Sequence_alignment](https://en.wikipedia.org/wiki/Sequence_alignment)

~~~
glofish
I would say that perhaps diff is a special case of alignments, but not in
reverse.

Alignments differ from a "diff" because they are assigning a score to each
operation (match, mismatch, gap). The "optimal" alignment only cares about
maximizing the score, does not "care" about the objects that are aligned.

If one chooses the scores incorrectly they could end up with optimal (relative
to the score) alignments that visually seem completely "wrong".

I can't see how one could tune a diff, to produce obviously "incorrect"
output.

~~~
thanatosmin
Maybe most correct is to say diff and seq alignments are specializations of
some superset, given the specializations for both situations.

Diff algorithms absolutely do assign a "score" to edits, just in a very simple
way. For example, by playing with the Myers diff algorithm, you can get an
edit distance of one where you simply rewrite the entire file, or a large edit
distance but with the smallest edits possible.

------
gravelc
Definitely not a great approach.

It's worth pointing toward the actual algorithms commonly used for local and
global alignments (global is the one to use in the coronavirus genome case).

Local alignments: Smith-Waterman [1]. Global alignments: Needleman–Wunsch [2]

[1]
[https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorit...](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)
[2]
[https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algor...](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)

------
yters
I've been using a kmer approach, just a few lines in Python, and it's
interesting comparing coronavirus to many different genomes.

The neat thing about kmers is that it is alignment free, so can find
similarities between genomes even if genes have been moved around.

Interestingly, it finds HIV is quite close to cov compared to a random
sampling of refseq viruses. Maybe there is credence to that Indian paper after
all.

------
willyt
I read once that we share 99% of our DNA with Chimpanzees but we also share
95% of DNA with fruit flies. So if the viruses are 89% similar does that mean
the difference between SARs and nCOV is bigger than the difference between a
human and fruit fly? Who knows what this analysis might be telling us.

~~~
thanatosmin
That's not unreasonable to say, but virus sequences drift much faster and have
a much shorter lifecycle. This means a 1% difference can be huge for us and
insignificant for them.

------
et2o
This is a good start. However when trying to build phylogenies and understand
evolutionary relationshios there are many next steps beyond sequence identity.

------
scribu
Interesting, in light of a recent discussion about viruses having very high
gene variability:
[https://news.ycombinator.com/item?id=22276105](https://news.ycombinator.com/item?id=22276105)

------
djvu9
It seems NCBI's index is not updated. You should try MN996532, which is also
from bats and the sequence was uploaded Jan 29. It shows 96% similarity.

