
Genomics – A programmer’s guide - andy-thomason
https://gist.github.com/andy-thomason/f304850bdf20d2cd2ecbb042d81b5e54
======
heuermh
Along with the annual BOSC (Bioinformatics Open Source Conference), the OBF
(Open Bioinformatics Foundation) hosts a free and welcoming collaborative
event called CollaborationFest (CoFest).

CoFest is a collaborative two-day working session. The only requirement for
attendance is that you have an interest in open source software and solving
scientific problems. We will have contributors to open source bioinformatics
tools present to collaborate with, and we welcome new attendees who want to
learn and contribute to open source code, documentation, workflows, or
training.

This year CoFest is July 26-27 in Basel, Switzerland. There is no registration
fee to attend in person or virtually.

Disclosure: I'm on the organizing committee for BOSC and CollaborationFest

[1] [https://www.open-
bio.org/events/bosc/collaborationfest/](https://www.open-
bio.org/events/bosc/collaborationfest/)

------
glofish
For a more realistic and comprehensive guide to genomics from a data science
perspective see the "Biostar Handbook"

[https://www.biostarhandbook.com/](https://www.biostarhandbook.com/)

The book is inspired by "Biostars" the StackOverflow like Q&A of genomics:

[https://www.biostars.org/](https://www.biostars.org/)

~~~
chrisamiller
I often recommend Aaron Quinlan's excellent course if people want to learn a
more hands-on and practical set of skills related to computational genomics
and bioinformatics.

[https://github.com/quinlan-lab/applied-computational-
genomic...](https://github.com/quinlan-lab/applied-computational-genomics)

------
deugtniet
More genetics than genomics, but a cool introduction [1]!

There's a lot of subtlety in genomics analysis, and learning about all of it
is a deep dive into chemistry, biology, and statistics, as well as a lot of
literature search.

One fun fact about this is that most of the data in genomics is some form of a
plain text table. As this makes interoperability between different programs
easier. Many people have tried to make this more efficient, but their
application is usually only limited to very specific use cases.

[1]
[https://en.wikipedia.org/wiki/Genomics](https://en.wikipedia.org/wiki/Genomics)

~~~
andy-thomason
Yes, this is a deep subject and "Genetics" is indeed the correct term. Our
company is called "Genomics" because of the origins of the founders in the
Oxford stats department and Big Data Institute.

Both Gil and Peter have done some incredible things in this space, but the
language of scientists can be difficult to understand by a lay audience.

------
dannykwells
The hardest part of genomics for me has honestly been figuring out which open
source poorly maintained tool I should use for a particular problem. and which
options should be run and how the data need to be preprocessed before hand.

I mean has anyone ever actually read the documentation of the GATK? It is
famously dreadful. And that's professionally maintained.

Honestly a nice addition here would be a "so you want to" with snippets of raw
FASTQ or VCF data and working code for various operations, maybe with an
accompanying Docker container.

~~~
xyhopguy
have you ever looked at the test suites for Picard? All regression tests and
the library is OO hell lols

I was taught a decade ago that rolling your own in genomics isn't as bad of a
decision as it seems.

~~~
searine
>I was taught a decade ago that rolling your own in genomics isn't as bad of a
decision as it seems.

Famous last words.

------
bluejellybean
Genetics is an extremely fun topic but the learning curve for entry has been
non-trivial. It reminds me of the struggles I had when I was first learning to
program recursive functions. With that said, it's been far, far more
complicated than recursion. So, thank you for the resource, I'm always happy
to find content sources that are helpful on the subject. If you enjoy hard CS
problems, bioinformatics has been booming and I don't really see it slowing
down anytime soon.

As for some resources, these books have been the most helpful for me.

[0] [https://www.amazon.com/Molecular-Biology-Cell-Bruce-
Alberts-...](https://www.amazon.com/Molecular-Biology-Cell-Bruce-Alberts-
ebook/dp/B00PWDH4RW) // I've linked and have seen mentioned here in the past,
a great intro to cell biology.

[1] [https://www.amazon.com/System-Modeling-Cellular-Biology-
Conc...](https://www.amazon.com/System-Modeling-Cellular-Biology-
Concepts/dp/0262514222) // Modeling with math. I would consider this high-
yield as it gave me a great deal of insight into what different code bases are
actually attempting to do.

[2] [https://www.amazon.com/DNA-Nanoscience-Prebiotic-Emerging-
Na...](https://www.amazon.com/DNA-Nanoscience-Prebiotic-Emerging-
Nanotechnology/dp/1498750125) // Blew my mind when I first read it but the
content is pretty standard as far as genetics course material goes.

------
ghgr
For those of you interested in bioinformatics I can recommend the Rosalind
Project [1]. It's like the Euler Project, but for Bioinformatics.

[1]
[http://rosalind.info/problems/locations/](http://rosalind.info/problems/locations/)

------
blopker
If anyone is interested in playing with a full 23andMe raw data file (VCF), I
have mine on GitHub:
[https://github.com/blopker/DNA](https://github.com/blopker/DNA) PRs welcome!

If you're also interested in working on this stuff, shoot me an email ;)
blopker@23andme.com

~~~
magnamerc
How far are companies like 23andMe from entire genome sequencing? That's kind
of what I'm waiting for. Can you still get valuable data from genotyping?

~~~
jghn
You can get valuable data from genotyping. SNPs contain the bulk of variation
between you & me.

For your first question it depends on your definition of "companies like
23andMe". There are numerous companies that'll do a whole genome for you, but
I don't know if any of them do the writeup about it that 23andMe provides.
23andMe did at one time offer an exome product, but stopped that a while back.

The largest hurdle is cost. Whole genomes, even exomes, are significantly more
expensive than a SNP chip. As most would be users don't know enough to care it
doesn't make much economic sense to offer those to the masses at the moment.

~~~
astazangasta
Actually about half of variation is private (not common) and commercial
services will only look for common SNPs. So you will have some unique variants
that would show up in a whole genome but not a SNP test.

~~~
jghn
True. I was trying to say that your average lay person is unlikely to know the
difference enough to be a big deal

------
codeulike
That's a useful guide. A description of how sets of three letters translate to
amino acids or stop commands would be handy, because that bit is quite mind-
blowing and also quite reminiscent of machine code. And from there you can
explain different sorts of mutation, like truncation, substitution and phase
shift.

Also a guide to how to usr all that to interpret medical nonclemature of
mutations, like c.345G>E would be handy

~~~
dasmoth
_Also a guide to how to usr all that to interpret medical nonclemature of
mutations, like c.345G >E would be handy_

Those mutation descriptions are called HGVS (Human Genome Variation Society)
nomenclature. In the example you give, "c." means that it's in a (protein)
coding region, 345 is the position within the region, and G>E would be the
change (although E isn't a valid "letter" in DNA sequence, even if you allow
ambiguity codes -- you'd normally see something like G>T there instead).

Complications include:

1) You need to know which gene this is relative to.

2) The "coding sequence" for the gene isn't always perfectly defined, due to
splice variation and different versions of the annotation. Ideally, you'd see
this code relative to a specific splice variant (which might have an ENST
identifier, from [http://www.ensembl.org/](http://www.ensembl.org/)). But it
depends...

More at [http://varnomen.hgvs.org/](http://varnomen.hgvs.org/) if you're
curious.

~~~
andy-thomason
How to represent variants is a whole can of worms. There are a number of
competing systems.

* RSIDs (from DBsnp [https://www.ncbi.nlm.nih.gov/snp/](https://www.ncbi.nlm.nih.gov/snp/)) * HGVS as mentioned. * Ensembl chrom-pos-ref-alt (CPRA). * Variant key (Nicola Asuni)

As dasmoth says, there is no fixed coding sequence for a gene or location in
the genome.

------
brianzelip
NCBI (Natl Center Biotech Info) and other related hackathons,
[https://biohackathons.github.io/](https://biohackathons.github.io/)

------
anderspitman
If you're a programmer/CS person interesting in genomics/bioinformatics, I
can't recommend UCSD's Coursera courses[0] enough.

[0]
[https://www.coursera.org/specializations/bioinformatics](https://www.coursera.org/specializations/bioinformatics)

------
zubyak
Maybe it's off topic, but anyway :

I'm a cs student, in my thesys I'll be working on a NGS C++ application. I
need at least a brief introduction to "basic" sequencing but I'm struggling to
find something accessible. Every book I find seems superspecialized. Now I'm
reading "Insect Molecular Genetics : An introduction to principles and
applications" but I'd like to read just a book chapter a little bit more
advanced than the contents shown in this video
[https://youtu.be/ONGdehkB8jU](https://youtu.be/ONGdehkB8jU)

Any suggestions?

~~~
alextheparrot
I studied Biochemistry/Comp Sci and the foundational biochmeistry book imo is
the Lehninger Principles of Biochemistry. It goes over the basic biochemistry
and once you understand that things just start to “Make sense. Once you have
those basics you can read the wikipedia article and things start to click.

On the other hand, as a person who’s worked on sequencing software I’ve found
the biochemistry knowledge to only be incidentally useful - though I may be
underestimating some of the “basic” assumptions that were used day to day.

~~~
zubyak
>as a person who’s worked on sequencing software I’ve found the biochemistry
knowledge to only be incidentally useful

I have the same feeling but I'm uncomfortable working on something knowing so
little about it. I'll check out the book, thanks!

------
aBioGuy
This is a nice intro. For a good collection of "worked out" "pipelines" to
analyze different kinds of genomic data types (RNA-seq, ChIP-seq) in the R
environment (the concepts are universal, even if you don't R), take a look at
Bioconductor:

[https://www.bioconductor.org/packages/release/BiocViews.html...](https://www.bioconductor.org/packages/release/BiocViews.html#___Workflow)

------
virtuabhi
There are many people looking for introduction into genomics and NGS
applications. One of the books I found to be extremely useful is Genomic
Quirks ([https://www.amazon.com/Genomic-Quirks-Search-Spelling-
Errors...](https://www.amazon.com/Genomic-Quirks-Search-Spelling-Errors-
ebook/dp/B01MYG6U6N)). This book explains genomic concepts with several case
studies.

Here is a video by the author -
[https://www.youtube.com/watch?v=BfVo8EkeDVI](https://www.youtube.com/watch?v=BfVo8EkeDVI)

------
Hackbraten
Unfortunately, the article handwaves the one thing I’ve been struggling with
the most: chromosomes.

What is the relationship between chromosomes and the human genome?

I somewhat get that a chromosome is somewhat of a partition of the genome, but
how does the „two copies“ phenomenon of the human genome and the „two copies“
thing of chromosomes fit together? Are those one and the same concept?

Are there two copies of the XY chromosome, too?

~~~
madhadron
A chromosome is a physical object. You can see them under a microscope. In
eukaryotes, they consist of a DNA double helix supercoiled and wrapped around
big protein structures called nucleosomes, along with a bunch of chemical
modifications of certain bits sticking off the nucleosome and of the DNA
structure which are involved in a bunch of different functions in the cell. In
bacteria and archaea, it's still supercoiled, and there are some proteins that
are similar to nucleosomes in function, but the picture is much more diverse.

Animals tend to have the same number of chromosomes at all times, and they
tend to come in pairs that are nearly identical. There are various ways of
mapping chromosomes that yield unique fingerprints that are stable under the
level of variation we typically see in a species (see restriction mapping for
example), so we can take a particular fingerprint and call it chromosome 1 or
2 or whatever. Animals have two copies of chromosome 1 and two copies of
chromosome 2, etc. There are individuals who don't, who have a single copy of
one or extra copies, and this causes problems, such as Turner syndrome.
Similarly, when animals reproduce, each parent produces a germ cell (sperm or
egg) that has one of each of the chromosome pairs. One of the reasons that
many hybrids like mules are sterile or nearly infertile is that their
chromosomes, coming from different species, aren't in pairs, so when they pass
on half of them, there may be necessary hunks of DNA that just aren't passed
on.

Other species have different numbers of copies of chromosomes, and may vary.
Depending on point in lifecycle and conditions, some plants range from two
copies to hundreds. Dinoflagellates tend to have four for interesting reasons
that I believe are related to the Byzantine generals problem.

There is no XY chromosome. Females in mammals follow the normal pattern with
the X chromosome: they have two of them and they are passed on like any other
chromosome. Males are weird. They have one copy of X and a copy of a shrunken
chromosome called Y. Note that this is only mammals. Birds and reptiles have a
totally different set of chromosomes for sex determination, and in some
vertebrate orders the chromosomes don't fully determine sex. Incubation
temperature often changes it.

------
terrykfwong
Thank you for this detail and useful guide Andy.

~~~
andy-thomason
Thanks, Terry.

I hope it was useful. It is just an introductory jargon-free guide. You can
find more on Ensembl and Wikipedia.

~~~
inciampati
There are a lot of caveats in representing a genome as some point diffs from a
reference. I worry that your description might promote a way of thinking about
genomes that ignores the more complex things that can and do happen.

