
As DNA reveals its secrets, scientists are assembling a new picture of humanity - yawz
https://www.statnews.com/2016/10/07/dna-genome-sequencing-new-maps/
======
apathy
Google is using graph learning for representations of stuff where obtaining a
huge, representative, "labeled" corpus is hard.

As many in the field can tell you, "labeling" genomes is hard, too. You can
stick labels like "died at 39 of AVCR" or "had 5 primary tumors by age 28" or
"insane multi-substance abuser with extra toes" on a person, but that doesn't
really encapsulate all their traits. I claim that genome analysis is a great
candidate for semi-supervised learning. Looks like Ben Paten already had
_that_ thought...

Haussler, Paten, McVean, and the usual suspects are working on a tractable
graph representation to replace (say) hg39, i.e. instead of a new "reference
genome", there should be thousands. This makes more sense when you look at how
common structural variants (say, inversions) are, and then when you do
something like ask Haussler at ASHG "how do we represent inversions in this
thing?" you realize how f _cking hard it really is to get it right.

McVean & co. made it work for the major histocompatibility complex, which
immunologists can explain better than I can, but that's perhaps the most
diverse bit of genome there is. It's riddled with ancient repeat element
insertions and generally fascinating. It's also a source of Little Problems
like organ transplant rejections, graft vs. host disease in stem cell
transplants, maybe bits of schizophrenia and neurodegenerative disease... in
short, if it can be done for the MHC, it's probably doable for the whole
enchilada. It's not easy, though. In fact it's so difficult to get right that
a sub-field of accurate HLA typing and immunotyping evolved in parallel to
large-scale genome analysis, because it's really, really important not to fuck
this up. So that's a note of caution from historical precedent.

Nonetheless, it turns out that your pals at Google spearheaded the effort to
have a Hadoop-like "data lake" for genomes (sign up for the GA4GH mailing
lists if you like to watch professors bikeshed an API, and occasionally
produce incredible insights by accident). Maybe this is going to converge in
an interesting way. It won't happen overnight, but it will happen, and the
mathematicians_ will be vindicated.

* Seven Bridges is a little genomics company with a funny name. Unless you're familiar with Eulerian paths and the Seven Bridges of Konigsberg, in which case it's sort of obvious why they chose their name. Nice people, other than the patent, which infuriated most everyone else.

~~~
niels_olson
Hi, I'm a pathology resident and I would like to present a journal club on
this. My undergrad is in physics and I have taught myself some bioinformatics,
like Durbin's use of Markov models, Ukkonen's suffix tree, BWT, BLAST, but
clearly I'm no informatics expert. Two questions:

1) in the absence of a reference genome, does Burrows-Wheeler not apply?

2) Could you recommend any good articles to start from?

~~~
apathy
There's probably, in spirit, a graph version of the BWT, but I'm not familiar
enough to know it. The approach is less straightforward because you're not
just modifying BWT to allow for indels/errors, but rather a graph compression
algorithm to allow for fragment searching along vertices of a sequence graph.
That said, it has to happen eventually. It's been a while since I took comp
bio (sequences & graphs) but what you mentioned is what I learned (and I took
it from Waterman, so what I'm saying is that I think you've got it).

You can look at Pall's work to see how the current approaches may evolve into
something compressible:

[https://github.com/GFA-spec/GFA-spec](https://github.com/GFA-spec/GFA-spec)

As far as articles? McVean's, without a doubt!

[http://www.nature.com/ng/journal/v47/n6/full/ng.3257.html](http://www.nature.com/ng/journal/v47/n6/full/ng.3257.html)

An implementation paper for graph assembly HLA typing is at:

[http://biorxiv.org/content/early/2015/12/24/035253](http://biorxiv.org/content/early/2015/12/24/035253)

Interesting times ahead, with people recognizing that GxE and GxExE matters
far more than G alone.

~~~
volodia
Here is a paper that presents a graph version of the BWT:

[http://bioinformatics.oxfordjournals.org/content/29/13/i361....](http://bioinformatics.oxfordjournals.org/content/29/13/i361.full)

~~~
apathy
It's not 100% clear to me that this is directly compatible with the graph
reference, but if it isn't, that seems like a simple matter for Batzoglou &
co. Those guys are really, really good, and the multi-reference target is a
type of graph anyways, so I imagine it's just a bit more generalization (if
any) to make BWBBLE work on a graph-structured reference assembly as
implemented (there seem to be certain inconsistencies about how to represent
structural variants that differ between implementations).

The other thing that would be neat is that then you'd have a direct tie-in to
ancestral recombination graphs and could, in principle, get IBS/IBD for the
same cost as high-confidence genotyping for any two individuals. Come to think
of it, there's probably a way to recast this as shortest paths and get all
admissible traversals between a population of genotyped individuals (given an
ARG) for the same price as any two. Hmmm. This is a little disturbing.

------
WhitneyLand
There's got to be a lot more to the story that I don't understand.

Why wouldn't it have been obvious 16 years ago that a thoughtfully designed
data model was necessary, possibly using graphs, to account for variability
and other attributes of the genome?

Surely it was foreseeable that tooling would be crucial and that a solid
software foundation would be invaluable to enable efficient and flexible
processing for years to come?

Who were the lead developers supporting the original public genome project and
what were they thinking?

~~~
epistasis
A lot of them are the same people that worked on the first reference genome
are working on these algorithms, or have/are mentored the scientists in this
article. I would say that there's a lot more to analysis of genomes than you
expect; there are many different comparisons that make sense. Initially the
most informative comparisons were to other species, where a genome graph makes
less sense at the time with the amount of data and compute ability that was
available to bioinformatics scientists.

The rate of technology change for sequencing capabilities in the past 16 years
makes Moore's law look like the rate of change in battery technology.

16 years ago I didn't think it would be possible to sequence individuals in
the clinic before 2050 or so. Now we are building the technology to analyze
the varitation in a million genomes.

I guess your question is kind of like, "why didn't computer scientists build
systems like Kubernetes or Mesos in the 80s?" The problems and challenges were
just different 16 years ago, and there was more than enough to work on between
then and now. We don't _need_ genome graphs at this instant, but we will in
the coming years. And it's likely that newer representations will come about,
too, as more math and theory is invented.

------
wrong_variable
I interviewed for Seven Bridge.

Very interesting company. They had a typical white board interview process.

What they were doing didn't seem that technically hard, and they were more
concerned about prior credential ( like most bioinformatics company ) instead
of what you were able to do.

They also seem to think JavaScript is not worth their time :( The problem they
gave me was algorithmic and I just used the tool that was available to me.
They apparently write most of it in C++ due to "Speed".

The only reason I even got a face-to-face, even though I didn't even has a
Masters Degree was due to having done the assignment better than their Phd
candidates ( their words ).

This piece seems like a submarine article for their proprietary platform.

------
nonbel
Genetics question:

How many cells do you need in a population so that there is at least one
variant at each position (ie a SNP)? This can be for any species or cell line
for which the info is available.

~~~
toufka
Humans have two copies of ~3 Billion base pairs. A reasonable error rate of
DNA replication (not under stress) is about 1 write error in 1 billion reads.
Many of those errors are actually immediately corrected by post-replication
error correction mechanisms. Also, as cells divide, errors that happen early
will be propagated more times than errors that happen late in development.
Further, mutations are not uniformly distributed though the genome. A human
has > 10^14 cells, so any given human might have a few thousand different
variants among its cells. The error rates of human polymerase is way less than
a bacterial polymerase. And a human genome is way larger than a bacterial (or
viral) genome. Viruses actually have mechanisms to increase the error rate of
DNA replication.

For a human, you would need a LOT more than a single human to have a variant
at each position. Something like HIV can actually have a small enough genome,
a high enough error rate, and a large enough population in an infected
patient, that it can realistically have a unique variant at each position in
its genome within a single human host.

~~~
nonbel
Thanks, I later found this paper that claims something different. Can you
explain where this logic has gone wrong, they seem to start out with a similar
error rate (~10^-9 per site per division):

"For example, the intestinal epithelium contains approximately 10^6
independent stem cells, each of which generates transient daughter cells every
week or two. Thus, the intestinal epithelium of a 60-y-old is expected to
harbor >10^9 independent mutations. This implies that, not far beyond the age
of 60 y, nearly every genomic site is likely to have acquired a mutation in at
least one cell in this single organ."
[http://www.pnas.org/content/107/3/961.full](http://www.pnas.org/content/107/3/961.full)

 __Edit: __

> "A reasonable error rate of DNA replication (not under stress) is about 1
> write error in 1 billion reads"

I think it is that you are using the 10^-9 value to be _per genome_ while that
reference uses it as _per basepair_

~~~
apathy
Write error == (fixed) base pair mutation. A classic example is a methylated
cytosine spontaneously deaminating to yield thymine. Our DNA repair enzymes
can't tell the difference (in terms of which is the "right" base) between the
thymine and the guanine left behind, but since they don't match, one of them
has got to go. Thus there is a 50-50 chance that the mutation will be fixed.
That's the easiest example because it's not an "error" per se (rather a
thermodynamics problem) but genuine proofreading errors also occur. The net
rate is about one in a billion bases.

These estimates ignore indels and SVs but empirical evidence suggests that the
"everyone over 50 has a 50/50 chance of at least one adult stem cell harboring
a mutation in at least one interesting gene". My personal favorite is

[http://www.nature.com/nature/journal/v518/n7540/abs/nature13...](http://www.nature.com/nature/journal/v518/n7540/abs/nature13968.html)

but Druley's follow up was equally awesome:

[http://www.nature.com/articles/ncomms12484](http://www.nature.com/articles/ncomms12484)

and the recent survey of adult stem cell mutations is nice:

[http://www.nature.com/nature/journal/vaop/ncurrent/full/natu...](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature19768.html)

One take-away from all this is that, while 95% of a sensitively surveyed
population of 50-60 year olds had at least one stem cell with at least one
known preleukemic mutation, it is equally clear that most people aren't
walking around with anything resembling an acute leukemia. The natural
conclusion is that in individuals with a competent immune system and diverse
enough pools of healthy stem cells, it's not that big of an issue. Only when
bad luck and/or stresses to which the mutants are adapted (e.g. TP53 mutations
in therapy-related leukemia) afflict people, or the natural diversity of their
stem cell populations collapses (as with really old people and individuals
whose immune system actively attacks their stem cells, as in severe aplastic
anemia) do you see the sort of massive, life-endangering takeover that we
recognize clinically as disease.

Furthermore, nearly all of us are born with 5-10 predicted-to-be-lethal
variants in our genomes. Clearly, we're also not dead, so our conception of
"lethal" can't be quite right. There is an enormous amount of complexity in
how real live multicellular organisms deal with variation and mutation,
something we're really only just starting to grasp, and of course all of that
then interacts with the person's environment to manifest (or not) their
genetic tendencies. We build models of reality because the actual thing is too
complicated to be tractable; it's important never to confuse the two :-)

~~~
nonbel
>" The natural conclusion is that in individuals with a competent immune
system and diverse enough pools of healthy stem cells, it's not that big of an
issue."

Thanks, I have been thinking along those lines for a few years now after
looking at the age-specific incidence of a bunch of different cancers from
SEER. You see that many cancers peak consistently year after year at a given
age, while the height of the curve may change drastically. The same was true
when I looked at some data from other countries, although I never followed up
very much on that aspect.

Then if you read the paper which spawned the multi-stage model of cancer that
has been widely adopted[1], you see they make some assumptions for
computational reasons that are unnecessary in these days of cheap computing
power:

    
    
      pt ~ 1-(1-p)^t, if p<<1,
      where,
      p = probability a required mutation occurs during a given time interval
      t = number of elapsed time intervals (ie age)
    

Then by the product rule of probability they derive that, if cancer is due to
accumulation of errors (usually considered to be mutations), the incidence at
a given age would be:

    
    
      I(t) = k*p1*p2*...*pn*t^n = k*(p'*t)^n
      where
      I(t) = incidence at age t
      n    = number of required mutations
      p'   = geometric mean of the probabilities for mutations 1:n
      k    = a constant determined by the number of cells in each tissue,
             the proportion of times that a detectable tumor forms from 
             the carcinogenic cell, and possibly the sequence in which the
             mutations occur
    

If you use the non simplified version of their theory you would instead get:

    
    
      I(t) = k*(1 -q^t)^n
      where
      q = 1-p'
    

In contrast to the model that was simplified for computational reasons, this
has a turnover. By setting the second derivative to zero you can get the age
at which the peak incidence should occur as a function of number of required
mutations (n) and geometric mean of the probabilities each mutation occurs (q
= 1-p'):

    
    
      t_peak = log(1/n, base = q)
    

From this you will see either the multi-stage model is totally wrong, the
error rate must be much higher than commonly thought, and/or the cell division
rate of the error-accumulating cells must be much higher than commonly thought
(the age is usually taken as a stand in for number of divisions). The last two
possibilities suggest that we are constantly generating these cancerous cells
and they are being cleared somehow.

[1]
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/)

~~~
apathy
Armitage-Doll was a nice advance over previous models, but it's also
incomplete (and probably flat wrong in some cases, although working on
pediatric malignancies has convinced me that a second cooperating event
usually is mandatory).

In normal stem cells, it appears that attrition and immune clearance gets rid
of damaged cells when they cycle and senescent cells all the time (subject to
some variation, not entirely age related, at least in our volunteers). We may
expect higher rates in filter organs, but liver cancer isn't too common, and
the paper I referenced earlier shows that this can't be just an issue of fewer
divisions (I despise the oversimplified Tomasetti & Vogelstein paper because
the facts simply don't support it). Colorectal is probably more common because
the crypts are "facing out" ala melanocytes, thus prone to accumulating lots
of environmental damage.

Anyways, the latter of your possibilities (proliferative mutants divide faster
and error more often than normal counterparts) makes the most sense -- the
eventual "winner" in a tumor is the cell that produces the most progeny and
resists apoptosis due to stress the best. It's probably not a coincidence that
these are traits which adapt a mutated cell to survive chemotherapy as well.
However, spawning nonself mutations willy-nilly is a great way to attract
immune attention -- particularly if you haven't blown the immune system away
by nuking it with chemotherapy. :-/

~~~
nonbel
>"pediatric malignancies"

maybe, or you can use the full Armitage and Doll model I described above and
replace t with something like a discrete exponential decay where N(t) = number
of divisions since zygote as a function of time. Ie N(t) = N0(1 - k)^t + 1
where N0 = N_birth - N_adult

That is, take the difference between division rate at birth and division rate
as adult and fit a constant k between zero and one. It is just a first
approximation at best because data on division rate by age in various tissues
doesn't seem available...

[https://s18.postimg.org/9cn5vi8t5/div_Rate.jpg](https://s18.postimg.org/9cn5vi8t5/div_Rate.jpg)

~~~
apathy
The majority of pediatric malignancies are either germline related, in utero (
_de novo_ mutation/SV, potentially caused or facilitated by maternal
environmental exposures), or a reverse lottery winner. There simply isn't
enough time for somatic mutation to cause the sort of devastating fallout that
you see in DIPG or infant leukemias. (Furthermore, even the point mutations
seen in pediatric cases are characteristic and rare or absent in adults; some
structural variants are also observed in adults, but they are much rarer and
accompanied by fewer cooperating events)

~~~
nonbel
>"There simply isn't enough time for somatic mutation to cause the sort of
devastating fallout that you see in DIPG or infant leukemias."

This depends on the error + division rates (along with number of cells) in
that tissue at that age though. From my research there not really such data
available on any of those terms. Also, the errors need not be somatic
mutation. For example, chromosomal missegregation may be much more common and
potent since it can mess up the expression of many genes at once:

"Nevertheless, the rate of chromosome missegregation in untreated RPE-1 and
HCT116 cells is  0.025% per chromosome"
[https://www.ncbi.nlm.nih.gov/pubmed/18283116](https://www.ncbi.nlm.nih.gov/pubmed/18283116)

I'm just saying there are a number of other assumptions being made here, and
if we get rid of the standard ones the Armitage-Doll model is capable of
fitting the data surprisingly well.

------
veli_joza
Is there an open source project that tries to replicate the Seven Bridges
proprietary graphing technology? It seems like a logical next step.

~~~
a_bonobo
There are many people working on various aspects of DNA graphs - for example,
there's now the FASTG format, a replacement for the FASTA format which is
essentially a graph in text form:
[http://fastg.sourceforge.net/](http://fastg.sourceforge.net/)

Some assemblers (SPAdes?) have started to support this format, but most
downstream software only uses the FASTA format (the non-graph version from a
single genome/representation).

Heng Li (somewhat now the godfather of bioinformatics) wrote a blog post about
various implementations here: [https://lh3.github.io/2014/07/25/on-the-
graphical-representa...](https://lh3.github.io/2014/07/25/on-the-graphical-
representation-of-sequences/)

~~~
a_bonobo
Just realised that most of the above is from 2014 - here are some recent
works:

A novel data structure to store the whole graph:
[https://almob.biomedcentral.com/articles/10.1186/s13015-016-...](https://almob.biomedcentral.com/articles/10.1186/s13015-016-0083-7)
Software is here: [https://www.uni-
ulm.de/in/theo/research/seqana.html](https://www.uni-
ulm.de/in/theo/research/seqana.html)

This one maps the whole DNA space using markers:
[http://www.nature.com/articles/ncomms7914](http://www.nature.com/articles/ncomms7914)

Here's a paper that looks at the total genetic space of several individuals,
but with read mapping alone, no graphs:
[http://genomebiology.biomedcentral.com/articles/10.1186/s130...](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0757-3)

Most of this is based on open source or openly available software. A big
company that's closed source is NRGene from Israel, I've read good things
about their DeNovoMAGIC/PanMAGIC but I'm unsure how that stuff works exactly
(apart from massive short read coverage).

------
rdabane
I'm highly interested in learning about genomics. What's the best way for an
electrical engineer(signal processing,information theory,graph theory) to
start gaining more insights in this field?

