
How Big Data Can Help Fight Cancer - DarkContinent
http://cancer.nautil.us/article/167/how-big-data-can-help-fight-cancer
======
dkural
The biggest bottleneck to realizing this vision, unbelievably, is lack of
funding to do large-scale, high quality whole genome cancer sequencing + high-
quality treatment, family, phenotypic etc. data to go along with it.

As nice as Foundation's data is, there are few/no phenotypes to go along with
it, representing a bottleneck to extracting any useful information out of it.
Keep in mind that genomics is only one half of genetics (the other half being
the phenotype) and is meaningless without knowing more about the patient.

On the phenotype side, EMR/EHRs are practically useless as scientific tools
(doesn't prevent academics from publishing how they extracted meaningful data
from it.. which has little correlation with if something works or is
reproducible; unfortunately) and I don't foresee this being fixed in the
current healthcare ecosystem in the US. The UK & other European governments
with good data + national healthcare systems represent a better chance.

If you don't believe me, I challenge you to point me to a single study
containing a comparison of 1000 metastatic sites vs primary tumors in one
cancer type, with WGS. Given that metastasis causes > 90% of deaths (ref:
Weinberg cancer textbook), you'd think we'd have done this study by now.

A dx company has no hope of reimbursement for doing cancer whole genomes, and
does not receive patient data in sufficient detail (how did they fare after
treatment? what drugs were they given?) to undertake a study.

------
pasbesoin
I would embrace big-data epidemiological studies (in the U.S.) if society
would legally and practically-irreversibly guarantee me that the results would
not be used to discriminate, against me nor against others. In health care
insurance and health care delivery. In employment. Etc.

As it is, I fear any and every bit of data I provide the system may well be
used against me at a future point.

Right now, I'm going through some extensive testing, and I've decided to
provide further historical data in my possession for the sake of a better
analysis and diagnosis. However, that same data -- or rather, one datum of the
data it is comprised of -- a few decades ago, was used as the basis to deny my
application to purchase individual health care insurance.

With the ongoing attacks on the Affordable Care Act, I have the distinct
feeling of traveling back in time.

If we are going to have cooperative buy-in on big data, we are going to need
to ensure that the resulting benefits are shared across the population and are
not used to discriminate against subjects having data "on the left side of the
bell curve".

------
nonbel
Well, they need to make sure they are looking at the right type of data. I
know it is blasphemous, but why not include the aneuploidy/chromosomal data as
well? These error rates appear to be much higher than point mutations, etc:

 _" Nevertheless, the rate of chromosome missegregation in untreated RPE-1 and
HCT116 cells is  0.025% per chromosome and increases to 0.6 – 0.8% per
chromosome upon the induction of merotely through mitotic recovery from either
monastrol or nocodazole treatment ( Fig. 3 C ). These basal and induced rates
of chromosome missegregation are similar to those previously measured in
primary human fibroblasts ( Cimini et al., 1999 ). Assuming all chromosomes
behave equivalently, RPE-1 and HCT116 cells missegregate a chromosome every
100 cell divisions unless merotely is experimentally elevated, whereupon they
missegregate a chromosome every third cell division. Chromosome missegregation
rates in three aneuploid tumor cell lines with CIN range from  0.3 to  1.0%
per chromosome (Fig. 3 C ). Depending on the modal chromosome number in each
cell line, these cells missegregate a chromosome every cell division (Caco2),
every other cell division (MCF-7), or every fifth cell division (HT29)."_
[https://www.ncbi.nlm.nih.gov/pubmed/18283116](https://www.ncbi.nlm.nih.gov/pubmed/18283116)

Many people claim that aneuploidy is found in nearly all cancer cells:

[https://www.ncbi.nlm.nih.gov/pubmed/17046232](https://www.ncbi.nlm.nih.gov/pubmed/17046232)

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4443636/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4443636/)

[https://www.ncbi.nlm.nih.gov/pubmed/10687734](https://www.ncbi.nlm.nih.gov/pubmed/10687734)

------
digitalzombie
This is my peers thesis. I may be doing this type or along this line.

I think the problem is big data, in a non statistican sense, where you cannot
get large amount of observation from patient. Either because of legal loop
holes and/or the cost of getting enough patients for experiment and trials.
This is for trials.

Even with phase III clinical trial it's less than 200 obs. This is not big
data in the non statistician world. In our world big data mean tons of
predictors. Medical data is usually high dimensional, less obs but tons of
predictors, more columns than rows.

Also hospitals are wary of giving out data, either because of legal issues or
because they know it's valuable so they don't want to share it. These two
issue is compound it on the fact that the infrastructure is not there to share
the data in one spot, it's fragmented across many other databases with
different schema and what not.

But my peers and I have thesis involving cancer using genetic data. It's very
promising, one of the recent thesis is about base on genetic data if the
patient should take the surgery route or the chemo route and the model had a
80% accuracy rate and nice sensitivity rate (forgot what it was). The
prediction is survival rate.

I also saw other comment about using genetic data against them. I think this
is FUD because we have a law in place, GINA.

------
Gatsky
Sequencing DNA in archival specimens like Foundation has significant
limitations. The ability to predict drug responses from this data alone seems
quite limited. For example, the most common alteration, loss of function
variants in TP53, is not at present druggable.

Additionally, one of the greatest revolutions in cancer therapy -
immunotherapy - does not have a great genomic-based predictive biomarker
(Foundation Medicine has created a surrogate test using mutation burden, but
this is not really good enough, and not validated prospectively).

So DNA sequencing is only part of the story. Some tumours in particular seem
driven by epigenetic changes, or structural variants that Foundation cannot
detect.

Sequencing is in a place where it has reached technical maturity, and has
collided with big data hype. But in reality, the benefit to patients will be
incremental. There is at present zero good quality evidence that panel based
genomic screening like Foundation Medicine provides improves patient outcomes
as a general strategy.

------
aradhakrishnan
As whole-exome sequencing has now dipped below $1,000 [1], this really should
become a diagnostic assay of first resort. That said, further improvements are
required as it appears the majority of cancer causing sequence variants are
found in non-coding regions of the genome [2], suggesting that greater
sequencing coverage is tremendously valuable.

[1]
[https://www.genome.gov/sequencingcosts/](https://www.genome.gov/sequencingcosts/)

[2]
[http://www.nature.com/nrg/journal/v17/n2/abs/nrg.2015.17.htm...](http://www.nature.com/nrg/journal/v17/n2/abs/nrg.2015.17.html)

~~~
chrisamiller
Cancer genomics researcher here. I agree wholeheartedly about getting
sequencing done if you have cancer - it's what I would do for myself or my
family. Two minor quibbles about your thoughts:

1) Exome sequencing is below $1000, but analyzing that data adds a non-
negligible cost. Still, even 2 or 3 grand is way cheaper than wasting time on
treatments that won't work. Whole-genome sequencing is even better (for a
little more cost) because of the extra types of information it adds about
structural variants and copy number changes.

2) We're reasonably sure that most cancer causing variants are in the coding
space, but there are undoubtedly some in non-coding regions (and classes of
large structural events like duplications or deletions that affect both).

It's the best time in the history of the world to have cancer, and it is only
getting better. Survival curves are slowly bending, and new classes of
treatments like immunotherapies are helping to bend them even more.

The bottom line is, If you get cancer, fight like hell to get your tumor
sequenced. Most insurers cover at least some kind of genomic test for cancer
these days.

~~~
JangoSteve
> 1) Exome sequencing is below $1000, but analyzing that data adds a non-
> negligible cost.

That is by and large the most significant factor that we've seen which slows
adoption of more widespread whole-exome or whole-genome sequencing. It takes
less than a day and costs less than $1000 to sequence your exome (and even
your whole genome), but the backlog for analysis of the sequencing results in
labs can be 9 months or more.

There are patients that could be treated from the analysis of their genome
that aren't even considered if their prognosis is shorter than the amount of
time it will take to get clinically actionable results from the lab, which is
the really sad part. This problem is actually what led us to create our
company, Genomenon, to help make the analysis much faster and alleviate the
bioinformatics bottleneck at the sequencing labs.

~~~
bozoUser
> It takes less than a day and costs less than $1000 to sequence your exome
> (and even your whole genome), but the backlog for analysis of the sequencing
> results in labs can be 9 months or more.

Layman qs. what is stopping the labs to quickly analyze the genome?
Computational power or few labs doing this kind of work?

~~~
jfarlow
In general it's a (computationally) hard problem to restitch a genome
together. Even today, when you 'get your genome sequenced' you are not getting
a full read-through of your entire genome's data.

Imagine you want to reconstruct the data on two RAIDs that are mostly, but
importantly not exactly, mirrors of each other. Each RAID has 23 drives. Each
drive has ~1Gb or so of data. And much of the data is not only mirrored
between the two RAIDs, but is also mirrored between the 23 drives - and many
of that mirrored data is 'off by 1' in very important ways (both 'must', and
'must not' scenarios). Further some of the data contains very long sections of
highly repetitive data. And some of the data is mechanically biased to be
harder to read than others.

You must now reconstruct those two RAIDs with single-bit accuracy - as a
single bit-flip in certain sections determines whether or not you get cancer.
The data you are given to do the reconstruction is a 200Gb single column CSV
file with each row being 12 bytes of data.

Go.

~~~
bozoUser
hmm I don`t think so I fathom the complete complexity of the process but with
so many powerful GPU`s out there, is there a possibility of reconstruction in
a matter of days if not hours?

~~~
virtuabhi
The steps in the genomic analysis pipelines are not always embarrassingly
parallel. The degree of parallelization cannot be increased to an arbitary
number. In addition, the parallel executions can give slightly different
results when compared with the serial output, so we need "safe" data-
partitioning schemes and rigorous error control.

If you are further interested in parallelization schemes for genomic
pipelines, please have a look at our paper on the strengths and limitations of
big data technology for genomic analysis (published last week) -
[https://people.cs.umass.edu/~aroy/sigmod17-roy.pdf](https://people.cs.umass.edu/~aroy/sigmod17-roy.pdf)

------
joshenglish
It’s crazy how many ailments are now being heavily data mined.

~~~
j6m8
I think it's a little crazy how many are _not._

