
Game of Genomes - dsr_
https://www.statnews.com/feature/game-of-genomes/season-one/
======
biomcgary
I strongly recommend any biology articles by Carl Zimmer. He is very good at
providing an approachable perspective for non-scientists even on complex
topics. In my experience, he is one of the rare journalist covering science
that does not make scientists cringe.

~~~
eggie
I did cringe. This is my field. (I know virtually all of the people he's been
talking with.)

The descriptions are accurate, but murky. One huge missing piece is a
description of the structure of the genome. There is one mention of the word
chromosome. The words haplotype and allele never make an appearance. There is
no mention of the fact that we are diploid, nor that one copy of our
chromosomes comes from each of our parents. Nor is there any hint as to the
way that we share our genomes.

Maybe I'm allergic to assertions of novelty, but I don't enjoy his claim that
he's the first journalist to get access to their "raw" genomic data. That's
just being fancy. Is it necessary to say that? In any case, what's "raw"
changes every few years, so someone else will soon get the chance to say the
same thing. (I gripe, but I have to admit that it's nerdy/cool to see someone
writing so much about BAM files :) !)

I'm still waiting for the first journalist to mention genome structure in a
popular article on genomics. As badly as we need to understand what's going on
in the genome, the public, and particularly professionals whose work brings
them to articles like this, need to develop an intuition about how genomes
work.

~~~
noiv
All the points you're missing I learned at secondary school decades ago. What
I did not learned is how people might judge the overwhelming complexity of
their own genome in terms of fitness. I found his views appropriate if not
foreshadowing.

~~~
lunula
You did, but if others did they promptly forgot their lessons. These things
need to be repeated many times for people to appreciate them. This is an
opportunity to do so, and the author apparently found it unimportant.

~~~
noiv
Well, what do you think he found important?

------
arca_vorago
I was a sysadmin at a genetics company for a while, and learned quite a bit
while there. This is a very accurate article of the current state of things.

I have two main takeaways for anyone curious about this:

1) Sequencing costs are getting lower, yes, but the computation and data
complexity is growing higher and higher as the scientists want to analyse more
and more, especially those smart enough to realize the microbiome is really
where it's at for a holistic approach, which equals massive amounts of data.
We are talking about petabytes over time, with 200gb+ data generation a day
just for a few sequencers.

2) The end goal I think is eventually there is going to be a sequencer in
every hospital, and it will start catching tons of diseases before the patient
even experienced any effects. It's going to be great for healthcare, not just
practically but healthcare has tons of money floating around.

I think the real interesting developments will be between the sequencing
machine manufacturers. Illumina, Roche 454, Ion Torrent, all doing great
things and keeping the competition in innovation mode is great for the
industry.

I can't wait to see what comes of it all, plus my non-compete is finally over.
There is a huge bridge to be gapped between IT and the scientists that I don't
think very many companies have done very well.

~~~
danieltillett
Yes the problem is Moore's law is not keeping up with sequencing data deluge.
The cost limitations of genomics are now being cause by the data analysis and
not the data generation.

Personally I think the unusual genetic history of humans is going to make it
really difficult to reach the full potential of what genomics. The recent
population explosion and changes in selective pressures will make it
impossible to really tease out the true linkages between human genes and
phenotypes like disease risk.

~~~
dekhn
There really isn't any crisis in data storage and analysis of sequencing data.
The data isn't really that large (sum total of all sequencer data is in the
exabytes (per year), but most of that data isn't retained, nor does it need to
be retained after processing).

I founded the Google Cloud Genomics project and at the time, people were
citing some very scary stats about the growth of sequencing data. After doing
some analysis most people were concerned about retaining the original raw data
files from the sequencer, usually saying that they wanted to go back and
reanalyze the data when the algorithms got better. But nobody wanted to pay
the (commodity) storage costs to keep all that data around.

Well, the first thing to be realized is that the quality scores (which
represent about 75% of the sequencer data post-compression) were overly
precise. Simple quantization techniques drop the value space of quality scores
tremendously with no relevant effect on the resulting variant calls.

The second thing to be realized is that keeping tons of old BAM files or FASTQ
files around is typically wasteful. Few people go back and reanalyze, and even
if they do, they get only marginally better results; in those cases, it's
better to not spend the money to bank exabytes of raw data.

Next, once people get away from thinking they need to store exabytes of raw
reads, suddenly the costs shift from "$X/year to store a bunch of archival
data and a smaller amount on the data analysis" to "nearly all the money is
spent on analyzing a relatively small (tens of petabytes) data set". And that
shifts things from storage to CPUs. I can assure you, Moore's law is still
quite ahead of sequencing data analysis- the sum of all genomic data
processing is just a tiny fraction of what it takes to run a medium-sized
Google service. Most processing algorithms are fairly naive and unoptimized,
but when I built processing algorithms that ran using Google's BigQuery, I got
results in minutes.

As to your second point, I honestly don't know what the outcome of large-scale
sequencing and disease correlation analysis will be. it seems like, as was
suggested quite some time ago, sequence data is very limited in its ability to
produce actionable medical data for human populations, and will continue to be
an ancillary tool, rather than a silver bullet, for the foreseeable future.

------
rgejman
Generally I thought this was a great piece. I was a little annoyed that he
kept referring to "BAM" files as holding his genomic information. e.g.: "He
wanted to get his own BAM file and study it."

BAM is just a format that describes the way that short chunks of DNA are
aligned to a reference file of other DNA. It would be a BAM file whether you
are talking about someone's genome or an experiment where the readout was a
sequencing assay.

To hear how off that sounds, consider the following: "He wanted to get his own
m4a file to listen to it." "She wanted to get her own xlsx file to calculate
her expenses"

Perhaps its a small gripe, but why not just say "raw data" or "genomic
sequence" either of which would be more accurate and not cause some of us to
cringe!

~~~
JangoSteve
Also, the BAM file isn't really the raw file that comes out of the sequencers,
that would be FASTA or FASTQ. The BAM file is then, as you stated, what you
get after comparing/aligning it to another sequence such as the human
reference genome.

~~~
a_bonobo
Some people have been pushing to use BAM files for unaligned reads too - the
upside being that it's a) compressed and b) you know when the file is
incomplete as the BAM ending is missing. You don't get that with FASTQ. Most
notably, the new versions of the PacBio software produce BAM files with the
sequencing reads instead of FASTQ (or h5 as previous versions did)

------
breck
I'm enjoying the article, but the dancing dollar signs and other animations in
the gutters are killing me. Not only distracting, but slowing down my machine.
Why oh why? And sadly the page also breaks Safari Reader View with only the
first page visible in that view.

~~~
noiv
Your reply is equivalent to a remark concerned about the font size of an
unreadable lore ipsum paragraph on a site mostly presenting advanced CSS
animations viewed on a C64. But, now I'm thinking whether a genetic
disposition exists capable of steering focusing, so there is that.

