
23andMe Pulls Off Massive Crowdsourced Depression Study - nkurz
https://www.technologyreview.com/s/602052/23andme-pulls-off-massive-crowdsourced-depression-study/
======
Mtinie
I'm very conflicted about trying a service like the one that 23andMe offers.
Part of me would love to see what comes back, the other (currently louder)
part of my brain is worried about throwing away any ideas of future genetic
privacy if I do.

The worst part is, I'm not sure which is a more rational desire. Is my my
privacy concern undue paranoia?

~~~
maga
> Is my my privacy concern undue paranoia?

And are there genes responsible for it?

On a more serious note, I remember hearing about microchips the size of USB
sticks that were supposed to sequence a DNA in hours costing "only" few grands
back then couple of years ago. I'm surprised they didn't yet flood the market
considering the privacy concerns when it comes to services like 23andMe.

Sure, those services do more just sequencing, but for those who would just
like to check for a few hereditary deseases running in their family while
sticking to their tinfoil hat that would be a bargain. In my case, though, my
body is so tough that I'm afraid if my DNA is revealed it may be used for
making a clone army one day.

~~~
yread
Sequencing is the easy part. Going from an unordered set of 100-character
reads (with 10% error rate) to a list of somatic variants (mutations,
insertions and deletions that have a biological effect) is the difficult and
costly part

~~~
maga
> Going from an unordered set of 100-character reads (with 10% error rate) to
> a list of somatic variants (mutations, insertions and deletions that have a
> biological effect) is the difficult and costly part

Isn't it a software/data science problem? And if so, what makes if difficult?
Computational complexity or the availability of reference data, or both?

~~~
dekhn
If you assume the hardware is fixed, then yes, it's a data problem.

What makes it difficult is several things: the genome itself has many
repetitive regions whose lengths are greater than 100 characters. If your
sequencing technology produces reads less than 100 characters, you don't have
enough information to place those reads within their proper location in the
genome.

Next: There are regions that aren't exact duplicates of each other, but are
very similar. If you have a 100 character read, and you know it probably
contains errors (a necessary consequence of the sequencing technology), it can
be hard to assign it precisely to a single region, as the read will match
numerous regions equally well. A huge amount of effort is currently put into
mapping these reads to the most appropriate location.

Next, because of the high cost of doing the assembly properly, heuristics are
used. Often, the heuristics will be based on greedy algorithms, that try to
tile overlapping reads to extend the reads into longer segments. However, due
to the read error rate, you might accidentally tile two unrelated reads; this
wil prevent you from finding the true, optimal solution. To correct for this
ambiguity, you typically have to sample a large combinatorial set of possible
solutions to find the best ranking one. This is a major area of research.

Mapping reads to reference data is often done (instead of wholesale assembly)
but it suffers from the same problems. The reference is highly biased, it was
based on a small number of individuals; when you try to map reads from an
individual who is genetically distinct from the reference nindividual, there
will be large regions that don't map (for example, I believe the reference was
a European-American, if you try to map African genomes to that reference,
you'll find there are entire regions missing from one that are in the other.

I don't know a lot of CS, but there is probably a term for taking a bunch of
reads and mapping them in a way that maximizes the probability of the mapped
reads representing the true solution (IE what is actually in the person's
genome). Most people come up with heuristics that (IMHO) have serious
deficiencies. When I ran Exacycle at Google we used it to do a "real" assembly
(full n __2 comparison of all read pairs) and we found it was far more
accurate than previously assembled genomes. Subsequent to that, Gene Myers
(who designed shotgun assemblers) used these reults to find a heuristic that
produced assemblies that were nearly as good but at a much lower cost. I 'm
personally still skeptical.

If sequencers produced 10KB reads with 0.01% error rate, I'd be happy.

~~~
maga
Thanks for detailed reply! This really puts things into perspective.

From the NLP side of the data munching isle, it sounds like a lot more
complicated variations on language processing problems.

Comparing reads to each other is like approximate string matching (or fuzzy
matching) we use to account for spelling errors in words, but genome chunks
are longer and you don't have dictionaries to check against.

And finding the best arrangement of those reads is akin to language modeling
where we assign probabilities to word sequences which later can be used to
predict the most likely sequence. In case of genomes, though, with so little
data, no standard "words", and the error rates in reads, it's like trying to
put back a shredded book of in illiterate author only having few other books
of illiterate authors as reference.

~~~
dekhn
my introduction to DNA sequence analysis was "the linguistics of DNA" by David
Searlers:
[https://www.jstor.org/stable/29774782?seq=1#page_scan_tab_co...](https://www.jstor.org/stable/29774782?seq=1#page_scan_tab_contents)

it applies the chomsky hierarchy to sequence analysis, there is in fact a ton
of interesting literature around graphical models like HMMs, see
[https://www.amazon.com/Biological-Sequence-Analysis-
Probabil...](https://www.amazon.com/Biological-Sequence-Analysis-
Probabilistic-Proteins/dp/0521629713) for more details.

------
erdevs
I believe this is the underlying paper:
[http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3623....](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3623.html)

GWA studies general should be treated with great caution. The way they work
generally is based on a simple p-value test of association among outcome (in
this case, depression) and all genes based on SNPs. There is a high degree of
mere chance association and false positives. Most GWA studies leave a lot to
be desired.

This one looks more solid than most. The p-value appears to be 10^-5 and they
ran a replication data set as well. Many GWASs report much less stringent
p-value and many don't run sub-replications.

One interesting aside about GWA studies: this may have changed recently, but I
believe it used to be that any GWA study involving funding from the NIH was
requires to post its data to a public, freely available database. That always
seemed a good practice to me and one that should be emulated in other sorts of
studies involving government funding. I wonder if this study is subject to
that requirement.

~~~
jessriedel
> There is a high degree of mere chance association and false positives.

Are you really suggesting that most GWAS studies don't calculate genome-wide
significance? This is wrong. If that's not what you're suggesting, I don't
know what you're saying.

~~~
erdevs
You should try reading up on the subject before being blithely dismissive.

False discovery rates, false positive rates, and family-wise error rates and
how best to control for them are ongoing areas of interest and research in
GWAS. There are calls for p-value requirements in the 10^-8 range to help
avoid this. There are calls to address stratification (which can affect both
type I and type II errors). A _lot_ of research and debate on this topic over
the past several years. Who are you exactly to dismiss all of this scientific
inquiry? It's great if you have expertise in another field, but it seems odd
to dismiss scientific questions and ongoing research in this particular field.

You'll find plenty of information if you actually seek it out rather than
simply making a knee-jerk, snarky comment, but here is one example article
which articulates some of the issues that have been under consideration in
recent years:
[http://m.ije.oxfordjournals.org/content/41/1/273.full](http://m.ije.oxfordjournals.org/content/41/1/273.full)

Note, there, how the level of significance is discussed. A p-value of 10^-7 to
10^-8 is suggested (as compared to this study's 10^-5 level of significance...
and, believe me, many GWA studies have been published with much less
significant p-values).

This is an ongoing and active area of discussion in the field. I'm not an
expert, but some of my colleagues are, and it's a topic they sometimes discuss
and brainstorm over lunch, etc.

Actually, even the _wikipedia page_ on GWAS mentions some of this inquiry and
debate, as well as the erroneous publication that has plagued this nascent
field. It'll all be worked out over time and great discussions are happening
here. Vast improvement in processes and standards has been made over the past
few years in particular. But we don't move the ball forward by dismissing
questions or incorrectly assuming all must simply be right and well.

~~~
jessriedel
My comment was not intended to have the barbs on it that appear to have been
read into it. ("Who are you exactly to dismiss all of this scientific
inquiry?" "...knee-jerk, snarky comment...") I'm sorry that it came off that
way. In any case, I still don't know what your concrete claim is.

> here is one example article which articulates some of the issues that have
> been under consideration in recent years:
> [http://m.ije.oxfordjournals.org/content/41/1/273.full](http://m.ije.oxfordjournals.org/content/41/1/273.full)

From the abstract of that article: "Currently, associations of common variants
reaching P ≤ 5 × 10−8 are considered replicated. However, there is some
ambiguity about the most suitable threshold for claiming genome-wide
significance." So, people do calculate genome-wide significance, and there is
some ambiguity over where exactly that line should be. This is in line with my
understanding of the situation.

The exactly analogous statement can be made within particle physics ("there is
some ambiguity about the most suitable threshold for claiming significance"),
where it is often called the "look-elsewhere effect". But this prudent caution
doesn't cause to people to say FUD like "particle physics studies should be
treated with great caution" or "Most particle physics studies leave a lot to
be desired". _Such statements may absolutely in fact be true for GWAS studies,
but you didn 't give any good reasons for it._

Indeed the very article you link to ends this way: "Conclusion: A substantial
proportion, but not all, of the associations with borderline genome-wide
significance represent replicable, possibly genuine associations. Our
empirical evaluation suggests a possible relaxation in the current GWS
threshold."

How should one square that conclusion with your original comment?

~~~
erdevs
Thanks for the reply, and my turn to apologize for misinterpreting your tone.

You asked initially what exactly I was saying, and here again how to square
what I'm saying with my original comment. So, let me try to explain, and I
would hope we're not on different sides of this as I think what I'm saying is
reasonable, given the context.

I originally said: "GWA studies general should be treated with great caution.
The way they work generally is based on a simple p-value test of association
among outcome (in this case, depression) and all genes based on SNPs. There is
a high degree of mere chance association and false positives. Most GWA studies
leave a lot to be desired."

For context: genome-association studies have had a history of being blown out
of proportion in the press. And often for outcomes which greatly affect
people's lives. Depression is one such issue and it'd be a shame if people
were led into thinking there is _necessarily_ a great breakthrough here in
understanding possible gene-linkages to depression outcomes. It'd also be a
shame if the result was ignored. I tried to provide some praise to this paper
for being fairly rigorous, but also note that GWAS studies should be treated
with caution generally.

Why should GWAS studies be treated with caution generally? (Besides that
results of _any_ study should be treated with some degree of caution.)

Well, firstly, GWAS is a fairly nascent field. Unlike physics or even particle
physics, it hasn't had that much time to mature. This is doubly so when
applying GWAS to mental health. I'm sure the paper covers some of these risks
(or at least it hopefully does)... but relying on self-reporting introduces
potential selection bias in the population sample, as does using people
seeking help vs a general population study. Quickly reading the paper, it
looks like it relied on self-reports and analyzed only people who'd been
diagnosed with major depression (meaning they'd sought help). We should be
cautious in over-generalizing based on this.

Secondly, GWAS has had a rough history of overstating results and misapplying
analyses. It's _much_ better today than it was even, say, 5 years ago.
Ioannidis and others made some heroic efforts to convince the field to clean
up it's act, in effect starting ~7-8 years ago.

Thirdly, there is a historical pattern of results in this field being
overhyped in the press.

Finally, there are _active_ and sometimes heated debates in the field about
how best to do GWAS. This is getting worse as a high-throughput, low-cost
full-genome sequencing comes online to a greater and greater degree and SNP-
based data sets fall by the wayside. Some question taking a frequentist
approach at all in the face of such a huge degree of multiple testing. Others
call for much, much higher requirements for holdout data sets, cross-
validation, and replication before a study is published or considered final,
especially when dealing with things like mental health (and their likely
application to the field of pharmacology).

This is serious stuff that could end up affecting people's mental health
treatments and lives, so caution is warranted. Especially given the field's
relative nascence, self-admitted history of publishing low-quality results,
the rapidly changing techniques, and the fact that there are ongoing debates
within the field of how best to do GWAS analysis and how to effectively
replicate results.

~~~
erdevs
Here is an example of the sort of press reaction I'm referring to (this isn't
a particularly egregious example, but it demonstrates the point):
[https://www.washingtonpost.com/news/to-your-
health/wp/2016/0...](https://www.washingtonpost.com/news/to-your-
health/wp/2016/08/01/large-dna-study-using-23andme-data-finds-15-sites-linked-
to-depression/)

Look at the language there. Scientists " _pinpointed_ 15 locations in our DNA
that _are_ associated with depression..." [emphases added]. There is no sense
of nuance or any caution in conclusion here. Even reading the whole article,
which perhaps most people won't even bother with, you don't get much sense
that there's any degree of uncertainty here. There is no indication that the
field at large is still wrestling with how best to analyze SNP studies at all,
despite publishing studies "practically ever week". There is little
qualification around the data set here (it only briefly mentions that
23andme's data is based on saliva, not blood or other cells... which may be
problematic. But also it should be noted that disease outcomes in this study
were _self-reported_ and from a _self-selected_ subset of the general
population who sought professional help... and there are many, many other
nuances to consider). Instead, we get the fairly flat-out impression that this
is a definitive discovery and, not only that, but it is of huge significance
in the field as well.

It may well be. But caution is warranted until time and replication hopefully
do their work. Unfortunately, this is a pattern in GWAS studies and their
public relations. And while the underlying methodology in the field has
improved tremendously over the past 5 years, there is still a lot of debate
within the field about how much more improvement needs to be made (on data
sources, on analytic techniques, on review standards), and how to adapt to
forthcoming changes in available study data.

I think _this_ study appears to be more robust than many others. So, I'm not
picking on it. In fact I gave it particular praise for being relatively robust
compared to some other GWASs. But I do think generally people should be more
cautious about interpreting GWA studies or over-generalizing them.

------
argonaut
Most people in this thread are missing the fact that use of your genetic info
for research purposes by 23andme is _opt-in_. And their privacy statements
indicate the same would hold for other uses.

~~~
batbomb
Just because something is opt-in, that doesn't mean it is necessarily ethical.

~~~
merpnderp
They clearly state, in short easy to read words, exactly what is going to be
done with your data. They also offer at any time to destroy the remains of
your existing sample and disassociate the data with your account.

They may be unethical, but I have seen zero evidence of it.

~~~
posterboy
That implies the usage of the data is easy to understand exactly.

------
lips
This article has zero detail. It's like an empty box of cookies.

------
shaqbert
23andMe biz model in a nutshell:

\- you give us your DNA (spit) and consent to do whatever we goddamn like with
your DNA \- we give you some dozen of mostly uninteresting genetic marker
worth of analysis (wet earwax anyone?)

With genetic filtered pharma studies in the vicinity, 23andMe sits on a
goldmine.

~~~
jonlucc
As someone who works in pre-clinical pharma research, thank god someone is
doing this. Projects get killed because they would lead to a drug that would
require very expensive clinical trials. Often, this is because the patient
population is hard to identify before they get very sick.

Also, I think 23andMe should be _very_ clear that this is what they do, but it
isn't inherently problematic. People shouldn't ever be surprised by getting
asked to be in a clinical trial based on their information stored at 23andMe.

