

A Cryptologist Takes a Crack at Deciphering DNA’s Deep Secrets (2006) - bhickey
http://www.nytimes.com/2006/12/12/science/12prof.html?pagewanted=all

======
dekhn
I met Nick when I visited the Broad Institute a few years ago. Fun guy. He
told a shaggy mosquito joke that somehow ended up with the challenge of
extracting DNA from tiny mosquito testicles.

I question just how effective our belief that evolution and development can be
understood from just treating DNA as an information science- and I say that as
a person who has been doing precisely that for 20+ years. I'm finding, more
and more, there are variables that contribute a lot to evo devo that have
nothing to do with the specific sequence of coding regions, and more and more
to do with the more cryptic (heh) regions. Although we apply more and more DNA
sequencing scaling, and more and more information theory- our rate of
knowledge discovery appears to be flattening out.

I've realigned my research interests around cellular biology, specifically
high throughput imaging of induced pluripotent stem cells, because it appears
that those systems can provide us with a lot more direct experimental
evidence, and that evidence is more consistent with the behavior of systems,
than what can be extracted from DNA using the best statistical techniques.

------
blazespin
I enjoy a lot of the speculative fiction around junk dna containing an ancient
message hidden by our progenitors. What a great way to talk to your
descendants (seeded by, say, bacteria lodged in comets) through time when they
are ready to hear what you have to say.

------
cakebrewery
I am very interested in the benefits of DNA studies and their application to
cryptography. I don't know much about biology but It would be interesting to
know if there's a relation between code breaking through quantum computing and
DNA, and if quantum computing would ever help in figuring out DNA

~~~
csirac2
This is a pretty old article, and well before this time markov/bayesian stuff
was already in full swing for bioinformatics, AFAIK. Biology draws from many
disciplines, and it's taken a lot of computer science to get to where it is
now. But I think that the infosec community generally is pretty underwhelming
in its application of statistical methods to big data. People seem to have
limited imaginations when it comes to pattern matching, and too quickly jump
to heuristics and "machine learning" without properly understanding the
limitations of the data and techniques they're working with.

    
    
        To paraphrase provocatively, 'machine learning is statistics
        minus any checking of models and assumptions'.
        ~Brian Ripley, 2004 [1]
    

We should really check out some of the stuff bioinformatics people have been
doing. That's not to say that _cryptography_ has more to give to biology,
quite the contrary. But I think the broader infosec community should try
harder when it comes to doing stats properly. I think there is a lot more in
common between bioinformatics and infosec analytics than most (on either side)
realize. Huge amounts of messy, un/semi-structured data being the common
theme.

In my (admittedly limited) experience, I find that wetlab biologists tend to
have a better understanding of the limitations of the experiments they're
setting up, in terms of interpreting the significance, limitations and meaning
of their results at least - than the computer science types playing in
biology. But good stats is the one universal skill that all good scientists
should master.

I spent a few years where part of my work was supporting evolutionary
geneticists manage and figure out their data. Enormous amounts of data. So
much raw data that the very large scientific institution I'm sure you've heard
of couldn't accommodate it in their normal data repositories (at least not if
every researcher started doing so). Having multi-million dollar research
projects conclude and then archived to a handful of redundant sets of HDDs
curated by teams and departments that now no longer exist is unnerving
(supposedly things are better these days).

In any case, it's a fascinating area to work in for budding computer
scientists. There are many problems to sink your teeth into, even basic stuff
like repeatable analyses on research projects running for more than 10-15
years. "Yes, there are faster/more robust/more appropriate MCMV (Markov Chain
Monte Carlo) solutions out there but we don't feel like re-running a decade's
worth of work to ensure they're comparable". But I digress.

It amused me that each research team seemed to be either computer-science
heavy (and ignorant about the quality or nature or "ground truth" of their
data and the possibly embarrassing impacts on the signals they thought they
were seeing), almost wilfully ignorant or at least dismissive of "old-school"
biology methods and data...

... or were very wet-lab savvy and grounded in the raw biology side of things
but limited their research questions according to the types of analyses they
were comfortable doing themselves, using well-established methods, tools and
software that would run on their biggest $20k PC under someone's desk.
Although that was seriously changing by 2013 when I left (partly due to nice
big web-UIs in front of HPC clusters, partly due to better mixing of team
capabilities).

Now, it's easy to dismiss the latter group as just being not "up to date" or
unskilled in "big data" (recent biology grad programmes are hopefully
improving). But what was surprising was the obvious opportunities (to me) that
highly CS-capable teams (inside and outside of the organisation I worked for)
seemed to be deliberately staying away from. On more than one occasion I asked
some of these people why they were ignoring the rich, machine-readable,
highly-curated datasets from more traditional biology science (Eg.
taxonomy/ID/phenotype/ecology info).

The answers are varied: they aren't even aware this data exists in machine-
readable form; they seem unfamiliar with the very biology discipline they're
playing in (they're more obsessed with algorithms); but a disturbing trend
seemed to be a generic distrust of manually curated data. I guess because
there's no repeatable algorithms to reproduce human curators :-)

Now, I'm not saying everyone should go out and use that data as a source of
truth, but it should at least be a reference to sanity-check or compare to.
And imagine if you actually did augment your sequence data with this stuff -
one (admittedly naieve) thought was that there might be a fun way to look for
at least some phenotype influences in a given genome, "for free", without
having to start new wetlab work - i.e. just repurpose existing data.

When countering the aversion to manually curated data, I tried to point out
that the very "garbage in" being dealt with - the genome sequences, spatial
records, environmental layers etc. - all involve an element of human curation.
I mean, that's how we end up with nearly 20% of non-human genomes containing
human DNA [2] in the first place. And that's before we even consider whether
stuff in the "good" 80% are even representative of the species group(s) of
interest - you wouldn't believe how easy it is for people in the field (or the
lab!) to get species identification completely and utterly wrong. Or scrape up
DNA samples of some infection or parasite instead of the actual thing
itself...

... and yet even with all these imperfections in the molecular data, we find
that some of the old-school morphological (phenotype-driven) taxonomies are
still 80%+ unchanged (hazy on the numbers now) when reconstructed with purely
DNA-derived phylogenies. And that's on groups where we're still fighting over
what the best DNA markers are to build those molecular phylogenies in the
first place! A lot of people don't realize that the same DNA evidence can give
different pictures of the evolutionary tree, depending on what genes you're
picking on...

Apologies for the over-sized rant, it's been a while since I had to think of
this stuff.

[1] [http://stats.stackexchange.com/questions/6/the-two-
cultures-...](http://stats.stackexchange.com/questions/6/the-two-cultures-
statistics-vs-machine-learning)

[2]
[http://www.nytimes.com/2011/02/17/science/17genome.html](http://www.nytimes.com/2011/02/17/science/17genome.html)

~~~
danieltillett
This certainly was a rant :)

You do bring up a very good point which is the difficulty that biologists and
CS people have talking to each other. As someone with feet in both camps I see
all the time both sides talking past each other and completely missing the
basic assumptions each side make about the same data. Biologists generate the
wrong data because they don’t know what the CS people need and the CS people
don’t know what to ask the biologists to generate. It does make a lot of low
hanging fruit available if you can understand both fields.

