

How Perl Saved the Human Genome Project (1996) - bsima
http://www.bioperl.org/wiki/How_Perl_saved_human_genome

======
codex
Perl and the human genome are almost perfectly matched; both are almost
incomprehensible, with no central design, accreted haphazardly over a long
time.

~~~
kbenson
> with no central design

Any supporting evidence to that? If you look into it, you may be surprised.

Here's a hint, Perl doesn't necessarily optimize for the same things other
languages optimize for.

~~~
nijk
Perl 5 has no spec, just a bunch of features that seem helpful. It has a
philosophy, not a design.

------
jgrahamc
Nice. Also, The Perl Script That Powered The Alan Turing Petition:
[http://blog.jgc.org/2012/07/perl-script-powered-alan-
turing....](http://blog.jgc.org/2012/07/perl-script-powered-alan-turing.html)

------
bane
Honest question from somebody that doesn't know any better, with the data from
the HGP available for quite some time now, it doesn't appear to me (as a
layperson) to have had the promised impact or suddenly providing a genetic map
that will allow us to quickly find and target genetic diseases and other
undesirable traits: would anybody knowledgeable in the field be able to
provide some insight into what kind of impact the HGP data has had?

~~~
ben1040
Here's something from the news last year. Without the reference genome this
wouldn't be possible.

[http://www.nytimes.com/2012/07/08/health/in-gene-
sequencing-...](http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-
treatment-for-leukemia-glimpses-of-the-future.html?pagewanted=all&_r=0)

This was just one person, sure, but there are a lot of smart people working
very hard on making this scale.

~~~
bane
So I guess my question is, and I've seen lots of articles and breakthroughs in
genome sequencing, is what use has the actual data been in the HGP? It seems
to me that things like this are about as based on the HGP data as velco is to
NASA. It's an enormously beneficial spinout technology that happened to have
developed as a side-benefit of the main work. I don't know if I'd go so far as
to say the sequencing or velcro would have never been developed without the
main research focii, but it didn't hurt.

~~~
ben1040
The HGP reference genome is pretty much essential to the "whole genome"
analysis done on humans and that's the big direction in research right now. I
work in cancer and disease genomics doing data analysis software and all of
the analysis methodology goes back to this reference in some way.

Sequencing technology has gotten to a point where it's just blown Moore's Law
absolutely out of the water and we can't throw more compute at the analysis
problem, we have to make it smarter. The reference genome is used in how
that's been made smarter.

It helps to discuss a little bit about how the HGP reference was produced, and
why producing it took 10 years and three billion dollars.

The HGP process first had a map made, where the genome was broken into lots of
smaller segments. The idea was that this reduced your problem space; any
segment of DNA produced from a sample from that portion of the genome came
from that area. Then that segment was broken into lots and lots of smaller
chunks and then read on the sequencing machines in 600-800 base segments. By
the time that sequencing technology reached "max level," the state of the art
machine could generate 96 of those segments in an hour's time.

Then you'd calculate overlaps and assemble those smaller "reads" back into a
sequence of that chunk you chose from the map. Then someone would audit the
computer-generated assembly by hand, possibly ordering up more lab work to
fill any gaps or resolve areas of crummy data. Repeat for the next chunk from
the map.

Now here's how things work, when we need to do any sort of genomic analysis on
an individual:

New technology has the ability to sequence human genomes at deep coverage in
11 days[1], and cranks out 6 billion reads 100bp long from places all over the
genome. Computationally, this is an absolutely different animal. You can't
feasibly try to re-assemble these reads into a human. So, what we do is use
string matching algorithms to "map" a 100bp read back to where it most likely
came from, using the HGP genome as a reference.[2] Since obviously your DNA
does not match the HGP reference base-for-base, and
mismatches/insertions+deletions are really where the interesting data is
anyway, there's some leeway for mismatches in the mapping.

At that point, by mapping reads back to where they came from, we end up with a
data file that represents an individual's genome. You're able to walk across
the genome base for base and ask "So, base 347 of Chromosome 7 is a T in the
reference, what is the most likely base on Joe's genome at this point given
the reads we have that span this base?"

Mapping things to the reference also allows us to attempt to find really
interesting stuff that can cause disease, such as structural variations in the
genome. These are instances where large segments are removed, duplicated,
inverted, or picked up and moved somewhere else relative to where they "should
be."

[1] <http://www.illumina.com/systems/hiseq_comparison.ilmn>

[2] <http://bio-bwa.sourceforge.net/> is the tool that's most popular these
days.

------
micro_cam
And set us back years as well.

Too much perl code is essentially write once and forget. It gets results quick
but it is a disaster for repeatability which is an essential part of science.
I've worked on bioinformatics perl projects where bugs canceled each other out
(ie code that was supposed to clear an array and repopulate it with corrected
values did neither so the original values were returned). And I've spent far
too many hours trying to figure out what a perl script that is the reference
implementation for a certain procedure actually does.

Their certainly still are scientists who use it but python and R are gaining
ground for good reason.

Wiring together analysis pipelines with pipes as they describe is, however, an
excellent technique regardless of language.

~~~
Mithaldu
> Too much perl code is essentially write once and forget.

Please stop repeating this misconception. People who put little effort into
learning how to program write "write-once" code in any language. Perl had the
"misfortune" of being the only dynamic language on the block for a long time,
leading to many people reaching for it to get things done without bothering to
actually learn the language, thus creating a vast corpus of low quality code.

(It does not help that the definitive resource of Perl for bioinformatics
people, which i've seen in libraries like those of the Genome Campus in
Cambridge, isn't worthy of being used as toilet paper, yet influenced a whole
generation of scientists.)

> I've spent far too many hours trying to figure out what a perl script does

How often do you reach for perltidy when you do this?

~~~
bsima
What is this "definitive resource" and what resource would you recommend for
learning perl? I'm learning now (self-teaching) and don't want to learn the
wrong way.

Here's the book I've been using: <http://bixsolutions.net/>

~~~
Mithaldu
I do not remember the exact title anymore. It was a book that taught how to
build web applications by interleaving data retrieval and html printing,
instead of building it up in proper MVC fashion, to name one example of the
damage it did.

As for learning Perl, the best resource is <http://perl-tutorial.org>

You will find an explanation there of how to judge a learning resource and a
list of the most current texts that teach modern Perl.

~~~
nijk
"Proper" MVC fashion.

~~~
Mithaldu
Please elaborate. :)

------
sciurus
17 years later, Perl still seems to be the go-to tool in bioinformatics.

~~~
epistasis
In my personal experience, I see no new Perl scripts these days. Python has
completely replaced Perl for new code.

~~~
Mithaldu
If you say something like this you also need to quantify your personal
experience. Does it consist of a 3 person department? Or one school with a
pretty big bio informatics department? Or do you contract out to many
different schools?

In my experience as a global contractor, there are at least big players who
use Perl exclusively, and in Perl irc channels i see bio informatics people
very regularly.

~~~
epistasis
I would hope that any one person's "personal experience" is not taken so
seriously that it needs to be quantified.

UCSC has switched from teaching Perl to teach Python in intro bioinformatics
classes (they get 5-10 new PhD students a year, plus at least that many
masters students, I think?). Nobody that I encountered at my postdoc
institution used Perl, maybe about 60 people who programmed, and I knew enough
of ~20 people's habits to know their programming choices. R, Python, and C/C++
were commonplace, with a few weirdos using Ruby, and of course lots of shell,
awk, and sed glue to keep it all together. The Broad has been somewhat
successful on shoving Java into people's pipelines with GATK and Picard, but
it's not a welcome addition to many people's habits, and I haven't encountered
any significantly used project around next gen sequencing that is based in
Perl. For example, in the RNA world, all the common tools like Tophat,
Mapsplice, and ChimeraScan all use Python in ways that Perl would have been
used 10-15 years ago.

I have active collaborations with about five different labs that also have in-
house informatics for everything from microarrays, to ChIP-Seq, to RNA-seq,
exome to whole-genome resequencing, and nobody uses Perl for any of it.

That said, it's really easy to collaborate with lots of informatics people and
never even need to know what they use internally. What's this about
contracting in bioinformatics? I've never encountered such a thing.

~~~
Mithaldu
It seems the main difference here is that you're mainly talking to people in
the USA, while i talk mainly to people in Europe.

As for contracting: Oftentimes bioinformatics people will realize that they're
not good programmers at all and call in people to help make their code less of
a rat's nest and learn from them at the same time. It happens quite often here
in Europe.

------
bsima
I'm learning Perl now, specifically for bioinformatics. Appreciated this
article

------
draegtun
Here's an HN post (with comments) from couple of years ago on same article
(though it was via DrDobbs) - <https://news.ycombinator.com/item?id=1568109>

------
mrmagooey
The PUG that I'm a member of had a very interesting presentation of PyCogent
(<http://pycogent.org/>) which is meant to be a Python based successor to
BioPerl. IANA bioinformatics researcher so have no idea as to the actual
relative strengths of each, but the PyCogent guys appear to have put the hard
yards in (~8 years of development and still going)

------
manish_gill
Is there any point for new programmers to learn Perl, when they have the
choice of Ruby (which supposedly is inspired from Perl) ?

~~~
timr
Perl is very different than Ruby. It's also very different than Python. It's
also faster and more concise than both.

You should learn Perl because you're likely to encounter a lot of Perl code in
the wild, and because you can learn something from it. Knowing how to generate
a Perl one-liner that does something incredible will take your CLI skills to a
new level in a way that knowing Ruby will not.

Perl is still one of my go-to languages for sysadmin scripts, because it's so
concise and powerful. It's a long-beard language.

~~~
protospork
Thank you, that's a better-written answer than anything I could get down. I'd
just like to add that (as the article does a pretty good job of saying) perl
is /unparalleled/ as a text-processing language. There is no other language
that even comes close to perl's ease and utility for string manipulation.

~~~
timr
That's a good point. When I think over what I use Perl for in practice, it's
usually some form of text processing. Perl is just the swiss army knife of I/O
munging.

------
abraininavat
Every article about Perl leads to the same pattern of comments. Most people
think Perl is horrible and lends itself to incomprehensible code. And Perl
people have their backs against a wall, furiously defending their language
with prevarications like _Perl had the "misfortune" of being the only dynamic
language on the block for a long time, leading to many people reaching for it
to get things done without bothering to actually learn the language, thus
creating a vast corpus of low quality code._

Sure, couldn't have anything to do with the language. The whole rest of the
world just doesn't get it.

~~~
kbenson
I think it's more a case of people _think_ they understand Perl, or can make
assumptions because of their existing C/PHP/Shell programming experience and
apply it to Perl without problem, and that is not always the case. The fact
is, Perl is fundamentally different, but looks just similar enough to fool
people.

If it looked like Lisp, people would be less likely to think that it's just a
matter of applying their C experience, but alas, it generally looks pretty
familiar, if a bit messy, to users of other imperative languages.

If you are trying to understand, write or change a Perl script, and you don't
know what context a statement takes place in, or what I mean by context in
this case, then _you don't know what you are doing._ (I mean _you_ in the
general sense, not as an indictment against the parent).

