

How Perl Saved the Human Genome Project - p3ll0n
http://www.drdobbs.com/184410424;jsessionid=2R3FPIQUDQR4PQE1GHRSKH4ATMY32JVN

======
draegtun
Slightly related video presentations _Curing Cancer with Perl_ by David
Dooling of the Washington University Genome Center:

* part 1 - <http://blip.tv/file/1997719/>

* part 2 - <http://blip.tv/file/1998152>

* part 3 - <http://blip.tv/file/2000983/>

~~~
ben1040
Wow, I filmed that video!

I cannot speak officially for the Genome Center, but I'll throw out there that
the ORM that powers much of the GC's analysis platform is out on Github and
CPAN.

It's actually more than an ORM in that it also supports features like
automated creation/smart rewriting of class files based on database tables,
quick and easy command modules that get turned into hierarchical command-line
tools for free, and an automated test harness that can even parallelize onto
an LSF cluster if you've got one.

Github <http://github.com/sakoht/UR>

CPAN w/ documentation <http://search.cpan.org/dist/UR/lib/UR.pm>

~~~
draegtun
Thanks for recording the talk. I enjoyed watching it.

------
westbywest
Many moons ago, I worked on an FPGA-based platform that was among several
research projects targeted at the Genome Project. The general idea was to
offload BLAST-style sequence alignment to purpose-compiled FPGAs, such that
sequencing across the entire dataset could be performed in order of magnitude
less time. It really wasn't all that complex (I just implemented Smith-
Waterman directly, as a demonstration), only intended to perform fuzzy matches
at Gbps speeds to winnow the working dataset down to a size more palatable to
a desktop workstation.

My understanding is that all these projects (mine included) were cast adrift
when the funding for them evaporated in the post-9/11 climate. In the
intervening years, I was aware that Perl was being picked rapidly at the
Genomics labs in the nearby university hospital (i.e. since we never delivered
them the FPGA platform), and I'm happy to read Perl has risen to fill this
niche.

------
dstorrs
The part that made me smile was when he said "In all, between one and
TERAbytes of data would generated!!!!" [exaggerated emphasis mine]

I've got 3-4 terabytes of storage within a dozen feet of me as I type this; it
really drives home the pace of change in computing.

------
pasbesoin
In the same vein, though sometimes with less detail:

[http://oreilly.com/pub/a/oreilly/perl/news/success_stories.h...](http://oreilly.com/pub/a/oreilly/perl/news/success_stories.html)

O'Reilly also published some of these in at least two folded/stapled pamphlets
that were handed out for free e.g. at conferences. I recall a finance-centered
application where the Perl prototype far outperformed the subsequent
implementation and ended up taking over the production role.

It looks like maintenance at that URL stopped in about 2004, but in googling
"perl success stories" I saw a few more recent articles that might qualify.

~~~
rmoriz
iirc: O'Reilly stopped maintaining perl.com content around that time and
finally handed the domain over to the perl foundation a few weeks ago…

<http://www.perl.com/pub/2010/07/relaunching-perlcom.html>

~~~
chromatic
Minor nit: the abandoning occurred in 2008.

------
blahedo
(1997)

------
hackermom
Alternate title: How it happened to be Perl instead of any other just as
capable language that saved the Human Genome Project (in the land of Dangling
Participles and Allusion Errors).

~~~
jbert
Python, perl and ruby are roughly the same language. The differences between
them are primarily cultural, rather than technical.

I suspect the reason perl flourished here was a combination of luck and the
cultural fit. Culture here includes the newbie-friendly online help (e.g.
perlmonks), the ease of "publish and re-use components" (CPAN).

~~~
adorton
Also, remember that when the project started, Python and Ruby didn't exist
yet. Perl still wasn't the only dynamic scripting language on the block, but
it probably the most mature and best-suited to this problem domain.

I wonder if perl would still be used if the project was started today.

~~~
c1sc0
Perl _was_ the only language on the block with strong built-in text-processing
capabilities. For many a biologist the Camel book was the only programming
book they read before moving on to solve _real_ biological problems instead of
fiddling with programs.

~~~
elblanco
It's also performant enough that it wasn't worth the time to learn a faster
performing language.

~~~
hackermom
Without going off on a limb: back then, if you knew Perl, you knew C.

~~~
hartror
I learnt Perl and used it on projects long before I learnt C.

------
p3ll0n
In addition to Lincoln's thoughts I think one of the main reasons
bioinformaticians are attracted to Perl is because it is forgiving. Biological
data is often incomplete, fields can be missing, or a field that is expected
to be present once occurs several times (because, for example, an experiment
was run in duplicate), or the data was entered by hand and doesn't quite fit
the expected format. Perl doesn't particularly mind if a value is empty or
contains odd characters. Regular expressions can be written to pick up and
correct a variety of common errors in data entry. Of course this flexibility
can be also be a curse.

~~~
chronomex
A paragraph very similar to this one occurs in the article.

~~~
pjscott
From the article:

"Perl is forgiving. Biological data is often incomplete, fields can be
missing, a field that is expected to be present once occurs several times
(because, for example, an experiment was run in triplicate) or the data gets
entered by hand and doesn't quite fit the expected format. Perl doesn't
particularly mind if a value is empty or contains odd characters. Regular
expressions can be written to detect and correct a variety of common errors in
data entry. Of course, this flexibility can also be a curse, as I'll discuss
in more detail later."

A few words are different. The article says triplicate, and p3ll0n says
duplicate, for example. But they are similar enough to use as testing input to
a diff algorithm.

EDIT: Also from this guy's comment history:

<http://news.ycombinator.com/item?id=1456105>

Some of the phrasing looks to have been copied and pasted from this article by
Jonathan Ellis:

[http://www.rackspacecloud.com/blog/2009/11/09/nosql-
ecosyste...](http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/)

I bet if you could make a bot to do this -- go out and find relevant
information, and summarize it -- you could actually provide a serious public
service. As long as you cited your sources, so it's not a plagiarism-bot.

~~~
chromatic
That bot is easy to write in Perl! I have a document summarizer written
already.

