
Bioinformatics and the joy of Perl6 - kamaal
http://perl6advent.wordpress.com/2014/12/15/day-15-bioinformatics-and-the-joy-of-perl6/
======
coriny
Just one technical comment - that parser will break on many FASTA files. They
can also include '.' & * characters ('.' is often output by sequence aligners
to indicate sections outside of the model, * is there for aa but not dna/rna
and I have seen them in sequencer output for uncertain positions). I think
this kind of edge case variance is possibly why the various Bio-* tools have
not been very successful in all, you end up spending hours trying to work out
why it's not working with your particular inputs.

On a personal viewpoint: "Take a look at the REPL dump below and see if you
can work out where we are implicitly using $seq as a Str rather than Seq:" \-
this section pretty much sums up why I've always hated working with Other
Peoples' Perl. To accurately follow someone else's perl requires either an
encyclopedic knowledge of the language and it's edge cases or very heavy
documentation and a set standard of coding practices. It's a shame that this
looks true for Perl 6 as well.

Great for personal use, I guess. But by and large I'm glad to see more
projects switch over to python.

~~~
Ultimatt
Sure but the nice thing is you can just add an extra catch all to the Grammar.
Or better yet create a new Grammar for that specific sequencing machine and
override the token.

For example just adding:

rule sequence:sym<seq> { <-[\>]>\+ }

Would mean you can just grab anything without worrying about if it's correct.
Because a tonne of people abuse FASTA doesn't mean you shouldn't at first
attempt to enforce something valid. The nice part of that is you know within
the parser if you have clean data or something with a badly defined character
like * in DNA when it should just be X or N or one of the other confusion
characters.

Also it's an advent post not some formal public codebase ;P The focus was on
showing some of the more obscure features of Perl 6 that aren't like other
languages. Python can look as crazy as you like, it's just the majority of
people using it fall into the camp of avoiding the crazy. For example here's
([https://gist.github.com/MattOates/6195743](https://gist.github.com/MattOates/6195743))
some crazy Python I've written. The six line list comprehension is especially
readable :P

~~~
dalke
There are a couple of problems with starting with a strict approach. I predict
the performance is going to be worse than something which allows (almost) all
characters, because of validation overhead.

More seriously, I wrote a full grammar based parser engine for Biopython along
these lines. The Achilles heel was error recovery. Suppose record #123456 in a
database contains a syntax error (I can list dozens I found, like a copyright
date of 19995 breaking my four-digit definition, or SWISS-PROT record with the
copyright lines in the wrong spot.) How do I have the parser ignore that
record, report the problem, and continue parsing with the next record?

Error recovery in context free grammars is hard, but doable.

Then how do I say that "if the ID is XYZ and the version number is 12 then use
this work-around because that release had a bug which has since been fixed"?
In practice and unlike a computer language, the format spec ends up being less
important than the actual data, so you end up with lots of tweaks to handle
vendor errors.

P.S.: Wouldn't

    
    
        names = [SQLiteMagician.types.get( type( d ).__name__) for d in data[0]]
        types = ["NONE" if t == "NoneType" else t for t in names]
    

have been less crazy, and to be preferred? Also, and I'm not sure what that
code is trying to do, but sys._getframe() and linecache from the standard
library should help, as might this example, which I think does what you want
without using the ast module:

    
    
      import sys
    
      def show_me():
          caller = sys._getframe(1)
          f_code = caller.f_code
          names = f_code.co_varnames[:f_code.co_argcount]
          local_vars = caller.f_locals
          for i, name in enumerate(names):
              print repr(name), "=", repr(local_vars[name])
        
      def spam(x, y, z=4):
          a = "%s %s %s" % (x, y, z)
          show_me()
      
      spam("Hello", 2)
    

This gives as output:

    
    
      'x' = 'Hello'
      'y' = 2
      'z' = 4

~~~
Ultimatt
Great thanks! So I think there are two things about your code which are
different. You are getting the names and values of the spam function with
show_me() ? My version gets the next level up which is the names of the
variables passed in to spam. I believe I tried to use some of the same
standard functions at first. But this had problems when a function is called
over more than one line of source text, and some other limitations. You don't
necessarily get back what you expect without properly going into the source
and parsing. I'll take another look though since I might have just missed
something obvious. My Python isn't /that/ strong.

Completely agree with your list comprehension suggestion. The list
comprehensions are crazy by design to troll monoglot Python people who hold a
false belief about readability of code just being identical to language
syntax. If I handed you that code without the comments, and without sensible
variable names. Which is frequently what I do get handed to me by people I
work with. I've found a myth persists about Python being inherently readable
(the opposite of Perl!), and many scientists who don't have a programming
background just go with this idea especially as it gets them off the hook of
learning to program well.

With parsing and grammars I am still more happy having a parse fail and tell
me what the garbage data was and where. It's very rare any code would be able
to recover from completely arbitrary and unknown problems with a file. With
exceptions being thrown you might at least start processing the bit of the
file you already have and just set a state that shows the whole thing didn't
run. Asking a language to do more for you feels a bit much. Your point about
validation overhead I'm not convinced by since you have to check the
characters for a record start/end state anyway. Plus it just means garbage
data gets even deeper into your program! At some point you have to check it's
valid. Fail fast. Checking a small set of match states isn't going to kill
performance if the grammar/regex engine is any good. One of the ways I'm
getting around this in my Perl 6 libraries is just having a thread that deals
with file IO, parsing and object creation. Then a users main program can just
deal with the stream of sequence objects as they come in.

~~~
dalke
Getting the variable names used for the caller is of course much harder, and
meaningless for a call like f(1+g(b)). py.test, which made a more serious
attempt at handling the f() case, gave up on that approach, preferring to
rewrite the original code behind the scenes rather than infer it backwards.

So you're a troll. Thank you for letting me know. By trolling the uninformed
you seemingly have no meaningful point other than to point your finger and
laugh at the views of others. This action runs counter to your stated goal of
wanting scientists to be better programmers. What good do you think it will do
to the world to troll people using your novice Python skills?

You do realize that when someone searches for "Matt Oates", "bioinformatics"
and "Python" they are likely to find this page in the future, and learn about
your love of trolling, right?

I do not want to live in a troll world, just like I don't want to live in a
world where people use false and undercutting statements disguised as
joviality. Instead, I teach scientists how to be better at the programming
they need to do good science.

Your last paragraph tells me that you have very little experience in writing
parsers. In a record oriented format, and with hand-written parsers,
resynchronization to the record boundary is easy. In Perl's it's almost
trivial, because nearly all record-oriented formats have a boundary which can
be handled with $/ ; FASTA is an exception but still easy to handle. On the
other hand, my point is that it's much harder to resynchronize with a context-
free grammar.

Here's an example of a failing case with my strict parser. HAS2_CHICK was the
only record in SWISS-PROT 39 with a DT line of:

    
    
        DT   30-MAY-2000 (REL. 39, Created)
    

This failed because the spec said that the "REL." was only supposed to be a
"Rel." My parser failed. Every other parser basically said (by lack of full
validation) that they weren't interested in that field, and ignored it.

You can certainly condemn the entire SWISS-PROT 39 release as "garbage" for
having a single record which didn't match the spec. Effectively everyone who
wants the data will disagree with you. You can also train your parser to
ignore those details, and retrain it for every release, though you'll find it
increasingly difficult to accept the format as it changes across 5 different
database releases.

Here's another example. Some of the GENBANK records included a qualifier of
"/primer", which wasn't in the specification. My parsers failed, because it
validated the qualifier names. And you know what? No one really cares about
that failure, so I had to tweak my parser to allow the non-standard name. The
same with EMBL: the ID line is supposed to have the molecule type (one of
"DNA|RNA|circular DNA|circular RNA"). In actuality? Record S40706 had a type
of 'XXX'... and so again, my validating parser stopped, when in reality almost
no one cares.

As these are further examples of simple differences causing a failure within a
record, your bringing up "completely arbitrary and unknown problems with a
file" suggests to me that you have little real desire to engage in my point,
and rather prefer to troll me by diverting the topic. In case I am wrong,
please assume that Perl's $/ or a similarly simple method is sufficient to
determine the record boundary and resynchronize.

The underlying issue is that a database format spec changes after almost every
release, so a validating parser ends up having to handle a _family_ of
formats, not one. Which is much easier for a line-oriented parser to do, since
those formats are written with line-oriented hand-written parsers in mind, and
usually updated in such a way as to not break those parsers.

When you find out that your nice, elegant, validating context-free parser
breaks after nearly every release, due to details that are irrelevant to what
you care about, then you might conclude that the rest of the world is a bunch
of ignorant fools who can't program; or reject the idea of a validating parser
for this sort of data and switch to a line-oriented one; or provide a recovery
mechanism to determine what to do with the records which couldn't be parsed,
and continue if requested. I do the latter now.

(It's also difficult to use a CFG-based parser that doesn't know about record
boundaries to parse files that don't fit into memory even though each record
does. I don't know if that's still an issue in bioinformatics these days, but
it certainly is in cheminformatics. In any case, a mix of simple record
extraction and a CFG-based grammar suffice, excepting for records containing
an entire genome, which is outside the scope of this sort of task, and for
PIR, which had header and footer components to the file.)

Regarding performance, don't trust your beliefs or mine. Measure it.
Empiricism trumps faith. Try benchmarking a hand-written, line-oriented FASTA
parser against your grammar. I wrote one just now in Python. The regex which
used "[^>\n].*\n" was about 2% faster than checking the [specific sequence
letters]. A 2% here and a 2% there and those validations quickly incur
noticeable overhead. By comparison, my line-oriented parser took about 1/2 the
time of the regex-based parser.

Of course, Perl has a different runtime environment and will have different
conclusions.

Then again, there's already a history in Perl of variable stringency levels
for improved performance. Swissknife (doi: 10.1093/bioinformatics/15.9.771)
was written as a lazy parser because Perl5 wasn't fast enough for full
validation and parsing of each record. Partial evaluation gave a 4x
performance improvement when full validation wasn't needed.

~~~
raiph
> my point is that it's much harder to resynchronize with a context-free
> grammar.

So, don't use a context free grammar. [1]

Use something else, eg a PEG. [2]

Not only are Perl 6 grammars PEGs, not CFGs, but individual rules are closures
(of arbitrary code) as well as input matchers so one can write arbitrary code
to handle anything that happens during a parse. [3]

As a very trivial example, in a grammar I wrote for parsing a CPAN control
file, I added a rule for handling unexpected field types. It prints a warning
to stderr and then keeps on going. [4]

Imo Perl 6 is, among its many strengths, an innovative parsing tool with at
least three big problems right now:

* The grammar engine is slow. [5]

* Cat is likely a post Perl 6.0.0 feature. (Among other things, Cat is for enabling practical parsing of arbitrarily long and arbitrarily buffered data streams without having to tweak userland code.) [6]

* Macro, slang and parse tweaking features are still evolving. [7]

[1] [http://en.wikipedia.org/wiki/Context-
free_grammar](http://en.wikipedia.org/wiki/Context-free_grammar)

[2]
[http://en.wikipedia.org/wiki/Parsing_expression_grammar](http://en.wikipedia.org/wiki/Parsing_expression_grammar)

[3]
[http://en.wikipedia.org/wiki/Perl_6_rules](http://en.wikipedia.org/wiki/Perl_6_rules)

[4]
[https://gist.github.com/raiph/63508a5f4ade000acc7f](https://gist.github.com/raiph/63508a5f4ade000acc7f)

See the `data-item:dst_GENERIC-MATCH` rule; note the embedded `{ warn ... }`
closure.

[5]
[http://www.reddit.com/r/perl6/comments/2ec2hd/rakudo_perl_6_...](http://www.reddit.com/r/perl6/comments/2ec2hd/rakudo_perl_6_and_moarvm_performance_advances_in/)

[6] [http://perlcabal.org/syn/S03.html](http://perlcabal.org/syn/S03.html)
which includes:

"The Cat type allows you to have an infinitely extensible string. ... Then a
Regex can be used against it as if it were an ordinary string ... The Cat
interface allows the regex to match element boundaries with the <,> assertion
... Strings, arrays, lists, sequences, captures, and tree nodes can all be
pattern matched by regexes or by signatures more or less interchangeably."

[7] [http://perl6advent.wordpress.com/2014/12/17/day-17-three-
str...](http://perl6advent.wordpress.com/2014/12/17/day-17-three-strange-
loops/)

------
dalke
I'm a bit surprised at the implied superiority to the other Bio* projects.
None of them started as a copy of BioPerl (and btw, it's "Biopython", not
"BioPython"), and none of them are trying to implement all of the features of
BioPerl. For that matter, the other projects implement features that BioPerl
doesn't have.

It's interesting that the example supports a comment in the FASTA record. I
wonder why the author included it. I've never seen a comment used in the wild,
but then again I haven't worked with FASTA files for some time. My experience
was that nearly every tool would break if you give it a FASTA file with a
comment. For example, the BioPerl FASTA parser is at
[https://github.com/bioperl/bioperl-
live/blob/master/Bio/SeqI...](https://github.com/bioperl/bioperl-
live/blob/master/Bio/SeqIO/fasta.pm#L139) , where you can see it doesn't
support a comment.

Since the chance of breakage is so high, my belief 10 years ago was that this
original feature from Pearson is dead, and will never be revived.

~~~
Ultimatt
Update for the superiority, though that is part my flippant language use and
some projection :P The comment section in the grammar isn't comments as in ;
line comments but the per sequence comments usually used to encode extra meta
information such as with GenBank. No idea if they should be called something
else, I've always thought of them as comments.

~~~
cjfields1
Okay, I think I see the discrepancy. It's a nomenclature thing.

The Bio* refer to the additional stuff after the ID as the sequence
'description'. There is a subtle difference between that and a simple comment,
though I have seen even NCBI use this as the 'junk drawer' for any additional
seq information (such as the headers in the nr database).

------
jghn
I'm really surprised to see modern discussions of Perl in bioinformatics,
normally when I see any mention I assume that either a) it's an outsider who
is snickering at how backwards we are or b) it's an old timer who has never
gotten with the times.

I've been in the bioinformatics space for about 13 years now and I've only
come a small handful of perl scripts, all of which have been legacy programs.
Even when I started in the field, perl was viewed as somewhat anachronistic.
All of the end-user software I've written (read: For researchers) has been in
R, Java and a wee bit of scala.

These days all of the researchers I know are using some combo of R, Python and
to a lesser extent MATLAB. Many have useful C and/or Java (and/or Scala)
skills as well.

My personal experience is obviously not exhaustive but I find it hard to
believe that many people are using Perl these days, and IMO that's a matter of
good riddance.

~~~
Snoooze
> b) it's an old timer who has never gotten with the times.

As someone who knows the author of the post, that is mostly true ;)

In all seriousness, while I generally agree with your sentiments I don't think
that Perl is dead within Bioinformatics. Not least since major data providers,
e.g. Ensembl, write their tools in Perl.

~~~
jghn
Heh.

Oh I know it still exists, but while this is admittedly anecdotal on my part
(and likely heavily biased by the institutions I've been a part of) I haven't
seen a single _new_ person who defaults to Perl in a very, very, very long
time.

And to be clear, I'm not trying to rip the author or the article, it just
seems ... oddly timed, like writing a contemporary article on how to use COBOL
in the business computing space.

~~~
Ultimatt
Author of the post here. I think my comeback would be Perl 6 is a completely
new language, that's only just becoming usable from a long design and
development process. A lot of people dont think about it this way because they
have zero idea what Perl 6 is about. If you don't like sigils on your
variables you aren't going to like it though :P I use Perl 5, Python and R on
a daily basis. I try to avoid BASH in favour of Perl, as it genuinely is just
less crazy and is perfect for the same tasks. Python isn't a great shell
script replacement. I like Perl 6 mostly because it lets you write nicer
Perlesk code that isn't quite so disposable and gross. At the moment Perl 6
isn't completely viable for daily use, just because of performance. I also
know Java and C quite well but rarely work on anything that needs the perf of
these languages given the overhead for someone who is a researcher rather than
fulltime programmer.

~~~
jghn
Fair point. I haven't looked at Perl6 since the early aughts and assumed it
was basically exactly what I saw then. I didn't realize that it was just now
starting to blossom.

~~~
Ultimatt
Lol not sure I would go as far to say its just started to blossom. Some
flowers are weeds to one person and a pay-for wild flowers to another. Each to
their own!

------
grondilu
For those who want to test their skills in Bioinformatics:

[http://rosalind.info/](http://rosalind.info/)

The neat things is that you can use any language you want.

~~~
SilasX
On that note, awk shows up a lot in the shortest solutions because of the ease
with which it parses input files.

I haven't used perl myself, so can anyone comment on the merits of perl vs awk
here? They often compete for this use case.

~~~
grondilu
Oversimplifying a bit, I'm going to tell you that Perl was initially conceived
as a mix between C, sed and awk. So, there are quite a few awk idioms that are
directly translatable in Perl. As a matter of fact, there even is a section
dedicated to awk in perltrap[1], enumerating the differences between the two
languages.

1\. [http://perldoc.perl.org/perltrap.html#Awk-
Traps](http://perldoc.perl.org/perltrap.html#Awk-Traps)

