
Rosalind: Learn bioinformatics by programming it - mnemonicsloth
http://rosalind.info/problems/locations/
======
xvilka
Would have been nice to have a Julia version too. Some time ago I suggested
[1] to create a Julia flavor of Biostar Handbook [2]. And now there is an
initiative[3] to create similar, but open source book instead. So anyone can
contribute already.

[1] [https://discourse.julialang.org/t/biostar-handbook-
computati...](https://discourse.julialang.org/t/biostar-handbook-
computational-genomics-and-julia-to-be-or-not-to-be/25732)

[2] [https://www.biostarhandbook.com/](https://www.biostarhandbook.com/)

[3]
[https://github.com/BioJulia/biojulia_handbook/issues/1](https://github.com/BioJulia/biojulia_handbook/issues/1)

~~~
computerfriend
Rosalind is language-agnostic, except for the mini tutorial at the start.

------
divbzero
If by chance anyone is not aware, the namesake is Rosalind Franklin [1] who
made seminal contributions in the fields of X-ray crystallography and electron
microscopy.

[1]:
[https://en.wikipedia.org/wiki/Rosalind_Franklin](https://en.wikipedia.org/wiki/Rosalind_Franklin)

It was her X-ray image that led to the discovery of the molecular structure of
DNA.

------
eesmith
So, I picked one at semi-random -
[http://rosalind.info/problems/prtm/](http://rosalind.info/problems/prtm/) and
found a usability problem (a popup that doesn't work; in FF or Safari) and a
wrong example answer. Here's the description.

> Given: A protein string P of length at most 1000 aa.

> Return: The total weight of P. Consult the monoisotopic mass table.

The "monoisotopic mass table" appears to be a link. I get a pop-up, but
nothing appears in it, other than a spinner. I had to do a web search to find
[http://rosalind.info/glossary/monoisotopic-mass-
table/](http://rosalind.info/glossary/monoisotopic-mass-table/) .

The page continues:

> Sample Dataset - SKADYEK

> Sample Output - 821.392

Using the monoisotopic mass table I computed:

    
    
        >>> d = {'A': 71.03711, 'C': 103.00919, 'D': 115.02694,
        'E': 129.04259, 'F': 147.06841, 'G': 57.02146,
        'H': 137.05891, 'I': 113.08406, 'K': 128.09496,
        'L': 113.08406, 'M': 131.04049, 'N': 114.04293,
        'P': 97.05276, 'Q': 128.05858, 'R': 156.10111,
        'S': 87.03203, 'T': 101.04768, 'V': 99.06841,
        'W': 186.07931, 'Y': 163.06333}
        >>> sum(d[c] for c in "SKADYEK")
        821.3919199999999
    

This matches the example. BUT!!!!

This is _NOT_ the correct answer because as the expanded text says, "the mass
of a protein is the sum of masses of all its residues plus the mass of a
single water molecule."

The table says "the monoisotopic mass of water is considered to be 18.01056"
so

    
    
        >>> 821.3919199999999 + 18.01056
        839.40248
    

This latter number matches the value given by [https://web.expasy.org/cgi-
bin/compute_pi/pi_tool](https://web.expasy.org/cgi-bin/compute_pi/pi_tool) .

Which means the example answer ... is wrong. Yes?

How (in)correct are the other answers? I-am-not-a-bioinformatics-programmer.

~~~
ihunter2839
This site was developed by Pavel Pevzner, who teaches bioinformatics at UCSD.
We used this site as the main curriculum in one of our final bioinformatics
class, and after solving ~ 10 - 15 problems a week for 10 weeks, I don't
recall a single time where the error was in the problem set solutions.

Re: the problem - not a hundred percent on this, but I think the issue is that
they are vague on the fact that this is a theoretical question, not a
practical one. The key is that the question itself does not mention the
addition of the water molecule, just that you have a sequence P with a
dictionary of weights.

Edit 1: If memory serves me correct, after the initial ionization phase of
mass spectroscopy, the additional water molecule is discarded, making it
insignificant in the analysis of your peptide sequences.

Edit 2: If anyone is interested in following through this site, I would highly
recommended using the existing problem tracks
[http://rosalind.info/problems/list-
view/?location=bioinforma...](http://rosalind.info/problems/list-
view/?location=bioinformatics-textbook-track) These will help lay out the
problems in a logical order an ensure you have the skills you need to
progress. Alignment problems are a great way to learn dynamic programming and
will allow you to move onto some of these other problems (like mass spec and
HMMs) more reasonably (at least, in my experience!) Good luck!

~~~
eesmith
A closer reading shows that I got tripped up by what "residue" means. But I'm
not sure the author of the question got it right either? At the very least,
I'm confused by it.

The first paragraph of the expanded question text has: "every pair of adjacent
amino acids has lost one molecule of water, meaning that a polypeptide
containing n amino acids has had n−1 water molecules removed"

The second paragraph has: "Thus, the mass of a protein is the sum of masses of
all its residues plus the mass of a single water molecule."

The fifth paragraph has: "The mass of a protein is the sum of the monoisotopic
masses of its amino acid residues plus the mass of a single water molecule"

And the monoisotopic mass table says "Note: the monoisotopic mass of water is
considered to be 18.01056 Da."

So I thought that the water molecule was important in the calculation.

However, the last paragraph (which I only now closely read) says it isn't
important, with "In the following several problems on applications of mass
spectrometry, we avoid the complication of having to distinguish between
residues and non-residues by only considering peptides excised from the middle
of the protein. This is a relatively safe assumption because in practice,
peptide analysis is often performed in tandem mass spectrometry."

Since it didn't mention "water", and instead used the specialist term
"residue", I missed the connection earlier.

That said, the text seems to use "residue" inconsistently. There's the
definition "a residue is a molecule from which a water molecule has been
removed; every amino acid in a protein are residues except the leftmost and
the rightmost ones."

but there's also the usage: "the mass of a protein is the sum of masses of all
its residues plus the mass of a single water molecule"

Surely that should be "the mass of a protein is the sum of masses of all its
residues plus the mass of its leftmost and rightmost amino acids minus the
mass of a single water molecule", yes?

So I looked up the definition of "amino acid residue". It appears to be
[https://goldbook.iupac.org/terms/view/A00279](https://goldbook.iupac.org/terms/view/A00279)
"α-Amino-acid residues are therefore structures that lack a hydrogen atom of
the amino group (–NH–CHR–COOH), or the hydroxyl moiety of the carboxyl group
(NH2–CHR–CO–), or both (–NH–CHR–COO–); all units of a peptide chain are
therefore amino-acid residues".

[https://en.wikipedia.org/wiki/Protein_sequencing#Whole-
mass_...](https://en.wikipedia.org/wiki/Protein_sequencing#Whole-
mass_determination) also agrees that "residue" includes the two amino acids at
the ends, saying "The protein’s whole mass is the sum of the masses of its
amino-acid residues plus the mass of a water molecule and adjusted for any
post-translational modifications"

Which means ... I don't think the author uses the term "residue" correctly?

Or, more likely, I'm confused by the specialist terminology. Can someone clear
up my confusion?

~~~
mnemonicsloth
Think of the normal meaning of the word. The residue is what's left over after
whatever was going to happen has finished happening.

For amino acids, the interesting thing is that they get joined into chains
that fold up into proteins (which do all the work). The residues after that
happens looks like this:

    
    
                R   O               R   O
            H   |   ||          H   |   ||            H
      ----- N - C - C --------- N - C - C ----------- N etc
                H                   H
    
    

The lines - and | are chemical bonds. The Oxygens are double bonded. N=
Nitrogen, C=Carbon, O=Oxygen. The R is one of 20 different chemical groups
called side chains that make each amino acid different.

When the amino acids are isolated (not chained) the double-bonded carbon on
the right carries an extra OH group. The whole right-hand carbon is often
written COOH. The nitrogen on the left hand side of each amino acid carries an
extra H. As part of the process that joins the amino acids together, the OH on
the left and the H on the right pair off to form water. This is called
condensation, and it happens at every junction between two acids. So if there
are n amino acids, there are n-1 junctions and n-1 water molecules that were
present (in aggregate) in the constituent amino acids that don't make it into
the final protein chain.

Note, though, that one of the methods of chopping up protein chains is adding
the water molecules back again, so you should know exactly what you're looking
if you want to count masses precisely.

Jeez. Didn't mean to write all that. Hope it helps.

~~~
ihunter2839
Just a note - the OH and the H+ don't specifically pair up to form a single
water molecule; realistically, there will be a number of water molecules in
play throughout the reaction.

Rather nit-picky, though. Thanks for the diagrams :)

------
dang
A thread from 2012:
[https://news.ycombinator.com/item?id=4761831](https://news.ycombinator.com/item?id=4761831)

(Reposts are fine after a year:
[https://news.ycombinator.com/newsfaq.html](https://news.ycombinator.com/newsfaq.html))

------
fao_
First off, the login page doesn't redirect to the HTTPS version of the page,
so it's sending my password over plaintext. What makes this worse is that when
I manually go to the TLS page, it gives me a PF_END_OF_FILE_ERROR (I'm running
firefox 72.0.2, on Alpine Linux).

The second thing is picking the first example (the character counting
problem). Clicking on the thing, it told me that the important words are
highlighted, and that the words 'figure N' refer to the figures on the right
-- which felt unnecessary, because it's something that anyone visiting
wikipedia, or browsing a book, would know.

------
danielecook
It’s a great site and greatly accelerated my learning of programming.

The form of learning which I call “problem based” learning is a great format
for me. You learn from reading up on a topic. You learn from trying different
solutions. Finally, you learn from seeing other people’s answers once you’ve
solved it.

Also check out:

Hackerrank.com - all around focus Project Euler- math focus Leetcode - more
oriented towards interview training but still useful and fun.

------
acomjean
We used a version of this site for a bio informatics algorithm class a couple
years ago (we used the site for part of the homework assignments, I guess the
auto grading of code saves the instructors time...)

The problems are interesting and fun to solve, they didn’t have a lot of
context, though They seemed to have added some at the start of each problem.

------
rjkennedy98
Surprised to see this here as this has been around for quite some time. I used
to do these problems on the weekends in 2013-14.

~~~
killjoywashere
I think there will be some sites like books, they are timeless. And Rosalind
is one of them. I'd add Philip Greenspun's /books
([http://philip.greenspun.com/books/](http://philip.greenspun.com/books/))

