
Ask HN: Teaching programming in computational biology? - veddox
Hi everyone,<p>I work at a university institute of computational biology. Besides doing research, we also teach quite a few courses for biology students, many of which include an introduction to programming (mostly with R, but also Python). We have a long-standing debate as to what the best approach to this is, and I would like to hear some of your opinions on the matter.<p>Basically, we have two factions: the &quot;tools-first&quot; and the &quot;fundamentals-first&quot; approach. The supporters of &quot;tools-first&quot; argue that we are teaching biologists, not software developers. They like to teach the specific tools (languages, libraries, functions, etc.) that our students are actually going to need as quickly as possible. To cover as much ground as possible, they are willing to sacrifice a deeper understanding of programming.<p>The &quot;fundamentals-first&quot; faction would prefer to do it the other way around: instead of teaching a load of commands to learn by heart, they would rather take more time to develop a programming mindset. They are willing to sacrifice how much content course participants learn, if it means that they learn how to program properly and fully understand how a given language works (with its variables, functions, scoping, etc.). Their argument is that once you have understood the concept, teaching yourself the content is easy (GIYF); whereas the reverse is much harder.<p>What do you think? I would be especially interested in the opinions of other computational biologists ;-)
======
nekonaute
Hi, I am a university professor in one of Paris universities. I teach
programming to both computer scientists and biologists. Fundamentals-first is
of course my choice for CS students.

As for students in biology I will always go tools-first, at least for the
first course (which is ~30 hours here). Then for further courses I would go
fundamentals.

A lot of students of biology (unfortunately) don't really care about
programming, and may consider it as some kind of black magic. Taking them
quickly to do things that "works" is essential to break the ice. And this is
just what I think the first course is for: to give them the intuition of what
programming is.

Then, for a biologist to take a second course of programming (this is not
mandatory here) means he/she really wants to learn something, so we can raise
up the bar and go fundamentals.

Hope this helps.

~~~
cimmanom
I think this is great advice about the psychology of learning. But also wanted
to add an anecdote.

A friend is getting a degree in a field similar to biology. She and everyone
else in her program have been processing their data (for at least the past 5
years) with a script written by the person in the department who is most
competent with the programming tools at their disposal.

Last week, sanity-checking its output, she discovered a bug in the script that
meant that every dataset processed by this program had been deeply flawed.

That means dozens, maybe hundreds of studies published based on meaningless
data.

It might be worthwhile to also introduce a cautionary tale or two and
encourage your students to take followup courses that go deeper into
programming; learn automated testing; etc, so that they can be confident in
the future that the output of their studies is correct.

~~~
veddox
Oh my goodness! :-(

Even worse, I heard a very similar story in a science talk only last week. I
wonder how many more of these bugs there are out there?

~~~
breckuh
This is an anecdote but my guess is a lot. I was a software engineer for a
decade and recently transitioned to a bioinformatics job. A cursory look at
many of the repos that supplement published papers (even in top journals like
Nature), will show you that there are often no tests or type documentation of
any kind and quite often you'll find bugs when you use packages with your own
datasets. So to the OP I would recommend either teaching fundamentals first or
concurrently--but don't skip fundamentals!

~~~
arandr0x
To be clear though -- sometimes (often) this is not due to lack of education.
The job of researchers is to produce papers and grants, not working software,
and postdocs and grad students are junior researchers, not junior developers.
It advances their careers to forgo the time to write tests (or, sigh, manually
test other cases than their lab's data).

Big labs with large funding where funding sources allow hiring non-research
full-time support staff as software engineers, places like the Broad, tend to
have slightly better practices. But if you're using a package written by a
graduate student.... his job is not to make yours easier. And since he's paid
less than the guy who manages your local sandwich shop, maybe it's
understandable.

There's a problem in science where we don't value scientists. Telling them
"the world values software engineering" and turning them into software
engineers is only going to make them go into industry faster.

~~~
cimmanom
But surely they (or most of them) value being able to trust their data?

I agree that we shouldn't ask them to be professional software developers, but
then we need to provide them with access to professional developers if we want
there to be such a thing as science that uses data and computing.

It's ridiculous to ask untrained people to write data processing scripts if
you want to be able to rely on the output.

And then people wonder why we have a reproducibility crisis.

~~~
arandr0x
Yes, absolutely. It would make sense either for funding bodies to provide ways
to fund contract developers or for even smaller institutes to be required to
spend a portion of their budget on full-time developers (instead of spending
it all on machines and student staff).

For the actual data science, as opposed to writing new software packages, it
would be reasonable for students to maintain lab notebooks (except for
computer experiments not wet lab ones), with exact versions of everything,
exact steps, name, size and checksum of every dataset, and so on. I think
things are definitely moving in that direction but this is a skill that is not
beaten as much into grad students as recording wet lab work is. But in time it
will be.

Though, my experience in wet labs is, both quality of the lab notebooks and
quality of the labeling of the stuff in the fridge definitely varies.

------
madhadron
I'm the guy no one wants to hear from about computational biology.

I think the distinction is misleading. Teach programming, but do it with
purely biological cases. Strip down the real tasks into simplified forms, and
give them very specific tools where they have to have them.

You can't unwrap BLAST in a first course, but you can have them find exact
matches. That can be used to find restriction sites, and have them generate
simulated electrophoresis gel patterns. You can have them do read depth
counting in RNAseq with exact matching on a single stranded virus genome (so
they don't have to worry about reverse complement matching). It throws away a
lot of reads, but you're not missing anything fundamental.

You can't do full molecular dynamics, but you can do wiggling polymers on a 2D
lattice and relax them. Likewise, you can do a simple version of threading an
amino acid sequence through a known 2D structure on a lattice and relax it.

You can't do sophisticated phylogeny, but you can brute force assemble small
trees with substitutions and no indels, and you can likewise evolve and sample
trees of sequences with substitution, and then have the students evaluate how
well their tree reconstructions match the generating algorithm. Then take a
subregion of a set of real 16S sequences without indels and have them build a
tree of those.

You can't do full metabolic network modeling, but you can do a simple
stochastic transition model of a couple of pathways and use it to make
predictions about genetic experiments. See Hatzimanikatis's work for some
interesting things to do with this.

I spent some time thinking about this when I was part of the Swiss Institute
for Bioinformatics, and we put together a _course de perfectionnement_ about
programming practice for computational biologists. They finally ran the course
after I left the country (they weren't waiting for me to leave, honestly, it
just didn't get scheduled until then), and I'm told it went well.

~~~
arandr0x
I agree with the idea, but I'm curious -- at the Swiss Institute for
Bioinformatics did undergrads have the fundamental biochemistry knowledge to
model pathways? When I was a third year half my enzymology class were
struggling mentally modeling a single reaction.

I would attend a graduate class that worked like you outline though, and I
kind of wish all of them did.

~~~
madhadron
SIB was a consortium for practitioners at the various universities, but many
of them came out of a biology background, so the ones that would be taking
this course generally had a masters or PhD in biology.

------
ValleyOfTheMtns
Why not both? Sacrifice a little from each so that you can do both, and they
will give a bit of context to each other.

If you only go tools-first you risk treating the tools like "black magic"
without really understanding what's happening.

With only fundamentals-first you lose relevance for biologists.

Give them enough on fundamentals (i.e. variables, functions, loops,
conditions, scope, basic data structures) so that they can appreciate how the
tools do what they do.

Beware false dichotomies.

------
dekhn
Realistically, you have limited time and resources and your job is not to make
computational biologists who are world class programmers (I wish that were the
case, but that's not what the incentive system wants).

I would suggest teaching the basics of language with the expectation that
everybody can implement control flow (enough to parse a file and output a
converted form of the data), and maybe enough to understand one or two
efficient algorithms, and then move on to how to use tools effectively.

My perspective is from somebody who was a hacker, then a computational
biologist, then a software engineer. None of my training prepared me for any
of that- I had to do extensive self-research to figure out how, and later,
work with good programmers for years, before I could call myself a decent
software engineer.

------
gravelc
I had a molecular biology research background before switching to
bioinformatics via a Masters degree some time ago. The degree involved
subjects that teach tools, and other subjects that teach programming
fundamentals (with the undergrad CS and engineering students).

By far the most helpful for me were the software engineering subjects. I could
immediately see where I could apply pretty much everything I was learning each
day to a biological problem I was trying to solve. Learning individual tools
was trivial in comparison, and with programming fundamentals I could wrangle
large datasets and thus string tools together into workflows and pipelines.

What was beneficial for me may not of course be broadly applicable, and I
suspect different students will appreciate different approaches.

------
eljost
I'm a computational/theoretical chemist and use python extensivly.

To me te question seems a bit poorly phrased. What is the goal of the course?
Are they supposed to learn how to program? Then it's of course fundamentals
first, especially the stuff you cited (variables, functions, scoping). You
can't solve any problem without knowing this stuff.

If they are supposed to learn a specific library/tool then, in my opinion, it
depends on the libs/tools you want to teach them. Maybe this libary is not
that relevant for comp. biologists, but take scikit-learn. You still have to
know how to program when you wan't to use this library properly. On the other
hand if they shall learn some GUI-tools then I guess they don't have to know
how to program. You can ask yourself: Can you anticipate what tools/libraries
your students will be using in a few years? If you can answer yes to this,
then teach them a bit about these tools. If your field is constantly
evolving/changing then it would be better to invest in the basics, so your
students can adapt easier.

Considering the fact that you are (hopefully) and have to be constantly
learning as a practicing scientist it is of great importance to get the basics
right.

------
justinnhli
You might be interested in Harvey Mudd's CS5 Green syllabus [1], which "is
designed to give you the foundations of computer science in the context of
solving real and important problems in the biological sciences."

[1]:
[https://www.cs.hmc.edu/twiki/bin/view/CS6/CourseSyllabus](https://www.cs.hmc.edu/twiki/bin/view/CS6/CourseSyllabus)

~~~
veddox
Sounds fantastic! Unfortunately, 4 1/2 hrs a week for a whole semester is
_way_ more than we have available for most of our courses :-(

------
mjfl
Tools first, 100%. People learn concrete things, and then they learn the
abstractions. The abstractions are nearly useless by themselves. It's the
reason that people learn addition before algebra before group theory. The
people that say learn the "concept" first are just flat wrong.

~~~
veddox
The problem is a) that observation shows that many people _don 't_ learn the
abstractions after being taught the application (and thus can never teach
themselves more applications) and b) that (perhaps) unlike in mathematics, an
understanding of the concept makes the application a _lot_ easier...

------
sprior
Only slightly offtopic, a really great book I'd recommend is "Vehicles,
Experiments in Synthetic Psychology" by Valentino Braitenberg. He uses the
term "Vehicles" to represent simple biologically inspired robots and how we
perceive them. Coding simulated vehicles from his book would be a really cool
programming exercise. Even better would be implementing them by programming
Arduino based devices.

------
silverlake
My brother is a biology researcher who uses Puthon scripts for DNA analysis. I
recommended he hire a programmer for his group to setup the tools and write
programs. Everyone should do what they do best Outsource the rest.

------
arandr0x
I have a graduate degree in biochemistry. (This is not biology and there is
some difference in e.g. the makeup of the people who choose each course. So
please take my experience with the required amount of salt). Today I am a
programmer.

Here are my tips on how biological sciences students work.

* They enjoy concepts. Biology is somewhat more abstract than the intro programming courses in the way it's taught in the early years, and biology students tend to be good at retaining and linking information.

* However, they can be very cautious. They tend to request clear instruction, and specific goals (given by the instructor).

* They're motivated by getting good grades

With that in mind, for an intro course, I would recommend a short intro on the
use of programming/data science in real life science problems. Explain that
for example, optimizing foodstuffs requires now advanced genetic
understanding, that even breeding is targeted. If they're biochem students
more than just biology you can bring up antibiotic resistance. This should be
the very first class, because most biology students don't give a hoot about
learning Python, but they care about doing good in the world. Plus it will
give them material for when they become computational biology researchers and
have to rehash this stuff for grant committees.

Next, some fundamentals. Variables, iteration. Do not introduce functions yet.
For some reason I cannot fathom, even some quite advanced scientists seem to
hate functions, to the benefit of really long scripts. They'll need to learn
it eventually, but no need to put them off.

Build exercises. The exercises should be instructor-led and graded. They
should include some practical exercises, however, if graded programming
exercises are included students should either be given sample similar
exercises (solved) during the lectures as ungraded practice or access to a TA
for questions. Most of these people really care about their grades, so graded
exercises with no practice intimidates them. The exercises can be small
applications of fundamentals at first, however, keep them biology-related (The
Game of Life grabbed me personally in my first CS course, otherwise you can do
things like make them write an R program to display a scatterplot and then fit
a line on a set of data from lab experiments that will then reveal a chemical
or biological law they have seen in their other courses. Counts of bacteria
after N days, that sort of thing.)

Introduce some things that are not fundamentals for most CS students, but that
should be for computational biology. Like regular expressions. Regexp are more
fundamental to biology than e.g., scoping and functions. Actually, basic data
structure instruction (lists and tables/hashtable, concepts of databases and
records) are also very important.

For each concept, provide lists of solved problems or an automatic interface
where they can type code and see solutions if they're stuck, like all natural
science students get in problem sets. Programming instruction typically
requires a lot more initiative than science students are comfortable with, and
gives a lot less guidance; students are left on their own with the machine,
and they have to "figure out" what works and why it doesn't. In science,
students have textbooks with toy problem sets and solutions, and those help
build confidence. Jupyter notebooks can be great for this but just make sure
to stay within a concept and not include too many packages/APIs/problem
definitions at once.

A lot of comp bio instructors think their students have to know Linux, know
how to use a DNA aligner, learn Python and R, and be able to independently
write programs from scratch to use CS in biology. They don't. They need linear
regression and databases. If you get students that take several courses in the
sequence, then you know they are more "tech-savvy" and interested in tech for
its own sake, and you can build up all the tools. But often tools-based
approaches suffer from introducing too many tools, for too few reasons.

So I guess I am advocating an hybrid tools-fundamentals approach that takes a
very focused, very thin slice of fundamentals, and the tools they apply to,
and solve a number of very small, "easy" quantitative real-world problems. And
then in follow-up courses, build on that. Build bridges as well with your CS
department so that you can direct students who want to learn deeper, more
technical topics, and if you have research talks at your institute that talk
about real research facilitated by computational methods, and how those are
used (esp given by grad students) then recommend those to your more interested
undergrads.

Edited to add: This applies to my observations of biological science undergrad
and grad students in the US and Canada. I haven't been to uni in Europe but
know people who have been and instruction there (for all degree programs) is
so different that recommendations on course formats are useless. For example
European and Asian students in STEM are usually pretty good at math, even if
their degree program doesn't have math courses, and less turned off by
formulas and heavier fundamentals.

~~~
veddox
Thank you for this detailed response! I think that is a very thorough and
well-thought out concept.

> Do not introduce functions yet. For some reason I cannot fathom, even some
> quite advanced scientists seem to hate functions, to the benefit of really
> long scripts. They'll need to learn it eventually, but no need to put them
> off.

Yes, I have observed that myself... I really don't understand that - functions
would make so many things so much easier. Plus, for many students, some of
their biggest comprehension problems can be traced back to an incomplete
understanding of functions. (Especially the simple fact that builtin system
"commands" are nothing but pre-written functions...) [Having said that, I
ought to add the disclaimer that I cut my programming teeth on Lisp and
Python, so I think very heavily in terms of functions.)

> not include too many packages/APIs/problem definitions at once

Good point ;-) I have seen that too in some courses.

