Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Teaching programming in computational biology?
34 points by veddox 8 months ago | hide | past | web | favorite | 30 comments
Hi everyone,

I work at a university institute of computational biology. Besides doing research, we also teach quite a few courses for biology students, many of which include an introduction to programming (mostly with R, but also Python). We have a long-standing debate as to what the best approach to this is, and I would like to hear some of your opinions on the matter.

Basically, we have two factions: the "tools-first" and the "fundamentals-first" approach. The supporters of "tools-first" argue that we are teaching biologists, not software developers. They like to teach the specific tools (languages, libraries, functions, etc.) that our students are actually going to need as quickly as possible. To cover as much ground as possible, they are willing to sacrifice a deeper understanding of programming.

The "fundamentals-first" faction would prefer to do it the other way around: instead of teaching a load of commands to learn by heart, they would rather take more time to develop a programming mindset. They are willing to sacrifice how much content course participants learn, if it means that they learn how to program properly and fully understand how a given language works (with its variables, functions, scoping, etc.). Their argument is that once you have understood the concept, teaching yourself the content is easy (GIYF); whereas the reverse is much harder.

What do you think? I would be especially interested in the opinions of other computational biologists ;-)

Hi, I am a university professor in one of Paris universities. I teach programming to both computer scientists and biologists. Fundamentals-first is of course my choice for CS students.

As for students in biology I will always go tools-first, at least for the first course (which is ~30 hours here). Then for further courses I would go fundamentals.

A lot of students of biology (unfortunately) don't really care about programming, and may consider it as some kind of black magic. Taking them quickly to do things that "works" is essential to break the ice. And this is just what I think the first course is for: to give them the intuition of what programming is.

Then, for a biologist to take a second course of programming (this is not mandatory here) means he/she really wants to learn something, so we can raise up the bar and go fundamentals.

Hope this helps.

I think this is great advice about the psychology of learning. But also wanted to add an anecdote.

A friend is getting a degree in a field similar to biology. She and everyone else in her program have been processing their data (for at least the past 5 years) with a script written by the person in the department who is most competent with the programming tools at their disposal.

Last week, sanity-checking its output, she discovered a bug in the script that meant that every dataset processed by this program had been deeply flawed.

That means dozens, maybe hundreds of studies published based on meaningless data.

It might be worthwhile to also introduce a cautionary tale or two and encourage your students to take followup courses that go deeper into programming; learn automated testing; etc, so that they can be confident in the future that the output of their studies is correct.

Oh my goodness! :-(

Even worse, I heard a very similar story in a science talk only last week. I wonder how many more of these bugs there are out there?

This is an anecdote but my guess is a lot. I was a software engineer for a decade and recently transitioned to a bioinformatics job. A cursory look at many of the repos that supplement published papers (even in top journals like Nature), will show you that there are often no tests or type documentation of any kind and quite often you'll find bugs when you use packages with your own datasets. So to the OP I would recommend either teaching fundamentals first or concurrently--but don't skip fundamentals!

To be clear though -- sometimes (often) this is not due to lack of education. The job of researchers is to produce papers and grants, not working software, and postdocs and grad students are junior researchers, not junior developers. It advances their careers to forgo the time to write tests (or, sigh, manually test other cases than their lab's data).

Big labs with large funding where funding sources allow hiring non-research full-time support staff as software engineers, places like the Broad, tend to have slightly better practices. But if you're using a package written by a graduate student.... his job is not to make yours easier. And since he's paid less than the guy who manages your local sandwich shop, maybe it's understandable.

There's a problem in science where we don't value scientists. Telling them "the world values software engineering" and turning them into software engineers is only going to make them go into industry faster.

But surely they (or most of them) value being able to trust their data?

I agree that we shouldn't ask them to be professional software developers, but then we need to provide them with access to professional developers if we want there to be such a thing as science that uses data and computing.

It's ridiculous to ask untrained people to write data processing scripts if you want to be able to rely on the output.

And then people wonder why we have a reproducibility crisis.

Yes, absolutely. It would make sense either for funding bodies to provide ways to fund contract developers or for even smaller institutes to be required to spend a portion of their budget on full-time developers (instead of spending it all on machines and student staff).

For the actual data science, as opposed to writing new software packages, it would be reasonable for students to maintain lab notebooks (except for computer experiments not wet lab ones), with exact versions of everything, exact steps, name, size and checksum of every dataset, and so on. I think things are definitely moving in that direction but this is a skill that is not beaten as much into grad students as recording wet lab work is. But in time it will be.

Though, my experience in wet labs is, both quality of the lab notebooks and quality of the labeling of the stuff in the fridge definitely varies.

This is why the push to open science is so important. Making data sets and code accessible for more in depth peer review should be required. Maybe it is a side effect of our antiquated publishing industry though.

A frightening number, I'm sure. Consider the fact that even highly experienced programmers introduce bugs all the time. And that the vast majority of people writing data processing scripts for scientific use are not experienced programmers.

Any recommendations for data analysis testing?

...I don't think there's a good book available on that.

I'm curious, what do you consider "fundamentals" for CS and "tools" for bio?

Thank you for this great teaching insight!

I'm the guy no one wants to hear from about computational biology.

I think the distinction is misleading. Teach programming, but do it with purely biological cases. Strip down the real tasks into simplified forms, and give them very specific tools where they have to have them.

You can't unwrap BLAST in a first course, but you can have them find exact matches. That can be used to find restriction sites, and have them generate simulated electrophoresis gel patterns. You can have them do read depth counting in RNAseq with exact matching on a single stranded virus genome (so they don't have to worry about reverse complement matching). It throws away a lot of reads, but you're not missing anything fundamental.

You can't do full molecular dynamics, but you can do wiggling polymers on a 2D lattice and relax them. Likewise, you can do a simple version of threading an amino acid sequence through a known 2D structure on a lattice and relax it.

You can't do sophisticated phylogeny, but you can brute force assemble small trees with substitutions and no indels, and you can likewise evolve and sample trees of sequences with substitution, and then have the students evaluate how well their tree reconstructions match the generating algorithm. Then take a subregion of a set of real 16S sequences without indels and have them build a tree of those.

You can't do full metabolic network modeling, but you can do a simple stochastic transition model of a couple of pathways and use it to make predictions about genetic experiments. See Hatzimanikatis's work for some interesting things to do with this.

I spent some time thinking about this when I was part of the Swiss Institute for Bioinformatics, and we put together a course de perfectionnement about programming practice for computational biologists. They finally ran the course after I left the country (they weren't waiting for me to leave, honestly, it just didn't get scheduled until then), and I'm told it went well.

This is absolutely my view as well. I ran a course along the same lines for MD and PhD students last summer and it went quite well.

One aspect that I think should be emphasized more is feeling comfortable with the command line. A lot of science-focused courses put you in an IDE and never have you muck around in bash, which I think is a big mistake for the long term usefulness of the course.

I agree with the idea, but I'm curious -- at the Swiss Institute for Bioinformatics did undergrads have the fundamental biochemistry knowledge to model pathways? When I was a third year half my enzymology class were struggling mentally modeling a single reaction.

I would attend a graduate class that worked like you outline though, and I kind of wish all of them did.

SIB was a consortium for practitioners at the various universities, but many of them came out of a biology background, so the ones that would be taking this course generally had a masters or PhD in biology.

Why not both? Sacrifice a little from each so that you can do both, and they will give a bit of context to each other.

If you only go tools-first you risk treating the tools like "black magic" without really understanding what's happening.

With only fundamentals-first you lose relevance for biologists.

Give them enough on fundamentals (i.e. variables, functions, loops, conditions, scope, basic data structures) so that they can appreciate how the tools do what they do.

Beware false dichotomies.

Realistically, you have limited time and resources and your job is not to make computational biologists who are world class programmers (I wish that were the case, but that's not what the incentive system wants).

I would suggest teaching the basics of language with the expectation that everybody can implement control flow (enough to parse a file and output a converted form of the data), and maybe enough to understand one or two efficient algorithms, and then move on to how to use tools effectively.

My perspective is from somebody who was a hacker, then a computational biologist, then a software engineer. None of my training prepared me for any of that- I had to do extensive self-research to figure out how, and later, work with good programmers for years, before I could call myself a decent software engineer.

I had a molecular biology research background before switching to bioinformatics via a Masters degree some time ago. The degree involved subjects that teach tools, and other subjects that teach programming fundamentals (with the undergrad CS and engineering students).

By far the most helpful for me were the software engineering subjects. I could immediately see where I could apply pretty much everything I was learning each day to a biological problem I was trying to solve. Learning individual tools was trivial in comparison, and with programming fundamentals I could wrangle large datasets and thus string tools together into workflows and pipelines.

What was beneficial for me may not of course be broadly applicable, and I suspect different students will appreciate different approaches.

I'm a computational/theoretical chemist and use python extensivly.

To me te question seems a bit poorly phrased. What is the goal of the course? Are they supposed to learn how to program? Then it's of course fundamentals first, especially the stuff you cited (variables, functions, scoping). You can't solve any problem without knowing this stuff.

If they are supposed to learn a specific library/tool then, in my opinion, it depends on the libs/tools you want to teach them. Maybe this libary is not that relevant for comp. biologists, but take scikit-learn. You still have to know how to program when you wan't to use this library properly. On the other hand if they shall learn some GUI-tools then I guess they don't have to know how to program. You can ask yourself: Can you anticipate what tools/libraries your students will be using in a few years? If you can answer yes to this, then teach them a bit about these tools. If your field is constantly evolving/changing then it would be better to invest in the basics, so your students can adapt easier.

Considering the fact that you are (hopefully) and have to be constantly learning as a practicing scientist it is of great importance to get the basics right.

You might be interested in Harvey Mudd's CS5 Green syllabus [1], which "is designed to give you the foundations of computer science in the context of solving real and important problems in the biological sciences."

[1]: https://www.cs.hmc.edu/twiki/bin/view/CS6/CourseSyllabus

Sounds fantastic! Unfortunately, 4 1/2 hrs a week for a whole semester is way more than we have available for most of our courses :-(

Tools first, 100%. People learn concrete things, and then they learn the abstractions. The abstractions are nearly useless by themselves. It's the reason that people learn addition before algebra before group theory. The people that say learn the "concept" first are just flat wrong.

The problem is a) that observation shows that many people don't learn the abstractions after being taught the application (and thus can never teach themselves more applications) and b) that (perhaps) unlike in mathematics, an understanding of the concept makes the application a lot easier...

I agree but I would say mix the tool teaching with fundamentals. Let them bang away for an hour trying to make some simple change to their tool script, and then show them the principle that had they known about they could have solved their problem in 30 seconds.

Stoke in them the desire to invest regular time learning the fundamentals.

Only slightly offtopic, a really great book I'd recommend is "Vehicles, Experiments in Synthetic Psychology" by Valentino Braitenberg. He uses the term "Vehicles" to represent simple biologically inspired robots and how we perceive them. Coding simulated vehicles from his book would be a really cool programming exercise. Even better would be implementing them by programming Arduino based devices.

My brother is a biology researcher who uses Puthon scripts for DNA analysis. I recommended he hire a programmer for his group to setup the tools and write programs. Everyone should do what they do best Outsource the rest.

I have a graduate degree in biochemistry. (This is not biology and there is some difference in e.g. the makeup of the people who choose each course. So please take my experience with the required amount of salt). Today I am a programmer.

Here are my tips on how biological sciences students work.

* They enjoy concepts. Biology is somewhat more abstract than the intro programming courses in the way it's taught in the early years, and biology students tend to be good at retaining and linking information.

* However, they can be very cautious. They tend to request clear instruction, and specific goals (given by the instructor).

* They're motivated by getting good grades

With that in mind, for an intro course, I would recommend a short intro on the use of programming/data science in real life science problems. Explain that for example, optimizing foodstuffs requires now advanced genetic understanding, that even breeding is targeted. If they're biochem students more than just biology you can bring up antibiotic resistance. This should be the very first class, because most biology students don't give a hoot about learning Python, but they care about doing good in the world. Plus it will give them material for when they become computational biology researchers and have to rehash this stuff for grant committees.

Next, some fundamentals. Variables, iteration. Do not introduce functions yet. For some reason I cannot fathom, even some quite advanced scientists seem to hate functions, to the benefit of really long scripts. They'll need to learn it eventually, but no need to put them off.

Build exercises. The exercises should be instructor-led and graded. They should include some practical exercises, however, if graded programming exercises are included students should either be given sample similar exercises (solved) during the lectures as ungraded practice or access to a TA for questions. Most of these people really care about their grades, so graded exercises with no practice intimidates them. The exercises can be small applications of fundamentals at first, however, keep them biology-related (The Game of Life grabbed me personally in my first CS course, otherwise you can do things like make them write an R program to display a scatterplot and then fit a line on a set of data from lab experiments that will then reveal a chemical or biological law they have seen in their other courses. Counts of bacteria after N days, that sort of thing.)

Introduce some things that are not fundamentals for most CS students, but that should be for computational biology. Like regular expressions. Regexp are more fundamental to biology than e.g., scoping and functions. Actually, basic data structure instruction (lists and tables/hashtable, concepts of databases and records) are also very important.

For each concept, provide lists of solved problems or an automatic interface where they can type code and see solutions if they're stuck, like all natural science students get in problem sets. Programming instruction typically requires a lot more initiative than science students are comfortable with, and gives a lot less guidance; students are left on their own with the machine, and they have to "figure out" what works and why it doesn't. In science, students have textbooks with toy problem sets and solutions, and those help build confidence. Jupyter notebooks can be great for this but just make sure to stay within a concept and not include too many packages/APIs/problem definitions at once.

A lot of comp bio instructors think their students have to know Linux, know how to use a DNA aligner, learn Python and R, and be able to independently write programs from scratch to use CS in biology. They don't. They need linear regression and databases. If you get students that take several courses in the sequence, then you know they are more "tech-savvy" and interested in tech for its own sake, and you can build up all the tools. But often tools-based approaches suffer from introducing too many tools, for too few reasons.

So I guess I am advocating an hybrid tools-fundamentals approach that takes a very focused, very thin slice of fundamentals, and the tools they apply to, and solve a number of very small, "easy" quantitative real-world problems. And then in follow-up courses, build on that. Build bridges as well with your CS department so that you can direct students who want to learn deeper, more technical topics, and if you have research talks at your institute that talk about real research facilitated by computational methods, and how those are used (esp given by grad students) then recommend those to your more interested undergrads.

Edited to add: This applies to my observations of biological science undergrad and grad students in the US and Canada. I haven't been to uni in Europe but know people who have been and instruction there (for all degree programs) is so different that recommendations on course formats are useless. For example European and Asian students in STEM are usually pretty good at math, even if their degree program doesn't have math courses, and less turned off by formulas and heavier fundamentals.

Thank you for this detailed response! I think that is a very thorough and well-thought out concept.

> Do not introduce functions yet. For some reason I cannot fathom, even some quite advanced scientists seem to hate functions, to the benefit of really long scripts. They'll need to learn it eventually, but no need to put them off.

Yes, I have observed that myself... I really don't understand that - functions would make so many things so much easier. Plus, for many students, some of their biggest comprehension problems can be traced back to an incomplete understanding of functions. (Especially the simple fact that builtin system "commands" are nothing but pre-written functions...) [Having said that, I ought to add the disclaimer that I cut my programming teeth on Lisp and Python, so I think very heavily in terms of functions.)

> not include too many packages/APIs/problem definitions at once

Good point ;-) I have seen that too in some courses.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact