I'm a computer scientist working in a Parkinson's lab that is very closely tied ...

rosejn · on June 25, 2010

I'm interested in learning more about bio-tech research, genetics and synthetic biology. I hope to finish my doctorate in CS soon, and after that I'm going to take some time off to travel and read. Any advice for a computer geek who wants to learn how to program biology for the good of mankind? What do you, or others who work in the industry, think would be the best way for me to bootstrap myself in the area?

fgimenez · on June 25, 2010

My lab brought me on because they realized that they needed to use computers more effectively with the amount of data they generated. I came in with little bio knowledge, but they did a good job of filling me in very quickly.

I'd suggest getting involved in any lab doing any form of computer work. Lab meetings are a great place to find out what problems people are having, and most of the time you'll be able to use your skills to solve them. After taking on a few projects, people will start to realize what you can and can't do, and they'll start to give you a lot of related work.

Good luck!

mbreese · on June 25, 2010

there is so much information to process

This can't be overstated. The shear amount of data available for processing is huge. Google-scale. I'm not very familiar with GWAS studies (I'm more familiar with high-throughput sequencing), but these are O(N^m) scale problems (where m>=2).

carbocation · on June 25, 2010

Let's see, how do I put this gently; the data generated by high-throughput sequencing blows the living hell out of GWAS data. One exome requires reading 30-60x of what's read in a standard GWA. (Not to mention that most people get GWA data on their next-gen sequenced stuff anyways.) You're in the most data-intensive part of the field... congrats! As you probably know better than I, we're still in the age of targeted sequencing and exome sequencing. Whole genome is just beginning to dawn.

Where I work, filesystem I/O is the rate-limiting step for most of my (next-gen sequencing) experiments...

mbreese · on June 26, 2010

Yeah, we haven't even had to deal with data directly from the instrument yet. So far, we've been getting data from collaborators for analysis via terabyte usb drives (FedEx throughput can't be beat). For the actual analysis, we've found the same thing... disk IO is a limiting factor. Well, that and the 16GB human genome indexes in RAM. And we aren't on an Isilon system yet (probably won't be either).

However, we just got our own instrument, so this will definitely be an issue, but our University knows a thing or two about dealing with big data (http://kb.iu.edu/data/avvh.html).

I've only dealt with a few GWAS style datasets, and the next-gen stuff dwarfs the GWAS data in terms of size. But when looking for linkages between variations, we're still talking more time than the universe is old level of calculations for more than 3 combinations. Which is really scary, because like you said, all the genetics people are going to be using sequencing for most things from here on out, so its like you have complexity on top of complexity...

nir · on June 25, 2010

Could non-blocking filesystem I/O (a la node.js: http://blog.kodekabuki.com/post/267934877/how-node-js-expose... ) improve that?

Locke1689 · on June 26, 2010

Final edit:

OK, now that I've thought about it, the easiest way around this probably is throwing money at hardware (more disks) or optimizing the processing. However, this is only in the case of a true disk I/O bottleneck. If you're optimizing correctly then the disks should be reading 8GB blocks directly into memory and the CPU should be spitting them right out again. At the very minimum you should be using an optimized file system with large pages enabled in your kernel.

carbocation · on June 25, 2010

I don't think I have enough low-level knowledge to answer that intelligently at the moment. What I can say is that we're hitting these problems despite being on Isilon drives. (I think that is orthogonal to your solution, but again am not all too familiar with the subject.)

mbreese · on June 26, 2010

Not really... these datasets start to saturate 10Gb network connections very easily, so it's a question of volume. With some instruments, you can generate 10-20 terabytes at a time.

tocomment · on June 25, 2010

wow, what do you do?

carbocation · on June 25, 2010

One of my hats is being an analyst dealing with exome (all exons in the genome) sequencing for my lab. We do a lot of exomes.

mbreese · on June 26, 2010

We are dealing with transcriptomes. Do you do targeted sequencing of exomes (DNA), or RNA?

carbocation · on June 26, 2010

Just DNA, on my end (human lipoprotein genetics). I know that the cancer folks have a lot of interest in RNA, however. What is your focus? (Edit: your profile pretty much explains it! )

mkramlich · on June 25, 2010

That would be great at a party:

"What do you do for a living?"

"Exomes."

-spits drink out-

carbocation · on June 25, 2010

Hah! If that would go over well at a party of yours, you'll need to introduce me to your friends :)

metamemetics · on June 25, 2010

"Systematic brain delivery", is your lab simply jacking up L-DOPA or L-tyrosine in the substantia nigra? Also by massive information is your lab doing machine learning directly on fMRI output or do you also run statistical analysis on all published Parkinson's research in general? Would it be possible for you to run a meta-analysis on Parkinson's studies involving alpha-7 nicotinic receptor agonists?

fgimenez · on June 25, 2010

Our lab's main focus is targeted delivery to the brain which can be used to treat a lot of neurodegenerative diseases. Specifically for Parkinson's, we infuse an AADC vector into the putamen to break down L-DOPA into Dopamine. Further down the road, we are looking at using GDNF to save dying neurons that convert L-DOPA into Dopamine. The specifics are here: http://neurosurgery.ucsf.edu/bankiewicz/parkinsons.html

Regarding massive information, I analyze a lot of the data we generate in order to optimize our delivery platform. Since there are few labs that do this kind of work, all the data I work with comes from us.

DanielRibeiro · on June 26, 2010

Reminds me of an old article from NY Times entitled All Science Is Computer Science: http://www.nytimes.com/2001/03/25/weekinreview/the-world-in-...