I'm a computer scientist working in a Parkinson's lab that is very closely tied to the names in the article. My main duty is data analysis simply because there is so much information to process. If you want some more information regarding state of the art treatments, our website is http://neurosurgery.ucsf.edu/bankiewicz .
I also maintain our website, so if you see anything that needs fixing, drop me a line :)
I'm interested in learning more about bio-tech research, genetics and synthetic biology. I hope to finish my doctorate in CS soon, and after that I'm going to take some time off to travel and read. Any advice for a computer geek who wants to learn how to program biology for the good of mankind? What do you, or others who work in the industry, think would be the best way for me to bootstrap myself in the area?
My lab brought me on because they realized that they needed to use computers more effectively with the amount of data they generated. I came in with little bio knowledge, but they did a good job of filling me in very quickly.
I'd suggest getting involved in any lab doing any form of computer work. Lab meetings are a great place to find out what problems people are having, and most of the time you'll be able to use your skills to solve them. After taking on a few projects, people will start to realize what you can and can't do, and they'll start to give you a lot of related work.
This can't be overstated. The shear amount of data available for processing is huge. Google-scale. I'm not very familiar with GWAS studies (I'm more familiar with high-throughput sequencing), but these are O(N^m) scale problems (where m>=2).
Let's see, how do I put this gently; the data generated by high-throughput sequencing blows the living hell out of GWAS data. One exome requires reading 30-60x of what's read in a standard GWA. (Not to mention that most people get GWA data on their next-gen sequenced stuff anyways.) You're in the most data-intensive part of the field... congrats! As you probably know better than I, we're still in the age of targeted sequencing and exome sequencing. Whole genome is just beginning to dawn.
Where I work, filesystem I/O is the rate-limiting step for most of my (next-gen sequencing) experiments...
Yeah, we haven't even had to deal with data directly from the instrument yet. So far, we've been getting data from collaborators for analysis via terabyte usb drives (FedEx throughput can't be beat). For the actual analysis, we've found the same thing... disk IO is a limiting factor. Well, that and the 16GB human genome indexes in RAM. And we aren't on an Isilon system yet (probably won't be either).
However, we just got our own instrument, so this will definitely be an issue, but our University knows a thing or two about dealing with big data (http://kb.iu.edu/data/avvh.html).
I've only dealt with a few GWAS style datasets, and the next-gen stuff dwarfs the GWAS data in terms of size. But when looking for linkages between variations, we're still talking more time than the universe is old level of calculations for more than 3 combinations. Which is really scary, because like you said, all the genetics people are going to be using sequencing for most things from here on out, so its like you have complexity on top of complexity...
OK, now that I've thought about it, the easiest way around this probably is throwing money at hardware (more disks) or optimizing the processing. However, this is only in the case of a true disk I/O bottleneck. If you're optimizing correctly then the disks should be reading 8GB blocks directly into memory and the CPU should be spitting them right out again. At the very minimum you should be using an optimized file system with large pages enabled in your kernel.
I don't think I have enough low-level knowledge to answer that intelligently at the moment. What I can say is that we're hitting these problems despite being on Isilon drives. (I think that is orthogonal to your solution, but again am not all too familiar with the subject.)
Not really... these datasets start to saturate 10Gb network connections very easily, so it's a question of volume. With some instruments, you can generate 10-20 terabytes at a time.
Just DNA, on my end (human lipoprotein genetics). I know that the cancer folks have a lot of interest in RNA, however. What is your focus? (Edit: your profile pretty much explains it! )
"Systematic brain delivery", is your lab simply jacking up L-DOPA or L-tyrosine in the substantia nigra? Also by massive information is your lab doing machine learning directly on fMRI output or do you also run statistical analysis on all published Parkinson's research in general?
Would it be possible for you to run a meta-analysis on Parkinson's studies involving alpha-7 nicotinic receptor agonists?
Our lab's main focus is targeted delivery to the brain which can be used to treat a lot of neurodegenerative diseases. Specifically for Parkinson's, we infuse an AADC vector into the putamen to break down L-DOPA into Dopamine. Further down the road, we are looking at using GDNF to save dying neurons that convert L-DOPA into Dopamine. The specifics are here: http://neurosurgery.ucsf.edu/bankiewicz/parkinsons.html
Regarding massive information, I analyze a lot of the data we generate in order to optimize our delivery platform. Since there are few labs that do this kind of work, all the data I work with comes from us.
I also maintain our website, so if you see anything that needs fixing, drop me a line :)