The problem in cancer is not that we don't understand individual proteins or the way that drugs bind to them (the problem being solved in this article). It is that the biology of cancer is a crazy web of highly complex interactions and feedback loops of which we have a pathetically rudimentary understanding. So even when you think you're hitting a target that should kill the cancer, you find out there's some side pathway that spools up and limits the drug's efficacy (or worse, the cancer cells actively pump your compound out). If I recall, something like >95% of new cancer therapies fail in clinical trials. Most of the failure in this drug category is due to lack of efficacy (even though they hit a target, they don't do jack for treating the cancer). If you could make that number 85%, you'd probably be a Nobel contender.
Unfortunately, there's no digital route here. What needs to be done is a lot of slow, messy "wet bench" biology. There is no electronic shortcut to understanding how living cells work.
If we had good computational models of how perturbations would affect transcription networks, for example, we could predict these "side pathways" that so often occur in humans but not in mice.
But you're right, the reality is complex, and you have do deal not only with transcription, but translation, post-translational modification, non-coding RNAs, the list goes on... And most current models of this type don't delve into 3D simulation. Some people are working on whole-cell modeling but that's in its infancy.
Ultimately I believe the breakthroughs will come fastest if we can "close the feedback loop" by automating a lot of bench biology, and then have computers both generate and test hypotheses.
I agree, the totality of the interactions for a single cell is so many orders of magnitude above what we are capable of currently modeling that I fear these computational approaches are dangerously over-hyped. Having been privy to the state-of-the-art projects in a lab with a ~5,000 node cluster it was still disappointing to see how rough the whole cell modeling approaches were. It's really tough just to model a small corner of the cytoplasm and get the diffusion of different protein and metabolites right, let alone address organelles or chromosomal folding and surface availability. It's a mess.
For instance: De novo protein folding is not a solved problem, so how can a simulation predict dynamics for a protein whose conformation isn't even known?
I'm sure in the year 2150 when my grandchildren go to the doctor to be scanned by the tricorder, the results will go to the full-cell (and full-body) simulator...but for now, I think bioinformatics is better served by sticking to higher levels of abstraction like transcript and protein counts (for disease modeling purposes).
"Ultimately I believe the breakthroughs will come fastest if we can "close the feedback loop" by automating a lot of bench biology, and then have computers both generate and test hypotheses."
I don't know anything about bioinformatics, so I'm trying to see how this differs from vanilla automated model selection. I'm really interested, so please feel free to send me an email if you feel that's more appropriate.
They will in no way eliminates or reduce the need for wet lab biology but hopefully it will couple with improvements in high throughput experimental technology to help us design and make sense of experiments targeted at understanding the whole phenomena.
I see the only scientifically defensible way forward is to do large, data-rich networks like Eric Schadt does them. Individual wet-bench work is good to use as prior data, but it's just a hint at a part of a large, complex system, and overly reductive approaches are going to completely miss the big picture.
Furthermore high-throughput assays are more expensive so people often cut corners on sample size.
Unfortunately not everything is done on a massive scale so not everything is automated.
This isn't actually true. There's initially an instance limit of 20, and you have to contact Amazon to get it lifted. You could probably order just as many servers at Softlayer or purchase them at Dell without talking to anyone (except the guy who confirms your credit card).
After all, they're not going to let anyone run up a $3 million bill on the hope that it'll be paid at the end of the month!
This has huge implications for the availability of supercomputers for smaller organizations and use cases.
Ah, if only this were true for everything. It'd make everyone's lives a lot easier. :-) Unfortunately there are a lot of applications out there where inter-node communication is incredibly important: computational fluid dynamics, in all its varied forms, is a good example of a latency-bound application. This covers weather forecasting, aerodynamics, dynamic mechanical modeling, and so on.
What I find interesting is that to get to 51,000 cores, they had to use AWS datacenters all over the world. I'd love to know what kind of resources are actually available in any given datacenter. It will vary at any given time, but it would be useful to know how many cores are "close" to each other in a networking sense, for applications were latency matters.
The actual feat accomplished here would not surprise anyone involved in supercomputing, and companies have done very similar things with non-scientific tasks, such as this article from 2008:
describing how the New York Times itself did a similar computation.
What struck me is that I wonder why Google (or Amazon) hasn't put out the cure for cancer. The actual extent of Google's infrastructure is classified, but using open sources its clear that putting together even half a million 'cores' is not a huge project for them. (Think of it this way, they spent a billion dollars building a data center that the actual building/land/etc estimates out at about 200M.)
One of the qualities of making it a great soundbite is that Schrodinger, the company that it nominally the topic of the story, goes on about how their 1,500 core cluster can't give them the resolution they need but a 50,000 core cluster makes everything clear. Understand that a 'westmere' processor is potentially 12 cores (if you use Intel's defintion of threads and I'm sure they do), and the typical motherboard is 2 CPUs so that is 24 'cores' per machine . A 1,500 core cluster is 62 machines, that is a couple of cabinets worth if you're using Supermicro boxes, less than a cabinet if your using OpenCompute type cabinets . And at maybe $3K each that is an investment of $186K, maybe $250K if you included switches. Which is about 1/3 what it would have cost a pharmaceutical company to buy a VAX minicomputer back in the day.
My point is that if you're in a multi-billion dollar market place, you can afford to spend more on your hardware. And even at approximately $5K/hr a 50,000 node cluster is only 2100 Open Compute servers in 24 of their 'triplet' cabinets.
That is about 1MW critical kW of compute power. (1MW being the power commitment you would have to buy from a colocation center to power it) and even that is a fairly small foot print at Amazon, Google, Facebook, or even Apple with its new $1B data center .
So when I read the story, I was left thinking "Gee, if using a cluster that was 33x bigger got them such great results, why not use one that 3000x bigger? Wouldn't that just answer the question?" And of course I took a moment to analyze that thought and asked the question every critical reader has to ask which is, "What is this article trying to say anyway? And do I believe it?" And that was when it becomes obvious what the article is saying is that Cycle, the company that makes a living creating virtual super computers for embarrassingly parallel problems out of EC-2 instances has reached the point where they can get 50K cores running the same problem." Which is great and all but like a long story that is just a setup for bad pun, it leaves me feeling jaded, and hence my snarky remark that if all it takes is more cores, Google should stop trying to be a great advertising company and switch to being a pharmaceutical company. Which, when you say that out loud you realize it couldn't possibly be that easy and yes, its a snarky way of expressing irritation that I was lead to believe there was something newsworthy here when there wasn't.
In terms of speed, I would assume Amazon to be faster than most clusters since they probably offer the latest and greatest computers. Lastly, MD is an embarassingly parallel problem (don't ask me how it is) so latency isn't a major issue.