Hacker News new | comments | show | ask | jobs | submit login
$4,829-per-hour supercomputer built on Amazon cloud to fuel cancer research (arstechnica.com)
81 points by evo_9 1646 days ago | hide | past | web | 35 comments | favorite

While it's nice from a technical perspective, this is unlikely to lead to a cancer cure. Having worked in cancer drug development, I can tell you, there is no shortage of cancer targets. Researchers have a list of targets they want to hit, and chemists are pretty darn good at designing small molecule compounds to hit them.

The problem in cancer is not that we don't understand individual proteins or the way that drugs bind to them (the problem being solved in this article). It is that the biology of cancer is a crazy web of highly complex interactions and feedback loops of which we have a pathetically rudimentary understanding. So even when you think you're hitting a target that should kill the cancer, you find out there's some side pathway that spools up and limits the drug's efficacy (or worse, the cancer cells actively pump your compound out). If I recall, something like >95% of new cancer therapies fail in clinical trials. Most of the failure in this drug category is due to lack of efficacy (even though they hit a target, they don't do jack for treating the cancer). If you could make that number 85%, you'd probably be a Nobel contender.

Unfortunately, there's no digital route here. What needs to be done is a lot of slow, messy "wet bench" biology. There is no electronic shortcut to understanding how living cells work.

(Bioinformatician here). Although I think bench work is the most obvious route and the most likely way these problems will be solved, there are in principle some computational ways they could be addressed.

If we had good computational models of how perturbations would affect transcription networks, for example, we could predict these "side pathways" that so often occur in humans but not in mice.

But you're right, the reality is complex, and you have do deal not only with transcription, but translation, post-translational modification, non-coding RNAs, the list goes on... And most current models of this type don't delve into 3D simulation. Some people are working on whole-cell modeling but that's in its infancy.

Ultimately I believe the breakthroughs will come fastest if we can "close the feedback loop" by automating a lot of bench biology, and then have computers both generate and test hypotheses.

(former bioinformatician here)

I agree, the totality of the interactions for a single cell is so many orders of magnitude above what we are capable of currently modeling that I fear these computational approaches are dangerously over-hyped. Having been privy to the state-of-the-art projects in a lab with a ~5,000 node cluster it was still disappointing to see how rough the whole cell modeling approaches were. It's really tough just to model a small corner of the cytoplasm and get the diffusion of different protein and metabolites right, let alone address organelles or chromosomal folding and surface availability. It's a mess.

Over-hyped? In my neck of the woods these approaches are treated with extreme skepticism for just the reasons you mention.

For instance: De novo protein folding is not a solved problem, so how can a simulation predict dynamics for a protein whose conformation isn't even known?

I'm sure in the year 2150 when my grandchildren go to the doctor to be scanned by the tricorder, the results will go to the full-cell (and full-body) simulator...but for now, I think bioinformatics is better served by sticking to higher levels of abstraction like transcript and protein counts (for disease modeling purposes).

Sorry to be late to the party-- xaa, could you explain a little more what you mean by

"Ultimately I believe the breakthroughs will come fastest if we can "close the feedback loop" by automating a lot of bench biology, and then have computers both generate and test hypotheses."

I don't know anything about bioinformatics, so I'm trying to see how this differs from vanilla automated model selection. I'm really interested, so please feel free to send me an email if you feel that's more appropriate.

The main message I got from talking to people involved in systems biology is that we don't even get qualitative agreement between simulation and experiment, nevermind any sort of quantitative data, on extremely well studied systems like the lac operon. It seems like we're at least 25 years away from decent models.

I agree mostly with your post but there are computational approaches being developed that may help improve our understanding of these cancer networks.

They will in no way eliminates or reduce the need for wet lab biology but hopefully it will couple with improvements in high throughput experimental technology to help us design and make sense of experiments targeted at understanding the whole phenomena.

I am well aware of these approaches, having worked on some of them myself in prior work I've done. You always run into limitations of what was known about the biology and how various things interact. The network models will probably work some day, but my point above was that this will only happen after a lot of hard wet bench work happens. There's just too much we don't know right now to use these techniques to develop deep understanding. Sometimes, when we were lucky, they would support an existing hypothesis. But that was only for activity in a single cancer cell line in a specific experiment, not a whole organism which is what matters for a drug.

I work on computational approaches to using known, wet-bench validated interactions along with high-throughput cancer data, and agree that much more high-throughput data is needed, more so than small-scale approaches. Quantitative measurements of interactions rates in vitro are no more trustworthy than computational methods on high-throughput data, because you never know what other cofactors may be affecting the interaction, or what compartmentalization or localization you missed in your model system that's different in a real system.

I see the only scientifically defensible way forward is to do large, data-rich networks like Eric Schadt does them. Individual wet-bench work is good to use as prior data, but it's just a hint at a part of a large, complex system, and overly reductive approaches are going to completely miss the big picture.

Unfortunately, for non-computational peers and reviewers, "high-throughput" is often a synonym for "fishing expedition". This perception is gradually changing, though.

Furthermore high-throughput assays are more expensive so people often cut corners on sample size.

I think it's important to note that although biological systems are still well out of our reach, chemical systems (using quantum chemical methods) is being done increasingly better, and we can learn a lot about various systems, ranging from graphene to organic semiconductors to metal-clusters in enzymes using computational methods. So one day, we'll slowly get there, and we shouldn't write it off. Though yes, the current state of the art is hopeless for cancer drug development--but the field is dynamically changing.

(Naive layperson here). Could the manual microsocpes-and-pipets work being done by lab biologists be mechanized, so that you're generating drug candidates in software, testing them in living cells, and using automatically-gathered observations to generate new candidates?

It is already done to an extent (high-throuhput machines), however, there's still a lot of old timers in biology who spent too long pipetting and haven't invested yet.

While I agree with you that there is a ton of automation, it's a bit flippant to suggest the manual work in biology is because of old-timers. You can't exactly automate necropsy on a rat liver to see if the compound you just gave it caused liver failure. There are a ton of experiments that are not automatable with current robotics technologies. Plus, even when you can automate, biology can't be rushed. If you're waiting for a tumor to grow in a mouse model, you have to wait real wall clock time.

I'm in no position to argue about the mechanics of rat autopsy. But I do know how to get high throughput when you have high latency: shotgun parallelism. Try tons of things, the vast majority of which you assume will turn up with nothing, starting at the same time, in parallel.

It would be good even for not-old-timers. I know a couple of bioinformatics people and they're definitely experts on one topic: RSI.

Unfortunately not everything is done on a massive scale so not everything is automated.

> As impressive as it sounds, such a cluster can be spun up by anyone with the proper expertise, without talking to a single employee of Amazon.

This isn't actually true. There's initially an instance limit of 20, and you have to contact Amazon to get it lifted. You could probably order just as many servers at Softlayer or purchase them at Dell without talking to anyone (except the guy who confirms your credit card).

After all, they're not going to let anyone run up a $3 million bill on the hope that it'll be paid at the end of the month!

Bump ... I've had personal experience where I needed a few 100 VMs for a short experiment. A quick message to Amazon got the job done. Took less than a day to get approval.

$4,829 per hour isn't the most impressive metric here. It ran for only 3 hours, at a total cost of $14,486 - compared to the build cost and lead time involved in building a "$20-25m data center".

This has huge implications for the availability of supercomputers for smaller organizations and use cases.

The cluster used a mix of 10 Gigabit Ethernet and 1 Gigabit Ethernet interconnects. However, the workload was what’s often known as "embarrassingly parallel," meaning that the calculations are independent of each other. As such, the speed of the interconnect didn’t really matter.

Ah, if only this were true for everything. It'd make everyone's lives a lot easier. :-) Unfortunately there are a lot of applications out there where inter-node communication is incredibly important: computational fluid dynamics, in all its varied forms, is a good example of a latency-bound application. This covers weather forecasting, aerodynamics, dynamic mechanical modeling, and so on.

What I find interesting is that to get to 51,000 cores, they had to use AWS datacenters all over the world. I'd love to know what kind of resources are actually available in any given datacenter. It will vary at any given time, but it would be useful to know how many cores are "close" to each other in a networking sense, for applications were latency matters.

Wow. This press release even got covered in the NYT!


The actual feat accomplished here would not surprise anyone involved in supercomputing, and companies have done very similar things with non-scientific tasks, such as this article from 2008:


describing how the New York Times itself did a similar computation.

Stowe was just speaking at AWS summit, lots of AWS goodness there: http://aws.amazon.com/live/

Man, and I thought it was bad when I set in motion a 3 hour computation and lose it due to a bug in the code that writes the results to disk. At least it doesn't set me back $15k to redo the computation...

I was in the audience at the NY AWS summit ... quite a loud applause when they give their per hour cost number. I was also impressed. However, I found it a bit surprising that that it took 2 hours to acquire the VMs, and 1 hour to do the actual work.

I was there as well. Pretty cool stuff!

Two things struck me about that article, the pharmaceutical guy who felt like with enough cores they could find the cure to cancer, and Cycle's challenge of moving past 50,000 cores.

What struck me is that I wonder why Google (or Amazon) hasn't put out the cure for cancer. The actual extent of Google's infrastructure is classified, but using open sources its clear that putting together even half a million 'cores' is not a huge project for them. (Think of it this way, they spent a billion dollars building a data center that the actual building/land/etc estimates out at about 200M.)

Google does invest in life sciences startups: http://www.googleventures.com/portfolio#life-sciences

Perhaps I should be a bit more clear and less snarky. I think Cycle has had a great press release, it has advertised their product well, and like any good press release it doesn't so much read like an advertisement for a particular company as it does as real news. This defines great execution of the PR technique known as 'article placement.'

One of the qualities of making it a great soundbite is that Schrodinger, the company that it nominally the topic of the story, goes on about how their 1,500 core cluster can't give them the resolution they need but a 50,000 core cluster makes everything clear. Understand that a 'westmere' processor is potentially 12 cores (if you use Intel's defintion of threads and I'm sure they do), and the typical motherboard is 2 CPUs so that is 24 'cores' per machine [1]. A 1,500 core cluster is 62 machines, that is a couple of cabinets worth if you're using Supermicro boxes, less than a cabinet if your using OpenCompute type cabinets [2]. And at maybe $3K each that is an investment of $186K, maybe $250K if you included switches. Which is about 1/3 what it would have cost a pharmaceutical company to buy a VAX minicomputer back in the day.

My point is that if you're in a multi-billion dollar market place, you can afford to spend more on your hardware. And even at approximately $5K/hr a 50,000 node cluster is only 2100 Open Compute servers in 24 of their 'triplet' cabinets.

That is about 1MW critical kW of compute power. (1MW being the power commitment you would have to buy from a colocation center to power it) and even that is a fairly small foot print at Amazon, Google, Facebook, or even Apple with its new $1B data center [3].

So when I read the story, I was left thinking "Gee, if using a cluster that was 33x bigger got them such great results, why not use one that 3000x bigger? Wouldn't that just answer the question?" And of course I took a moment to analyze that thought and asked the question every critical reader has to ask which is, "What is this article trying to say anyway? And do I believe it?" And that was when it becomes obvious what the article is saying is that Cycle, the company that makes a living creating virtual super computers for embarrassingly parallel problems out of EC-2 instances has reached the point where they can get 50K cores running the same problem." Which is great and all but like a long story that is just a setup for bad pun, it leaves me feeling jaded, and hence my snarky remark that if all it takes is more cores, Google should stop trying to be a great advertising company and switch to being a pharmaceutical company. Which, when you say that out loud you realize it couldn't possibly be that easy and yes, its a snarky way of expressing irritation that I was lead to believe there was something newsworthy here when there wasn't.

[1] http://opencompute.org/projects/intel-motherboard/

[2] http://opencompute.org/projects/triplet-racks/

[3] http://gigaom.com/apple/apples-new-north-carolina-data-cente...

How efficient is this compared with running your own cluster? 2x slower? 10x? In addition to latency between nodes, surely you're paying some kind of performance penalty for virtualization?

Our department which currently runs a 1000+ core machine and 2500+ core machine for MD simulations is still waiting on jumping onto the Amazon/Cloud bandwagon mostly because it's still not cheaper than owning a cluster for a few years. Granted, being at a large research university means they're not the only clusters on campus and there's an infrastructure already in place to maintain it.

In terms of speed, I would assume Amazon to be faster than most clusters since they probably offer the latest and greatest computers. Lastly, MD is an embarassingly parallel problem (don't ask me how it is) so latency isn't a major issue.

In what programming language would this cross-CPU simulation software be built?

The actual application used was Glide, http://www.schrodinger.com/products/14/5/, which is likely Fortran or C++ given the type of app. Don't know what Cycle's core sofware for managing the whole infrastructure is written in, but they use a lot of Chef and schedulers like Torque or Condor in addition to Boto to orchestrate the EC2 side.

And yet, no results from their massive computation. Schrodinger is well known for being a company of liars and frauds, and unfriendly to open source and other ideals of our community.

To be fair, Gaussian is orders of magnitude worse. See <http://www.bannedbygaussian.org/>.

Doesn't Schrodinger own PyMOL, which is open source?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact