The performance isolation is good, but I wouldn't really seek after high utilization unless compute costs are significant to your business. We've seen some crazy things where nominally non-interfering jobs cause significant performance degradation to other jobs on the same node. There's work yet to do here.
"According to Wilkes, Google plans to publish a research paper on Borg (though he still won’t use the name). "
Wilkes won’t even call it Borg. “I prefer to call it the system that will not be named,"
(I work with John and have contributed to Borg)
Nevertheless, there are many open questions for large-scale cluster management for researchers and developer to address. Here are some of my favorite:
- The curse of overprovisioning: Borg and many other systems rely on reservation which are systematically exaggerated by users. Right sizing these reservations is one way to go beyond the 40-50% usage shown in the Borg paper (see fig 12). A promising way of doing this is Christina Delimitrou's work using classification techniques (see http://goo.gl/vFf8oN)
- Oversubscription using better isolation mechanisms): this is what the Borg paper calls resource reclamation. Take unused (but reserved) resources from priority jobs and use them for best effort analytics. David Lo (http://web.stanford.edu/~davidlo/) has a very interesting paper coming up on how to coordinate cpu sets, cache partitioning, Linux TC, RAPL/DVFS (power management) to run websearch clusters at >90% by packing them with analytics without causing ANY glitch on search. And that is Google search.
There are definitely more interesting. Exciting times.
goes to: http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf
It's my understanding that URL shorteners are frowned upon in HN posts or comments.
Edit: remove error in my comment re: borg/omega order.
I've implemented similar at my current job, as that sort of naming is very convenient. http://www.boxever.com/using-google-apps-openid-connect-with... has a sketch of how to do this with Apache as a reverse proxy with Google Auth, though we're using a PAC file now going to a HTTPS forward proxy to avoid limitations of SSL wildcard certs.
When I contracted at Google at 2013 I loved their infrastructure. For my task I had to run huge Borg jobs and the job submission, monitoring and logging system were very easy to use. I really liked the summary of hardware failures that occurred - hardware really is not very reliable when running at scale.
After not using AppEngine for a few years I have started using it recently for two personal projects. Using AppEngine's logging and scaling features is a tiny bit like using Google's internal infrastructure - makes me a little nostalgic.
The title implied to me that Google and Cisco were working together.
I'm amazed how often it gets ignored. There's even StarCluster so you can automatically set up a cluster on EC2.