It feels like we've gone in one big circle, where first we move the DB on to a separate machine for performance, yet now more computation will go back to being done nearer the data (like Hadoop) and we'll try to pretend it's all just one giant computer again.
Actually, I think this is not exactly a circle. The reason why we moved Databases to separate machines for performance was we did not understand how to schedule tasks or jobs.
Also, it comes from the age old debate of whether an Operating System is sufficient to perform a certain type of resource management function or should the application which knows about its own semantics build its own resource manager. One popular incarnation of this is Michael Stonebraker's paper on "Operating system support for database management" - http://www.eecs.harvard.edu/~margo/cs165/papers/stonebraker-... where he discusses if Database management systems should builds its own buffer/caching mechanism or those mechanisms provided by filesystems are good enough. This can be extended to the problem in question now, which is about scheduling of jobs on a huge cluster seen as a single computer.
One of the major reasons (of many others) why we moved Databases to separate machines was because OSes did a bad job at scheduling tasks. Rather than saying bad, I think I should say that it is difficult for the operating systems to understand the types of workloads. However, in this particular case, Google knows a lot more about its workloads and hence can do a better job at scheduling them on the same machine.
Also, it is very important to note that Google has been able to do this because they run 1000s of machines next to each other and everything is distributed. If they can't schedule a task on a particular machine because it is overloaded, they always have the choice of scheduling that task on a next machine in the rack. This may not be the case for most people outside Google. If they are using a single machine to host both the database and the computation itself and if for some reason that machine is already running at its peak, there is no other option than making the tasks to be scheduled wait, because there is no where else to schedule those tasks.
Moving DBs to separate machines was really to keep the utilization at low levels so that there would be cycles available when needed.
If you look at it from the perspective of CPU cores it lines up more:
~2002: move everything off to its own dual-cpu server
~2007: move everything to its own dual core vm on an 8-core server
~2012: put everything on each 16/32 core box but manage its resources via cgroups (or just overprovisioning)
i have manually operated multiple JVM per server, maintaining /etc/services to track the ports, and apache+modproxy as front end. start/stop services and let apache detect the active state. with mesos and cgroups we get even more a dynamic assurance that services have the minimum guaranteed resources and access to scale across the entire compute cloud. we will soon be able to scale up/down on demand maintaining an 85% or greater utilization reducing allocation. with cgroups we remove the overhead of multiple OS that traditional VM brings.
Do you use zookeeper for addressing and assigning ports? One of the things most lacking in the outside-Google datacenter world is an addressing scheme like DNS that allows you to find some resource by name, including the port.
Using well-known ports severely restricts flexibility of scheduling.