The outcome is pretty grim.
Even if there was 0% performance penalty from virtualization, you'd still see suboptimal allocation of hardware resources just from trying to take an abstracted view of the hardware. Different applications have different performance profiles. You either end up with overbuilt hardware to support the virtualization environment and the different performance profiles of the different applications, or with multiple VM's for the same application on the same hardware which is totally unnecessary overhead. Virtualization is just not meant for large scale.
Jump to the results section. The more relevant bullets here are 'single guest' (either virtio or using direct mapping).
Highlights (or lowlights, depending on your perspective):
kernbench - 9.5% overhead
SPECjbb - 7.6% overhead
I don't agree with your point about suboptimal allocation of hardware resources. Virtualization does not require you to divide a machine in a different way than processes do (you could easily have one VM consume nearly all CPU cycles, one consuming nearly all I/O capacity, etc.) IMO, the key difference is that virtualization lets you easier establish hard, enforceable limits and concrete policies around resource usage (not to mention the ability to account for usage across all kinds of different applications and users). And, it lets you do that for arbitrary applications on arbitrary operating systems. So users don't have to write to one particular framework/language/runtime/OS whatever. That's all pretty important for large scale.
Is there a system that would allow for mapping part of an IO device (such as a block range or a LUN) to a guest when multiple guests are running with the same level of overhead?
The distinction being made here isn't for a single guest or multiple guests, it's for a single guest OS or nested guests (i.e. a VM running another VM). To expose the hardware virtualization extensions to the guest VMM, then they must be emulated by the privileged domain (host). There are software tricks that allow this emulation to happen pretty efficiently (and map an arbitrary level of guests onto the single level provided by the actual hardware). It's not a common use-case, but for a few very specific things it's very useful.
There are a few different ways to map I/O devices directly into domains. Some definitely allow for part of an I/O device. For example, many new network devices support SR-IOV -- which effectively allows you to poke it and create new virtual devices (which may be constrainted in some way) which can be mapped directly into guests.
Parties that care muchly about fine performance margins apparently need to be using Xen or KVM or Illuminos then.
For instance it's easy to make a benchmark showing huge throughput to any given storage solution (and many NAS providers sell on this basis), but your I/O might be terrible because to get that throughput you're maxing the CPU (etc.). Likewise, you can change your benchmark and show high I/O, but the throughput is 'terrible'.