Jump to the results section. The more relevant bullets here are 'single guest' (either virtio or using direct mapping).
Highlights (or lowlights, depending on your perspective):
kernbench - 9.5% overhead
SPECjbb - 7.6% overhead
I don't agree with your point about suboptimal allocation of hardware resources. Virtualization does not require you to divide a machine in a different way than processes do (you could easily have one VM consume nearly all CPU cycles, one consuming nearly all I/O capacity, etc.) IMO, the key difference is that virtualization lets you easier establish hard, enforceable limits and concrete policies around resource usage (not to mention the ability to account for usage across all kinds of different applications and users). And, it lets you do that for arbitrary applications on arbitrary operating systems. So users don't have to write to one particular framework/language/runtime/OS whatever. That's all pretty important for large scale.
Is there a system that would allow for mapping part of an IO device (such as a block range or a LUN) to a guest when multiple guests are running with the same level of overhead?
The distinction being made here isn't for a single guest or multiple guests, it's for a single guest OS or nested guests (i.e. a VM running another VM). To expose the hardware virtualization extensions to the guest VMM, then they must be emulated by the privileged domain (host). There are software tricks that allow this emulation to happen pretty efficiently (and map an arbitrary level of guests onto the single level provided by the actual hardware). It's not a common use-case, but for a few very specific things it's very useful.
There are a few different ways to map I/O devices directly into domains. Some definitely allow for part of an I/O device. For example, many new network devices support SR-IOV -- which effectively allows you to poke it and create new virtual devices (which may be constrainted in some way) which can be mapped directly into guests.