This may be a dumb question - but the experiments show latencies of the order of milliseconds. How would it work when your median latencies are of the order of 100 - 200 microseconds? At that scale, the effect of the arbiter would be more pronounced right?

Am I missing something here, or is this not meant for that use case?

It should still reduce your tail latency.

Dubious. The question is what the tail response time of the arbiter is at full load. The key measurement (which is conspicuously absent from the paper) is the impact on the end-to-end delay at varying load. A distribution graph of this would answer this question immediately. My suspicion is that it is no better because essentially that same about of "scheduling work" is being done regardless of where it is done.

