This first thing to note is that this is not "done and dusted" "accepted as gospel truth". The work may have been accepted for publication at a top tier conference (and that's fantastic!), but this means that only a small fraction of researchers have actually read that paper. Research publications are part of a conversation in the research community, an argument, and they should be read critically.
In that light, reading the work carefully, several things should be noted:
First the good stuff:
1) This work is incredibly well written. It's easy to read, it's sexy and it sells. This makes it easy for reviewers who often get a lot of poorly written work. A job well done by the writers.
2) In my (qualified) opinion, the reason that this work has been accepted for publication is that many in the community, including myself, would not have believed that a centralized arbiter could be be built, AT ALL. This is quite a novel thing. The idea is mad, and it seems to have worked.
Now the not so good stuff:
1) This paper is carefully craft to be slippery. You should be very careful about the claims that the authors make, in contrast to what you assume from the language used.
2) Fastpass is not "zero queue". They simply move queuing into other places. First, in the end host. When a host wants to transmit data, it must queue those packets and send a message to the arbiter to get a slot, it must then wait for a response, and then wait for a slot. Second in the arbiter itself, the arbiter must be able to keep up. This is easy at low load, but these times get much longer at high load and larger systems.
3) The authors never measure or demonstrate the impact of this extra queuing at high load (or at any load) on the end-to-end latency. Using "zero queue" in the title, implies that latency in the system is better, and indeed the MIT sound byte assumes this, but it's not measured or demonstrated anywhere. My guess is that the results just aren't better. So they have focused on the things that look good, and not the dirty details. This seems disingenuous to me.
4) The Facebook implementation uses only 1 rack (at most about 100 machines, probably more like 40), which means that there is only ever 1 path that packets can take. This means that a significant part of their algorithm (calculating the right path) is never run, reducing the cost.
5) Despite this, the Facebook implementation shows almost no benefit. They manage to reduce the number of TCP retransmits from 4 per second down to 2 per second. They never discuss or demonstrate that this has any useful benefit, and, frankly, I'd be surprised if it did.
6) The headline number of reducing latency by 1000's of percent is only in a contrived experiment with ping and iperrf, extreme ends of the latency throughput spectrum with little relation to the real world. This same result could have easily been achieved by simply setting ping to a high network priority.
7) There is no mention of tail latencies for realistic workloads, which is the real problem that is suggested in the solution but never demonstrated.
8) Scalability is a serious issue for this work, which the authors acknowledge, but will limit deployability.
9) Ultimately, this work is sexy and interesting, but never in the paper demonstrates any tangible benefit.
HPC has been doing this stuff for decades, it's a fairly well understood problem how to achieve "zero queuing".
It's worth mentioning that Infiniband is incredibly affordable for datacenter networking and has excellent tooling on both Linux and Windows these days.
Fibre Channel is about the last example of a circuit-switched network, and it's pretty much dying.
What really makes the big difference though is the presence of the subnet manager. Which pre-programs the routing information into each switch at fabric bring-up time. This is what causes Infiniband to act like a circuit switched network despite ofcourse being VCT at the PHY layer.
List prices alone are very attractive:
36-port QDR (40gbits) switch http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1...
Dual port QDR HBA http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1...
The cost benefits are compounding when you go up to FDR (56Gbits) and make use of RDMA aware protocols, like iSER, SRP or SMB3.
Am I missing something here, or is this not meant for that use case?
It would be good if they had more info about their testing methodology and also something like a haproxy implementation.
Also I don't see any mention of failure if the arbiter falls over.