Hacker News new | comments | show | ask | jobs | submit login
Fastpass: A Centralized “Zero-Queue” Datacenter Network (mit.edu)
87 points by jonbaer 1016 days ago | hide | past | web | 28 comments | favorite

The failure modes of the master to secondary arbiter failover process need to be analyzed a bit more. Especially if you have a packet of doom that takes out both the master and the secondary; what happens to all the network traffic when both are gone? Does it degrade to normal TCP (it didn't look like it).

As context, I am a researcher in datacenter networks at a world renowned university. I've read the full research paper in detail, several times (not just the media summary) and have reviewed it carefully with other researchers in this area in our weekly reading group.

This first thing to note is that this is not "done and dusted" "accepted as gospel truth". The work may have been accepted for publication at a top tier conference (and that's fantastic!), but this means that only a small fraction of researchers have actually read that paper. Research publications are part of a conversation in the research community, an argument, and they should be read critically.

In that light, reading the work carefully, several things should be noted:

First the good stuff:

1) This work is incredibly well written. It's easy to read, it's sexy and it sells. This makes it easy for reviewers who often get a lot of poorly written work. A job well done by the writers.

2) In my (qualified) opinion, the reason that this work has been accepted for publication is that many in the community, including myself, would not have believed that a centralized arbiter could be be built, AT ALL. This is quite a novel thing. The idea is mad, and it seems to have worked.

Now the not so good stuff:

1) This paper is carefully craft to be slippery. You should be very careful about the claims that the authors make, in contrast to what you assume from the language used.

2) Fastpass is not "zero queue". They simply move queuing into other places. First, in the end host. When a host wants to transmit data, it must queue those packets and send a message to the arbiter to get a slot, it must then wait for a response, and then wait for a slot. Second in the arbiter itself, the arbiter must be able to keep up. This is easy at low load, but these times get much longer at high load and larger systems.

3) The authors never measure or demonstrate the impact of this extra queuing at high load (or at any load) on the end-to-end latency. Using "zero queue" in the title, implies that latency in the system is better, and indeed the MIT sound byte assumes this, but it's not measured or demonstrated anywhere. My guess is that the results just aren't better. So they have focused on the things that look good, and not the dirty details. This seems disingenuous to me.

4) The Facebook implementation uses only 1 rack (at most about 100 machines, probably more like 40), which means that there is only ever 1 path that packets can take. This means that a significant part of their algorithm (calculating the right path) is never run, reducing the cost.

5) Despite this, the Facebook implementation shows almost no benefit. They manage to reduce the number of TCP retransmits from 4 per second down to 2 per second. They never discuss or demonstrate that this has any useful benefit, and, frankly, I'd be surprised if it did.

6) The headline number of reducing latency by 1000's of percent is only in a contrived experiment with ping and iperrf, extreme ends of the latency throughput spectrum with little relation to the real world. This same result could have easily been achieved by simply setting ping to a high network priority.

7) There is no mention of tail latencies for realistic workloads, which is the real problem that is suggested in the solution but never demonstrated.

8) Scalability is a serious issue for this work, which the authors acknowledge, but will limit deployability.

9) Ultimately, this work is sexy and interesting, but never in the paper demonstrates any tangible benefit.

Great response but you didn't need to start with the chest thumping ("world renowned university", "(qualified) opinion"). I almost completely skipped reading it because it makes you sound like a blowhard.

Thanks. I guess I'm trying to give some sort of indication that I know what I'm talking about. Too many people have opinions without any factual basis to make them on. I'll try to be more toned down in the future.

Cool, but if you really want to do this sort of stuff (and don't mind being tied to the concept of some sort of arbitrator or route server) then you should be looking at circuit switched networks like Infiniband, Myrinet, DolphinLink, NUMALink etc.

HPC has been doing this stuff for decades, it's a fairly well understood problem how to achieve "zero queuing".

It's worth mentioning that Infiniband is incredibly affordable for datacenter networking and has excellent tooling on both Linux and Windows these days.

None of the networks you mention are circuit switched. They are all packet switched.

Fibre Channel is about the last example of a circuit-switched network, and it's pretty much dying.

Technically correct, Infiniband is actually a variant of virtual cut through switching. If one want's be be even more pedantic you can also point out there are implementations of Ethernet that employ the same technology at a switching level.

What really makes the big difference though is the presence of the subnet manager. Which pre-programs the routing information into each switch at fabric bring-up time. This is what causes Infiniband to act like a circuit switched network despite ofcourse being VCT at the PHY layer.

Uh, no. You aren't using those words the way other people do.

At the scale of these networks, infiniband isn't "incredibly affordable".

It's surprisingly affordable I think you will find.

List prices alone are very attractive: 36-port QDR (40gbits) switch http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1... Dual port QDR HBA http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1...

The cost benefits are compounding when you go up to FDR (56Gbits) and make use of RDMA aware protocols, like iSER, SRP or SMB3.

"At the scale of these networks". These are warehouse-scale data centers.

When I saw "Zero-Queue" it scared me. It's really zero? What kind of transport layer protocol does this network framework taken(D2TCP or ...)?

The queues are still there, but scheduling ensures that they don't fill up.

Incorrect. The scheduling only ensures that the queues in the network (from the NIC onwards) don't fill up. However, there queues are still there in the host and in the arbiter. The authors never measure or demonstrate that these queues are any shorter, that the tail latencies are improved for any real workload, or that there is any actual benefit in the approach for a real world scenario.

This may be a dumb question - but the experiments show latencies of the order of milliseconds. How would it work when your median latencies are of the order of 100 - 200 microseconds? At that scale, the effect of the arbiter would be more pronounced right?

Am I missing something here, or is this not meant for that use case?

It should still reduce your tail latency.

Dubious. The question is what the tail response time of the arbiter is at full load. The key measurement (which is conspicuously absent from the paper) is the impact on the end-to-end delay at varying load. A distribution graph of this would answer this question immediately. My suspicion is that it is no better because essentially that same about of "scheduling work" is being done regardless of where it is done.

I don't normally vote articles up but this looks really cool.

It would be good if they had more info about their testing methodology and also something like a haproxy implementation.

Also I don't see any mention of failure if the arbiter falls over.

Yes, the paper talks about having secondary arbiter that does watchdog pings, and if the primary dies, the secondary waits for the queues to flush then takes over, statelessly. It's a huge hole in the paper.

Is it just me, or don't the graphs at the bottom left appear to indicate that Fastpass achieves "improved fairness" through considerably diminished performance, especially under low contention?

It took me a while, but I believe this might be one of the few times that a stacked line chart would better serve the purpose. Total throughput is the addition of the 5 "per-connection throughputs." Thus, during the peak, we see ~2 Gb/s for all 5 servers, totaling ~10 Gbps. In the top graph, we see 1-3 Gbps for the 5 servers, totally ~10 Gbps. It would be more clear if a "total throughput" line were added or the author used a stacked line chart, since we're looking at both total throughput and fairness.

Right, but I mean the start and end of those graphs. That shows 6+ Gbit for one flow without Fastpass, and around 4 Gbit with it.

Yeah, that's fairly poor. It should be 10 Gbps for the first 30 seconds.

It's not just you. If you read it carefully you'll notice that the evaluation only shows that the improvements are that TCP retransmits are reduced from 4 to 2 per second, which probably has no impact. And that the latency is only better in a contrived example with iperf and ping. The paper really demonstrates no tangible benefit.

Just curious: Is Amy Ousterhout anyway related to John Ousterhout of Stanford?

Is there no queue for the arbiter?

Fastpass is not "zero queue". They simply move queuing into other places. First, in the end host. When a host wants to transmit data, it must queue those packets and send a message to the arbiter to get a slot, it must then wait for a response, and then wait for a slot. Second in the arbiter itself, the arbiter must be able to keep up. This is easy at low load, but will get much longer at high load and larger systems.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact