Thanks for the clarification. gRPC is slow. We have in-house experiments showing...

dgacmu · on June 3, 2017

Publish those results? It'd be very interesting to see. And, it sounds like you think there are benchmarks missing from the existing common set of things people are measuring -- what's a very specific network you'd like to see added to the mix? VGG16 doesn't fall into my radar of "modern and applicable" in the days of ResNet.

Using NCCL is great; TF now supports it, as of about a month and a half ago (though I don't know how tightly integrated it is): https://github.com/tensorflow/tensorflow/blob/master/tensorf...

From the benchmarks available, and not knowing what your in-house experiments show, I don't believe that the "internal network stack" is key to scaling. The scalability numbers shown on tensorflow.org/performance are very reasonable: From 902 images/sec to 1783 (1.97x) going from 32->64 K80 GPUs on Amazon for Inception v3, and 565->981 (1.7x) for ResNet-512. I'd love to be proved wrong.

That 1.7x scaling on ResNet-512 would be a great point of comparison, for example. From my student Hyeontaek's results, I actually suspect that there are scheduling improvements that could make up some of that difference, not networking improvements.

As I'm sure you know, of course, and are just fishing for, the reason that code links against gRPC externally is because trying to extract Google's internal networking code from the full internal software codebase would be ridiculous. I think it's far more likely to see the other direction, with everything settling on gRPC -- gRPC is actually newer, and in general, more feature-ful, than Stubby: https://cloudplatform.googleblog.com/2016/08/gRPC-a-true-Int...