I did not know the below situation with tensor flow.
MS contributions to OSS, at least in this instant, appear a lot more transparent and not-self-centered, compared to Google's
"...It was made very clear from the first day of TensorFlow’s announcement, that Google created two TensorFlow versions: a public version and an internal version. As a TensorFlow user, one either must tolerate the slow speed of the public version, or pay to run the TensorFlow job on Google’s cloud.
..."
There is one TensorFlow. The differences between using TF internally and externally have primarily to do with which RPC bindings it uses (the external one uses gRPC, which is open-source, and the internal one uses the internal RPC framework, which is tied in with all of the internal cluster stuff and authentication and whatnot), and things like filesystems that only exist in Google. The other difference is that there are linkages to use TPUs, instead of just GPUs, which is hardware that doesn't exist outside of Google. The final differences are just in how the BUILD files link against library files -- the external version downloads protobuf for you, the internal version assumes it's there to use. yadda yadda yadda.
You can see all of this in the code. It leaks out in places, such as:
Yes, it's that _super secret_ use of a different integral_types.h header. (/sarcasm). If you look through for things like PLATFORM_GOOGLE in the defines, you'll see a lot of the things that differ, and they're incredibly boring. The core of TensorFlow performance-related stuff is Eigen (or, thanks to Intel's recent contributions, Intel MKL) for executing Tensor ops on CPU, or cuDNN for executing Tensor ops on GPU. Just like every other freakin' framework out there. There's a reason that all of these things tend to reduce to the performance of cuDNN...
("we use almost exactly the same code base inside Google that we make available on GitHub").
(Source: I'm a part-time hanger-on on the Brain team, which develops TensorFlow. I'm also a Carnegie Mellon professor most of the time, and I despise marketing getting in the way of truth.)
Scalability is part of TensorFlow's claimed advantages. If someone adopts TF on their own cluster, would they get the same scalability story as marketed?
are generated using GCP, AWS, and an NVidia DGX-1, all using exactly the capabilities any ordinary user has on those platforms. The K80 distributed training results are AWS.
There's also a very useful set of suggestions for how to tune TensorFlow for best performance both, and scripts that repeat the benchmarking results: https://www.tensorflow.org/performance/
I see that since my comment, Microsoft has updated the claims in the cited page. It's still not true that there are two versions, but I'm glad you're trying to provide more detail. I'd like to stick a big [citation needed] on the claim that the internal version is much faster.
At the time Mu Li did his performance analysis of MXnet vs Tensorflow, we hypothesized that gRPC overhead was one of the reasons that MXnet was showing better scaling numbers than TF. That turns out to not have been very correct - there were several things that the TF team identified that closed the scalability gap to a pretty narrow degree around the 1.0 release. I don't feel confident that gRPC is much of an impediment to scalability. (I'm also not saying that it isn't -- just that I don't think there's a lot of evidence one way or another).
I'd love it if the CNTK team or someone else were to publish high-quality, head-to-head scalability numbers using the best practices and scripts identified in the TensorFlow performance guide, and using the equivalent CNTK best practices. It benefits everyone when Microsoft and Google work hard to out-do each other. :) (And throw in MXNet as well, with Amazon's best guidance.)
Thanks for the clarification. gRPC is slow. We have in-house experiments showing on RDMA-capable networks an optimized implementation can achieve significant speed up over gRPC. And I bet Google's internal version is even faster.
MxNet has a highly efficient network stack that's open source; Caffe2 uses gloo, which is open source; CNTK primarily uses Open MPI, NCCL and soon NCCL2.0. I think it's fair that Google also open source the internal network stack because it is the key to scaling.
Most convolutional networks are not a stress test for scaling because the model size/computation ratio is too low. Use a speech model that has many fully connected layers, or VGG16/19, the communication cost will dominate, and that's when CNTK's 1-bit SGD and Block Momentum really shine.
Publish those results? It'd be very interesting to see. And, it sounds like you think there are benchmarks missing from the existing common set of things people are measuring -- what's a very specific network you'd like to see added to the mix? VGG16 doesn't fall into my radar of "modern and applicable" in the days of ResNet.
From the benchmarks available, and not knowing what your in-house experiments show, I don't believe that the "internal network stack" is key to scaling. The scalability numbers shown on tensorflow.org/performance are very reasonable: From 902 images/sec to 1783 (1.97x) going from 32->64 K80 GPUs on Amazon for Inception v3, and 565->981 (1.7x) for ResNet-512. I'd love to be proved wrong.
That 1.7x scaling on ResNet-512 would be a great point of comparison, for example. From my student Hyeontaek's results, I actually suspect that there are scheduling improvements that could make up some of that difference, not networking improvements.
As I'm sure you know, of course, and are just fishing for, the reason that code links against gRPC externally is because trying to extract Google's internal networking code from the full internal software codebase would be ridiculous. I think it's far more likely to see the other direction, with everything settling on gRPC -- gRPC is actually newer, and in general, more feature-ful, than Stubby: https://cloudplatform.googleblog.com/2016/08/gRPC-a-true-Int...
I think that part is misleading. Vijay Vasudevan (a member of the TensorFlow team) has repeatedly put down the notion that the internal TensorFlow code is significantly different than what we see:
Obviously, this is taking the word of someone who is incentivized to get as many people using TF as possible, but I haven't been given a reason to not believe him.
I believe that the public TensorFlow repo is very close to what they use internally. That said, I'm sure there is a huge amount of internal tooling (for easily spinning up clusters internally, profiling, probably an automatic device placer) that we don't get to see. But that has more to do with the fact that its designed with Google's specific infrastructure in mind than it has to do with "hoarding the good stuff".
Thanks for mentioning that Sam, I appreciate it. Speaking for just myself:
I personally don't care what tools or frameworks people use to get work done and have repeatedly suggested people use whatever works best for them.
It also wouldn't make any sense to hoard any good stuff internally if we wanted to provide a useful framework that people wanted to use externally.
We actually don't have a huge amount of tooling internally that we hold back. The only tooling I really use is the [timeline](https://github.com/tensorflow/tensorflow/blob/f488419cd6d925...) for debugging performance (not the EEG tool in the whitepaper, I've never used it). That's available externally though, as you can see.
On some of the comments in this thread in general, I'm pretty sad at the lack of scientific rigor in the community, and that goes for any person who publishes code that differs from the results they claim, regardless of affiliation. I am happy about projects like OpenAI's RL baselines, as I know others are too.
Most of the papers and articles comparing performance of frameworks have lots of bugs and aren't even comparing the same model computation between frameworks. In fact, CNTK's article points to an external benchmark showing CNTK in a good light, but those benchmarks have bugs in them rendering the comparison incorrect (we've been sending PRs to fix them). I find it disappointing that the culture of the organization promotes calling others out for being 'irresponsible' except when it suits them.
The TensorFlow team hasn't published many articles comparing performance directly to others because it's honestly a lot of hard work to verify that you are comparing fairly. Of course, I do think TensorFlow needs to improve performance out of the box for people, and the team is working on that.
Thanks for the response, Vijay. I didn't mean to insinuate that I thought the team was holding back tooling for the public (rather that such tooling wouldn't make sense to release), but it's reassuring to hear that the total TensorFlow experience is pretty much the same internally and externally.
I think part of the conspiracy theorizing is due to the misconceived notion that Google has some "secret sauce" that allows it to do what it does, as opposed to many talented engineers spending a lot of man-hours on a problem. There has also been a fair amount of negative Google sentiment in the community recently, and the story that Google is holding out on developers feeds into this narrative.
Benchmarking has always been low-hanging fruit for community members to latch onto for the sake of attacking/defending a particular framework. However, the practical difference between these frameworks (assuming each is configured properly) seem to be within a margin of error and are constantly changing (not to mention the inconsistencies you mentioned), so choosing a framework solely on its benchmarking scores is narrow-minded.
From what I've seen, benchmarking has been more useful as a discovery mechanism for areas in a codebase that can be improved. The TensorFlow team has done an excellent job of using various benchmarks to guide development, and I imagine other frameworks are doing the same.
Thx for all the follow ups and clarifications.
I am a user of neither toolkit yet, but following their progress.
Perhaps submitting a bug to ask MS to change the above misinformation, is appropriate
"...It was made very clear from the first day of TensorFlow’s announcement, that Google created two TensorFlow versions: a public version and an internal version. As a TensorFlow user, one either must tolerate the slow speed of the public version, or pay to run the TensorFlow job on Google’s cloud. ..."