Hacker News new | past | comments | ask | show | jobs | submit login
Building the Future of TensorFlow (tensorflow.org)
107 points by tosh on Oct 21, 2022 | hide | past | favorite | 48 comments



Maybe “TF 3” or whatever they call it will be ergonomic and a pleasure to use, but that was the promise of TF 2, and unless you wanted to use Keras it was anything but. I’m glad I have been able to work only in PyTorch and Jax. Maybe TF will get better but I’m not holding my breath. On the other hand, XLA is very nice, and I hope they continue to develop it.

I hope torch has a convincing distributed tensor API coming soon. Their development on ShardedTensor seems to have slowed or stopped recently, so TF’s DTensor is definitely ahead, which is a shame. And of course TF’s ecosystem for the whole lifecycle is more mature with TFX, TF.js, etc, but torch is slowly closing those gaps, and hopefully that will continue.


The ergonomics for Tensorflow are probably always going to be behind PyTorch (or Keras for that matter). The fact that the API has not been stable for the past 6 years has really burned me one too many times to not flinch at using it. It is basically an internal Google tool that has been made available to the public, and like most internal tools at Google the deprecated/developmental dichotomy applies (https://goomics.net/50/). That said the deployment of TensorFlow models onto mobile devices or the browser is really good, so sometimes the pain is necessary.


That comic reminds of the current state of azure machine learning. What a mess. I'd actually love to use GCP at this point, I've never had such a weird insanity of SDKs with Google Cloud.

Though I agree that it is very weird that google is treating TF as an internal product, versus something more akin to GCP. There's no reason for them to do so especially after they had the chance to break and redo tons of stuff for TF2.


That Goomics comic was shown in my Noogler training. I thought it was a joke, but I've learned it's definitely not a joke.


PyTorch does have issues with both distributed tensor (which is easier to solve) and deployment (which is harder to solve, but solvable).

I also think Keras' Functional API is superior in terms of composability than PyTorch's OOP model, but I am biased as a software engineer. It does feel like the community thinks the OOP model is much more hackable thus easier to use.

All in all, it is still early days. We don't have a competent all-in-one OSS SQL database until late 2000s, which is 20-ish years after the theory was ready and taught extensively in the school. And even after that, we have plenty of innovations around database in 2010s for new use cases. Frameworks for differentiable programming have long way to go.


Their development on ShardedTensor seems to have slowed or stopped recently

What makes you say that?


Just looking at commit frequency in the main branch, and activity on the relevant RFCs in GitHub issues. I have no visibility into what’s going on at Meta of course so it could just be there’s a lot of currently internal development, maybe there’s a lot of work going into a different branch at the moment, or maybe they’re waiting for things to happen in torch’s RPC support, I’m not sure. Or it could be they’re just waiting to release what they have as a beta in the 1.13 or 1.14 release?

I’ll note that checking right now there was a commit 15 hours ago, but the last commit before that seems to be 28 days. So some work is still going on at least, thankfully :)


ShardedTensor was merged / subsumed into DTensor: https://dev-discuss.pytorch.org/t/rfc-pytorch-distributedten...

Lots of development and traffic happening here: https://github.com/pytorch/tau/


So is DTensor a Google thing like XLA or more of an open standard?


Looks like its just a name collision. It's a tensor used in distributed models, thus Distributed Tensor, or DTensor for short.


Awesome thanks for pointing this out!


how do you envision a distributed tensor API as working? (perhaps a code snippet of an ideal API?)


Sorry, on my phone at the moment so I don’t think I can really type some decent code right now!

I actually think that torch’s ShardedTensor looks very promising. Essentially you can initialize a sharded tensor from an already initialized tensor, or initialize a sharded tensor on a meta device where it’s not allocated locally and each shard gets initialized on the specified remote devices (useful for extremely large tensors)

The sharding is described by a ShardingSpec, where you can either let it shard equally sized shards across the requested devices, where the split happens along a single dimension, or you can do grid sharding along multiple dimensions. They also have a more general sharding spec that allows you to choose explicitly which indices go on which devices, if you need non uniform shards.

I think once these are implemented (along with some special cases like cloned tensors, and things like that), and once the distributed autograd engine has full support for CUDA, it should be pretty easy to start building out distributed versions of common neural net operations.

The one thing (that I haven’t thought about a ton, to be frank, and I’m sure other smarter people have :)) is that you’ll end up in cases with both a sharding spec for the weights as well as for the inputs, and what’s the best way to make sure everything matches up. Is the best way to handle that custom logic for each operation? And do you have each operation just reshard the input automatically? Seems potentially like a pretty big performance pitfall.


Merging Keras into TF and trying to copy PyTorch by adopting their model was the downfall of TF. The initial TF releases were great. It was simple, easy to reason about, and solved a clear problem. But then Google wanted TF to appeal to everyone and solve all problems for industry and research and beginners and experts at the same time because people need to get promoted internally. And that never goes well. Nobody I know wants to use TF these days (but some are forced to).

Let's hope JAX won't suffer the same fate.


> The initial TF releases were great.

I object to this statement. The earlier releases of TF (ca. 2013) were impossible to debug and the documentation was always broken - if you tried to follow the Seq2Seq tutorial you know what I'm talking about. I'd argue that these releases were great for the 50 people who already knew how to use it, but they were aggressively unhelpful for beginners.

PyTorch won my lab (and certainly others) because you could add prints to check your dimensions while TF forced you to build a correct computation graph all at once. Performance? Sure, TF is probably faster. But I'd argue that TF's big mistake was not taking their new users into consideration.


I agree. I truly miss the graph-mode API, especially coming from Theano before TF, but it wasn’t as beginner friendly and Google wanted to capture market share for their cloud.

At least with jax the core library isn’t adopting any of the framework level stuff so those can evolve independently.


Yup. For a lot of models I preferred the graph mode. It was explicit with no magic. I think they should've just stuck with that, even if it meant not everyone and their mom can use it.

Agree on the framework stuff. Please just be a library, not a collection of opinionated frameworks where I need to read the source code anyway to understand what it actually does. After something not working and debugging for hours I remember looking at the number of weights in the model and thinking, wait, something can't be right here. Then I dig into the framework layers and figured out it added slightly different things than I thought it would. Would've been much faster to just write the graph myself.


the original tf was truly horrendous :) it was extremely unintuitive and slow to program. that's why so many people switched to pytorch. the merge was a good idea, it just came too late. tf is as popular as it is just because it was the first to market, & because of Google's megaphone, even inside Google researchers don't want to use it, and a large fraction have switched to Jax


The downfall was fragmentation. They merged Keras after TF started having multiple competing application libraries, each managed by different people (and each fighting for promotion, as you say). Keras just happened to be the most popular. Even after Keras, various teams (e.g., deepmind) have decided to make their own libraries.


In hugging face, almost 85% of models are exclusive to PyTorch, and even those that are not exclusive have about a 50% chance of being available in PyTorch as well. In contrast, only about 16% of all models are available for TensorFlow, with only about 8% being TensorFlow-exclusive.

(source: https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-202...)


That matches my experience reading a lot of ML research papers. Most papers that include code are PyTorch-only. Some have both PyTorch and TF. Few are TF-only. Then there is JAX and it's been growing.


I feel like the comments on backwards compatibility are due to the absolute shitshow of TF2 compatibility for TF1 code and models.

Also the threat of Pytorch can be seen when reading between the lines, especially since it's now run by a foundation and the darling of the diffusion model developments.


PyTorch has been a darling of almost every noteworthy open source model for the past 3-4 years (BLOOM, GPT-J, StyleGAN3, detectron, etc). Personally, I've only seen people use TensorFlow/XLA if they got free TPU credits from Google (gpt-neo), or if it was released by Google (t5).


GPT-J is actually JAX


My mistake, I was thinking of gpt-neox. Thanks for the correction.


I just wish they would put some focus on making tensorflow reliable in a production environment.

It can easily be wildly non deterministic across different cpus or GPUs, or even in the same session, with the same input.

Performance seems to get worse with new releases, and there are frequent subtle breaking changes when using models built with old versions on newer releases.

Tensorflow serving is barely controllable, and requires insane tuning to make it perform the same as pytorch, but provides little to no docs.

The vast majority of models people build just don't work in tensorflow serving either, as you can't reach in with hacky python to mess with internal state.

If you use a custom host instead then you have to deal with literal gigabytes of python dependencies, making your docker images huge.

Memory usage is uncontrollable and causes terrible performance or instant host death. Results vary depending on cpu count, and automatic parallelism can reduce performance.

I just don't understand how Google use tensorflow internally for real world services.


> Tensorflow serving is barely controllable, and requires insane tuning to make it perform the same as pytorch

What are you using for serving PyTorch models?


I went to PyTorch from Keras/TensorFlow and couldn't be happier. Good to see they are still trying, though. Competition is always good.


I've had my last straw moment with tensorflow some time ago. JAX has been an a pleasure and Im never looking back. It got to a point on my team where it was just easier to rewrite our entire distributed training infrastructure in jax with pmap than coerce TF2 into doing what I wanted it to.


What was TF unable to do that you could do in JAX?


TF needs to Stop violating the principle of least surprise and spooky action at a distance.


Reason about my program's behavior.


Fix the damn abstraction bugs! Models of models only works in a limited fashion. Its a damn graph, I can't understand why they can't get this to work. For me the tf ship has sailed.


Was honestly expecting this to be a deprecation announcement…

Not because it’s google but because Jax has so much momentum lately.


I believe it is, whether they realize it or not. JAX/PyTorch + XLA is just so much better in so many ways. Development on XLA will (thankfully) continue, and JAX and PyTorch will continue to cannibalize the TF userbase.


The blog post doesn't mention DeepMind. I assume that's because they exclusively use JAX?

https://www.deepmind.com/blog/using-jax-to-accelerate-our-re...


Is it me or does this between the lines suggest that the current approach to Tensorflow is not working?


yes, if you see a blog post from google that follows this title pattern and the text says what this does, it means there's an explicit acknowledgement that something serious wasn't working and the leadership decided to course-correct.

TF will continue to have a place at Google for prod work but its application base is going to continue to shrink. I'm just blown away they're rejiggering the distributed model again.


It's just that TensorFlow is big and bloated, has quite a few quirks and it's not the cool thing to use anymore so they're doing some work to address these issues. TensorFlow is still very popular and you can get work done with it.


Looks good. Does adopting the numpy API standard while maintaining backwards compatibility mean that they'll have to duplicate a lot of functionality?


Nope. It’s a different namespace and the numpy methods just wrap existing TF methods: https://github.com/tensorflow/tensorflow/blob/v2.10.0/tensor...


It would be nice if they made it easier to build at all.


hey - it's using bazel, that makes things great. Well, bazel won't work on your system, because protobuf needs a patch for a new libc. Why didn't you use Ubuntu? Now it builds! Oooh, wait they are vendoring protobuf in tensorflow.....

(all that, while depending on a single-file, 3-release python library...)


I think f chollet's attitude has really hamstrung keras taking off. I really like the api but its brittle in many ways like models in models


> The future of TensorFlow will be 100% backwards-compatible

I love to hear that. But does Python even guarantee 100% backwards-compatible?


Guaranteeing backwards compatibility at the level of a wide purpose language like python is how you end up with stagnation (see CPP).

For domain-specific languages like TF I guess they were motivated to commit to it to ensure adoption of the new versions.


Assuming you are talking about C++, I honestly don't see any stagnation. In terms of backwards compatibility, I'm very happy that my 20 year-old C/C++/Perl codes are all working correctly.


> how you end up with stagnation (see CPP).

C++ isn't stagnant, if anything it should slow down a bit as it moves faster than an overcaffeinated hamster on a rocket.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: