Hacker News new | past | comments | ask | show | jobs | submit login
Tensorflow 2.0 (github.com/tensorflow)
465 points by Gimpei on Sept 30, 2019 | hide | past | favorite | 106 comments

As someone who uses tensorflow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well.

I mean, when relying on third party code, things like `tf.enable_control_flow_v2() and tf.disable_control_flow_v2()` can and will go horribly wrong. It looks like some operations are changing behaviour depending on a global flag being set. And not just some operations, but control flow operations! That will lead to some very hard to figure out bugs.

In my opinion there are some architectural problems with TF, which have not been adressed in this update. There is still global state in TF2. There is still a difference in behaviour between eager and non-eager mode. There is still the control flow as a second class citizen.

If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.

Not only upgrading is hard, but also installation (on Windows at least). For each Tensorflow version you need a specific python version, a specific CUDA version, specific tensorflow-gpu version, and many other easy to get wrong things. The problem is not the requirements, but that it's very hard to know what versions are compatible. There are endless threads on Github of people trying to use Tensorflow but failing after spending days trying to install it.

Try using these containers[1] from my peer team at IBM. They run on a variety of architectures (x86, ppc64le, with and without GPUs in most cases).

In addition if you don't want to fiddle with the containers, there's also a conda channel[2] that lines stuff up. I work on a peer team in machine vision, and use these for personal and professional projects.

[1] https://hub.docker.com/r/ibmcom/powerai/

[2] https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.6.1/...

Tensorflow sounds like an ideal candidate for running in a container. List out all your approved compatible versions in the Dockerfile and distribute it with your source code and anyone can reproduce your results with the exact same setup.

Yes, but for personal use, unless someone else already made those containers, I will still have to go myself through the trial-and-error process of finding the right combination of versions. Yes, second installation will be easier, but if I just want it on my PC it doesn't really help.

You can `docker pull username/mytensorflowcontainer` and start from someone elses' work. Looks like Tensorflow has a how-to on the site: https://www.tensorflow.org/install/docker including working cpu-only and gpu-enabled examples.

I've been using containers for this, and mounting the source code.

I've been using Anaconda and it finally made Python work without endless rounds of 'incompatible libary bingo'.

It also works well for machine learning projects.

Anaconda may work well as a virtual environment for some ml projects, but it is by no means a solution for getting a gpu-working installation of tensorflow on Windows.

> getting a gpu-working installation of tensorflow on Windows

I think part of your problem is the last word there. Windows is a bad match for such an environment. On Ubuntu it is pretty much painless.

I agree, and prefer Linux myself, but some clients only allow for solutions based in Windows (and no containers) :/

+1 for Anaconda, worth trying for anyone who has a problem with package versions.

As a software engineer and non-data scientist, I hate Anaconda because it feels like it's a tool that tries to be the be-all, end-all package management tool for everyone in the data science field, yet it feels like a sloppily built, bloated whale. It's even managed to overwrite PATH on some of my Linux machines, which is where I drew the line.

I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.

Sadly, many of the ML models I investigate on Github don't even have their package requirements frozen. It's an uphill battle...

> I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.

I suspect you have a lot of time on your hands. But for me the 'batteries included' approach really nails it, why repeat the headache over and over again when a single entity can take care of that in such a way that incompatibilities are almost impossible to create? The hardest time I've had was to re-create an environment that ran some python code from a while ago, with Anaconda it was super easy.

I'm sure it has its limitations and just like every other tool there are situations where it is best to avoid it but for now it suits me very well.

I would suggest you try out Miniconda (https://docs.conda.io/en/latest/miniconda.html). It comes with just the basics, and let's you install TF with GPU support by simply doing:

conda install -c anaconda tensorflow-gpu

it's incredibly slow.

What’s incredibly slow? Installing things with conda?

conda install for our environment.yml: about 3-5 minutes solving, then 5-10 minutes installing.

pip install with almost exactly the same set of packages: 3-5 minutes total.

They're (slowly) making this better. Starting with 1.15, there is only one tensorflow pip package, no tensorflow/tensorflow-gpu hell any longer.

Really? https://www.tensorflow.org/install/gpu says to `pip install tensorflow-gpu`.


Ah I guess it's on the way but not fully there yet

> If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.

Or consider https://github.com/google/jax !

Just looked at Jax. At first glance, it seems to be a GPU/TPU based NumPy?

The thing is, TF has more than tensor ops. It has pre-defined NN layers, data loading/serialization, distributed training, metrics, and model serving.

It seems like a bit of a step backwards, that's all.

Edit: "matrix ops" -> "tensor ops"

Have you ever used any of those features of tensorflow though? They're all, er, idiosyncratic. If you're a decent software engineer and are following the mafs well from a book, I couldn't reccomend Jax highly enough. (I work on big tf RL projects every day).

That’s true. I’m still a student and haven’t really shipped a large ML project. Most of my ETL is done using scrapers, manual Pandas transformations, and storage in flat files. But yeah, I see your point, and frankly I don’t think there’s too big of a user base around these specialized features. Not to mention that some can be, erm, difficult to use. A friend tried using the YouTube-8M dataset in TFRecords format for a project and was extremely annoyed at the complexity.

> At first glance, it seems to be a GPU/TPU based NumPy?

Yes, with a compiler to make this fast.

> The thing is, TF has more than tensor ops. It has pre-defined NN layers, data loading/serialization, distributed training, metrics, and model serving.

Yes, it is a simpler and smaller API.

For things like data loading, you can use the tool of your choice -- TF, pytorch, whatever. For pre-defined NN layers, there are libraries that build this as a very thin wrapper around JAX's low-level API, see e.g. lax, which is include in JAX.

> see e.g. lax, which is include in JAX.

I think jleber means stax here, for pre-defined NN layers.

Yes, mb!

I know. I don't want to go as far, but if I had to choose, I would also go for Jax, and help make it feature complete. However, it is not very feature-complete yet, and thus probably not as useful for everyone yet.

Agreed. Tensorflow kinda reminds me of OpenGL in that its dependence on global flags causes some really annoying bugs, especially when you're using third-party libraries or pre-trained models. `enable_eager_execution`, `enable_tensor_equality`, and `enable_v2_tensorshape` have all completely broken my code at one point or another.

»If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.»

- that’s exactly what we did and we don’t regret that decision

Is it production ready for serving PyTorch models? How about if I wanted to use something like Go to serve those models? That's fairly straightforward with Keras (Python) trained TF models.

In our team, We serve PyTorch models in production using libtorch. A C++ library for loading models. You can easily call the C++ code in Go if you wrap it in a C interface.

Last I checked there was basically zero serving story for pytorch. The trade off seems all to common, tensorflow optimizes for enormous production applications first while pytorch optimizes for developer ease first.

If you are trying to apply one of these libraries to a production system that doesn’t get a lot of throughput you probably shouldn’t be using them (try a linear model first). If you have a high throughput application you probably want tensorflow and just deal with the shittyness.

I'm actually looking to learn PyTorch. Where's the best place to start?

If you have any existing ML/DL experience, picking up PyTorch is a breeze. You could get a pretty solid understanding of the framework with an MNIST handwritten digit recognition model over the course of an afternoon, so don’t sweat looking for the “right” tutorial.

Awesome, I've done some ML before. I'll just dive into the docs.

Highly recommend it! I love pytorch so much, it's basically numpy with automatic backprop and CUDA support. It evaluates eagerly by default, which makes debugging a lot easier since you can just print your tensors, and IMO it's much simpler to jump between high-level and low-level details in pytorch than in tensorflow+keras. Just as one example, activation functions in pytorch are applied by calling a python function on your layer, instead of passing a string argument with the function name to the layer constructor, so you write

  layer = F.relu(Linear()(input))
instead of

  layer = Dense(activation_fn='relu')(input)
As a result, it's a lot more straightforward to try out custom activation functions in PyTorch.

Pytorch website has a “blitz” tutorial that was fantastic.

The problem is that tensorflow is an umbrella name for bunch of related technologies: it's a matrix calculation engine, graph definition language, distributed graph calculation engine, ML algorithmic libraries, ML training libraries. On top of that it's extremely poorly documented. At the end of the day when you use it anything beyond most trivial stuff turns out to be incompatible with each other (this operation is not implemented for TPUs or GPUs, this API doesn't work with this API) and most of development cut-n-paste trial and error. Then you go to read it at source, but creative Python renaming and importing leads you to multi-hour wild goose chase.

If you switch to PyTorch, what are you going to use for prod deployment? Is there any way to use TPUs?

> If you switch to PyTorch, what are you going to use for prod deployment? Is there any way to use TPUs?

PyTorch has an optional XLA device, that let's you use TPUs: https://github.com/pytorch/xla

I'm still not sure what the difference between tf.keras and the other keras repo is!

And that's before Sonnet, the Estimator API (?) and TFLearn (I'm probably forgetting a bunch).

> This is also the last major release of multi-backend Keras. Going forward, we recommend that users consider switching their Keras code to tf.keras in TensorFlow 2.0


The intent of having separate repos for tf.keras and keras was to support third-party platforms in the latter.

But since the supported third-party platforms (CNTK and Theano) have stopped development due to TensorFlow, well...

Cannot but agree. Above that 1.14 documentation was simply deleted from the tensorflow website and now we are scratching our heads at what to do when it comes to model maintenance. We serve our models via TF Java API since our system is Java/Scala based. We can't even update existing TF Java API because it is incompatible with anything prior to 1.15. It's an utter mess.

Is the friction a result of trying to support eager execution?

Sorry maybe a stupid question but would not be Mxnet better for what you do in TF?

Agreed, MxNet is faster and considerably easier to use than TF.

There's absolutely no evidence that MXNet is faster than TF. At the high end, all three (TF/PyTorch/MXNet) are similarly performant. The reality is that implementation matters more than framework when you are talking about performance.

My question was more about usability or interchangeability. Could TF get replaced by Mxnet in a typical deep learning project?

In terms of features and functionality, they are very much interchangeable. A new project could be written in any of the three major frameworks and be equally good. The only standout feature I'm aware of is that TF has the best support for doing inference on devices, but that won't be true forever. In terms of actually migrating a codebase from one to the other, the APIs are different enough in small ways that it would be a large amount of effort.

Recently I found that a lot of TF2.0 Keras' functionaly does not support eager execution. This makes Pytorch still significantly easier to prototype with than TF2.0.

If you miss Keras' way of defining NNs, you can use PyWarm: https://github.com/blue-season/pywarm which offers a fully functional NN building API for pytorch.

Does pytorch have a Tensorboard equivalent? For rapid prototyping, I find Tensorboard a lot more useful than, say, print statements and such that you can get through eager execution. Tensorboard is also crucial for post-hoc analysis, and the Summary format is clean enough to use as a primary data artifact (e.g. use loss recorded in summaries versus some alternative hand-crafted text file).

It has TensorBoard integration itself (which can be installed independently).


"Visdom" [0] isn't well know, but it's powerful and easy to use. You can even centralize remote multiple experiments in the same dashboard. Very useful for following in real-time what is happening to your networks.

[0] - https://github.com/facebookresearch/visdom

There is also tensorboardX:


I decided a few weeks ago to transition to PyTorch (was using Keras before) and I must say that I really love it! How PyTorch is structured gives me the right balance between ease of use and the ability to make customisations. Further, using DistributedDataParallel, dividing the work over multiple processes, where each process uses one GPU, is very fast and GPU memory efficient.

Before my switch I tried out Keras for Tensorflow, and even got a lot of support from Google in my endeavours to resolve the issues I encountered (kudos to Google for that!). In the end I felt it was still not mature enough. Further, although I do believe TF and Keras are moving in the right direction, I still felt that in some cases the way the software was set up just didn't sit well with me.

Maybe it is worth to try again in a year or so, or by then I will tryout Swift for Tensorflow, which I think has a great future ahead.

The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do.

That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.

Disclosure: I'm a big BIG tensorflow fan.

I've been using the rc's for a while now and I must admit, it's a big step up for projects you are starting from scratch. Migrating... Probably not as clean as I would like to admit but it does the job. Overall tf 2.0 removes a lot of the boilerplate code, which is awesome.

Might be slightly off-topic, what’s the latest state-of-play for AMD/OpenCL? I hear in various places that AMD is fantastic for compute, but everyone seems to be using CUDA.

I’ve got a very expensive Bitcoin mining rig paperweight at the moment with two Vega 64s (along with another Vega 64 in my main rig) — it’d be great to re-purpose them for something potentially useful.

AMD is great when it comes to raw compute power for the money.

However, AMD has a history of shipping poor OpenCL drivers, so everyone just went with what actually worked - NVIDIA.

I don't know too much about it myself, but you could checkout https://rocm.github.io/dl.html

Edit : directly linked to the ROCm deep learning page.

Still pretty terrible compared to PyTorch _unless_ you need deployment to device, in which case it's basically the only game in town. Or at least the only _viable_ game.

Case in point: people are still trying to figure out on their github how to apply global weight decay when training a model, and to get a "correct" resize for segmentation you have to fall back to the legacy 1.x API and specify align_corners=true there. These bugs existed for many years, and nobody gives a damn. That said if your choice is between 1.x and 2.0, 2.0 is much easier to work with, especially if you use something other than TF (e.g. PyTorch) for data pipeline and augmentation. You can hook that up pretty seamlessly if you train in eager mode.

I think it’s getting easier by the day to port a PyTorch model to something that can be production ready, like Tensorflow Lite. It’s cumbersome, but doable. For me, I like to optimize my workbench and just deal with the final steps of pain to get it to prod.

There might be some additional help on the way directly in PyTorch for on-device :)

- iOS: https://github.com/pytorch/pytorch/pulls?utf8=%E2%9C%93&q=is...

- Android: https://github.com/pytorch/pytorch/pulls?utf8=%E2%9C%93&q=is...

Has ONNX not caught fire on mobile yet?

It has but not as good as tensorflow support. This is one thing i miss truly in pytorch. Otherwise pytorch is wonderful.

It has? Where? Who uses ONNX for anything? I doubt it even can work, period. The moment you do anything other than a bare bones classifier (which nobody really runs on devices - you need more complex models to solve real world problems) you run into ops unsupported by your inference framework, and that's if ONNX is supported by its tooling in the first place. In fact you could also run into unsupported ops during export as well: that is, it is somewhat likely that you won't even be able to export your model unless it consists entirely of the ops ONNX standard implements. The rest can be exported as opaque ops, but your inference tooling will not know what to do with those for sure.

It can work for more than a bare bones classifier. It's certainly not painless and sometimes you need some manual work to translate your model but work it does ...

For mobile, the situation seems worse than a year ago. TF Mobile mostly worked. The current situation with TF lite is a joke .. tried a real-world Pytorch -> ONYX -> TF/TF lite and it has been weeks of misery.

I agree with you only basic models works !!

Doesn't CoreML abstract 90% of the actual model format/training source away? Last time I played with it in Xcode, it was painless to pull a pretrained TF model and use as-is.

The number of operations supported by CoreML or Onyx is limited.

Core ML 3 adds a ton of new operations, including control flow support.


What about Android?

Android has its own incompatible world with NN and NDK fun.


Or you get to use Tensorflow Lite and target both platforms, also with a smaller feature set than its big brother.


But, crucially, larger feature set than the likes of CoreML etc. You don't get access to the NPU that way though, at least not on iOS. The only acceleration option there seems to be Metal. Which isn't bad, but also not the most power efficient thing the hardware supports.

Still though, it's the only game in town if you don't want to have insane un-debuggable headaches everywhere you deploy to device. Plus it also supports embedded Linux boards, and pretty much all current TPU-like things available there.

What's an example of a missing operator?

I haven't explored yet but has anyone tried TVM for PyTorch? https://github.com/pytorch/tvm

Reality is that most production ML workflows still use 1.x,( not even the latest 1.14[5]. Migration to 2.x does not make sense in many existing use cases unless you want to build your new algorithms from scratch. From previous experience with TF, API changes so much that is impossible to keep up from an Enterprise perspective, I expect TF team do some type of LTS version for Enterprise and continue improving 2.x, which clearly was a response to Pytorch and is an evolution of TF.keras and eager execution plus all the cleanup.

Is there something which has opinionated defaults where you can hack on projects without getting into all the boilerplate?

Like there is a mostly finite set of typical solved things you would want to use ML for like image classification, object detection, etc.

I find myself spending my time copying an example verbatim and replacing the csv / images with my own.

Are you looking for advanced models that you can train on your data pretty much out-of-the-box or simple, easy-to-read models that help you learn the underlying concepts?

If it's the first case, what you want to do is find the best GitHub repos for the task(s) you are trying to do. Make sure the GitHub repo has a model zoo and good support and start from there. In CV, if you are trying to do high-end work, the repos to check out are:

- https://github.com/facebookresearch/maskrcnn-benchmark (don't be misled by the name, it has support for lots of the high-end modern CV models)

- https://github.com/open-mmlab/mmdetection

- https://github.com/TuSimple/simpledet (haven't explored this one as much, but it looks very solid)

If you want easy-to-read code for non-trivial tasks, I would suggest taking a look at Gluon (GluonCV - https://gluon-cv.mxnet.io/ and GluonNLP - https://gluon-nlp.mxnet.io/). I haven't worked much with the fast.ai library, but that's probably also a good suggestion.

I think this is what TF's Estimators are about. The idea is that they're going to reduce boilerplate on the assumption that most users of Tensorflow actually only use the same handful of models (ResNet, VGG, Inception, etc).

I've been learning Tensorflow recently for a side project, and the style transfer work I'm doing means I need to build my own Tensorflow graphs, so I haven't had much use for this kind of thing. But it sounds like it was made for you, not me.

Keras maybe?

fast.ai might do the job - https://docs.fast.ai

I think that I take the effort to update most of my side projects to TF 2, or mark the github repos as no longer supported for really old code.

Some may complain about big API changes but I think it is occasionally healthy to tag old stable versions and do massive code refactoring.

I assume that TensorFlow.js is also getting updated - I find it almost equally nice to prototype with (the bundled examples are first rate).

On a different context, what're some good resources to get started with Deep learning using TF (besides the stuff Google put on YT)

The 2nd edition of Aurélien Géron's book[1] was written for Tensorflow 2.

[1]: https://www.oreilly.com/library/view/hands-on-machine-learni...

The first edition of the book is fantastic!

Aurelien's writing is clear and clean compared to most other books on ML.

Any online courses using TF 2.0, such as Coursera? Or books coming out soon? That's key for proper adoption.

If I were to start literally at ground zero with zero knowledge of programming or tensorflow. Where should I go?

Install jupyter notebook and play around with numpy/scipy. Neural networks are not really for outright beginners, and tensorflow is doubly not.

I'm also new to programming and I think you can start with Github:) A lot of free tutorials are available in Github like https://github.com/30-seconds/30-seconds-of-python https://github.com/fonnesbeck/statistical-analysis-python-tu...

Linear regression might be the simplest thing to do with Tensorflow's autodifferentiation, if you want to jump straight in...

So if you want to work in deep learning should you learn pytorch and tensorflow? Or just one, which one?


As far as production deployments of ML models, what percentage use TF vs PyTorch?

So which Cuda version and cudnn is this compatible with? Same as beta?

This documentation seems to describe 2.0:


It says CUDA 10.0 and cuDNN >= 7.4.1.

Yeah that's same as beta.

Was incredibly fussy when I tried it & tends to silently fail over to CPU if it doesn't like something (and worse builds of versions. Eg 7.4.1 comes in 10.0 Cuda flavour and 10.1)

Just use julia.

I like julia a lot (use it everyday, it's my primary language right now), but this isn't really a reasonable recommendation, imho. Julia seems like it could be really great for ML, but I'm not sure if the current libraries are mature enough to wholeheartedly recommend.

I'd love to be proven wrong about that, though.

What are the mature libraries? Use PyCall and whatever your heart desires.

My heart desires python then

Julia consumes python and give you so much more.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact