As someone who uses tensorflow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well.
I mean, when relying on third party code, things like `tf.enable_control_flow_v2() and tf.disable_control_flow_v2()` can and will go horribly wrong. It looks like some operations are changing behaviour depending on a global flag being set. And not just some operations, but control flow operations! That will lead to some very hard to figure out bugs.
In my opinion there are some architectural problems with TF, which have not been adressed in this update. There is still global state in TF2. There is still a difference in behaviour between eager and non-eager mode. There is still the control flow as a second class citizen.
If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.
Not only upgrading is hard, but also installation (on Windows at least). For each Tensorflow version you need a specific python version, a specific CUDA version, specific tensorflow-gpu version, and many other easy to get wrong things. The problem is not the requirements, but that it's very hard to know what versions are compatible. There are endless threads on Github of people trying to use Tensorflow but failing after spending days trying to install it.
Try using these containers[1] from my peer team at IBM. They run on a variety of architectures (x86, ppc64le, with and without GPUs in most cases).
In addition if you don't want to fiddle with the containers, there's also a conda channel[2] that lines stuff up. I work on a peer team in machine vision, and use these for personal and professional projects.
Tensorflow sounds like an ideal candidate for running in a container. List out all your approved compatible versions in the Dockerfile and distribute it with your source code and anyone can reproduce your results with the exact same setup.
Yes, but for personal use, unless someone else already made those containers, I will still have to go myself through the trial-and-error process of finding the right combination of versions. Yes, second installation will be easier, but if I just want it on my PC it doesn't really help.
You can `docker pull username/mytensorflowcontainer` and start from someone elses' work. Looks like Tensorflow has a how-to on the site: https://www.tensorflow.org/install/docker including working cpu-only and gpu-enabled examples.
Anaconda may work well as a virtual environment for some ml projects, but it is by no means a solution for getting a gpu-working installation of tensorflow on Windows.
As a software engineer and non-data scientist, I hate Anaconda because it feels like it's a tool that tries to be the be-all, end-all package management tool for everyone in the data science field, yet it feels like a sloppily built, bloated whale. It's even managed to overwrite PATH on some of my Linux machines, which is where I drew the line.
I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.
Sadly, many of the ML models I investigate on Github don't even have their package requirements frozen. It's an uphill battle...
> I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.
I suspect you have a lot of time on your hands. But for me the 'batteries included' approach really nails it, why repeat the headache over and over again when a single entity can take care of that in such a way that incompatibilities are almost impossible to create? The hardest time I've had was to re-create an environment that ran some python code from a while ago, with Anaconda it was super easy.
I'm sure it has its limitations and just like every other tool there are situations where it is best to avoid it but for now it suits me very well.
Have you ever used any of those features of tensorflow though? They're all, er, idiosyncratic. If you're a decent software engineer and are following the mafs well from a book, I couldn't reccomend Jax highly enough. (I work on big tf RL projects every day).
That’s true. I’m still a student and haven’t really shipped a large ML project. Most of my ETL is done using scrapers, manual Pandas transformations, and storage in flat files. But yeah, I see your point, and frankly I don’t think there’s too big of a user base around these specialized features. Not to mention that some can be, erm, difficult to use. A friend tried using the YouTube-8M dataset in TFRecords format for a project and was extremely annoyed at the complexity.
> At first glance, it seems to be a GPU/TPU based NumPy?
Yes, with a compiler to make this fast.
> The thing is, TF has more than tensor ops. It has pre-defined NN layers, data loading/serialization, distributed training, metrics, and model serving.
Yes, it is a simpler and smaller API.
For things like data loading, you can use the tool of your choice -- TF, pytorch, whatever. For pre-defined NN layers, there are libraries that build this as a very thin wrapper around JAX's low-level API, see e.g. lax, which is include in JAX.
I know. I don't want to go as far, but if I had to choose, I would also go for Jax, and help make it feature complete. However, it is not very feature-complete yet, and thus probably not as useful for everyone yet.
Agreed. Tensorflow kinda reminds me of OpenGL in that its dependence on global flags causes some really annoying bugs, especially when you're using third-party libraries or pre-trained models. `enable_eager_execution`, `enable_tensor_equality`, and `enable_v2_tensorshape` have all completely broken my code at one point or another.
Is it production ready for serving PyTorch models? How about if I wanted to use something like Go to serve those models? That's fairly straightforward with Keras (Python) trained TF models.
In our team, We serve PyTorch models in production using libtorch. A C++ library for loading models. You can easily call the C++ code in Go if you wrap it in a C interface.
Last I checked there was basically zero serving story for pytorch. The trade off seems all to common, tensorflow optimizes for enormous production applications first while pytorch optimizes for developer ease first.
If you are trying to apply one of these libraries to a production system that doesn’t get a lot of throughput you probably shouldn’t be using them (try a linear model first). If you have a high throughput application you probably want tensorflow and just deal with the shittyness.
If you have any existing ML/DL experience, picking up PyTorch is a breeze. You could get a pretty solid understanding of the framework with an MNIST handwritten digit recognition model over the course of an afternoon, so don’t sweat looking for the “right” tutorial.
Highly recommend it! I love pytorch so much, it's basically numpy with automatic backprop and CUDA support. It evaluates eagerly by default, which makes debugging a lot easier since you can just print your tensors, and IMO it's much simpler to jump between high-level and low-level details in pytorch than in tensorflow+keras. Just as one example, activation functions in pytorch are applied by calling a python function on your layer, instead of passing a string argument with the function name to the layer constructor, so you write
layer = F.relu(Linear()(input))
instead of
layer = Dense(activation_fn='relu')(input)
As a result, it's a lot more straightforward to try out custom activation functions in PyTorch.
The problem is that tensorflow is an umbrella name for bunch of related technologies: it's a matrix calculation engine, graph definition language, distributed graph calculation engine, ML algorithmic libraries, ML training libraries. On top of that it's extremely poorly documented. At the end of the day when you use it anything beyond most trivial stuff turns out to be incompatible with each other (this operation is not implemented for TPUs or GPUs, this API doesn't work with this API) and most of development cut-n-paste trial and error. Then you go to read it at source, but creative Python renaming and importing leads you to multi-hour wild goose chase.
If you switch to PyTorch, what are you going to use for prod deployment? Is there any way to use TPUs?
> This is also the last major release of multi-backend Keras. Going forward, we recommend that users consider switching their Keras code to tf.keras in TensorFlow 2.0
Cannot but agree. Above that 1.14 documentation was simply deleted from the tensorflow website and now we are scratching our heads at what to do when it comes to model maintenance.
We serve our models via TF Java API since our system is Java/Scala based. We can't even update existing TF Java API because it is incompatible with anything prior to 1.15. It's an utter mess.
There's absolutely no evidence that MXNet is faster than TF. At the high end, all three (TF/PyTorch/MXNet) are similarly performant. The reality is that implementation matters more than framework when you are talking about performance.
In terms of features and functionality, they are very much interchangeable. A new project could be written in any of the three major frameworks and be equally good. The only standout feature I'm aware of is that TF has the best support for doing inference on devices, but that won't be true forever. In terms of actually migrating a codebase from one to the other, the APIs are different enough in small ways that it would be a large amount of effort.
Recently I found that a lot of TF2.0 Keras' functionaly does not support eager execution. This makes Pytorch still significantly easier to prototype with than TF2.0.
If you miss Keras' way of defining NNs, you can use PyWarm: https://github.com/blue-season/pywarm which offers a fully functional NN building API for pytorch.
Does pytorch have a Tensorboard equivalent? For rapid prototyping, I find Tensorboard a lot more useful than, say, print statements and such that you can get through eager execution. Tensorboard is also crucial for post-hoc analysis, and the Summary format is clean enough to use as a primary data artifact (e.g. use loss recorded in summaries versus some alternative hand-crafted text file).
"Visdom" [0] isn't well know, but it's powerful and easy to use. You can even centralize remote multiple experiments in the same dashboard. Very useful for following in real-time what is happening to your networks.
I decided a few weeks ago to transition to PyTorch (was using Keras before) and I must say that I really love it! How PyTorch is structured gives me the right balance between ease of use and the ability to make customisations. Further, using DistributedDataParallel, dividing the work over multiple processes, where each process uses one GPU, is very fast and GPU memory efficient.
Before my switch I tried out Keras for Tensorflow, and even got a lot of support from Google in my endeavours to resolve the issues I encountered (kudos to Google for that!). In the end I felt it was still not mature enough. Further, although I do believe TF and Keras are moving in the right direction, I still felt that in some cases the way the software was set up just didn't sit well with me.
Maybe it is worth to try again in a year or so, or by then I will tryout Swift for Tensorflow, which I think has a great future ahead.
The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do.
That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.
I've been using the rc's for a while now and I must admit, it's a big step up for projects you are starting from scratch. Migrating... Probably not as clean as I would like to admit but it does the job. Overall tf 2.0 removes a lot of the boilerplate code, which is awesome.
Might be slightly off-topic, what’s the latest state-of-play for AMD/OpenCL? I hear in various places that AMD is fantastic for compute, but everyone seems to be using CUDA.
I’ve got a very expensive Bitcoin mining rig paperweight at the moment with two Vega 64s (along with another Vega 64 in my main rig) — it’d be great to re-purpose them for something potentially useful.
Still pretty terrible compared to PyTorch _unless_ you need deployment to device, in which case it's basically the only game in town. Or at least the only _viable_ game.
Case in point: people are still trying to figure out on their github how to apply global weight decay when training a model, and to get a "correct" resize for segmentation you have to fall back to the legacy 1.x API and specify align_corners=true there. These bugs existed for many years, and nobody gives a damn. That said if your choice is between 1.x and 2.0, 2.0 is much easier to work with, especially if you use something other than TF (e.g. PyTorch) for data pipeline and augmentation. You can hook that up pretty seamlessly if you train in eager mode.
I think it’s getting easier by the day to port a PyTorch model to something that can be production ready, like Tensorflow Lite. It’s cumbersome, but doable. For me, I like to optimize my workbench and just deal with the final steps of pain to get it to prod.
It has? Where? Who uses ONNX for anything? I doubt it even can work, period. The moment you do anything other than a bare bones classifier (which nobody really runs on devices - you need more complex models to solve real world problems) you run into ops unsupported by your inference framework, and that's if ONNX is supported by its tooling in the first place. In fact you could also run into unsupported ops during export as well: that is, it is somewhat likely that you won't even be able to export your model unless it consists entirely of the ops ONNX standard implements. The rest can be exported as opaque ops, but your inference tooling will not know what to do with those for sure.
It can work for more than a bare bones classifier. It's certainly not painless and sometimes you need some manual work to translate your model but work it does ...
For mobile, the situation seems worse than a year ago. TF Mobile mostly worked. The current situation with TF lite is a joke .. tried a real-world Pytorch -> ONYX -> TF/TF lite and it has been weeks of misery.
Doesn't CoreML abstract 90% of the actual model format/training source away? Last time I played with it in Xcode, it was painless to pull a pretrained TF model and use as-is.
But, crucially, larger feature set than the likes of CoreML etc. You don't get access to the NPU that way though, at least not on iOS. The only acceleration option there seems to be Metal. Which isn't bad, but also not the most power efficient thing the hardware supports.
Still though, it's the only game in town if you don't want to have insane un-debuggable headaches everywhere you deploy to device. Plus it also supports embedded Linux boards, and pretty much all current TPU-like things available there.
Reality is that most production ML workflows still use 1.x,( not even the latest 1.14[5]. Migration to 2.x does not make sense in many existing use cases unless you want to build your new algorithms from scratch. From previous experience with TF, API changes so much that is impossible to keep up from an Enterprise perspective, I expect TF team do some type of LTS version for Enterprise and continue improving 2.x, which clearly was a response to Pytorch and is an evolution of TF.keras and eager execution plus all the cleanup.
Are you looking for advanced models that you can train on your data pretty much out-of-the-box or simple, easy-to-read models that help you learn the underlying concepts?
If it's the first case, what you want to do is find the best GitHub repos for the task(s) you are trying to do. Make sure the GitHub repo has a model zoo and good support and start from there. In CV, if you are trying to do high-end work, the repos to check out are:
If you want easy-to-read code for non-trivial tasks, I would suggest taking a look at Gluon (GluonCV - https://gluon-cv.mxnet.io/ and GluonNLP - https://gluon-nlp.mxnet.io/). I haven't worked much with the fast.ai library, but that's probably also a good suggestion.
I think this is what TF's Estimators are about. The idea is that they're going to reduce boilerplate on the assumption that most users of Tensorflow actually only use the same handful of models (ResNet, VGG, Inception, etc).
I've been learning Tensorflow recently for a side project, and the style transfer work I'm doing means I need to build my own Tensorflow graphs, so I haven't had much use for this kind of thing. But it sounds like it was made for you, not me.
Was incredibly fussy when I tried it & tends to silently fail over to CPU if it doesn't like something (and worse builds of versions. Eg 7.4.1 comes in 10.0 Cuda flavour and 10.1)
I like julia a lot (use it everyday, it's my primary language right now), but this isn't really a reasonable recommendation, imho. Julia seems like it could be really great for ML, but I'm not sure if the current libraries are mature enough to wholeheartedly recommend.
I mean, when relying on third party code, things like `tf.enable_control_flow_v2() and tf.disable_control_flow_v2()` can and will go horribly wrong. It looks like some operations are changing behaviour depending on a global flag being set. And not just some operations, but control flow operations! That will lead to some very hard to figure out bugs.
In my opinion there are some architectural problems with TF, which have not been adressed in this update. There is still global state in TF2. There is still a difference in behaviour between eager and non-eager mode. There is still the control flow as a second class citizen.
If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.