
Standardizing OpenAI’s deep learning framework on PyTorch - pesenti
https://openai.com/blog/openai-pytorch/
======
cs702
At work, we switched over from TensorFlow to PyTorch when 1.0 was released,
both for R&D and production... and our productivity and _happiness_ with
PyTorch noticeably, significantly improved.

Back when we were using TensorFlow, whenever we wanted to try something new
that wasn't already provided out-of-the-box by existing APIs, sooner or later
we would find ourselves _wrestling_ with its machinery, especially for models
with more complex control flow.

TensorFlow _feels_ like it was built from the ground up to scale up to
billions of users and all kinds of devices, with developer productivity and
happiness a secondary priority. PyTorch _feels_ like it was built the other
way around, prioritizing developer productivity and happiness; other
considerations were secondary.

That said, we are keeping an eye on Swift + MLIR + TensorFlow. We think it
could unseat PyTorch for R&D and eventually, production, due to (a) the
promise of automatic creation of high-performance GPU/TPU kernels without
hassle, (b) Swift's easy learning curve, and (c) Swift's fast performance and
type safety. Jeremy Howard has a good post about this:
[https://www.fast.ai/2019/03/06/fastai-
swift/](https://www.fast.ai/2019/03/06/fastai-swift/)

~~~
jorlow
> TensorFlow feels like it was built from the ground up to scale up to
> billions of users and all kinds of devices, with developer productivity and
> happiness a secondary priority. PyTorch feels like it was built the other
> way around, prioritizing developer productivity and happiness; other
> considerations were secondary.

I recently moved from Google to Facebook and this is how I'd characterize most
of the differences I see: Facebook optimizes for your ability to make progress
above everything else. Google, not so much.

~~~
kabes
Seems to hold up for most google stuff I tried. E.g. kubernetes and angular.

------
stabbles
I've started working with Flux [1] in Julia, and it's so elegant and such a
great experience :). Just look at this definition of a U-net model for image
segmentation:
[https://gist.github.com/haampie/bceb1d59fd9a44f092f913062e58...](https://gist.github.com/haampie/bceb1d59fd9a44f092f913062e58d482).
Apart from that, you can write your own custom loss functions in pure Julia
that run efficiently on the GPU, language level automatic differentiation,
proper integration with other packages. If people are moving away from
Tensorflow, then Flux could be a solid alternative as well.

[1] [https://github.com/FluxML/Flux.jl](https://github.com/FluxML/Flux.jl)

~~~
chillee
IMO, these kinds of functional abstractions look nice on paper, but are a pain
in the ass to actually use. In practice, you'll want to print out things in
between each layer, you'll want to log each layer's activations, you might
want to redirect a layer into another network, etc.

Both PyTorch and Tensorflow have purely functional abstractions, but they're
relegated to super basic functionalities.

~~~
eigenspace
Sorry, can you clarify on what you think is a purely functional abstraction?

Flux is incredibly flexible and can do all sorts of things that are not
limited to purely functional code and Flux is capable of many things that are
straight up impossible or infeasible in PyTorch or TensorFlow (with or without
their 'purely functional' abstractions).

~~~
chillee
Super late reply, so it's likely you won't see this... (Too bad HN doesn't
notify on replies).

I'm not complaining about Flux in general, I'm talking about the specific
example (the UNet) he brought up that he uses to claim that Julia is so
elegant.

Can you elaborate on what Flux can do that Pytorch can't?

~~~
ChrisRackauckas
At this point, the DiffEqFlux neural differential equation library fits the
neural ODE example from the original paper in 29 seconds [1]. The forward pass
of torchdiffeq on trivial ODEs without neural networks takes 47 seconds [2]
(and of course adding in neural networks makes it a lot more expensive). This
is a massive real-world difference. It means that the Julia packages are
building animations by looking at real-time fitting plots, while it's a hours
long ordeal in PyTorch. Being able to use optimized packages instead of
hardcoding a simple version of things really pays off in the long run, and
here using a real ODE solver suite is not a small difference but rather it's
multiple orders of magnitude. That's the real benefit of differentiable
programming.

    
    
      [1] https://github.com/JuliaDiffEq/DiffEqFlux.jl#training-a-neural-ordinary-differential-equation
      [2] https://gist.github.com/ChrisRackauckas/cc6ac746e2dfd285c28e0584a2bfd320

------
antome
As someone who has used both PyTorch and TensorFlow for a couple years now, I
can can attest to the faster research iteration times for PyTorch. TensorFlow
has always felt like it was designed for some mythical researcher that could
come up with a complete architecture ahead of time, based on off-the-shelf
parts.

~~~
cillaway
Indeed, no wonder PyTorch has beaten Tensorflow so thoroughly in the last 3
years, going up from 1% of the papers to ~50% of the papers (TensorFlow is now
down to only 23% of the papers):

[https://paperswithcode.com/trends](https://paperswithcode.com/trends)

~~~
nl
According to the methodology on that page that would classify the standalone
version of Keras (using _from keras.models_ imports as recommended by the
Keras docs) as "Other". (I tried finding source code to verify this, but
couldn't find it)

And if that is correct, then I'd be astonished if the vast majority of the
"Other" papers aren't Keras. I work in ML and I don't think I've seen a paper
that didn't use PyTorch, TensorFlow or Keras in years.

And is that's the case then almost certainly there are more that use TF than
PyTorch: Pytorch is 42%, TF is 23% but Other is 36%.

(In terms of biases, I _hate_ working in Tensorflow, and much prefer PyTorch
and Keras. But numbers are numbers).

~~~
ma2rten
Jax?

~~~
nl
Are there any papers that use it for things other than demonstrating Jax? I
can't think of one off the top of my head.

Perhaps I should have specified "papers outside those introducing new
frameworks, or around speed benchmarking".

There are a bunch of interesting papers using custom libraries for distributed
training, and ones targeted at showing off the performance of specific
hardware (NVidia has a bunch of interesting work in this space, and Intel and
other smaller vendors have done things too).

~~~
alevskaya
It's still early days for JAX, but there's neural tangents
[https://arxiv.org/abs/1912.02803](https://arxiv.org/abs/1912.02803) and
reformer [https://arxiv.org/abs/2001.04451](https://arxiv.org/abs/2001.04451)
from iclr.

~~~
nl
I agree about it being early days.

Reformer is a good example that I'd missed.

Neural Tangents is another paper demoing a framework.

------
theferalrobot
Happy to see PyTorch get some love. The company I am at made the same switch
and everyone has loved PyTorch. It has more expressive power than Tensorflow
1.x (there are models that cannot be done with static graphs) and is
simultaneously much easier to use.

------
sbrother
Is there any equivalent of TF Serving for PyTorch? We have been thrilled with
how robust and easy it is to deploy our models to production on the TF stack,
and it worries me that the inertia in the deep learning community seems to be
toward PyTorch.

~~~
calebkaiser
Have you checked out Cortex? It's an open source platform for deploying
PyTorch models easily. I wrote an article for the PyTorch blog about it:
[https://medium.com/pytorch/how-to-build-production-
software-...](https://medium.com/pytorch/how-to-build-production-software-
with-pytorch-9a8725382f2a)

GitHub:
[https://github.com/cortexlabs/cortex](https://github.com/cortexlabs/cortex)

Full disclosure/shameless plug: I work on Cortex

~~~
sbrother
Thanks! I was not aware of this and it looks fantastic. Is it in the roadmap
to target GCP, or even just a generic Kubernetes cluster?

~~~
calebkaiser
GCP is on the immediate short-term roadmap, and we're investigating on-
premise, but don't have a firm timeline on it quite (we're still a small
team).

------
sandGorgon
This is second large framework making the switch to Pytorch.

[https://medium.com/syncedreview/japanese-unicorn-
preferred-n...](https://medium.com/syncedreview/japanese-unicorn-preferred-
networks-migrates-its-dl-platform-to-pytorch-a509ac8f4ba0)

------
m0zg
If PyTorch had a viable way to convert models to run on a mobile GPU or DSP,
that's all I'd ever use. Currently I have to do my research in PyTorch and
then laboriously port to TF to convert to TFLite, which kinda sucks because TF
is full of bugs, and there are gotchas due to differences in how ops are
implemented.

------
sillysaurusx
This is a surprisingly unintelligent move from OpenAI. It adds corporate
inertia to something as mundane as choice of DL framework.

Imagine you worked at OpenAI. Imagine you wanted to experiment with Jax, and
that it turned out to be the best solution for the problem. Now you can't ship
without a solid technical justification.

Except, it's not really a technical justification that you need. You need
corporate clout. You can't just be a junior engineer and make a decision that
goes against corporate policy. That's the point of having a corporate policy.

I can hear a thousand people about to type "C'mon, OpenAI isn't a normal
corporation." But it is. Every corporation is a normal corporation. And having
policies against specific tech should make productive programmers pause.

People get jobs at companies based on whether they use React or Vue, for
example. And in DL, a programming library is basically a programming language,
so it's one step more powerful than that.

Here's an example. Pytorch, as far as I can tell, doesn't support running code
on a TPU's CPU. (I could be wrong about this!) When you enumerate the list of
accelerators available after connecting to a TPU, you get a list of 8 entries.
That means they only support executing code on the _cores_ of a TPU, not the
TPU's CPU. This is a huge difference. It means you're restricted to 8GB on
TPUv2-8's (which you get on Colab) instead of 300GB.

Does that count as a solid technical justification to use Tensorflow for a
research project instead of Pytorch? Who knows. But who wants to be the odd
one out on corporate politics? Especially if a project doesn't generate any
tangible results, which is often the case for research.

~~~
Erlich_Bachman
Or they see this problem and that's why the policy is sanely phrased as
follows:

    
    
        Going forward we’ll primarily use PyTorch as our 
        deep learning framework but sometimes use other 
        ones when there’s a specific technical reason 
        to do so.

~~~
sillysaurusx
It never works out this way in practice. You need corporate clout to go
against corporate policy. That's the point of having a corporate policy.

 _Of course_ they added that caveat. That's probably how this idea got through
in the first place. Just point at the caveat and say "But we're not _really_
throwing all the other frameworks under the bus. If everyone decides it's a
good idea to use something else, we'll use something else."

Except that likely won't happen, because now as a junior engineer you need to
convince N other people that using Jax was a decent choice. And it's against
your company's culture to use anything but Pytorch.

This battle of Tensorflow vs Pytorch is bad for everybody involved. OpenAI
released a lot of cool and important code related to Tensorflow. They did
GPT-2 (tensorflow 1.x), blocksparse (also tensorflow), memory saving gradients
(tensorflow 1.x), and now they're announcing they'll likely never be releasing
such tooling again. Memory saving gradients have been hugely helpful to us for
scaling our models beyond the normal limits.

~~~
gbear605
What you’re ignoring is that the switch isn’t from nothing to Pytorch, it’s
from Tensorflow to Pytorch. It’s only favoring one library over another. Your
scenario with Jax hasn’t changed, and such tooling is going to be released for
Pytorch instead of for Tensorflow. I suspect you’re only against this because
you prefer Tensorflow to Pytorch.

------
zackmorris
Just FYI I looked at PyTorch for the first time now, and unfortunately they
require Mac OS users to build it from source in order to get CUDA support:

[https://pytorch.org/get-started/locally/](https://pytorch.org/get-
started/locally/)

Please if someone at PyTorch is reading this, put in a request to make CUDA
support the default on Mac OS.

Also, it looks like PyTorch doesn't currently support OpenCL:

[https://github.com/pytorch/pytorch/issues/488](https://github.com/pytorch/pytorch/issues/488)

I can't tell by the issue comments if it's been added yet or if they plan to
use Intel's oneAPI or similar.

To me, these are prerequisites for switching to PyTorch. Hopefully someone can
clarify the state of these thanks!

~~~
smhx
Hi I am a PyTorch maintainer.

NVIDIA has dropped CUDA support for macOS:
[http://www.cgchannel.com/2019/11/nvidia-drops-macos-
support-...](http://www.cgchannel.com/2019/11/nvidia-drops-macos-support-for-
cuda/)

This was pretty evident for a few years, and it's one of the top reasons for
us to not provide official binaries with CUDA support -- the maintainer
overhead was way too much. We did work to make sure it still builds with CUDA
support from source (with a contbuild) but once CUDA 10.3 or 11 releases, we
have to drop that too.

~~~
zackmorris
Ah thanks for that. One of my biggest concerns right now is that since SIMD
won out in the performance wars, and has come to be dominated by the video
game industry and proprietary players like NVIDIA, that we are missing out on
a whole possible tree of evolution in computer science.

For one, that we don't have easy access to MIMD, so we can't easily/cheaply
experiment with our own simulations for things like genetic algorithms.

20 years ago I wanted to go into AI research and make a multicore FPGA (say
1000+ cores) where each one could run its own instance of an OS, or at the
very least an isolated runtime for something like Lisp. But the world has gone
a completely different direction, and that's great and everything with all the
recent advances in machine learning, but it's like comparing rasterization
(what we have) to ray tracing (what we could have had). Current
implementations are orders of magnitude more complex than they need to be.
I've written about this a bunch:

[https://news.ycombinator.com/item?id=17759391](https://news.ycombinator.com/item?id=17759391)

[https://news.ycombinator.com/item?id=17419917](https://news.ycombinator.com/item?id=17419917)

So I guess short of this, I hope that PyTorch can at least provide a cross-
platform performant SIMD implementation. Which I had hoped OpenCL would be,
but maybe it's too much like OpenGL and we need something a level of
abstraction higher for easier vector processing without all the worrying about
buffers and moving between CPU and GPU.

------
minimaxir
It's somewhat disappointing that research is the primary motivator for the
switch. PyTorch still has a ways to go in tooling for toy usage of models and
_deployment_ of models to production compared to TensorFlow (incidentally,
GPT-2, the most public of OpenAI's released models, uses TensorFlow 1.X as a
base). For AI newbies, I've seen people recommend PyTorch over TensorFlow just
because "all the big players are using it," without listing the caveats.

The future of AI research will likely be interoperability between multiple
frameworks to support both needs (e.g. HuggingFace Transformers which started
as PyTorch-only but now also supports TF 2.X with relative feature parity).

~~~
Voloskaya
> It's somewhat disappointing that research is the primary motivator for the
> switch.

They are a research organization, how is that disappointing?

~~~
minimaxir
Making AI more open is synonymous with making AI more _accessible_ , which
(IMO) is much better facilitated with TensorFlow/Keras versus PyTorch.

Many AI tutorials imply that the more complicated an AI approach is, the more
effective it is, which isn't practical, especially for newbies without a deep
background.

~~~
eachro
Accessible to whom? What makes tensorflow/keras more accessible than pytorch?

~~~
minimaxir
Accessible to _non-researchers_ , especially those with a programming
background but not an AI background.

The TF/Keras approach advocates the _minimum_ amount of code necessary and
effort needed to make model changes, with sensible default configurations and
layer architectures.

~~~
octbash
Minimum=/=lowest effort.

Especially with the caveat of "with a programming background", it is far
easier to reason and debug through PyTorch with just Python knowledge,
compared to TensorFlow/Keras, which sooner or later requires you to learn a
condensed history of TensorFlow/Keras development to understand why things are
the way they are.

In my opinion,

    
    
      import lib
      lib.train("imagenet", "resnet50", epochs=10)
      lib.eval()
    

is NOT a good example of a beginner friendly library. It's a thin wrapper
facade that hides all of the actual complexity behind "Train ImageNet in 3
lines of code!"

~~~
minimaxir
Fair; maybe minimum isn't the right word. More like "minimum without full
abstraction."

The Keras examples are a good reference (e.g.
[https://www.tensorflow.org/tutorials/keras/classification](https://www.tensorflow.org/tutorials/keras/classification)
); even without an AI background, you have a sense of both what's going on how
to tweak the model to improve it.

~~~
tastyminerals
The reason why Keras became so popular is that it borrowed a lot of concepts
from Lua Torch (which predates even Theano). And anyone who worked with Torch
immediately sees it reading Keras code. But Torch was Lua and naturally it
received less recognition than it deserved. Your will not lose anything by
simply moving to PyTorch.

------
tastyminerals
I think it was just a matter of time till TF would get superseded by PyTorch.
The only reason we kept TF on prod is the java api which allowed us to quickly
load and serve TF models. I spent so many sleepless nights trying to port
Torch model to TF back in the days and make it work the same as Lua based
prototype. Whole TF "experience" made us switch to plain Python services model
throwing away all the boilerplate Scala/Java code for TF. It doesn't happen
often in tech that a better engineered product gets more traction and
recognition eventually and I am glad that PyTorch did.

~~~
lasagnaphil
Pytorch actually got an experimental Java API in version 1.4 (about two weeks
ago), if you're interested.

------
bitL
I believe these days one has to know both, TensorFlow (Keras) and PyTorch;
most new research is in PyTorch and most deployments are in TensorFlow.
Academia can afford to run on PyTorch only, stable businesses on TensorFlow
only, but for individual developers they need to know both.

------
klowrey
For folks interested in Julia and RL, I've been involved in
[https://www.lyceum.ml/](https://www.lyceum.ml/) a set of tools for continuous
control problems like robotics.

It's pretty quick.

~~~
julialover
Yeah!! Let's switch to Lyceum!

------
syntaxing
Has anyone taken the course mentioned "Spinning Up in Deep RL"? I've been
meaning to learning some Deep RL and I was wondering if this is the best first
step.

~~~
rishy
The lecture series and reference material by David Silver:
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html).
You can then supplement that with "Spinning up in Deep RL" for more hands-on
experiments.

