Growing open source from Torch to PyTorch

albertzeyer · on Aug 4, 2021

It was necessary to move away from Lua to stay relevant within the machine learning community. Python was a natural choice because there was Theano and TensorFlow.

PyTorch could make use of the best API ideas from the other frameworks (also higher-level like Keras). And it was executed well. All these core principles of easy debuggability are indeed very important to win developers. Clean code, understandable code, flexibility, these are all very related to that, or mostly the same thing.

It's easy to get bloated, complex and complicated for a successful framework though. I wonder how PyTorch will look in a few years. I also remember the first TensorFlow releases, where the whole source code was also quite easy to understand. Then TensorFlow added more and more things, and many different types of APIs, starting to deprecate some earlier things, etc. The PyTorch internal code is also already much more complex than it was initially.

One reason JAX is now popular is because it again started with a fresh API. Despite being based on a new kind of idea of code transformations, which seems nice and powerful.

When looking at these developments, I really wonder what the future will look like. It's good to have new ideas and new or improved APIs. It's also good to adapt things for new kinds of hardware (GPUs, TPUs, maybe other neuromorphic hardware later).

jimsimmons · on Aug 4, 2021

Keras was a copy of Torch API. If you read the original Keras readme it literally says so.

enriquto · on Aug 5, 2021

> It was necessary to move away from Lua

Why? I sort of became disillusioned in torch after they abandoned lua.

amkkma · on Aug 4, 2021

As a julia user, thanks for this! Inspiring and packed with pearls. There's a lot we can learn from the python community

posharma · on Aug 4, 2021

PyTorch is amazing. The article was a good read. Although I'm confused. How can a ML framework be not obsessed with speed/performance?

smhx · on Aug 4, 2021

Author here. Being conscious about speed and performance is different from making that your competitive advantage or USP.

Our main focus is usability, and one of our secondary focuses is to not look like clowns in the performance department.

So, we try to take more decisions that trade off performance for usability than vice versa.

ampdepolymerase · on Aug 4, 2021

You are doing a good job balancing the two. For Julia's Flux, they did the opposite and it has severe performance problems compared to PyTorch despite being more usable and easier to install.

Installing PyTorch with Poetry is next to impossible. Flux got this right by bundling the GPU drivers. Their installation is also standardized and does not require the weird pip -f flag for CPU only installations.

amkkma · on Aug 4, 2021

>it has severe performance problems

It had. It's now around parity with pytorch.

And no, it wasn't about a usability tradeoff.

It was about being more general- More general compiler, more general code, more composable code.

Then, the team has been optimizing that and including compiler optimizations in the language that benefit all code. ML type code stressed that in a particular way. Pytorch does ML array heavy stuff as a special case.

Julia will be doing the same, but it's setting the groundwork for domain specific optimizations to be done in package and user space. A different sort of philosophy

It was about being more greedy and setting the groundwork for a more powerful tool in general, at some short term cost.

They could have just wrote a framework that baked in fp 32/64,16 with cuda kernels and tracing and operator overloading computational graphs and gotten more speedup over pytorch (in fact, avalon.jl takes that approach.), with better usability.

But they didn't and now there's a burgeoning ecosystem that does things no other framework can't. It's not quite as marginally beneficial for current vanilla ML because that is stuck in a local optimum, but I think that is going to change: https://www.stochasticlifestyle.com/useful-algorithms-that-a...

In the meantime, places like MIT, moderna, NASA etc are reaping the benefits.

amkkma · on Aug 4, 2021

Some specific steps that will push it past jax/pytorch for chunky array heavy GPU code (can already beat or meet openblas/MKL for kernels written in scalar form).

1. Better compile time memory management (https://github.com/aviatesk/EscapeAnalysis.jl)

2. Linalg passes built on generic composable compiler ecosystem: https://youtu.be/IlFVwabDh6Q?t=818

3. Metatheory.jl egraph based symbolic optimization interleaved with the abstract interpreter: https://github.com/0x0f0f0f/Metatheory.jl

4. Partial eval mixed concrete and abstract interpretation

5. Compiler based autoparallel with dagger.jl

6. New compiler integrated AD (as a package) that isn't based on an accidental lispy compiler hack like zygote: https://github.com/JuliaDiff/Diffractor.jl

7. Changes to array semantics which will include generic immutability/ ownership concepts.

And many more. The key is that all the initial groundwork that traded off fundamental flexibility for specific speed will then feed back into making the ML usecase faster than if it had focused on that initially. People can do all kinds of crazy yet composable things, in pure Julia without modifying the base compiler.

Bonus: Being able to modify the type lattice to track custom program properties. This means that you don't need to be stuck into global tradeoffs with a static type system and can do things like opt in track array shapes at compile time per module: https://twitter.com/KenoFischer/status/1407810981338796035 Other packages like for quantum computing are planning to do their own analyses. It's generic and the usecases and compositions aren't frozen at the outset. (unlike for example, the swift tensors fitting perfectly proposal).

lhomdee · on Aug 4, 2021

> In the meantime, places like MIT, moderna, NASA etc are reaping the benefits.

Can you elaborate more? MIT is well known but would interesting to know how Moderna and NASA are using Flux?

amkkma · on Aug 4, 2021

Sure!

NASA: https://www.youtube.com/watch?v=tQpqsmwlfY0

Moderna: https://pumas.ai/ https://discourse.julialang.org/t/has-moderna-used-pumas-ai-...

There are many many more. These unique and sought after capability are what got Julia Computing its 24 mil series A (https://twitter.com/Viral_B_Shah/status/1417128416206376960)

snicker7 · on Aug 5, 2021

> It had. It's now around parity with pytorch.

In some cases, it is much faster.

Consider Neural Stochastic Differential Equations, Flux is literally over 70,000x faster than Google's PyTorch-based implementation:

https://gist.github.com/ChrisRackauckas/6a03e7b151c86b32d74b...

dklend122 · on Aug 6, 2021

Yea, I meant parity for vanilla ML models. For anything off that beaten path it's much much faster

smhx · on Aug 4, 2021

We ship everything needed for userland -- including parts of CUDA/CuBLAS and CuDNN that we need (which is why our binaries are so fat).

GPU drivers would be kernel-land and I don't think we actually can install GPU drivers as part of a `pip install`. Will look into what Flux is doing, but I doubt they ship GPU drivers.

Separately, thanks for flagging the Poetry issue, we might prioritize it, especially if the fix is easy.

geofft · on Aug 5, 2021

You might want to take a look at https://discuss.python.org/t/what-to-do-about-gpus-and-the-b... if you haven't seen it - there's a practical problem that hosting built GPU code on PyPI is very difficult.

amkkma · on Aug 4, 2021

yes Flux doesn't ship GPU drivers. It ships everything else (like CUDA toolkit etc) as needed, using the artifact / pkg system, for all mainstream OSes. Doesn't interfere with system libraries.

https://julialang.org/blog/2019/11/artifacts/

gspr · on Aug 5, 2021

I'll hijack your presence and a post about CUDA/cuBLAS: is there any news about OpenCL support?

bertday · on Aug 4, 2021

Thanks for the post.

One question: One of the advantages about having a clean design is that performance is easier to optimize, since the 80%/20% rule of performance becomes much more obvious. How true was this in your experience? Were there any major performance-related design changes or was performance optimization a matter of tuning a few selected functions?

anonymoushn · on Aug 5, 2021

> So, over the years, I absorbed and appreciated that Torch was a user-centric product, which stood for immediate-mode, easy-to-debug, stay-out-of-the-way explicitness. It was targeted at people somewhat familiar with programming matters, and who could reason about things like performance, and if needed, write a C function and bind it in quickly.

This paragraph sort of surprises me. In my experience if you want to do anything other than calling out to numeric libraries, you can do it in Lua and it will work, or you can do it in Python and suddenly your machine learning pipeline will spend 95% of its time running Python while your GPU idles. So the need to be able to drop down to C is much more severe in Python, and the difficulty of calling out to C is much greater.

smhx · on Aug 5, 2021

Whether using Lua or Python, for GPU-based scientific computing, the need to drop into a C call is the same. The overhead of Python vs Lua never really mattered.

While we were based on top of LuaJIT, we couldn't use the JIT for anything because we had to always call into the C library for GPU kernels (and LuaJIT can't JIT through an opaque C call, incase that C call changes the interpreter stack).

Where Python really helps is with its ecosystem. The entire data science and ML ecosystem is in Python.

The difficulty of calling out to C is not much greater in Python, things like PyBind11 make it pretty natural.

anonymoushn · on Aug 6, 2021

People don't usually write the rules of a game or a MCTS to run on the GPU. They write it in Python, Lua, or C. If they write it in Python, then the GPU will idle all the time. If they write it in Lua or C then it will not.

dklend122 · on Aug 6, 2021

In Julia it's easy to run the entire thing on CUDA: https://github.com/fabricerosay/AlphaGPU

No C or any other language required because Julia has GPU codegen.

You can also keep MCTS on the CPU and be competitive with cpp despite the code being higher level, easier to read and more generic and composable. See: https://github.com/jonathan-laurent/AlphaZero.jl

blt · on Aug 4, 2021

This article does a good job explaining how PyTorch gained an advantage over TensorFlow. The 1.0 release of TensorFlow with graphs and feed_dicts was a little clunky but made sense. After 1.0 the second-system effect took hold quickly. Eager mode, Keras, TFX ... it all started to look like a mess.