It compiles your model, using TensorRT, Ahead of Time and enables you to use the compiled model through torch.jit.load("your_trtorch_model.ts") in your application.
Once compiled, you no longer need to keep your model's code in the application (as for usual jit models).
The inference time is on par with TensorRT and it does the optimizations for you as well.
You can quantize your model to FP16 or Int8 using PTQ as well and it should give you an additional speed up inference wise.
Here is a tutorial to leverage TRTorch.
We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up.
Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.
I will definitely try your setup :)
Have a nice day internet stranger.
But if you are using Nvidia Hardware, then TensorRT should give you the best performance possible, especially if you change the precision level. Don't forget to simplify your ONNX model before you converting it to TensorRT though: https://github.com/daquexian/onnx-simplifier
I think you meant that there are optimizers for the ONNX format. ONNX Runtime being one of them.
If you have a lot of preprocessing and post logic in your model it can be hard to export it for onnxruntime or triton so I usually recommend starting with Ray Serve (https://docs.ray.io/en/latest/serve/index.html) and using an actor that runs inference with a quantized model or optimized with tensorrt (https://github.com/NVIDIA-AI-IOT/torch2trt)
Having got a working torch model on cpu, what's the best path to actually making it run as fast as I feel it has potential to?
But at some point it became a model export format for production environments that can’t use CPython for performance reasons.
I’m surprised that you’re seeing worse performance with jit. It sometimes takes 20-ish iterations for the jit to “settle down” but I’d expect roughly equal performance at worst. If you can share a repro, I’d be happy to take a quick look if you file an issue on GitHub. (I’m @bertmaher there)
We’re working on eliminating the dependence on shape specialization right now, since it’s kind of an unfortunate limitation for some workloads.
Staying within pytorch, we recently added torch.jit.optimize_for_inference (I think it’s in 1.10, though not entirely sure) that can apply a bunch of standard optimizations to a model and often provides some nice wins.
The nice thing about it though, is that you can embed native python code (that's compiled to c++) into the model artifact. It's allowed us to write almost all of the serving logic of our models very closely to the model code itself, giving a better overview than having the server logic written in a separate repo.
The server we use on top of this can be pretty "dumb", and just funnel all inputs to the model, which the Python code determines what to do with.
As for model speedups, maybe you should look into quantization? I also find that there's usually lots of low hanging fruit if you go over code and rewrite to quicker ops which are mathematically equivalent, but allocate less memory, or do less ops.
It also makes it possible to sidestep the GIL and remove the overhead of launching kernels from python, which only really makes a noticeable difference with models that queue up a lot of small operations on the GPU. (LSTMs are an example of where this would make a difference https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscrip...)
We were doing millions of inferences and we had a specific target of a couple thousand a second so a specific case for sure but that's my two cents.
This is very interesting. Can someone talk about the roadmap of pytorch here ? It seems everyone is kinda rolling their own -
Pytorch has a very confusing distribution story
- OpenAI runs Pytorch on Kubernetes with handrolled MPI+SSH
- Pytorch-Biggraph is specifically using torch.distributed with gloo (with an MPI backend).
So here's the question - if ur a 2 person startup that wants to do Pytorch distributed training using one of the cloud-managed EKS/AKS/GKE services... what should you use ?
it is far far more efficient (as a proportion of time-to-market) to rent and build on top of services.
Kubernetes is where the wider ecosystem is. I dont like it ...but it is what it is.
So Grid.ai is something like AWS Sagemaker. I wanted to figure out what someone can use on a readymade kubernetes cluster.
I definitely don't want to dispute your calculations.. However, I'm still puzzled what to use.
Disclaimer: I work for Determined.
Is that published somewhere?
It's not totally clear what "Jax's direction" means to you, but I'd consider its defining characteristics as 1. composable transformations, 2. a functional way of programming (related to its function transformations)
I'd say that Pytorch is moving towards the first (see
https://github.com/pytorch/functorch) but not the second.
Disclaimer: I work on PyTorch, and Functorch more specifically, although my opinions here aren't on behalf of PyTorch.
Stuff like vmap and grad and pmap and all the rest have been a huge boon in simplifying some of my work, so I'm glad to see it's expanding into pytorch!
This can include things like doing operator fusion/lowering to a backend compiler, but can also include things like inserting profiling instrumentation (https://pytorch.org/tutorials/intermediate/fx_profiling_tuto...) or extracting intermediate features (https://github.com/pytorch/vision/releases/tag/v0.11.0).
Basically, if you want a graph representation of a PyTorch module that's really easy to modify, use torch.fx :)
torch.fx seems different in that it’s primarily aimed at being a platform for users, rather than JAX which (as far as I know) hides the logic behind @jit annotations.
Both trace eager code to a graph, which is then rewritten. JAX @jit is notably going eager -> graph -> XLA https://jax.readthedocs.io/en/latest/notebooks/How_JAX_primi...
So (to me) it seems that they are similar backends/library primitives with different front-ends. There doesn’t seem to be a difference in representational power, since both hit a graph representation. The main exception I could see would be something like timers, which would perhaps require a graph-mode equivalent for JAX.
For example, FX is extremely unopinionated about what it can trace, and the trace itself is extremely customizable. For example, if a subfunction/module has control flow (i.e. untraceable), it's easy to mark it as a "leaf" in FX's tracer, while that concept doesn't really make in sense in Jax's tracing system.
Another example of a difference is that Jax traces out into its own IR called a jaxpr, while FX is explicitly a Python => Python translation. This has some some upsides and some downsides - for example, you can insert arbitrary Python functions into your FX graph (breakpoints, print statements, etc.), while jaxprs don't allow that.
Is this a good thing? Well, if your main goal is to lower to XLA, definitely not lol. But for FX it works quite well.
TL;DR: The general principles of doing graph capture are similar, but the details matter, and the details end up being quite different.
But due to Python's dynamic nature, this is already possible. AllenNLP is a great example of that.
For example, say you had
With FX though, you can simply trace out the graph, substitute the F.relu with a F.gelu, and be done in <10 lines of code!
Essentially, it gives you the freedom to perform transformations on your code (although it places limitations on what your code can contain, like no control flow).
There might also be other things you want to do (like add profiling after each op) that would be tedious to do manually, but can easily automated with FX (https://pytorch.org/tutorials/intermediate/fx_profiling_tuto...).
Another example is the recent support from torchvision for extracting intermediate feature activations (https://github.com/pytorch/vision/releases/tag/v0.11.0). Like, sure, it was probably possible to refactor all of their code to enable users to specify extracting an intermediate feature, but it's much cleaner to do with FX.
It also allows you to plug in other ops to for example make quantization or profiling easier (see https://pytorch.org/tutorials/intermediate/fx_profiling_tuto...).
Another example is a feature that was just added to torchvision which can take a classification model and extract the backbone for generating embeddings.
> It makes it possible to automate optimizing python models, adding things like conv and batch norm fusion for inference.
By "optimize", do you mean "reduce computational load", or "use Adam/SGD/whatever to minimize a loss function"? What is "conv and batch norm fusion"? How does FX help with any of this?
> It also allows you to plug in other ops to for example make quantization or profiling easier.
I can indeed see how it could make profiling easier. I'd love to get pointers/links as to quantization methods that would necessitate adding new ops.
> Another example is a feature that was just added to torchvision which can take a classification model and extract the backbone for generating embeddings.
Hasn't it always been possible to extract a certain set of weights from some `nn.Module`?
Essentially, during inference, batch norm is simply a multiply and add operation. If this occurs after a convolution, then you can simply fold (i.e. "fuse") the batch norm into the convolution by modifying the convolution's weights. See https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.... for more details.
What the FX pass ends up looking like is:
1. Look for a convolution followed by a batch norm (where the convolution output is not used anywhere else).
2. Modify the convolution's weights to reflect the batch norm.
3. Change all users of the batch norm's outputs to use the convolution's output.
> Hasn't it always been possible to extract a certain set of weights from some `nn.Module`?
IIRC, what torchvision allows you to do now is to get a new model that simply computes the embedding.
For example, something like this (not actual code)
model = resnet18()
# Returns a new model that takes in the same inputs and returns the output at layer 3
model_backbone = create_feature_extractor(model, return_nodes=('layer3'))
The FX-based API allows you to simply specify what outputs you want, and it'll create a new model that provides those outputs.
Do you think it also comes in the way of understanding the paper and learning how to implement it?
Don't get me wrong. It is useful & concise like you mentioned. But your target audience is beginners & adopters, and it makes it no different from another framework such as Fastai (I have major gripes with them. It has a much bottled-in experience)
To be true to walkthroughs, please consider designing helper functions rather than using your framework. Admittedly it may not be as beautiful, but eventually your users will be more appreciative of the extra mile you go into making things transparent & similar to PyTorch docs.
If you are already working in Deep Learning, and super comfortable in some other framework, then-
-> Head over to the PyTorch website, and go through the introductory tutorials.
-> Go to Papers with Code , and start reading and trying out implementing relatively easier research papers.
If you are a beginner, then, you can go through resources that reach Deep Learning through PyTorch. I recommend-
Although, for DL professionals wanting to learn PyTorch will get less bang for the buck from these resources.
I recommend the PyTorch website tutorials for everyone, though.
I went through the first version of FastAI (when it was Keras, torch?/tensorflow?) and forgot most of it because never did anything with Deep Learning. Then I did the FastAI V1 course again where they use FastAI library V2.
I really liked the first version of the course because I felt like Jeremy did an awesome job of balancing understanding the guts of using a DL library with getting stuff done. It was a tough course, but I felt like I really understood things.
I felt like the version with FastAI library V2 went too far into "Here are some commands you can use in the FastAI V2 library to to do this sexy thing with Deep Learning." I completed that course and really felt it should have been titled "A Course on the Fast AI V2 Library"
I recently purchased "Deep Learning with PyTorch" by Eli Stevens. I've been working through this book and feel like it explains things a lot more. I'm haven't finished the book with it, but I do like it so far.
A brief post that shows just how thin the FastAI layer can be (if you want!) is here: https://muellerzr.github.io/fastblog/2021/02/14/Pytorchtofas...