Hi! I'm Mark from the PyTorch team at Meta and work on torchao. If you have any ...

necovek · 2024-10-01T07:14:51.000000Z

Great stuff!

A minor nitpick on the copy (and even then, it might just be me): I find "97% speedup" and "50% speedup" really hard to parse — a "30x speedup" or "97% reduction of time taken" immediately tell me what is being achieved!

Great results once I get my head around them, though!

IanCal · 2024-10-01T08:24:29.000000Z

Fwiw I'm pretty sure 97% speedup is 197% of the speed of the baseline, so roughly double.

necovek · 2024-10-01T10:04:09.000000Z

That's why it's confusing: "2x speedup" would clearly indicate 200% of the current speed, so 97% speedup is unclear if it's a multiple (not because that would be a slow down), a reduction in time (which was my assumption) or an increase in speed (something per unit of time).

I guess you are right and it's probably the latter, but obviously better language would have avoided any doubt.

elcomet · 2024-10-01T11:46:54.000000Z

I understand it as " the speed increases by 97%"

formalsystem · 2024-10-01T16:06:22.000000Z

yeah indeed choice of language might not be ideal, it seems like 2x language is clearest to folks? I can make some quick edits to the article

DhawalModi · 2024-10-01T01:59:53.000000Z

Hi Mark, the library looks cool, excited to try it out. Coincidentally I am starting work on a project that is investigating a lot of Post training quantization methods. I read the blog and I am curious to understand what kind of overheads are involved in quantizing a layer?

formalsystem · 2024-10-01T02:21:08.000000Z

There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`

Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.

However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.

DhawalModi · 2024-10-01T02:49:13.000000Z

I see, thank you for the explanation!

dark__paladin · 2024-10-01T05:55:02.000000Z

First off, well done, this looks exciting. I haven't had a chance to interact with the library yet — should torchao be seen as a dev-friendly quantization interface? I.e., if my team was working on new quantization techniques, does the API provide easy tooling for implementing and benchmarking new quantization algorithms? Or is this closer to a "toolbox of finished (generally) finished products"?

formalsystem · 2024-10-01T06:39:26.000000Z

It's both! For this blog we decided to discuss our best end user facing numbers to keep things simple. We briefly hint at our contributor guide here https://github.com/pytorch/ao/issues/391 which does a tour of the APIs we provide developers implementing new algorithms

But we have had quantization algorithm developers such as HQQ or Autoround merge their code in to get composability and serialization for free. We view quantization algorithms as the top layer and going down you have quantized tensors, quant primitives like dequant/quant and finally basic dtypes like uint1-7 and float3-8. Personally why I spent so much time on AO was I was hoping we could make it easier for people to express their quantization algorithms in easy to read PyTorch code and if they must use custom kernels we also have some tutorials for how to integrate custom cuda and triton ops.

Most of those discussions have been happening on #torchao on discord.gg/gpumode so if you need to chat back and forth feel free to reach out to the team there otherwise Github also works.

soulofmischief · 2024-10-01T02:09:35.000000Z

Thanks for the hard work, any idea what the roadmap is for MPS support?

formalsystem · 2024-10-01T02:15:11.000000Z

Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.

In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.

underanalyzer · 2024-10-01T02:39:50.000000Z

This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?

formalsystem · 2024-10-01T02:44:27.000000Z

Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend

Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol

darkninja · 2024-10-01T12:45:11.000000Z

Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training.

formalsystem · 2024-10-01T16:07:19.000000Z

we have experimental support for float4 training with the mx formats https://github.com/pytorch/ao/tree/main/torchao/prototype/mx...

But that's waiting for Blackwell to be released so we get the hardware support. SO recommendation for now would be to use either fp8 training or int8 training

darkninja · 2024-10-01T12:45:06.000000Z

Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training

OutOfHere · 2024-10-01T03:08:04.000000Z

Why don't they merge this into Pytorch? Why so many packages?

formalsystem · 2024-10-01T03:15:58.000000Z

There's different tradeoffs, spinning up a separate repo is what we call "out of core" vs having everything in PyTorch "in core"

Basically PyTorch is a large library where CI takes a long time to run which means merging code is hard and adding new dependencies is challenging and there are stringent constraints on BC breaking changes

Instead what torchao did and many other repos like torchtune, torchchat, torchtitan did was move out of core and it helps keep the core PyTorch library leaner with a smaller binary size and it really lets the team "out of core" focus on optimizing for their needs

Unfortunately the argument for what gets better changes over time, for example torch.compile initially a new repo called torchdynamo was built out of core to move fast but eventually merged back because everyone wanted to use it. Now torch.compile dev velocity is still quite fast and so now we have to tell people to use nightlies instead of official stable releases to which some people have asked me why don't you move torch.compile out of core

My 2c is the ecosystem will be much stronger and teams can move faster if they develop out of core so that's the tradeoff we picked for torchao. We managed to for example merge a few custom CPP kernels like fp6 or Marlin that would have challenging to motivate in core since those are still quite experimental and need to stand the test of time.