Hacker News new | past | comments | ask | show | jobs | submit login
Faster Quantized Neural Network Inference with XNNPack (tensorflow.org)
18 points by Marat_Dukhan 8 days ago | hide | past | favorite | 15 comments

Looking at posts from a couple of years back on HN/Reddit/SO about TF vs Pytorch, the only plus side of using TF was the ease of deployment, especially on the mobile side with Tensorflow Lite.

But I imagine that story is changing with the advent of Pytorch Mobile, ONNX, and that Pytorch itself supports XNNPack.

If anyone has any tips or insights as to ease of mobile deployment using TF vs using Pytorch, please share!

Can it perform fixed point arithmetic with arbitrary number of bits?

Both training-aware and post training.

It performs fixed-point arithmetic on 8-bit integers. You can mimick lower than 8-bit precision by using output_min/output_max parameters in XNNPACK operators, but keep in mind that: 1. This functionality is experimental and not exposed in TFLite. You'd need to call XNNPACK APIs directly from C/C++ code. 2. Computations would still be done on 8-bit numbers.


Author here, happy to take your questions.

Do I understand correctly that using XNNPACK and mobile acceleration is mutually exclusive? I.e. it's either XNNPACK or NNAPI/CoreML?

Should I consider XNNPACK for a modern mobile phone?

If by acceleration you mean offloading inference to a different IP block (GPU/DSP/NPU), then yes. XNNPACK is the inference engine for CPU.

CPU is the default backend in TensorFlow Lite, and CPU inference always works and produce correct result. GPU/DSP/NPU inference can be faster, particularly for large models on high-end SoCs, but generally you need to make sure that the model is supported on the IP block, the result is correct and performance is better than the CPU baseline. And that quickly gets very complicated:

1. NN API, and TFLite GPU/DSP backends support a limited subset of all TensorFlow Lite operators, and if a model is only partially offloaded to GPU/DSP/NPU, part of it will still run on CPU, and commonly synchronization overhead kills all potential speedups of the specialized hardware. The situation is even worse in CoreML, as CoreML doesn't provide an API to even learn which operators failed to offload to GPU/NPU.

2. Bugs in GPU shader compilers and NN API drivers do happen, and unless your model is a standard MobileNet, you're likely to hit them at least on some mobile phones. Then you'd need an infrastructure to detect this situation and disable offloading the model to this IP block on particular phones.

3. Low-end SoCs usually completely lack DSP and NPU, and their GPU is often slower than CPU even in nominal peak performance. This happens because CPU cores in low-end SoCs are typically just downclocked versions of the CPU cores in high-end SoCs, but low-end GPUs have 8-16 times fewer GPU cores than their high-end counterparts.

Wow! Thanks for such a detailed answer. It's much clearer now.

Is this a drop in solution that works with every existing tflite model?

Yes, these optimizations work with existing tflite models, so long as the quantized operators they use are supported in XNNPACK.

I see, in order to benefit, model has to be quantized. It is not super clear which kinds of quantization are supported. Both Fp16 and Int8?

In order to benefit from optimizations in *this blog post* the model needs to be quantized to 8-bit integers. However, XNNPACK supports floating-point inference as well (including with FP16 weights), see https://blog.tensorflow.org/2020/07/accelerating-tensorflow-...


Do the same optimizations apply to tensorflow/tensorflow serving?

TensorFlow doesn't support quantized inference (it supports only mimicking quantization in floating-point for quantization-aware training), so it can't immediately benefit from these optimizations.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact