Qnnpack: PyTorch-integrated open source library for mobile deep learning

antpls · on Oct 30, 2018

That's an informative article.

Last time I took a look at TensorFlow Lite, they had a vision where you would export your model into a .tflite file (which is FlatBuffer encoded execution graph with weights) and then use it on mobile for inference like this, in pseudo-code :

  model = tflite.interpreter().load_model("my_model.tflite")
  model.set_input_data(my_input_buffer)
  model.execute()
  model.get_result(my_output_buffer)

Which is nice, since you can easily update the model by simply distributing a new file "my_model.tflite". The TFLite interpreter library would use whatever capability (SIMD instructions, DSP cores, etc) is available on the device to accelerate inference, so the application developer doesnt have to worry about writing different code for different platforms or even understand how the prediction works under the hood.

Is Qnnpack a library directly competing with TFLite? Are the file formats used for the model the same between tflite and this? Does it support TensorFlow Cores created by Google for inference, and/or more generally specialized cores like Qualcomm's DSP?

Marat_Dukhan · on Oct 30, 2018

QNNPACK directly competes with the CPU backend of TensorFlow Lite and the gemmlowp library. The Caffe2 backend of PyTorch 1.0 integrates QNNPACK, and directly competes with TensorFlow Lite. QNNPACK targets only mobile CPUs, but Caffe2 integrates other backends for non-CPU targets, e.g. Apple's MPSCNN library for iPhone GPUs, Qualcomm's Snapdragon NPE for Qualcomm GPUs and DSPs, ARM ComputeLibrary for Android GPUs. Not sure what you mean by TensorFlow Cores: NVIDIA has TensorCores and TensorRT, and Google has Tensor Processing Units (TPU), but neither of these technologies are for mobile.

antpls · on Oct 31, 2018

Thanks for the precisions!

I was referring to TensorRT from Nvidia and TPUs from Google.

One of the strength of the TFLite API is that the same exported tflite model can run on both mobiles and servers. It may make less sense to run lite models on servers, because of the loss of precision but it may also have its own use case for very big models on cheap servers.

Nvidia sells Android devices and embedded boards for robotic, which will surely have some sort of TensorRT-derived cores if not already. Goole could one day integrate their specialized cores (security and TPUs) into their phones too, or into AI-oriented IoT devices.

dr_zoidberg · on Oct 30, 2018

From the look of the first paragraphs, it would seem that QNNPACK is a library similar to Intels MKL/MKL-DNN. So you get "compiled" functions/kernels that accelerate a particular (compute-intensive) task.

With regards to TensorFlow Lite, this means that Google could posibly build tflite with QNNPACK, and (maybe) get better performance out of the resulting binary, in a set of mobile platforms supported by QNNPACK.

Edit: by the end of the article, they say how they built Tensor Flow Lite with QNNPACK and got substantial speedups accross a different range of phones.

antpls · on Oct 30, 2018

I didn't understand it that way, they didn't built TensorFlow Lite with a QNNPACK "backend". They compared both version on the same benchmarks, but they didn't "merge" the solutions.

So, theorically QNNPACK could be used to implement a TensorFlow Lite interpreter. However it seems the most interesting implementations will use hardware specific accelerations, such as TensoCore RT from Nvidia, or the Google's TensorCores, but QNNPACK seems to only target SIMD optimizations from CPUs.

That still a good amount of work to identify the optimizable building blocks, or validate other approaches such as TFLite, but each mobile processor vendors (Qualcomm, ARM, Intel) already provide implementations of the Android NN API that maximizes the usage of the hardware.

That's why I'm not sure how QNNPACK integrates with the entire ecosystem.

Edit : as I see it, to consume a model for an application, the diagram looks like this : developer <-> tflite interpreter API <-> Android NN API (if target is Android) <-> vendor provided accelerated implementation (blackbox/binary blob, that's where most of the acceleration is supposed to happen)

Edit2 : Now that I think about it, it doesn't make sense to compare TensorFlow Lite in a benchmark. TensorFlow Lite is only an API and a file format spec, it's not a specific implementation, from what I understand.

dr_zoidberg · on Oct 30, 2018

Replying to your other comments (about how QNNPACK integrates and implementations of the Android NN API):

I'm not entirely sure what they're aiming for there. Usually when you see talk about "kernels" it's more of how particular filters/convolutions/low-level operations are optimized, and it is implied that kernels run on GPU (most of the time). They do talk a lot about microarchitectural details, size of caches and ARM/NEON operations, so it seems to be all implemented on CPU, but I don't really grasp how it ties with the vendor-specific implementations that you mention.

It could be that these are some new algorithms/implementations that focus on the strength of the systems (not particularly the CPU or the microarchitecture) and try to "go easy" on the memory bandwidth, for example, to get a better performance out of equivalent (maybe?) code.

This reminds me a bit of the numexpr[0] project, that accelerates numpy computations on python by rearranging data on memory to be more cache-friendly.

[0] https://github.com/pydata/numexpr

dr_zoidberg · on Oct 30, 2018

You're right, as I was skimming over parts of the text, I didn't read carefully the first time. They're using QNNPACK+Caffe2 to outperform TensorFlow Lite.

DoctorOetker · on Oct 30, 2018

"Without repacking, the microkernel would have to read rows of A separated by potentially large stride. If this stride happens to be a multiple of a large power of 2, elements from different rows of A in the panel may fall into the same cache set. If the number of colliding rows exceeds cache associativity, they evict each other and performance falls off a cliff. Fortunately, this situation cannot happen when the panel fits into L1, as with the models for which QNNPACK is optimized."

I only very roughly understand the concepts of caching and eviction algorithms, but I wish processors and OS'es (which presumably can configure cache behaviour) would expose the current cache configuration so that compilers and library designers can take this into account in a more automated/uniform way. Alternatively how accurately do emulators model cache behaviour? It would be nice to see where a bottleneck is, or how long ago a needed value was evicted for profiling purpouses