
Qnnpack: PyTorch-integrated open source library for mobile deep learning - jimarcey
https://code.fb.com/ml-applications/qnnpack-open-source-library-for-optimized-mobile-deep-learning/
======
antpls
That's an informative article.

Last time I took a look at TensorFlow Lite, they had a vision where you would
export your model into a .tflite file (which is FlatBuffer encoded execution
graph with weights) and then use it on mobile for inference like this, in
pseudo-code :

    
    
      model = tflite.interpreter().load_model("my_model.tflite")
      model.set_input_data(my_input_buffer)
      model.execute()
      model.get_result(my_output_buffer)
    

Which is nice, since you can easily update the model by simply distributing a
new file "my_model.tflite". The TFLite interpreter library would use whatever
capability (SIMD instructions, DSP cores, etc) is available on the device to
accelerate inference, so the application developer doesnt have to worry about
writing different code for different platforms or even understand how the
prediction works under the hood.

Is Qnnpack a library directly competing with TFLite? Are the file formats used
for the model the same between tflite and this? Does it support TensorFlow
Cores created by Google for inference, and/or more generally specialized cores
like Qualcomm's DSP?

~~~
Marat_Dukhan
QNNPACK directly competes with the CPU backend of TensorFlow Lite and the
gemmlowp library. The Caffe2 backend of PyTorch 1.0 integrates QNNPACK, and
directly competes with TensorFlow Lite. QNNPACK targets only mobile CPUs, but
Caffe2 integrates other backends for non-CPU targets, e.g. Apple's MPSCNN
library for iPhone GPUs, Qualcomm's Snapdragon NPE for Qualcomm GPUs and DSPs,
ARM ComputeLibrary for Android GPUs. Not sure what you mean by TensorFlow
Cores: NVIDIA has TensorCores and TensorRT, and Google has Tensor Processing
Units (TPU), but neither of these technologies are for mobile.

~~~
antpls
Thanks for the precisions!

I was referring to TensorRT from Nvidia and TPUs from Google.

One of the strength of the TFLite API is that the same exported tflite model
can run on both mobiles and servers. It may make less sense to run lite models
on servers, because of the loss of precision but it may also have its own use
case for very big models on cheap servers.

Nvidia sells Android devices and embedded boards for robotic, which will
surely have some sort of TensorRT-derived cores if not already. Goole could
one day integrate their specialized cores (security and TPUs) into their
phones too, or into AI-oriented IoT devices.

------
DoctorOetker
"Without repacking, the microkernel would have to read rows of A separated by
potentially large stride. If this stride happens to be a multiple of a large
power of 2, elements from different rows of A in the panel may fall into the
same cache set. If the number of colliding rows exceeds cache associativity,
they evict each other and performance falls off a cliff. Fortunately, this
situation cannot happen when the panel fits into L1, as with the models for
which QNNPACK is optimized."

I only very roughly understand the concepts of caching and eviction
algorithms, but I wish processors and OS'es (which presumably can configure
cache behaviour) would expose the current cache configuration so that
compilers and library designers can take this into account in a more
automated/uniform way. Alternatively how accurately do emulators model cache
behaviour? It would be nice to see where a bottleneck is, or how long ago a
needed value was evicted for profiling purpouses

