
XLA: linear algebra library for TensorFlow - mud_dauber
https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
======
Marat_Dukhan
>>> Softmax can be implemented as a composition of primitive TensorFlow ops
(exponent, reduction, elementwise division, etc.): softmax = exp(logits) /
reduce_sum(exp(logits), dim)

No, it can not be implemented this way, it is numerically unstable, and will
produce NaNs if any input is greater than ~88.7. Luckily, it is also not how
its implemented in Tensorflow:
[https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...](https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a246f54c506aae4587dbce5d3bcf0/tensorflow/core/kernels/softmax_op_functor.h#L43)

For a clean (and more efficient) C version of this algorithm, take a look at
NNPACK reference implementation:
[https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...](https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/softmax-
output.c#L30)

~~~
acmj
In the nnpack implementation, the same exponential (i.e. expf) is computed
twice for each element, which is a waste of time. A faster implementation
should save each expf result to output[sample][channel] first, compute the sum
and then rescale output[sample][channel] by the sum.

~~~
Marat_Dukhan
If you have an efficient vectorized implementation of expf (NNPACK does),
softmax is a memory/cache bandwidth-bound kernel, and storing data is less
efficient than recomputing.

~~~
acmj
Where is this vectorized expf implemented? I am only seeing softmax is calling
the standard expf. Let's suppose expf from libm is vectorized. Is there any
benchmark showing nnpack's implementation is really faster? I doubt, actually.
Exp is quite expensive even if vectorized.

~~~
Marat_Dukhan
This is the reference implementation of softmax, i.e. implementation used as a
reference in unit tests. It is designed to be simple, readable, and correct,
which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See
[https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...](https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64-fma/exp.py)
for the vectorized expf

~~~
Marat_Dukhan
Oops, here is the actually used vectorized expf (similar to the one I linked,
but with optional unrolling):
[https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...](https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64-fma/vecmath/exp.py)

~~~
acmj
Thanks for the pointer. The full softmax implementation is here [1]. I have
not read the code, but I can trust the developer to have a very fast
implementation. Nonetheless, I don't think the reference implementation in
your original link is optimal. Exp is expensive and should not be called twice
(EDIT: unless you can show a benchmark to prove me wrong).

[1]:
[https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...](https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64-fma/softmax.py)

~~~
barrkel
FWIW, you're replying to the developer of the file you linked to.

~~~
acmj
I later realized he is the developer, but this does not change this
discussion. Here is a micro benchmark, computing softmax 1 million times over
a random vector of size 1000. On an old Linux server, calling the libm expf
once takes 11.76 CPU seconds; calling it twice takes 25.15s. The
implementation for calling expf once:

    
    
      void softmax1(int n, const float *x, float *y)
      {
          int i;
          float s, max = -FLT_MAX;
          for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
          for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
          for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
      }
    

This micro benchmark proves my point: using expf from libm, the reference
implementation in nnpack is suboptimal. It is possible that a vectorized expf
may change the fact, but the developer needs to prove it with numbers.

------
theCricketer
Chris Leary, a compiler engineer at Google gave a talk about XLA at the recent
Tensorflow Dev Summit:

[https://www.youtube.com/watch?v=kAOanJczHA0](https://www.youtube.com/watch?v=kAOanJczHA0)

------
jakekovoor
Thank you OP, this is really helpful. :)

If you need to install TensorFlow on Windows 10 you can follow this

[http://saintlad.com/install-tensorflow-on-
windows/](http://saintlad.com/install-tensorflow-on-windows/)

~~~
alok-g
I would like to see the guide for the GPU version. Is it coming anytime soon?
:-) Thanks.

~~~
ska
anaconda has gpu and non gpu packages, if that helps.

------
visarga
It would seem Torch/PyTorch are faster than TF. TF uses static optimizations
on the computation graph while Torch has a dynamic computation graph.
Logically, static optimizations should be faster because they know the data
size beforehand.

So, why is TF slower?

~~~
pmalynin
Tensorflow is getting dynamic jit optimization too. I think part of the reason
that some dynamic optimizations might perform better is that the results of
the optimization can be cached for most other batches and they can specialize
to utilize batch/shape/input specific properties

~~~
killjoywashere
I'm kind of getting the sense that TF is presently being optimized for its own
massive development: that is, the engineers are yanking out chunks of code and
replacing them, quickly, and at varying scales.

------
shoshin23
I've been looking around in a few places but I can't find a way to use XLA to
compile tensorflow models for mobile devices. Is there a tutorial/blogpost by
google(or anyone for that matter) talking about it? Thanks!

~~~
learyg
Did you see the "using tfcompile" section of the docs?
[https://www.tensorflow.org/versions/master/experimental/xla/...](https://www.tensorflow.org/versions/master/experimental/xla/tfcompile#using_tfcompile)

If you're looking for more detailed information that's missing from the docs,
please do file a Github issue about it. Thanks!

------
ndesaulniers
Even if you're not interested in machine learning or ai, XLA and particularly
it's Python bindings are a great and easy way to do GPGPU programming.

------
probdist
Why does this support JIT but not AOT for NVIDIA GPUs?

~~~
learyg
AOT for GPUs is doable. Do you have a killer use case?

For CPU, mobile code footprint reduction was the driving force.

~~~
puzzle
Not having to ship a toolchain (nvcc, gpucc or whatever equivalent linked as a
library)?

~~~
learyg
It's gpucc, so it builds from LLVM when you enable XLA in the TF configure
step.

Trying to understand: you don't want to ship a compiler library on principle,
or is it some kind of product requirement, or ? There's lots of cool work to
be done in the compiler space so use cases help to prioritize. :-) Thanks!

