
We improved Tensorflow Serving performance by over 70% - craigkerstiens
https://mux.com/blog/tuning-performance-of-tensorflow-serving-pipeline/
======
ajtulloch
FWIW, I believe that the current state of the art for batch-size 1, fp32
inference for ResNet-50 on Intel CPUs is AWS's work in
[https://arxiv.org/abs/1809.02697](https://arxiv.org/abs/1809.02697). After
the low-hanging fruit outside of model execution are picked, this kind of work
is probably quite relevant.

~~~
human_afterall
Hey! Author here, thanks for linking the paper. The article was from an
infrastructure perspective, but we're definitely diving deeper into graph
execution optimizations after this:)

~~~
rahulun
Are there any particular optimizations you are looking into ?

------
ss7pro
Have a look here: [https://github.com/IntelAI/OpenVINO-model-
server/blob/master...](https://github.com/IntelAI/OpenVINO-model-
server/blob/master/docs/benchmark.md) You can replace tf-serving with OpenVINO
to get even better performance and latency when running on CPU

~~~
londons_explore
What useful models run at decent speed on a CPU these days?

Even basic image classifiers tend to be 100x faster on a GPU or TPU...

~~~
bitL
Inference is not that super slow on CPU, especially for network requests that
already have quite a bit of latency, so plenty of companies use CPUs on the
cloud for lambda/flexible loads where GPUs aren't available.

------
greesil
[https://www.microsoft.com/en-
us/research/publication/deepcpu...](https://www.microsoft.com/en-
us/research/publication/deepcpu-serving-rnn-based-deep-learning-
models-10x-faster/)

TensorFlow has some known inefficiencies.

------
solidasparagus
Cool work! It feels like the improvement is a little overstated due to how
you're measuring - your measurements include import/setup time so you get big
gains by improving imports. But in reality, you won't be creating a new client
for each request and client import/setup time is unrelated to TF serving
performance. TF serving performance is really about the time elapsed between
request received and response returned.

------
bwasti
> containers are run on a 4 core, 15GB, Ubuntu 16.04 host machine

What CPU is being used?

Assuming the benchmark is done with something like an EC2 C5 instance, the
results in this post are quite slow. Somewhere around 14x slower than
benchmarks from a year ago on EC2 C5 instances. [1]

[1]
[https://dawn.cs.stanford.edu/benchmark/ImageNet/inference.ht...](https://dawn.cs.stanford.edu/benchmark/ImageNet/inference.html),
using the c5.2xlarge benchmark and assuming linear scaling

~~~
human_afterall
Hi bwasti, the host's CPU platform is Intel Broadwell. While the CPU
architecture of our production hosts are the same, the resources allocated are
much higher than 4 cores. This post details an overview of the relative
improvements that can be made from a vanilla setup :)

-masroor (author)

~~~
bwasti
You may want to check out Intel's optimized version of TensorFlow Serving[1]
for further improvements (on the order of 2x for ResNet-50[2]).

As an aside, I took into account the resource allocation in the parent
comment. The c5.2xlarge has 8 cores, 8GB RAM [3] and does a single fp32
inference in ~17ms. If we chop that down to 4 cores and assume linear scaling
we can fathom running ResNet-50 in ~35ms compared to the ~500ms achieved here.
I'd recommend comparing to a known baseline rather than a "vanilla setup" to
ensure you aren't missing any simple changes that may dramatically improve
performance.

[1]
[https://github.com/IntelAI/models/blob/master/docs/general/t...](https://github.com/IntelAI/models/blob/master/docs/general/tensorflow_serving/InstallationGuide.md)

[2] [https://www.intel.ai/improving-tensorflow-inference-
performa...](https://www.intel.ai/improving-tensorflow-inference-performance-
on-intel-xeon-processors/)

[3] [https://aws.amazon.com/ec2/instance-
types/c5/](https://aws.amazon.com/ec2/instance-types/c5/)

~~~
human_afterall
@bwasti, really good points - this is something we look forward to evaluating!
Our post does indeed outline optimizations from tensorflow/serving to
tensorflow/serving:* -devel [1]. The next logical improvement (given intel
architecture and docs linked) is start building on top of the * -devel-mkl
image.

-masroor(author)

[1]
[https://github.com/tensorflow/serving/tree/master/tensorflow...](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/tools/docker)

------
mehrdada
The grpc.beta code elements are deprecated and may go away anytime. (gRPC
1.0.0 is also super old and unsupported)

~~~
human_afterall
Good point - we're still in the process of migrating to >= 1.17. The gRPC
connection and client stub should still translate (few semantic updates).

```

channel = grpc.insecure_channel('0.0.0.0:9000')

stub = PredictionServiceStub(channel)

```

-masroor (author)

------
rahulun
There is an optimized version of Tensorflow based on Clear Linux and MKLDNN
-[https://clearlinux.org/stacks](https://clearlinux.org/stacks) , would be
interest to see the performance difference between the natively compiled
version and this .

~~~
human_afterall
Hey! That's super interesting - so far we went with Tensorflow's ubuntu based
official Docker devel image, but a clearlinux base looks like it would
definitely be worth looking into!

-masroor (author)

------
naturalwarren
This is amazing.

