
Vectorized execution brings a 10x performance increase for expression evaluation - Lilian_Lee
https://pingcap.com/blog/10x-performance-improvement-for-expression-evaluation-made-possible-by-vectorized-execution/
======
gameswithgo
I have an 'art/game' project in Rust that uses vectorized expression
evaluation to draw random images. It turns the expression tree into a stack
machine then evaluates the stack machine with SIMD instructions:
[https://github.com/jackmott/evolution/blob/master/src/stack_...](https://github.com/jackmott/evolution/blob/master/src/stack_machine.rs)

The speedup is really big for this case sometimes because not only do you do
math 4x/8x/16x faster (depending on instruction set) but you also traverse the
stack machine (or tree if you are pure interpreting) 4x/8x/16x less often. The
improvement when traversing a tree is extra extra big because of reduced
memory hops.

I used a SIMD library I made in Rust, which lets me write the stack machine
once, and then run it in SSE2/SSE41 or AVX2 mode. You can select either at
runtime or compile time:

[https://github.com/jackmott/simdeez](https://github.com/jackmott/simdeez)

~~~
dahart
Sounds cool, do you have a gallery? I had a similar sounding project inspired
by Karl Sims’ genetic images that I haven’t touched for many years now, but
I’ve been meaning to revive and compile to GLSL or CUDA to get (I expect) much
faster eval than is possible with CPU AVX or SSE instructions. Have you
considered that route?

~~~
gameswithgo
the readme has some examples:
[https://github.com/jackmott/evolution](https://github.com/jackmott/evolution)

Not really enough to show off the full range of possibilities but it is a
start. The program is still a WIP I'll release it eventually.

~~~
dahart
Very cool, thanks for the link! The cell & fbm primitives look really useful.
The only complex primitive like that I included so far was Perlin turbulence.
If you’re interested, here’s a sampling of mine,
[https://flickr.com/photos/biv4b/albums/576553](https://flickr.com/photos/biv4b/albums/576553)
I haven’t gotten around to any vectorized evaluation yet, it’s pure CPU
recursive eval, the closest thing I did was to use multiple threads, but it’s
not particularly fast compared to what a GPU would do.

~~~
gameswithgo
those look really nice, id be interested to know what primitives you have and
how you compose them

~~~
dahart
Sure, I happen to have written those things up in a paper; it’s a fairly
simple & constrained set. The composing part is also fairly standard though I
came up with a couple of new mutation strategies that seemed useful.
[http://dahart.com/paper/hart_evomusart_2007_paper.pdf](http://dahart.com/paper/hart_evomusart_2007_paper.pdf)
You can ignore most of that unless you’re interested in animating evolved
expressions; the primitives & mutations are summed up in the first few pages.

------
snidane
I always wondered why people assume that the more high level abstractions or
high level programming languages you use the less performance you should
expect. C is faster than Java, Java will be faster than Python etc. The
'interpretation overhead' is supposed to kill your performance.

But there is one weird exception. You have the APL family, which are a very
high at level of abstraction but perform even faster than C. Especially
because od using vectorized processing. When you work on vectors of 1000s
items in one instruction, you amortize language interpretation cost away and
since working with vectors is actually the only natural way to work with
computers you get massive performance from those instructions using vectorized
instructions or even running on gpu. (All memory access is naturally linear in
computing. Random access memory is an unnatural computing myth which comes at
enormous cost and has to be hardware accelerated to be even usable).

Similar can be said about databases and SQL. Especially in OLAP processing,
where you can linearize your data tables and columns and vectorize your
processing. Because it is near impossible to overcome von Neumann bottleneck
in traditional single computer languages like C or Java, any SQL or APL will
beat the crap out of them if you span the processing over multiple cores and
machines.

Days of single machine processing are over and clusters of computers are the
future. AWS (and potentially other clouds) are essentially Operating Systems
for sich environent. It'd be nice for open source to catch up though.

~~~
gameswithgo
>The 'interpretation overhead' is supposed to kill your performance.

Java isn't interpreted

>You have the APL family, which are a very high at level of abstraction but
perform even faster than C.

No they don't. You process an array of numbers in contiguous order in C (or
anything else compiled with llvm or gcc or similar) it gets vectorized too.
APL may encourage the programmer down the right data layout path for
vectorization more often though. Which would certainly be a good quality on
modern hardware.

~~~
colejohnson66
> Java isn't interpreted

Correct. Not sure why you’re being downvoted.

Java and C# are JIT’d, but only hotspots are aggressively optimized (at least
in C#; not sure about Java). So in short runs, prevectorized C code will run
faster, but once the optimizer kicks in (with long runs), they’ll be neck and
neck.

Side note: C# also supports AOT compilation with optimization. Not sure about
Java

~~~
daxfohl
C frequently cannot vectorize without explicit hints. This is because with
pointer arithmetic, what you're writing to could be within the array you're
reading from. So order of operation could matter, and vectorization could
break it.

~~~
colejohnson66
I guess, cause in C#, you can’t have pointers to the middle of an array (when
not using `unsafe` code). So I guess one would just check if the arrays’
pointers were equal, and if they aren’t, the arrays don’t overlap.

Side note: C# 8’s ranges and indices return an IEnumerable, so I wonder how
those vectorize. Can IEnumerable’s even be vectorized?

~~~
daxfohl
The only vectorization I'm aware of in C# is the few classes in
System.Numerics that are provided specifically for this optimization. In
theory, the C# compiler and/or JIT _could_ automatically issue SIMD ops for
some array operations, if it could determine that order of operations doesn't
matter (as could just about any other language), but I think as a general rule
this is too expensive at compile time or runtime to be worth it, and most
languages force you to make it explicit. Except a couple languages like
FORTRAN and apparently APL. (Though I'd love to be shown wrong in the C#
case).

------
tom_mellior
> As the table reveals, every time this function performs a multiplication,
> only 8 out of 82 (9+30+28+8+7=82) instructions are doing the “real”
> multiplication. That's only about 10% of the total instructions. The other
> 90% are considered interpretation overhead. Once we vectorized this
> function, its performance was improved by nearly nine times. See PR #12543.

This is a misleading way to present this data. If I understand this correctly,
most of the 90% "interpretation overhead" are time spent evaluating the
operands to the multiplication, and this is _also_ vectorized. So it's not
just that vectorizing the 10% can give you a 9x speedup overall, although in
my opinion the text tries to suggest this.

In any case, there must be even more going on here. The data being processed
here seem to be Float64. On an AVX-2 processor like most of us have, you can
only fit up to 4 64-bit floats into a vector register. This means that, even
if your entire computation vectorizes very very nicely, you should only expect
a 4x maximum speedup. Even if they have an AVX-512 server (they don't say)
with twice the vector width, 8x would be the expected limit. In practice it
would be considerably less because the processor reduces its frequency to
avoid overheating on AVX-512-heavy computations. I'm not aware of hardware
that uses even wider vectors.

So an end-to-end 9x improvement for the entire function here seems impossible
to achieve using vectorization alone. I question both the measurement and the
suggestion that vectorization is the _only_ thing that changes here. Maybe
they accidentally (? they don't seem to understand in detail what's going on)
stumbled upon a much more cache friendly version of the computation they were
trying to do, or maybe previously they caused the GC to interfere, or...
something. But 9x due to vectorization of a Float64 computation? I'm not
buying it.

~~~
mping
I believe that in db context, vectorization means batch processing column
wise. Quoting the CockroachDB blog post:

Using vectorized processing in an execution engine makes more efficient use of
modern CPUs by changing the data orientation (from rows to columns) to get
more out of the CPU cache and deep instruction pipelines by operating on
batches of data at a time.

Here is the link: [https://www.cockroachlabs.com/blog/how-we-built-a-
vectorized...](https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-
sql-engine/)

~~~
senderista
That’s right, I believe it was MonetDB that introduced this abuse of
terminology into the DB lit and it's stuck. I remember being very confused as
well, looking for any mention of SIMD instructions in the MonetDB paper and
finding none :) ("Pipelining" is another term that the DB community uses
differently from everyone else--nobody else would refer to single-threaded
execution as "pipelined".)

The other typical optimization for expression evaluation is code generation,
which usually targets LLVM bitcode or Java bytecode. That's pretty standard
for column-oriented databases now and they're not exactly state of the art if
they don't implement some equivalent of the above.

------
qeternity
I’ve seen a fair bit about TiDB but not much from actual users. Can someone
who uses this in production explain why and what alternatives they evaluated
(we use Citus so would be curious to hear).

~~~
jinqueeny
You can check out the case studies from our users here:
[https://pingcap.com/success-stories/](https://pingcap.com/success-stories/)

------
baybal2
Well what to say. You are just throwing potential performance away if you
don't use SSE when you can.

There is an idea that "renaming and reordering engine can make non-SSE code as
fast as it without extra hassle." At least of X86, that can't be true as you
physically can't access all execution ports with non-vector instructions.

~~~
pdpi
Vector instructions come at a cost, though — there’s implications on power
budget and on thermals, so it’s very much a “just because you can doesn’t mean
you should” piece of tech. Specifically, sustained use of AVX instructions on
256 and 512 bit registers will cost you a chunk of clock frequency even for
all, so you have to pass the hurdle of having enough AVX instructions that the
downclocking is worth it overall

~~~
dr_zoidberg
AVX(1) and AVX2 don't cause throttling, just AVX512. Even so, on code that can
do the best of them[0], they still result in better performance. Eventually
they will implemented in a way that doesn't require throttling, maybe AMD when
it finally supports them or Intel when the famed 7nm[1] process arrives.

[0] numerical code usually, but go ahead and use them if you can express your
problem in a way that makes sense in AVX512

[1] 7 as in "seven" since the 10nm is so troublesome that apparently no
desktop/server chips will be made on it and just skips to 7, which should be
comparable to TMSCs 5nm

~~~
gameswithgo
>AVX(1) and AVX2 don't cause throttling,

yes they do, on intel cpus anyway. On Ryzen anything that makes more heat will
cause throttling, rather than a step change in mode.

>. Eventually they will implemented in a way that doesn't require throttling

Probably not, there are fundamental physical limitations. I mean, you can put
a good cooler on the cpu and disable the throttling, but then you could raise
the non SIMD clock rate too.

------
notbigdata
i'm quite surprise how many of HN comments are focused on nitpicking frontend
implementations rather than the content itself.

it's not very likely the author who works on vectorized execution also
implemented the blog system.

------
userbinator
You can also get a 10x accessibility increase for viewing your site, by not
doing shit like this:

    
    
        <div class="center-element" id="page-loader">
        <svg id="hexagon" viewbox="0 0 129.78 150.37" 
        ...
        </div>
        <div id="page-content" style="display:none">
        (actual page content here)
    

This is another one of these pages that shouldn't require JS, but
_deliberately hides the content_ and then uses JS to un-hide it. _WTF!?_ I
know this is a little off-topic but I found it ironic that a post about
optimising performance would be presented so outrageously inefficiently and
inaccessibly. (I just turned off the CSS to read it.)

~~~
iamxy
An SRE at PingCAP here. Got it. Thanks! We found that using JS to toggle page
visibility has some usability issues. We just removed the JS loading animation
and will deliver content with good compatibility and progressive enhancement.
Sorry for the issues with the display!

~~~
withdavidli
Nice to see quick response.

Page loading on mobile for me. Using Brave browser.

