
Yeppp: A High-Performance SIMD-Optimized Math Library for X86, ARM, and MIPS - desdiv
http://www.yeppp.info
======
sonium
I would be interested in seeing some real-world figures for HPC applications
(e.g. VASP). I think most people use Intel MKL these days. A 30% performance
increase over MKL as in the first figure would be huge.

------
Marat_Dukhan
Author here. AMA.

Note to mods: it would be fair to add (2013) to the link, as the site wasn't
updated since Yeppp 1.0.0 release

~~~
nkurz
Hi Marat --

Looks like a great approach!

Do you have thoughts on how this could be extended to work with Xeon Phi and
graphics coprocessors? The particular grail I'm searching for would do what
you have done, but also allow some operations to be offloaded.

Also, I see you have some preliminary R bindings:
[https://bitbucket.org/MDukhan/yeppp/src/7830144789416f9fbed3...](https://bitbucket.org/MDukhan/yeppp/src/7830144789416f9fbed3998a4c711147533cb546/bindings/R/?at=default)

Are these thought to be working?

~~~
Marat_Dukhan
Yeppp! supports Xeon Phi, and includes binaries for it, albeit without hand-
tuned kernels. Support for GPUs is unlikely: Yeppp! is a single-threaded
library by design, and GPUs programming models are intrinsically multi-
threaded.

The R bindings are working and do provide speedup, but they are in no way of
release quality. The officially supported bindings (.Net, JVM, and FORTRAN)
are auto-generated. Additionally, Julia team maintains Yeppp.jl - bindings for
Julia
([https://github.com/JuliaLang/Yeppp.jl](https://github.com/JuliaLang/Yeppp.jl))

------
kale
I looked over the types defined. I'm guessing it uses built-in types for
scalar float/double, and integer? It looks like the only defined types are for
complex floats and 128 bit integers.

EDIT: I looked at the functions listed for multiplication. Although it has a
64-bit x 64-bit multiply, the result is not in the internal 128-bit format.
Also there was a warning that 64bx64b multiply was not yet optimized.

Still very cool though.

~~~
Marat_Dukhan
Yes, on most platforms types map to intX_t/uintX_t.

------
yellowapple
A comparison with (and/or bindings for) Julia would be really interesting,
seeing as it makes similar claims about high-level performant programming. I
think Julia was doing some work on SIMD-exploiting compiler optimizations, but
I don't know off the top of my head whether or not that ended up coming
through.

~~~
Marat_Dukhan
There are bindings for Julia -
[https://github.com/JuliaLang/Yeppp.jl](https://github.com/JuliaLang/Yeppp.jl)

------
ris
Far more interesting to me is the PeachPy library that underlies this,
generating all the assembly in python:
[https://github.com/Maratyszcza/PeachPy](https://github.com/Maratyszcza/PeachPy)

------
kungfooman
What's the usage? Can't even see classes for Vector3, Quaternions etc.

~~~
yoklov
This library doesn't seem to be focused on matrix math for computer graphics.

FWIW, you don't tend to see a significant gain from writing those individual
types as SIMD. The gain is from operations over large arrays of numbers. E.g.
each vector has only one component type in it (e.g. you have vectors of [x0,
x1, x2, x3] and [y0, y1, y2, y3] instead of [x0, y0, z0, w0] etc).

Here's a set of slides that elaborate further on what I mean and explain why:
[https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afr...](https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)

------
ant6n
Why aren't there types like int32x4, float64x2, in8x8 built into languages
like C/C++, that just compile to SSE or NEON?

Writing asm or using intrinsics seems like such a pain and non-portable.

~~~
Marat_Dukhan
Because there is more to performance optimization than just using SIMD types.
Well-optimized assembly code tries to hide instruction latency and balance
load on different execution units. Both instruction latency and sets of
execution units depend on processor microarchitecture, thus you can't do it in
a portable way.

~~~
phkahler
>> Because there is more to performance optimization than just using SIMD
types.

But that doesn't explain the lack of language support for 3- and 4-element
vectors (or even 2d). I think the GCC intrinsics are nice, they allow you to
pass by value and are cross-platform with x86, ARM NEON, and AltiVec.

So many people think SIMD is somehow about parallel computation, but to me the
obvious use is small vector math with basic vector types and standard
operators.

~~~
theresistor
> So many people think SIMD is somehow about parallel computation, but to me
> the obvious use is small vector math with basic vector types and standard
> operators.

The fact that you think that indicates that you have no experience with SIMD
programming or instruction sets. Using SIMD for small vector math is at best a
wash, and typically a poor idea for performance. You only get significant
performance benefit out of SIMD instructions when you can use vectorize an
entire inner loop (moving between scalar and vector ALUs is fairly expensive
on some microarchitectures), and when data arrangement and movement is
arranged to maximize throughput (think SoA instead of AoS). Using SIMD for
short vector math is generally moving in the opposite direction of those
goals, resulting neutral to negative performance delta compared to scalar
code.

~~~
ant6n
I'm not sure what you are trying to say -- are you saying it is faster to use
x87 instructions for, say, vector3d operations rather than sse?

~~~
yoklov
Not the person you were replying to, but no.

Basically, instead of trying to gain a bit performance out of one [x, y, z,
w], you group your data so you have to process all of them at once, and then
you load them as [x0, x1, x2, x3], [y0, y1, y2, y3], [z0, z1, z2, z3], [w0,
w1, w2, w3], and then do 4 (or 8, or 16) steps of the loop at a time.

This ends up having roughly a 4 (or 8, or 16) times speed up. More if your
data wasn't already in that format, which is inherently cache friendly.

I linked this elsewhere in these comments, but here's a good overview of SIMD
techniques (as well as what not to do, which is basically what you had
suggested)
[https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afr...](https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)

~~~
ant6n
I still don't get it. Doing exactly what you describe would be much easier if
we had a [x, y, z, w] types built into the language. I just don't get the
downside of having packed types builtin -- so far it was said that it would be
faster to do some operations in scalar. If there is actually some way to
express operations in scalar faster than using SIMD operations, then the
compiler could probably regognize that as well and express my float32x4
operations in scalar. Like the dot product example from the linked slides.

I see this line of code, and I want to write the comment as code, rather than
the code:

    
    
        __m128 dw = _mm_mul_ps(aw, bw); // dw = aw * bw

~~~
tbirdz
You could use Agner Fog's vector class, but keep in mind that just using a
vector class isn't always the best idea:
[http://www.agner.org/optimize/#vectorclass](http://www.agner.org/optimize/#vectorclass)

