
SIMD instructions - z0mbie42
https://opensourceweekly.org/issues/7
======
dang
This is a list of articles—probably a good one, but HN is itself a list of
articles, so this is too much indirection.

Lists don't make good HN submissions, because the only thing to discuss about
them is the lowest common denominator of the items on the list [1], leading to
generic discussion, which isn't as interesting as specific discussion [2].

It's better to pick the most interesting item from the list and submit that.
You can always do it more than once, if there is more than one interesting
item—but it's best to wait a while between such submissions, to let the
hivemind caches clear.

[1]
[https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...](https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=by%3Adang%20denominator%20list&sort=byDate&type=comment)

[2]
[https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...](https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=by%3Adang%20generic%20discussion&sort=byDate&type=comment)

------
Twinklebear
SIMD is used a ton in rendering applications and starting to see more use in
games too (through ISPC for example).

I'd add to the list:

\- Embree: [https://www.embree.org/](https://www.embree.org/) Open source
high-performance ray tracing kernels for CPUs using SIMD.

\- OpenVKL: [https://www.openvkl.org/](https://www.openvkl.org/) Similar to
Embree (high-performance ray tracing kernels), but for volume traversal and
sampling.

\- ISPC: [https://ispc.github.io/](https://ispc.github.io/) an open source
compiler for a SPMD language which compiles it to efficient SIMD code

\- OSPRay: [http://www.ospray.org/](http://www.ospray.org/) A large project
using SIMD throughout (via ISPC) for real time ray tracing for scientific
visualization and physically based rendering.

\- Open Image Denoise:
[https://openimagedenoise.github.io/](https://openimagedenoise.github.io/) An
open-source image denoiser using SIMD (via ISPC) for some image processing and
denoising.

\- (my own project) ChameleonRT:
[https://github.com/Twinklebear/ChameleonRT](https://github.com/Twinklebear/ChameleonRT)
has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for
vectorizing the rest of the path tracer (shading, texture sampling).

~~~
z0mbie42
Hi, thank you for the pointers!

I try not to include C or C++ projects other than for educational purpose
(like the Mandelbrot set) because one of my life's goal is to help the world
to transition to a C & C++ free world (other than for kernels...).

I believe that my role is to promote projects which are "building the new
world" and thus we need to abandon and port all form insecure core.

~~~
sk0g
So in an article about high/extreme performance systems, you're ignoring the
vast majority of them because you don't agree with the tool used to achieve
said performance? What..?

~~~
pjmlp
I guess because using other programming languages proves the point that there
are other approaches, instead of reinforcing the status quo.

~~~
z0mbie42
Exactly this

------
burntsushi
ripgrep does, and it's a big reason why it edges out GNU grep in a lot of
common cases, especially for case insensitive searches. The most significant
use of SIMD is the Teddy algorithm, which I copied from the Hyperscan project.
I wrote up how it works here: [https://github.com/BurntSushi/aho-
corasick/blob/66f581583b69...](https://github.com/BurntSushi/aho-
corasick/blob/66f581583b6921ad1e5731d0dd4f192436af0e36/src/packed/teddy/README.md)

~~~
z0mbie42
I wasn't aware! Added it.

Thank you very much for all your work. Your CLI tools are really making a
positive impact on the world of development.

~~~
mastax
The intended connotation of ripgrep wasn't "RIP Grep" but that it rips through
searches, i.e. it is fast. I can't find the comment where he said this but
burntsushi can confirm.

~~~
burntsushi
Right, yes. It's a common enough mistake that it has its own FAQ entry now.
:-)

[https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#int...](https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#intentcountsforsomething)

------
reikonomusha
A Common Lisp project that uses SIMD (specifically AVX2) is the Quantum
Virtual Machine [1]. It’s a quantum computer simulator. Here [2] is part of
the source that has the SIMD instructions.

It’s cool that with using SBCL, an implementation of Common Lisp, you can
write compartmentalized assembly very easily in an otherwise extremely high-
level language.

[1] [https://github.com/rigetti/qvm](https://github.com/rigetti/qvm)

[2] [https://github.com/rigetti/qvm/blob/master/src/impl/sbcl-
avx...](https://github.com/rigetti/qvm/blob/master/src/impl/sbcl-avx-
vops.lisp#L53)

------
corysama
The megahertz-scaling "Free Lunch" was declared dead 15 years ago
[[http://www.gotw.ca/publications/concurrency-
ddj.htm](http://www.gotw.ca/publications/concurrency-ddj.htm)] and it's been
only getting deader. People are finally, grudgingly accepting that they must
go parallel unless we want to see software performance stagnate permanently.
For most people here, the issue has been obvious since before they learned to
program. But, still they are putting off learning how to deal with it. The
first, obvious answer to that is threading. But, in my experience, SIMD is a
bigger bang for the buck for two reasons: 1) No synchronization problems. 2)
Better cache utilization. It's not just that SIMD forces you to work in large,
contiguous blocks. Fun fact: When you aren't using SIMD you are only using a
fraction of your L1 cache bandwidth!

A big challenge is that SIMD intrinsic-function APIs are weird. They have
inscrutable function names and sometimes difficult semantics. What helped me
greatly was going through the effort of writing #define wrappers for myself
that just gave each function in SSE1-3 names that made sense to me. I don't
expect many people to put in that effort. And, unfortunately, I don't have go-
to recommendations for pre-existing libraries. Best I can do is:

[https://github.com/VcDevel/Vc](https://github.com/VcDevel/Vc) is working on
being standardized into C++. It's great for processing medium-to-large arrays.

[https://ispc.github.io/](https://ispc.github.io/) is great for writing large,
complicated SIMD features.

[https://github.com/microsoft/DirectXMath](https://github.com/microsoft/DirectXMath)
is not actually tied to DirectX. It's has a huge library of small-vector
linear algebra (3D graphics math) function. It used to be pretty tied to MS's
compiler. But, I believe they've been cleaning it up to be cross compiler
lately.

~~~
CyberDildonics
Can you say more about non SIMD instructions not making full use of the L1
bandwidth? Is it just that even keeping all the integer units busy still
doesn't equate to using all the bandwidth? I suppose that makes sense when
adding up the numbers for clock cycles and bytes. I'm guessing this not common
to point out since being limited to L1 cache bandwidth is so unlikely to be a
program's main bottleneck.

~~~
corysama
Intel's scalar pipelining does do an amazing job of keeping pipes busy. And,
well-pipelined code can approximate SIMD performance. But, in practice to
solidly get that kind of pipelining you need to pretty much write your scalar
code as if you were emulating SIMD.

But, the point is that a 4-byte load instruction leaves 12 bytes of bandwidth
on the table for many architectures -even with a perfect L1 cache hit.

I point it out because I usually get rebuttals that everything is memory bound
(true) and that using the cache well is more important (true, but it turns
out...).

------
TazeTSchnitzel
> SIMD […] is a good alternative to multithreading

They are not alternatives to eachother, they are orthogonal things, unless
you're using a GPU.

~~~
z0mbie42
They can, but as explained in one of the article (by Cloudflare, "On the
dangers of Intel's frequency scaling") SIMD in a multithreaded environment can
cause performance problems due to CPU throttling.

So generally SIMD are used for single thread algorithms.

~~~
gameswithgo
I'm not sure it is fair to say "Generally". Sometimes, you maybe don't want to
multithread it. When I've used it, multithreading was still useful, despite
downclocks, by huge margins.

And on AMD cpus, the downclock issue doesn't exist near as badly.

------
poorman
Surprised Apache Arrow isn't on this list.

[https://arrow.apache.org/](https://arrow.apache.org/) > Apache Arrow™ enables
execution engines to take advantage of the latest SIMD (Single instruction,
multiple data) operations included in modern processors, for native vectorized
optimization of analytical data processing.

------
singhrac
Pretty much every neural network framework is aggressively SIMD-optimized
(after all, that's kind of the point besides autodiff), not sure why Tencent's
framework is picked..

If you know about it, I want to hear about more fast SIMD-based CLI tools that
can replace my existing workflow (e.g. burntsushi's ripgrep or xsv).

~~~
z0mbie42
Thank you for the feedback.

I picked Chinese technology because they are rarely promoted but really great.

Regarding the CLI tools it's a great question and I have opened a ticket for a
future issue:
[https://gitlab.com/bloom42/open_source_weekly/-/issues/14](https://gitlab.com/bloom42/open_source_weekly/-/issues/14)

------
nickysielicki
Plus many many more if they use the right compiler flags and use aligned
types.

I wrote this a couple days ago: [https://sielicki.github.io/posts/playing-
around-with-autovec...](https://sielicki.github.io/posts/playing-around-with-
autovectorization/)

~~~
mynegation
Exactly this! I am glad this list exists but even more interesting question is
what are the reasons for the list like that to exist? Ideally it is up to the
compiler to user target architecture to its maximum potential.

Every item on this list is a library, compiler optimization, or an idiomatic
abstraction waiting to happen.

~~~
saagarjha
Most compilers will by default generate portable code.

~~~
sharpneli
SSE2 is mandatory on X64 and Arm64 has mandatory neon instructions.

So even portable code has simd insts available.

~~~
saagarjha
…which compilers will emit when possible?

------
gameswithgo
I have a couple of videos introducing intel SIMD intrinsics and how to use
them well.

For Rust/C/C++:
[https://www.youtube.com/watch?v=4Gs_CA_vm3o](https://www.youtube.com/watch?v=4Gs_CA_vm3o)

For C#:
[https://www.youtube.com/watch?v=8RcjQPbvvRU](https://www.youtube.com/watch?v=8RcjQPbvvRU)

------
dmos62
To someone just hearing about SIMD, anyone care to give an experience infused
introduction? Is it worth the hassle only in rare cases?

~~~
nickysielicki
> Is it worth the hassle only in rare cases?

Vectorization is not always faster. It's important to understand that modern
processors can perform work on >100 instructions in a given cycle, and not all
instructions take equal amounts of time. So reducing a dozen instructions to a
single instruction doesn't necessarily mean that the single instruction is
going to be faster.

~~~
vardump
> So reducing a dozen instructions to a single instruction doesn't necessarily
> mean that the single instruction is going to be faster.

Are you saying dozen scalar instructions can be faster than one vector
instruction? That's wrong 100% of time on modern CPUs.

~~~
jcranmer
One corner case that exists is that using AVX instructions imposes a frequency
limit, although this isn't the case for SSE instructions.

There exist some vector instructions that are going to be slower than a non-
equivalent sequence of scalar instructions: VPGATHER is going to be an easy
such case.

However, I doubt there are going to be any cases where a vector instruction
will take fewer clock cycles than its equivalent scalarized instructions.
There are some where it might be equivalent--a vector of 2 elements performing
an operation that can be issued twice a cycle is an easy example--but I can't
think of any where it would be worse. If that were the case, then you should
just implement the operation in hardware by scalarizing the uops (and some
instructions appear to be so implemented--e.g., gather/scatter).

~~~
vardump
VPGATHER is actually significantly faster than scalar loads at least on
Skylake, possibly on some earlier CPUs as well (Broadwell?).

On Haswell where it was first introduced... yeah, not very fast, like you
mentioned.

------
haolez
I found this curious regarding QuestDB[1]:

> Java 8 64-bit. We recommend Oracle Java 8, but OpenJDK8 will also work
> (although a little slower).

Anyone have an idea why?

[1][https://github.com/questdb/questdb](https://github.com/questdb/questdb)

~~~
bluestreak
This is because OracleJDK has more intrinsics than OpenJDK.

------
veselin
I was wondering generally, is SIMD a good idea for general purpose CPUs.
Imagine if the current high end CPUs had double the number of cores, no SIMD,
but possibly higher frequency and the algorithms that benefit from SIMD were
all run on integrated accelerators instead.

At least as a side observer it looks like a huge number of very large
registers take large portions of a core, for sure consuming a lot of power as
well, just to sit idle while the core is running JavaScript. Can somebody with
CPU architecture experience say what is the real tradeoff here.

~~~
zamadatix
Adding SIMD takes less space than adding cores and the use case where you need
double the cores on a many core chip but aren't doing the same thing many
times is pretty rare.

SIMD units don't need to consume power or limit the frequency of the rest of
the chip while not being used Same as when JavaScript is running on one
boosted core and the other 63 powersave. While being used SIMD units are more
efficient than running 2x or 4x entire cores just to get the additional
operation per clock.

------
jzelinskie
Reminder that nothing is a panacea: I've heard from game engine authors and
cryptographers that on Intel chips _over-using_ SIMD can actually heat up the
chips too much such that it'll cause the system to then adjust the clockrate
lower to cool down and you can degrade performance beyond not using SIMD at
all. Before hearing that, I had never considered thermal properties of
particular instructions.

~~~
corysama
It's not a problem for SSE and AVX1. But, with AVX2/AVX-512, the deal is that
you should not just dip your toe with an occasional call to a small SIMD task
using such heavy-hitting features. Either do enough SIMD work to overcome the
down-clock, or use a lower-end SIMD functionality for smaller tasks.

And, even within AVX2/512 there are huge sets of added functionality that are
really "AVX1-enhanced" without going wider. Those are fine to use to without
worrying about downclocking.

~~~
calaphos
The AVX ALUs also go into power saving when not used and take a couple of
cycles to switch back on, delaying the first AVX instruction. There is afaik
even a paper on a side channel attack that uses this.

------
GordonS
As a point of interest, you can even use SIMD (and other hardware intrinsics)
in dotnet core, since 3.0, e.g.
[https://medium.com/@alexyakunin/geting-4x-speedup-with-
net-c...](https://medium.com/@alexyakunin/geting-4x-speedup-with-net-
core-3-0-simd-intrinsics-5c9c31c47991)

------
tarr11
Apache Lucene has recently started using SIMD to decode postings lists (Java)

[https://issues.apache.org/jira/browse/LUCENE-9027](https://issues.apache.org/jira/browse/LUCENE-9027)

------
FZ1
Adding the obvious numpy vectorization - I presume that counts as an 'open
source project'?

Or maybe this is limited to little personal projects, and not major libraries
?

------
truth_seeker
JVM generates SIMD to certain extent, i wish other runtimes like V8
(Browser/NodeJS), Go, BEAM (Elixir/Erlang) etc did the same.

~~~
gameswithgo
JVM does it to an extremely limited extent. Anything jitted, doesn't have a
lot of time to do autovectorization. Even GCC/llvm are pretty limited at this,
as it is just a hard problem, and doing it with floating point is problematic
as it usually changes the result.

~~~
pjmlp
To a limited extent that also takes advantage of AVX thanks to Intel
contributions.

All modern JVMs take advantage of code caches and possibly PGO (depending on
the configuration) between runs.

ART also does the same for NEON.

------
andrea_s
Yandex ClickHouse also should be on the list!

------
vmchale
gcc can do SIMD on its own quite surprisingly (to me)

[https://github.com/vmchale/ats-
codecount/blob/master/DATS/wc...](https://github.com/vmchale/ats-
codecount/blob/master/DATS/wc.dats)

------
capableweb
Looks like a blog post, so shouldn't really be a Show HN. Take a look at the
guidelines:
[https://news.ycombinator.com/showhn.html](https://news.ycombinator.com/showhn.html)

> This week has been particularly bad regarding security

Seems like a weird way to phrase it, when talking about security fixes. If it
was about security vulnerabilities, I would understand you say it's bad, but
in this case it's about fixing vulnerabilities, that's good right?

~~~
z0mbie42
Oops, Fixed!

I prefixed with "Show HN" because I started this newsletter 7 weeks ago and
wanted to 'show it' to HN.

You are right regarding the phrasing (I'm not a native English speaker, so any
feedback is welcome). Fixed :)

------
The_rationalist
The easiest way to benefit from SIMD is to use OpenMP SIMD directive on for
loops.

