
.NET JIT and SIMD are getting married - ebrenes
http://blogs.msdn.com/b/dotnet/archive/2014/04/07/the-jit-finally-proposed-jit-and-simd-are-getting-married.aspx
======
kvb
This is great to see. While Mono already had a method for using SIMD
intrinsics, this tweet[1] indicates that this approach is better. Can anyone
elaborate?

[1]
[https://twitter.com/migueldeicaza/status/452099923157065728](https://twitter.com/migueldeicaza/status/452099923157065728)

~~~
migueldeicaza
Our implementation is too tightly coupled to the Intel SIMD approach, while
this approach works for SIMD implementation on other CPU architectures.

Mono.SIMD has a few extra features that are missing from the current design,
but they are not that important.

Microsoft has found a pleasant and easier to use API than we did. We have
shared our feedback with them, and hopefully it will keep improving.

~~~
runfaster2000
We shared our design with Miguel for feedback ahead of the release. He gave us
a boatload of very high quality and informed feedback. Thanks! We haven't been
able to act on most of it yet, since it relates more to what to do next than
what we released just now.

We're hoping to see this NuGet package supported on top of Mono, too. As we've
done previously, we'll share our tests with Xamarin to ensure consistent
behavior between implementations.

------
sergiosgc
The article starts with a falsehood. Moore's law is not finished. Much like
Mark Twain's death, reports of its demise are greatly exaggerated. Transistor
count is still surprisingly following Moore's curve, and single threaded
performance, while having taken a hit, is still growing at a healthy 25% per
year.

Oh well opening lines are what opening lines need to be. I'm just pedantic
about stepping on solid ground.

~~~
chinpokomon
It was also careful to frame it in terms of clock speed. I remember fondly the
ramp up from 4.77 MHz to our current clock rates, and we have really plateaued
around 3.4 GHz at the top end. [1] There have been challenges to reach higher
speeds, and I think the highest clock rate is an impressive 8+ GHz, but that
isn't off-the-shelf stock you just pick up at Newegg. That blog post has some
interesting and relevant data. It is a few years old now, but still applies.

[1] [http://csgillespie.wordpress.com/2011/01/25/cpu-and-gpu-
tren...](http://csgillespie.wordpress.com/2011/01/25/cpu-and-gpu-trends-over-
time/)

~~~
jmnicolas
To my knowledge the fastest (in ghz) commercial processor is an Power 7 at 5.5
ghz.

I wonder how it compares to a x86, but benchmarks are hard to come by.

------
ihnorton
Lots of interesting news in the .NET world lately. (Kind of hard to believe it
took this long to get SIMD though!)

SIMD can be a huge boost for numerical workloads; for example, SIMD auto-
vectorization was just added to Julia, and people saw 8x boosts on some
workloads (and in some cases another 4x from inlining improvements merged the
next day).

~~~
protomyth
> Kind of hard to believe it took this long to get SIMD though!

Does the JVM have SIMD?

~~~
rbanffy
If what's needed is speed, neither Java nor C# are optimal choices. As for
SSE/AVX from Java,
[http://stackoverflow.com/a/10809123/158026](http://stackoverflow.com/a/10809123/158026)
shows a nice example of how it can be done.

OTOH, I find this specific C# implementation neat and portable. I like it a
lot.

~~~
bunderbunder
_If what 's needed is speed, neither Java nor C# are optimal choices._

True, but if you're just trying to get some numerical speed out of some
specific routines in a larger .NET application then having access to SSE like
this could still be very helpful. Calling into native libraries from .NET
isn't necessarily a performant option because of the cost of marshaling data
back and forth between the managed and unmanaged memory spaces.

The end result can be pretty significant; from my own experience I'm usually
pretty hard-pressed to come up with a C++ implementation that can beat the C#
code it intends to replace outside of a microbenchmark. If the C# code now has
the option of banging on SSE then I'm not sure it'll even be worth trying to
trot out C++.

~~~
Locke1689
You can get rid of a fair amount of marshaling overhead if you use unsafe code
(if that's acceptable in your org).

~~~
bunderbunder
You can, and I've generally had better luck with unsafe code than with C++
code. Unsafe code creates GC overhead, though, so it can also end up doing
more harm than good if you're not careful. It's another spot where I've found
that microbenchmarks can be misleading - the performance cost that pinning
incurs is insidious and hard to measure.

~~~
magic_haze
That sounds interesting, could you elaborate? (what is pinning in this
context?)

~~~
ygra
Usually in the managed world you have references that point to an object. The
objects themselves don't live in the same place forever, they may be moved by
the GC (to consolidate "holes" in memory). References reflect that movement so
you don't notice. However, when using unsafe code (which has pointers and
pointer arithmetic) you need to keep the objects in place. That's pinning and
it essentially forces the GC to work around those islands of pinned objects.

------
fragmer
Although this technology preview only works on 64-bit Windows 8.1, Microsoft's
Kevin Frei promised that this will be released for "all platforms that .NET
supports" in the future:
[http://blogs.msdn.com/b/dotnet/archive/2013/09/30/ryujit-
the...](http://blogs.msdn.com/b/dotnet/archive/2013/09/30/ryujit-the-next-
generation-jit-compiler.aspx?PageIndex=2#comments)

------
bunderbunder
As someone who was recently contemplating pushing some functionality out into
C++ code purely for the sake of SIMD, I can only describe this development as
pants-wettingly exciting.

------
_random_
If JS and Node will eat enterprise I might as well move into .NET game
programming :).

~~~
bananas
That will _never_ happen.

Then again the NHS here in the UK just rewrote a ton of stuff in RabbitMQ,
Riak and Erlang from Oracle...

~~~
kevingadd
Sorry to break it to you, but .NET has been devouring game programming for
years. Unity uses it extensively, XNA was until recently a big choice for
indie development, MonoGame is growing, etc...

~~~
npizzolato
I still don't understand why XNA was discontinued. Is there a suitable
replacement for creating games rather easily in C#?

~~~
_random_
MonoGame is meant to be an OSS XNA replacement. It's very good for indies with
good programming skills. Fez and Bastion were implemented using XNA/MonoGame.
If one is more of a game-designer/scripter or is business-oriented then
Unity3d is also a good choice.

------
spyder
I was curious about SIMD support in JavaScript and it looks Firefox nightly
already has an API for it and Chrome too is getting it.

[http://www.2ality.com/2013/12/simd-
js.html](http://www.2ality.com/2013/12/simd-js.html)

[https://01.org/blogs/tlcounts/2014/bringing-simd-
javascript](https://01.org/blogs/tlcounts/2014/bringing-simd-javascript)

------
dantiberian

      For performance reasons, we’ve defined those types as immutable value types.
    

Immutability strikes again.

~~~
GyrosOfWar
How would immutability help with performance? (I'm not trying to ask a leading
question, just curious) Value types are obviously nice for better cache
performance (less pointer chasing, more linear data) but I generally associate
immutability with better correctness but worse performance (not much worse,
but worse).

edit: On topic, this is really, really cool. I was playing around with SIMD
intrinsics in C++ a few days ago (and realized that the compiler was in most
cases generating equal or better code with the auto-vectorizer than me using
intrinsics) so I have kind of a new-found interest in the topic.

~~~
bunderbunder
Two guesses:

First, making them immutable means you don't have to worry about memory
barriers. That could be huge for data that's being shared by multiple threads.
While these types logically have multiple elements, in reality they all fit
within a single SSE register. Meaning the performance cost associated with
having to worry about mutability could easily annihilate any potential
performance boost you might get from being able to fiddle with the vector
unit.

(Guess one-and-a-half is that, since these values are meant fit in a single
CPU register, they're really more analogous to atomic types than they are to
objects, anyway.)

Second guess is that it's more of a "pit of success" thing than a performance
thing. Mutable value types in .NET are really problematic. I've seen so many
bugs resulting from them that nowadays I consider them grounds for automatic
rejection in code reviews.

~~~
freikev
The immutable design came from the class library folks (not my team [the JIT
folks]). I believe the analogy to atomic types (integers aren't mutable) is
pretty sound. The API really was cleaner by making them immutable. If you want
to allow mutation, the API surface area really explodes, resulting in
dramatically more JIT work to make them perform well.

------
millstone
There's not much in the way of contrary opinion here, so let me offer some.
The approach of not tying you to a particular architecture is fundamentally
wrong. The right way is to expose APIs for each processor architecture.

Here's why. SIMD offers no new capabilities (1), only more speed, and not much
more at that, maybe 4x if you're lucky. It's also hard to use: it requires
unnatural data layouts, and lacks many operations (e.g. integer division).
None of this is specific to the .NET implementation: it's just the nature of
the beast.

So successfully exploiting SIMD is not easy, and requires thinking at the
level where instruction counts matter. And because the amount of parallelism
is so limited, high level languages (by which I include C!) can very easily
blow away any gains with suboptimal codegen. Just a handful of additional
instructions can ruin your performance (2).

Here's what will go wrong with an architecture-independent SIMD API:

1\. Say you invoke an operation without an underlying native instruction. The
compiler is forced to implement this by transferring the data to scalar
registers, performing the operation, and then transferring the result back.
Game over: this exercise is likely to eat up any performance benefit.

2\. To avoid this, say you limit the API to some "common subset" of all extant
SIMD ISAs. The problem is, many algorithms admit vectorization only through
the exotic instructions, such as POPCNT on SSE4, or the legendary vec_perm on
Altivec. If this instruction is not exposed, you can't vectorize the
algorithm. Game over again.

That's why software that takes advantage of SIMD invariably has separate
implementations for each supported ISA. .NET should have followed suit: expose
an API for each ISA (or a mega-API that covers all ISAs), and then provide
rich information about which operations are implemented efficiently, and which
are not, to allow apps to choose an optimal implementation at runtime. This
API would demo and market very poorly, but the engineers will love it, because
it's the one that enables the most benefit from SIMD.

1: with rare exceptions, such as the new fused multiply-add support in x86

2: Several years back, VC++ generated an all-bits-1 register by loading it
from memory, instead of issuing a pcmpeqd, which caused my vector
implementation to underperform my scalar one. This is my fear for the .NET
implementation.

~~~
aktau
This made me think of clang/gcc's vector extensions [1], which, together with
__builtin_shuffle can be used to get some real "ok" cross-platform
(SSE/NEON/...) SIMD code going on. An example of this in usage is [2].

That said you're right, usually the best performance can only be obtained by
using really specific instructions. But in my experience, a decent performance
increase can be obtained by using the generic vector extensions.

Moreover, if you can use the vector extensions for a large part of the code,
that means you have to write a lot less platform-specific stuff. I.e.: you
increase portability anyway, since now you only have to rewrite 5 out of 20
functions instead of 20/20\. Even better, they allow one to write v3 = v1 + v2
instead of v3 = _mm_add_ps(v1,v2). The first one being clearer, more portable
(will generate appropriate addps or equivalent NEON, ...) and plain nicer to
read.

Your pcmpeqd example is a good example of an optimizer flaw. In my opinion
this is orthogonal to whether or not to expose a specific or generic API. The
compiler should've use the most efficient instruction for that simple idiom,
period (without you telling it to use pcmpeqd). If we continue your line of
reasoning, we're back to assembly for everything.

[1]: [https://vec.io/posts/gcc-and-clang-vector-
extensions](https://vec.io/posts/gcc-and-clang-vector-extensions) (The vector
extensions allow +,-,*,/,<,>,==,!= to be naturally used for SIMD types) [2]:
[https://github.com/rikusalminen/threedee-
simd](https://github.com/rikusalminen/threedee-simd)

------
dbaupp
I don't see any mention of shuffles, which I believe are regarded as the most
important part of SIMD for high-performance.

Does anyone happen to know more about how/if shuffles are going to be exposed?

~~~
freikev
How would you like shuffles exposed? One of the things we really tried to do
with this design is make sure it's NOT tied to one particular hardware
implementation of SIMD.

~~~
zvrba
Couldn't you expose machine-dependent stuff in a subclass?

SSE has a bunch of other useful instructions like PMOVMSKB (useful for
fetching the result of vectorized comparisons, yay!), then there are string
instructions (sometimes also useful outside of string processing), etc.

New versions (AVX-512) will also have mask registers for masked operations.

~~~
chinpokomon
It sounds like they're trying to maintain platform independence while still
abstracting platform specific low-level operations. If you utilize machine
specific code in subclasses, how do you generate MSIL that remains
independent?

If you really needed something like that, could you not use C++/CLI and expose
those operations in your own unmanaged library? You will of course lose
portability, but that seems like a possible work around.

------
deadc0de
And we've had vectorization years ago in the JVM.

~~~
jodrellblank
And we've had straight marriage for years so why is gay marriage news, right?

