

Future instruction set: AVX-512 - przemoc
http://agner.org/optimize/blog/read.php?i=288

======
carterschonwald
The AVX / SIMD part of x86 instruction set is probably the most understandable
subset to focus on learning! And i'm very excited about AVX-512

I'll actually going to be spending some of my time over the next year adding
proper SIMD support (including all the shuffles) to the main Haskell compiler,
GHC!

Theres some really interesting constraints on the SIMD shuffle primops that
need some type system cleverness to compile correctly!

Namely, you need to know "statically, at code gen time", the shuffle constants
that are given as "immediates" to the instructions! Normal values don't quite
have the right semantics, and accordingly the simd intrinsics in C compilers
kinda lie about the types they expect (ie if you give them a variable of the
right type, they'll give you an error saying they need an actual constant
literal).

tl;dr I'm going to make sure the GHC (and haskell) can support AVX 512 by the
time thusly equipped CPUs are made available

~~~
nkurz

      > Namely, you need to know "statically, at code gen time",
      > the shuffle constants that are given as "immediates" to 
      > the instructions!
    

Can you clarify where the extra difficulty is?

I'm ignorant of GHC, but I'd think that from the compilers POV all that
matters is that that the operand is a constant. Then it's just a matter of
putting the value in instead of the name. In GCC inline assembly, you can use
the poorly documented '%c' prefix to have a number treated as an immediate, so
I'd guess this must be possible internally too. Also possibly worth noting is
that unlike the others, PSHUFB works from a register rather a value encoded
into the instruction.

~~~
carterschonwald
There's more than one shuffle instruction, in fact there's quite a few! You're
right that some can take register args, but those aren't the ones I care about
as much.

You're right, there are hacks in c that handle that. My goal for ghc is to
actually have a systematic solution for handling any sort of constant literal
expression at compile time. This includes making it easy to add new primops
that require compile time literal data.

There's some interesting implications if you want that restriction to be
typecheckable! This includes having a "static data kind". Part of why you want
that is also because ghc is great at common sub expression elimination, and I
consider any implementation strategy that could be broken by compiler
optimization to be unacceptable.

[edit to clarify: just naively using normal literal values would likely be
subject to cse optimization, and having code gen need to look around to lookup
a variable rather than being a localized process is somewhat horrifying]

One particular end goal of mine is this: SIMD isn't that complicated, and it's
really easy to experiment with (but only if you can cope with c). I want to
make experimenting with SIMD much more accessible.

Interestingly enough, the notion of static data I want seems like it might be
an especially strong version of the notion of static data that Cloud Haskell
(the distributed process lib) would like. So there may be some trickle Down
there!

One really cool optimization having a proper notion of static literals might
enable is making it much easier to generate things like static lookup tables
and related data structures that are small and perf sensitive

Edit: also if you want to try and stare at the source for a serious compiler,
ghc (while huge) is pretty darn readable. Just pick a piece you want to
understand and stare at the code for a while!

Edit: I should add that Geoff mainland has some great preliminary work adding
experimental simd support to ghc that's in head/ pending 7.8. That said, ghc
support for interesting simd won't be ready for prime time till 7.10 in a
yearish

------
tachyonbeam
Seems to me like Intel and AMD aren't very forward thinking, adding new
instructions and registers every year or two. As if the x86 instruction set
wasn't bloated enough, now they're going to have instructions with 4-byte
prefixes, and new registers you can only access with AVX. What's next after
that, AVX-1024 with 6-byte prefixes? Meanwhile this renders MMX and SSE sort
of redundant. Seems to me we might be better served with some kind of vector
coprocessor and instructions that can operate directly on large vectors in
memory, instead of doubling the size of the vector registers all the time and
making x86 an ever harder target to generate efficient code for.

Maybe this is another area where ARM can beat x86. Have a better planned out
vector instruction set that can be expanded without adding hundreds of new
instructions all the time, and more compact machine code.

~~~
corresation
_Meanwhile this renders MMX and SSE sort of redundant._

It's an evolution of the same SIMD ideas. Yes, the newer variants do render
the older variants redundant, but hang around because code might use it.

MMX - SIMD, integer only, reused existing floating point registers making it a
PITA that often didn't even payoff because of the expensive state switching
between MMX and floating point.

SSE - starting as floating point SIMD with its own 128-bit registers. Evolved
through SSE 4.2 with more instructions (functionality in hardware),
flexibility (e.g. operate on 4 singles or 2 doubles or...) and the addition of
integer functionality.

AVX - Double the size of the vector registers, adding more of them and lots of
new instructions and functionality. The successor of SSE, at least on the
floating point side. AVX2 brought the integer functionality.

AVX-512 - Double the size of the vector registers again. 16 single float
operations in one go.

 _Maybe this is another area where ARM can beat x86. Have a better planned out
vector instruction set that can be expanded without adding hundreds of new
instructions all the time, and more compact machine code._

This kind of sounds like baseless griping. Unless you write a compiler, why do
you care? Do you really sweat the instruction prefixes?

~~~
ChuckMcM
I wish that Intel would have spent some transistors on an arbitrary precision
decimal arithmetic floating point unit. That would have helped scientific
processing but in the past has been 'too expensive' in terms of transistors to
implement. Now that we have more transistors than we know what to do with,
seems like that should be revisited.

Then generalize the vector coproccessing abilities of the GPU and that would
be a pretty flexible base to work from.

~~~
jey
Why does scientific computing care about decimal arithmetic?

~~~
ChuckMcM
Same reason accountants do, numbers like .1 aren't approximate repeating
fractions in decimal representation.

~~~
acqq
Citation needed, please. I don't know where nature is decimal.

~~~
ChuckMcM
Actually its more fun if you just "do the math" and its a fun computer science
problem:

0.1 is 1 / 10, fractional binary bits are 1/2, 1/4, 1/8, 1/16 ... find a
combination of bits that represents 0.1.

If you want a very long treatise about how binary sucks for doing arithmetic
wander over to
[http://speleotrove.com/decimal/](http://speleotrove.com/decimal/)

~~~
jey
Right, and I can see why that's a problem in accounting, but why does it
matter for scientific computing? I do a fair amount of stuff that could be
called scientific computing, and I just use doubles. If I need to keep track
of uncertainty or propagate errors, I normally use Gaussians as the
representation (not "significant figures" as would be implied by using
decimal).

~~~
repsilat
It almost never matters in scientific computing. Doubles give us the
equivalent of almost 16 digits of accuracy, and that's more precision than we
know any physical constant to. You're right that the world isn't decimal, and
switching to decimal encodings actually reduces the effective precision of any
computation.

~~~
foobarbazqux
There's a reason they're called the natural numbers. Nature doesn't have to be
decimal for decimals to be useful (the question that started this debate), it
just has to be rational. Many many many parts of nature are rational, and
sometimes we need to deal with them in scientific computing. DNA sequence
processing comes to mind.

------
3pt14159
Is there a good place to begin to learn about this stuff from the ground up?
Maybe a user friendly compiler written for the purpose of education?

~~~
minimax
The shallow end of the pool is simple RISC CPUs e.g. Atmel 8-bit AVR. The
complete instruction manual is something like 160 pages (compare to x86 at
3k+) and there are tons of beginner resources for doing assembler on those
chips.

~~~
xradionut
There's more structured material for MIPS than AVR. Plus the AVR has some
funky memory maps and modes.

~~~
StephenFalken
This all time classic from MIPS is a good starting point: " _MIPS R4000
Microprocessor User 's Manual_" [1]

[1]
[http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_boo...](http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf)

------
WhitneyLand
And yet phones are already doing real-time 4k video encoding.

In other words, since a lot of compute intensive scenarios are already being
served by less general purpose hardware, what are the most recognizable
scenarios where AVX-512 will make a big difference over AVX-256?

I know video encoding can be an integer only algorithm while AVX seems to help
floats more but still...

~~~
protopete
Which phone can do realtime 4k encoding? Can you tell me which SoC it uses?

Edit: It's the Acer Liquid S2, with Qualcomm Snapdragon 800 SoC

~~~
devx
All Snapdragon 800-based phones, and there are quite a few of them, and there
will be more in the coming months (and of course later even more chips doing
the same). Right now it's the one you said, LG G2, Sony Xperia Z1, Samsung
Galaxy Note 3, and soon the Nexus 5.

Now I'm not 100 percent sure if every single one of them has that as a _user-
centric_ feature (maybe they didn't enable it), but the chip supports 4k video
recording.

------
noahl
The mask registers that can turn off operations on individual elements of a
vector reminded me of CUDA. It might be possible to emulate individual
"threads" on these pretty easily.

Does OpenCL have a similar threading concept? I don't know much about it,
sadly.

------
Symmetry
I really wonder what this is for. Isn't AVX-256 code usually already limited
by memory bandwidth? There are design tradeoffs between memory latency and
memory bandwidth and in order for Intel CPUs to keep their advantage in high
single threaded performance they have to lean towards the low latency side of
things.

Is this aimed at Larrabee?

~~~
reitzensteinm
SKUs of Haswell with the GT3e graphics configuration include 128mb of on
package DRAM, which is intended to give more memory bandwidth to the GPU, but
it also acts as a cache for the CPU.

Based on what you're saying, it seems as though AVX-512 and future models with
larger, faster embedded DRAM might play very nicely together.

~~~
Symmetry
I should have been clearer. I was talking about the cache hierarchy and CPU
memory pipes. I suppose there's a divergence in main memory too, with CPUs
using DDR3 and GPUs using GDDR5, but as you say you can just throw cache at
that problem.

------
erichocean
_Floating point vector instructions have options for specifying the rounding
mode and for suppressing exceptions._

That's huge.

------
Aardwolf
Is there any x86 instruction set that supports quadruple precision floating
point numbers? If not, why not? Is it not useful enough?

~~~
ansgri
Where exactly is it useful? Double is already an overkill in all multimedia
operations, and finance folks use fixed-point anyway.

~~~
fdej
Quadruple (or higher) precision is needed in many scientific applications.

~~~
alayne
Do you know of any specific applications? "many" seems like an extraordinary
claim. I'm under the impression that quad precision is fringe. Double
precision support on GPUs is fairly recent.

~~~
fdej
Sure. Just for a start, some of David Bailey's papers
([http://www.davidhbailey.com/dhbpapers/](http://www.davidhbailey.com/dhbpapers/)),
e.g. "High-Precision Arithmetic: Progress and Challenges" and "High-Precision
Floating-Point Arithmetic in Scientific Computation" show some concrete
examples. You will find many others by searching for papers containing the
words "quad double arithmetic", "high precision arithmetic", "multiple
precision arithmetic" or similar terms. Most applications are probably in
physics, and of course in pure mathematics.

------
waterlesscloud
This might well be a dumb question, but why don't we have 128 bit processors?

No advantage? Crazy expense?

~~~
lambda
At some point, you need to ask what you mean by 128 bit. When people talk
about an 8 bit, 16 bit, 32 bit, or 64 bit processor, they are actually
generally conflating two or more things. There's the size of general purpose
register, the size of the data bus (how much you can load from memory in a
single transfer), the size of the address bus (how many lines you have for
addressing RAM), and the size of pointers. In many machines, these have been
the same, though for example, 8 bit processors frequently had 14 or 16 bit
addresses and busses so they could access up to 16 or 64k of memory; but
there's also, for example, the 68008 with 32 bit registers, a 16 bit address
bus, and an 8 bit data bus.

So, when people talk about 32 or 64 bits, they generally mean two things: the
size of general purpose registers, and the size of addresses.

There's basically no need for addresses beyond 64 bits, at least for quite
some time. With 64 bits, you can address 16,384 petabytes (16 exabytes) in a
single process. Since the biggest single machines I can find these days
support a maximum of 4 TB of RAM (if you filled it with 32 GB DIMMs that
aren't yet available), we have a long way to go before you will need more than
64 bits of address space.

Furthermore, increasing address size can hurt performance. If your pointers
are all 128 bit, they take up twice the space as 64 bit pointers. There have
already been plenty of workloads that show a reduction in performance when
ported to 64 bit machines, just because the 64 bit pointers fill up so much
valuable cache space. In fact, for this reason, Linux even has support for the
x32 ABI, which uses an x86-64 processor in 64 bit mode but only uses 32 bit
pointers, so they can take advantage of extra registers available to x86-64
without paying the price for the larger pointers.
[https://en.wikipedia.org/wiki/X32_ABI](https://en.wikipedia.org/wiki/X32_ABI)

So, there's no benefit to 128 bit addresses and lots of potential downside, so
it's not going to happen for quite some time. How about for data, though?

Well, most software doesn't really need to work with integers or floating
point numbers larger than 64 bits, anyhow. For lots of applications, 64 or
even 32 bits is sufficient. Public key crypto can frequently take advantage of
large integers, though it generally needs even bigger integers, like 2048
bits, so you generally have to do bignum arithmetic anyhow.

Lots of the gains that you get from working with larger types come from
working on vectors of smaller types. But for those purposes, chips have had
128 bit registers for quite some time. SSE, introduced in 1999, included 128
bit vector registers, which could be treated as 4 32 bit integers (AltiVec on
PowerPC had introduced the same idea a few years earlier; the idea of SIMD has
been around in supercomputers for many years). Later extensions like SSE2
expanded their use to allow you to treat them as two 64 bit floats, two 64 bit
integers, 8 16 bit shorts, and 16 8-bit bytes.

So, for the only use case for which it's particularly valuable, working on
vectors in aggregate, we've had 128 bit registers for quite some time. We've
had 256 bit registers for a couple of years now in the form of AVX. Now this
promises to expand those to 512 bits. There's no good reason to expand your
addresses in the same way; at that point, you're just wasting space.

~~~
josephlord
Up voted. Although 16 exabytes is less overhead if you are memory mapping
persistent storage rather than just RAM which makes increasing sense with
SSDs. 64bit addressing is still plenty for most scenarios for some time to
come though even if this approach is taken.

