
The x86 Advanced Matrix Extension - rbanffy
https://fuse.wikichip.org/news/3600/the-x86-advanced-matrix-extension-amx-brings-matrix-operations-to-debut-with-sapphire-rapids/
======
PaulHoule
I wonder what Charlie Demerjian is going to say about this.

It is an awful lot of registers for a feature that few programs may use.
There's the risk that bfloat16 is a fad and five years from now it is used
hardly at all. At best it is for a full stack perception-and-synthesis feature
about as good as

[https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video](https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video)

Most of all it drives me nuts that Intel is going down the fixed-length SIMD
instruction route (going wider) where the instructions you write are specific
to the structures of the processor you run on. Machines like the ILLIAC, Cray,
and the vector processor for the 3090 mainframe would automatically chunk the
work so that you didn't need to rewrite your code for a future wider model.

ARM is doing it the right way:

[https://alastairreid.github.io/papers/sve-ieee-
micro-2017.pd...](https://alastairreid.github.io/papers/sve-ieee-
micro-2017.pdf)

~~~
zamalek
> It is an awful lot of registers for a feature that few programs may use

I was just wondering what you could do with 8KiB of register memory, if you
ignored the intent of the registers outright (assuming you can load/store to
the classical registers from these registers).

~~~
api
Cache expanded cryptographic keys and tables entirely in registers for one...

~~~
fuoqi
AFAIK SIMD accelerated cipher implementations already mostly do it. For
example, AES-NI based implementations of AES-128 and AES-192 keep round keys
in XMM registers without reloading them during processing of blocks.

------
api
Intel is going to start thrashing around now, adding a million features to try
to beat ARM and AMD on various microbenchmarks and special use cases.
Meanwhile ARM and AMD will keep winning on throughput, price/performance, and
(for ARM particularly) performance/watt.

Any win all these features bring can also be achieved with ARM or Zen by
adding more cores, with the exception of those few cases where you have huge
computational tasks that cannot be efficiently parallelized and where there
are few discrete jobs to allow for coarse grained (job/task) parallelization.
There are not many of these.

Meanwhile all these features are going to make Intel chips even more complex,
making bugs more likely and making iteration more costly.

My read is that Intel is shooting for maximum possible single threaded
performance because they can't compete on power efficiency or many-core. They
can't compete in those areas because their process nodes are not as small as
what TSMC can offer, and both (most) ARM chips and AMD are using TSMC and
fabricating at 7nm and soon 5nm. (Yes I know nm node sizes are no longer
directly comparable, but they are ahead of Intel and likely to stay ahead
unless Intel can really push hard on fab engineering.)

------
TinkersW
It sounded interesting until I saw that the only float type it supports is
Brain float :(

~~~
cesaref
I'm kind of interested to see what bfloat16 sounds like (for audio DSP). I'd
expect it to be good enough for a large number of algorithms so long as they
are properly stable, and if we have decent performance and reduced power use,
i'm all for that!

~~~
klodolph
I’m sure you could come up with an application for it, but if you want audio
output at some point, half-float sounds like quite a challenge.

\- The -66dB noise floor is pretty bad, and it accumulates at every step.

\- 11 bits is probably not enough for filter coefficients. So your filters
would likely be running with single precision floats, at least. Even low-cost
DSP chips tend to give you a big chunk of bits for your filters.

\- Naïve oscillator designs would accumulate a lot of error. Back of the
envelope calculation, if you wanted an oscillator at C4, you’d likely be
around 1/4 tone sharp or flat unless you ran the oscillator at higher
precision.

I’m definitely of the mind that bit depth is overrated in music. 16 bits is
great-for mastered music and simple tweaks. By contrast, from my experiences
writing DSP code, it often makes your code simpler and faster to run at higher
depths and sample rates, and then convert to e.g. 16 bit as the very last
step. The problem is that squeezing good output from low precision or low
sample rates requires more complicated and slower algorithms.

------
innocenat
That tile registers are crazy. But I wonder how long it will actually take to
actually be performant, considering the present problem with AVX512 switching.

~~~
google234123
AVX512 is already performant.

~~~
jfkebwjsbx
Only in a subset of cases, which is the problem: you cannot simply always use
it like the previous extensions.

~~~
klodolph
None of the previous extensions could be used blindly, either. It was a while
before people figured out how to use MMX or SSE well, and people still often
find that the scalar version of an algorithm beats their vector version.

~~~
jfkebwjsbx
I am not talking about ease of use, but about the downclock.

The other extensions do not trigger it, not even AVX256.

With AVX512 is not always a win, and you don't even know until you try in
particular hardware.

~~~
MaxBarraclough
It also has a 'warm up' phase, if I understand correctly.

[https://news.ycombinator.com/item?id=17936810](https://news.ycombinator.com/item?id=17936810)

[https://www.agner.org/optimize/blog/read.php?i=415](https://www.agner.org/optimize/blog/read.php?i=415)

~~~
Const-me
I don’t think that applies to modern AMD processors, though.

Agner’s microarchitecture.pdf says about Ryzen “There is no penalty for mixing
AVX and non-AVX vector instructions on this processor.”

Not sure if it applies to Zen 2 but I’ve been using one for a year for my
work, AVX 1 & 2 included, I think I would have noticed.

~~~
jcranmer
AMD processors used to implement AVX instructions by double-pumping them,
using only 128-bit vector ALUs. This means there's no clock penalty, but
there's also no speedup over an SSE instruction by doing so. I don't know if
this is still the case with the newest µarchs though.

~~~
Const-me
> but there's also no speedup over an SSE instruction by doing so

Just because they are split doesn’t mean they run sequentially. Zen 1 can
handle up to 4 floating point microops/cycle, and there’re 4 floating-point
execution units, 128-bit wide / each (that’s excluding load/store, these 4 EUs
only compute).

Native 256 bit are even faster due to less micro-ops and potentially more in-
flight instructions, but I’m pretty sure even on Zen 1 AVX is faster than SSE.

~~~
innocenat
It depends. If EUs are actually the bottleneck then doing SSE or AVX wouldn't
have any different in speed in such case.

However, when instruction decode/retire is the bottleneck, AVX can be faster.
I remembered this can be the case on Intel Sandy Bridge (first-gen AVX, double
pumped, retire 3 instructions/cycle), where AVX can sometimes be faster
(usually it's not that different)

With recent CPUs from both Intel/AMD able to at decode/retire at least 4
instructions per cycle this really cease to be the case.

~~~
Const-me
> AVX can be faster

Yes. Another possible reason for that is instructions without SSE equivalents.
I remember working on some software where AVX2 broadcast load instruction
helped substantially.

------
unwind
Meta: I'm not a native speaker, but that lonely 'W' in the title really irks
me. Suggested alternate title would be something like "Intel's Sapphire Rapid
debuts x86 Advanced Matrix Extension" (59 chars).

~~~
messe
Or how about remove “the ” from the start of the sentence, and just writing
“with”

~~~
stefan_
Or we just replace the w/ with "in".

------
waynesonfire
where are these extensions being used? I didn't look hard but are there open
source libraries / compilers that will take advantage of these?

It's sort of amazing the amount of performance you can squeeze when you fix
your OS and cpu architecture.

There are so many extensions:
[https://software.intel.com/sites/landingpage/IntrinsicsGuide](https://software.intel.com/sites/landingpage/IntrinsicsGuide)
\-- are we suppose to write our own libraries to leverage these or do we need
to file tickets with our favorite compilers for them to develop these
optimizations?

Oh, one more questions, how do these overlap with AMD?

~~~
mratsim
BLAS libraries, oneDNN, OpenCV, Eigen, Tensorflow, PyTorch, LLVM MLIR, ...

AMD usually implements them but with a couple of years of delay.

For example AVX512 is not implemented, and we had to wait for Ryzen 3 to have
the same AVX capabilities as Intel (2 AVX units per core instead of one).

------
sradman
OK, this is a new Intel SIMD-like instruction set: AVX for vectors, now AMX
for matrices. I guess this is an alternative to Nvidia GPU, Google TPU, Apple
Neural Engine, etc..

~~~
nabla9
It's not general alternative. It's good some subset of inference tasks.

Intel deploys their own GPU some time in the future.

------
deltasquared
I am wondering why I would want this on a CPU when this kind of processing is
already available on a GPU.

~~~
chrisseaton
Where is your data? Is it in the CPU cache or is it in the GPU? Computing
where your data is, rather than moving your data to where your compute is, can
often be the best option.

~~~
emcq
For small networks it's often a win to stay on chip at least on the power
side. But if you do need to go off chip for memory it's hard to beat the
memory bandwidth you have on a GPU.

------
mratsim
Looks very interesting but ... AVX512 is already problematic cooling wise,
this seems even worse.

------
im3w1l
How big of an issue is context switching in the middle of a sequence of AMX
operations?

~~~
jcranmer
The AMX extensions drop 2 more components into XSAVE: XTILECFG (which is 64
bytes) and XTILEDATA (8192 bytes).

Interestingly, there does seem to be a new extension (see §3.2.6 of
[https://software.intel.com/content/www/us/en/develop/downloa...](https://software.intel.com/content/www/us/en/develop/download/intel-
architecture-instruction-set-extensions-programming-reference.html)) that, on
first glance, looks to be a per-thread enable/disable bit for using these
registers, which suggests that an OS could make a process-level capability to
enable/disable AMX and thereby not bother saving these registers on context
switches if it's switching to a process without AMX.

------
snvzz
In other news, bloated CISC architecture becomes further bloated.

I'm looking forward to RISC-V's V extension, which is due to become standard
around September. Unlike AVX512 and friends, this one is vector size agnostic.

~~~
monocasa
This isn't built on the AVX512 style register file either; or even a vector
register at all. It's a set of huge matrix registers, so it's pretty
orthogonal to both AVX512 and RV-V.

~~~
jabl
I haven't followed the RV-V extension in a while, but IIRC it has acquired
features to configure the vector registers as matrix tiles, and matrix
multiplication instructions.

~~~
monocasa
It hasn't as of the 0.9 draft, but maybe there's something new I don't know
about.

~~~
jabl
I tried to look it up again, but I couldn't find anything. Might have been
just some offhand remark in some presentation about future improvements upon
the basic vector ISA.

