
AVX-512 Mask Registers, Again - ingve
https://travisdowns.github.io/blog/2020/05/26/kreg2.html
======
wtallis
I knew that the server variant of the Skylake core tacked on a few extra
sections to enable AVX-512 and the extra L2 cache. [0] But I wasn't aware that
the base consumer Skylake core already had a blank spot reserved for the
AVX-512 register file.

In hindsight it makes sense, but it raises the question why in 5 years of
iterating on the Skylake core, Intel hasn't tried to fill in that blank and
implement at least one AVX-512 execution unit on their consumer chips. It
seems like they could have gone for something running at half rate to avoid
the severe down-clocking the server cores require when powering up the full
AVX-512 unit, and they'd gain the benefits of the mask registers and other new
AVX-512 features and stay years ahead of AMD (expected to deliver AVX-512 in
2022?). Instead, this seems to be another feature that has been delayed by
Intel's 10nm failures.

[0]
[https://images.anandtech.com/doci/11544/skylakedie_changevsc...](https://images.anandtech.com/doci/11544/skylakedie_changevsconsumer.png)

~~~
londons_explore
Why would you ever leave a blank spot for something on a chip?

Die area is expensive, yet transistors on it are free. Might as well take a
crack at implementing it, and later disable it with fuses on models that
'shouldnt' have the feature.

Smells to me like perhaps corporate inefficiency making its way into the
silicon - space was reserved in the layout for this feature, but then the team
was late shipping a design, but they wouldn't allow anyone else to use that
area either, because "we'll be ready any day now".

~~~
BeeOnRope
Modern chips are full of "logically blank" spots. That is, chips are often
sold with 2 cores enabled when they physically contain 4 cores, sold with
certain parts of the ISA disabled (AVX) even though the execution units are
present, sold with less cache than physically present, etc.

Sometimes this is a necessary result of binning (e.g., those cache slices
didn't work), but mostly it is a result of pricing strategy: you want to
charge more to the customers who are willing to pay more, while still
capturing those who will pay less.

Building chips is expensive, but not _that_ expensive on a unit basis, so
whether you sell a die for $100 or $20,000 you are still making a (marginal)
profit.

Given that context, it's not surprising that chips can also have _actually
blank_ spots which are not enabled on any chip: in this case because the SKX
and SKL designs are tightly bound, so the floorplan is almost the same for
both chips.

This is probably either as a result of co-design, where SKL was designed with
the future SKX in mind, meaning that SKX needed minimal changes to the core
port over SKL, or as a result in a change of strategy: perhaps most SKL parts
were originally slated to have AVX-512, on 10nm, but when 10nm was repeatedly
delayed, the power or other impact was too high for most of the 14nm line and
so AVX-512 was relegated to the SKX family. Who knows.

What is clear is that SKL was definitely laid out with AVX-512 in mind.

------
zbjornson
> Who is going to be making heavy use of x87 or MMX (both obsolete) along with
> AVX-512

Edge case and still not likely to cause contention: x87 fadd has 3-cycle
latency while AVX has 4-c lat through ICL. If you need to prepare some
constant with a chain of high-precision additions to broadcast into an AVX
reg, it's faster. (Especially if the alternative is compensated addition.)

~~~
BeeOnRope
Yes, that is interesting. fadd also uses a different port (p5) as compared to
SSE/AVX FP stuff which uses p01, so it seems likely that there is separate
dedicated hardware for the x87 stuff, probably in the slice between the main
vector pipes and this can also handle the 80 bit stuff.

------
vt240
>Who is going to be making heavy use of x87 or MMX (both obsolete) along with
AVX-512 mask registers? It seems extremely unlikely.

Wouldn't this be a performance issue if your linking an executable against
libraries compiled for MMX, AVX.

~~~
chrisseaton
x87 and MMX are both so incredibly old that I think it's pretty unlikely that
you will be linking code that combines these, unless you have a 'dusty'
library that nobody has the source code for anymore.

~~~
acqq
x87 is still providing some functionality which is more convenient to be used
in some specific scenarios. I've used it intentionally even for some 64-bit
code, where it was a perfect fit to all other requirements of the whole
environment and the goals of the project.

When you don't need it, of course you should use your defaults. But there are
still some specific scenarios where it definitely has its uses.

Just as an example of the advantages, x87 code is very compact and the numeric
manipulations happen on the implicit hardware floating point stack, allowing
for complex formulas fitting in only a few bytes (as the addresses are
implicit). Also, as some ABIs still depend on it, targeting such ABIs makes
the use of x87 unavoidable.

~~~
BeeOnRope
It also provides the 80-bit precision stuff which I guess could be useful for
something.

