
New Intel Instructions for Alder Lake, Also BF16 for Sapphire Rapids - dataking
https://www.anandtech.com/show/15686/intel-updates-isa-manual-new-instructions-for-alder-lake-also-bf16-for-sapphire-rapids
======
chx
This doesn't mean ADL will actually see a wide release.

I am still dead convinced we will never see a wide release of a 10nm desktop
chip. It will be delayed then cancelled in favor of 7nm.

Server wise, same, limited releases, more smoke and mirrors etc etc

Intel will limp along with 14nm until 7nm. 10nm is utter broken and they can't
fix it.

~~~
reitzensteinm
I think that's true if 7nm progress is independent of 10nm, and the mistakes
made with 10nm aren't also delaying 7nm.

My understanding is that Intel's 14->10nm shrink was the most aggressive in
the industry, promising to yield a greater increase in density than the
geometry would imply, when usually there's a loss factor and the density
increase isn't as good as you'd naively expect.

Even after 10nm was delayed by quite some time, Intel pointed to this and
declared they weren't lagging the industry as the shrink was closer to a 1.5
node shrink.

If the 10nm delay was the result of this aggressiveness and Intel was well in
to developing 7nm in a similar fashion before it became obvious how 10nm was
going to turn out, this may not be a matter of skipping over a single bad
apple.

~~~
chx
> I think that's true if 7nm progress is independent of 10nm, and the mistakes
> made with 10nm aren't also delaying 7nm

That is exactly what's happening

~~~
keanebean86
7nm was planned for 2017 as of 2014. Ouch.

[https://www.tweaktown.com/news/41582/intel-to-hit-10nm-
in-20...](https://www.tweaktown.com/news/41582/intel-to-hit-10nm-
in-2016-with-7nm-cpus-arriving-in-2018/index.html)

~~~
chx
That's just early optimism, as far as these things can be known from the
outside, the process is fine very much unlike 10nm which was known to be
broken for years even before Cannon Lake.

------
throw0101a
Possibly a meta question on instructions: is there such a thing as having 'too
many'?

I know transistors are cheap, but at what point is it diminishing returns on
new instructions? How many different use cases need to be handled? Certainly
there are new situations that need to be handled (e.g., H.264/5/6, AV1 video),
but will there ever be a point where we can say "this is enough"?

~~~
throwaway-9320
I remember reading somewhere that nowadays a significant chunk of the
instructions isn't actually implemented on the CPU using transistors, but by
using CPU microcode to sort of emulate these instructions by combining
existing ones. Someone correct me if I'm wrong.

~~~
jcranmer
Micro-ops are the actual things that can be executed by the hardware. A
floating-point FMA unit is going to support a floating point addition,
subtraction, fused multiply add (with various intermediate sign twiddles), and
integer multiplication and wide multiplication--all without adding much more
hardware: you're adding a few xors or muxes to the big, fat multiplier in the
middle of it all. Each of these might have distinct micro-ops, or you might be
able to separate the processing stages and use a single multiplier micro-op
with distinct preprocessing micro-ops for the different instructions.
Realistically, though, you are adding new micro-ops, although the overall
hardware burden may be light.

The motivation of adding new instructions is generally to get higher
performance, so there's going to be pressure to have hardware to execute it
well, as opposed to a more naive emulation. But sometimes people add support
without making it fast--AMD chips used to (still do? I'm not sure) implement
the 256-bit AVX instructions by sending the 128-bit halves through their units
in sequence, so that it technically supported AVX instructions but didn't see
any improved benefit from it.

------
throw0101a
Is there a rationale behind Intel's codename scheme?

With (e.g.) Ubuntu they go through the English alphabet, so you can get some
idea of the order things have/will come out, but it seems that Intel is using
a random word generator for their _X_ Lake names.

~~~
auvi
I think all the X Lake names are taken form real lakes in the state of Oregon.

~~~
throw0101a
Sure, but how are the names chosen?

If Ubuntu _et al_ it is alphabetical order, and you can generally tell the
timeline of releases. But how can one really tell what the current product
from Intel is, and what came before, and what are the upcoming releases?

It's not like they're going through Oregon lakes in some kind of order, or are
they? By discovery, by size/volume, other?

~~~
barkingcat
As in most code names there is no order, if you want an order, maybe in terms
of favourite to least favourite parks/lakes of the Intel staff.

What you are looking for are monotonically increasing version numbers, which
code names are definitely not.

If you are looking for details about generations of chips,
[https://ark.intel.com/](https://ark.intel.com/) is your friend.

~~~
throw0101a
> _What you are looking for are monotonically increasing version numbers,
> which code names are definitely not._

They are not, but given that Intel seems to put them in their public
information / marketing material, it seems like the "codenames" are being used
as version numbers. If they were strictly internal-to-Intel I could see that
POV, but that doesn't seem to be happening.

And the Intel's model numbers / SKUs also seem to be created from a random
number generator. :)

~~~
barkingcat
What's wrong with having names that are public that doesn't match with a
version number?

Like you said, even model/sku numbers are mostly random. Why the expectation
of a monotonically increasing version number if intel has almost never done it
before?

~~~
msla
> Why the expectation of a monotonically increasing version number if intel
> has almost never done it before?

"Almost never" except for 8086/80286/80386/80486, you mean.

But who remembers those obscurities?

~~~
throw0101a
Pentium 2/3/4:

* [https://en.wikipedia.org/wiki/Pentium](https://en.wikipedia.org/wiki/Pentium)

~~~
msla
And the "Pentium" itself, arguably, being a "penta-" name coming after the
80486.

But the Pentium broke the numbering scheme to an extent, with the Pentium Pro
wedged between the Pentium and Pentium II.

------
throwaway_pdp09
The article says "LBRs (Last Branch Recording) in order to speed up branches".
Reading the PDF seems to be more about recording it (for profiling?). If it
really is about speedups can someone summarise how, and how that differs from
using branch predictors.

~~~
chrisseaton
I think it's for tooling like profilers, not directly for performance. But you
could use that tool to improve performance in your code, so maybe that's what
they mean.

~~~
zbjornson
That's correct. Info on LBR in general:
[https://lwn.net/Articles/680985/](https://lwn.net/Articles/680985/)

------
gok
Is it clear to anyone if the BFloat16 support in these chips means they have
extra-low precision multipliers internally, or are the dot products still
implemented with the single precision guts? Does it actually increase compute
over FP32 or is it just a bandwidth win?

~~~
jcranmer
The new instructions are, I believe:

* VCVTNE2PS2BF16 — Convert Two Packed Single Data to One Packed BF16 Data [i.e., float32 -> bfloat16 conversion]

* VCVTNEPS2BF16 — Convert Packed Single Data to Packed BF16 Data [ditto]

* VDPBF16PS — Dot Product of BF16 Pairs Accumulated into Packed Single Precision [i.e., multiply two v2bf16 with each other, and then sum the two results into a f32.]

The last instruction is the only computation instruction, and that can be
implemented by breaking the 23-bit multiplier of an FMA unit into two 7-bit
multipliers, one for each bfloat16, as well as duplicating the normalization
logic beforehand.

