
Sandy bridge, Ivy bridge, Haswell and the mess of Intel processors feature list - shodanshok
http://www.ilsistemista.net/index.php/hardware-analysis/39-sandy-bridge-ivy-bridge-haswell-and-the-mess-of-intel-processors-feature-list.html
======
jandrewrogers
There is a pervasive issue in that most code written today is tacitly designed
for the assumptions of microarchitectures from a decade ago, and most code is
even older and is rarely rewritten.

Generally speaking, compilers have a limited ability to improve the basic
design of code or deviate from the assumptions about hardware embedded in that
code. More often than not, taking advantage of different microarchitectures is
not just about the code generation, it is also about algorithm design and
selection, which falls directly on the programmer. I currently design software
that assumes the Haswell microarchitecture. While the code is written in C++
and the high level architecture is generally the same, the lower level idioms
are different because they reflect the strength of the microarchitecture. If
the code was for the latest ARM core, it would need to be quite different to
be efficient. And I am not even talking about things like AVX2. Idiom changes
to match microarchitectures are not "micro-optimizations" in the traditional
sense; it will often buy 2x performance improvement overall versus well
designed generic code tacitly optimized for another microarchitecture.

The reality is that almost no one writes code for e.g. the Haswell
microarchitecture. The idea of writing code for a microarchitecture at the
level of, say, C++ does not even enter most programmers' minds. Broadly
speaking, a compiler will not allow you to take advantage of advancements in
microarchitecture except by accident, which works sometimes but is not
efficient in terms of return on the new microarchitecture. Compilers learn
some micro-idioms, but most of the macro-idioms escape them (like designing
algorithms for the ALU parallelism of a particular microarchitecture).

The last big microarchitecture change for Intel was Nehalem, the original
"i7". A lot of code optimization assumptions from prior CPUs went out the
window with that microarchitecture. The latest, Haswell, added more usable ALU
parallelism and the BMI2 extensions, which are quite useful if you know how to
exploit them but nothing Earth-shattering. AVX2 was nice, but too limited to
have much value for most normal code.

That said, I am pretty excited about AVX-512, which will be a major
microarchitecture extension once that it is available. The giant caveat is
that no one will be able to really take advantage of it without, once again,
redesigning their code for that microarchitecture. A compiler won't be able to
do that for you, which has been the real stumbling block for the adoption of
advanced features in new microarchitectures. What constitutes an optimal
algorithm or data structure is microarchitecture dependent.

~~~
WallWextra
It's worth pointing out that Sandy Bridge was a massive microarchitectural
change from its predecessors, using a totally different out-of-order engine.

I don't know why the performance characteristics are so similar. I guess once
you have a big fat memory subsystem on-die, everything else is diminishing
returns?

~~~
nkurz
_I don 't know why the performance characteristics are so similar. I guess
once you have a big fat memory subsystem on-die, everything else is
diminishing returns?_

I don't find that to be true. As Andrew is saying, the performance
characteristics only look similar because the software is rarely written to
optimize performance on modern processors. I have a 5-year old overclocked
Nehalem that can beat most current Haswell's on single-threaded code of this
sort just because of the higher clock speed. But for algorithms designed for
the capabilities of the newer instruction sets, the different generations
really start to distinguish themselves.

Instead of ~5% improvements between Intel generations on standard benchmarks,
you can sometimes get 50% or more for architecture specific algorithms. From
Nehalem to Sandy Bridge to Haswell, the maximum per-cycle reads from L1 has
gone from 16B to 32B to 64B. This means that approaches that would have been
silly 5 years ago (like 16KB lookup tables from which you need to read 32B
every cycle to sustain throughput) can be practical now.

------
azurelogic
The author is complaining out of context on a couple of points here:

1) The B820 came out 1.25 years later than the B940. So, it's not hard to see
that maybe Intel decided in that timeframe to add VT-x support.

2) None of the K-series CPUs support VT-d. For some reason, VT-d and
overclocking are not compatible. This would have been more noticeable if, for
example, the author had compared the i7-4770K to the i7-4770, not the i5-4570.
The same distinction is there between the i5-4670 and i5-4670K.

3) The table is organized in a misleading fashion. Why put the weakest CPUs in
the center except to make proper trend be intentionally disrupted

After fixing or striking these issues, this isn't even an article worth
writing because the features suddenly make sense.

~~~
wtallis
> _" None of the K-series CPUs support VT-d. For some reason, VT-d and
> overclocking are not compatible."_

No, _compatible_ usually implies there's a technological or engineering
justification. This is pure marketing. And the _Devils Canyon_ update to
_Haswell_ introduced some -K parts that have VT-d.

Edit: And _support_ likewise implies that even if it's not a purely technical
issue, that there's at least some tradeoff. Enabling virtualization has a
price to the consumer, but doesn't cost Intel anything extra.

------
nkurz
As someone excited by the improvements that are possible moving from AVX to
AVX2, I was recently surprised to learn that Intel has chips released this
year that do not support AVX. AVX-512 is still supposedly scheduled for
release next year with Skylake.

How long do you suppose until it's reasonable to presume that most processors
will support it? SSE 4.2 came out in 2008 and appears to be the baseline now,
so 5 years seems like a reasonable guess.

It's hard to believe this degree of ISA segmentation is going to benefit Intel
in the end.

~~~
wtallis
I think that the SIMD instructions are well past the point of diminishing
returns, both for the kinds of instructions provided and the width of the
vectors. I'm more interested in how VT-d and TSX allow different software
architectures, if and only if you can design your OS around the assumption
that they're available. By only including them on select models, Intel's
making them much less useful even for the customers who want to pay extra to
have those features.

~~~
thechao
I used to write software rasterizers in a past life. AVX-512 is straight-up an
observed 8x performance gain over SSE: a well-written software rasterizer is a
very clever thread-scheduling algorithm wrapped around long lists of FMA-
load/store sequences.

I know that ray-tracers, databases, and other 'big data' all gain equally-well
from AVX-512. The IO-request depth on these new parts is such that, for ALU-
heavy work, with good IO-spread, memory latency is completely hidden---fetch
time is still ~250-300c, but you just no longer see it.

The only thing I miss is 'addsets', which was dropped from LRBni, as this
instruction was 'the rasterizer function': it now requires a fairly involved
sequence to replicate in AVX-512.

Pining for other lost things: if we had the LRBni up- and down- convert
instructions, a software texture sampler would be a lot more feasible.

~~~
wtallis
Wider SIMD units may provide a linear performance increase for many workloads,
but they have a super-linear impact on the up front cost of the chip,
especially when you take into account the opportunity cost of those
transistors: they could have been dedicated to something that might have also
helped non-numerical workloads. Wide SIMD is great to have, but it doesn't
come free, or else we wouldn't have GPUs.

------
hayksaakian
I've noticed that CPU companies are focusing more and more on power efficiency
to compete with ARM potentially. Maybe turning off the features described in
the OP makes lower end CPUs run faster?

That might be significant since i also presume most laptops use those same
lower end CPUs. Laptops being the ones that need to save energy much of the
time. (Desktops are obviously plugged in).

This is probably most significant in the Microsoft Surface (it uses a real
intel CPU if i recall correctly).

I wonder which features it has marked "off" , and what impact that has on
battery power.

~~~
wtallis
Remember, within a generation these chips are almost all made from the same
set of masks. None of these disabled features save any transistors or die
area; the disabling just leaves you with a dead spot on the chip. So the real
question is if you've got the hardware in place to provide these special
features, can you save any power by not using those features?

The answer is that you only save power by not doing at all the things those
features accelerate. It's never more efficient to implement AES through
general-purpose instructions than using the fixed-function hardware. If you've
got a workload that can benefit from the wider SIMD units, it won't save any
energy to only use the narrower instructions. If you're trying to secure or
virtualize your OS, doing it with VT-x and VT-d will be so much faster than
emulating it in software that even if those features added several watts to
the chips power consumption you would still save energy.

Doing less can save power, but doing the same amount of stuff the hard way
doesn't help anything.

~~~
tanderson92
That's only true if you are using those features 100% of the time though,
right ? By which I mean, there is no doubt a trade-off in play when you only
use the features for 15% of your workload. This is because the chip must be
powered on all the time but, say, the software implementation of AES doesn't
always have to be running.

~~~
wtallis
No, special-purpose functional units are very well power-gated on modern
chips. They don't draw appreciably more power when not in use than when fused
off.

~~~
rasz_pl
you say that, but in practice Intel is still 5-10x off ARM when it comes to
idle power.

~~~
wtallis
Register renaming and instruction reordering and things like that can't be
selectively powered down, and are very power-hungry. But if you were talking
about when the entire core is suspended, then that's largely unrelated to the
high-level feature set of the processor. It's about things like the analog
characteristics of the transistors used: Intel builds their mainstream CPUs
using circuits that can run at 4-5GHz without trouble while none of the mobile
SoCs can come close regardless of exotic cooling.

------
ChuckMcM
The author makes a point, when CPUs are "good enough" its a real problem for
CPU manufacturers. They don't price them to have a service life of 15 years,
they expect them to be replaced every 2 - 3 years. And the option there is to
featurize them. It is interesting to watch in the ARM space as well.

------
DSingularity
I don't think a lot of people understand that some of these features need more
than a clear chicken bit. Some chips can be incompatible with features due to
physical reasons.

Other of these features make little sense for some chips and some markets. For
example, some of the VTd features. The majority of personal users aren't
scheduling more than one or two VMs max. Do we really need VPID and larger
TLBs for such workloads? No. So why include the feature?

If binary size becomes an issue then we need a better solution. Maybe its
delaying some or all optimization for install time. Maybe its providing
individual binaries for different targets.

~~~
nkurz
_Some chips can be incompatible with features due to physical reasons_

I think this is good point. So far, I think the AVX/AVX2 distinction is based
only on release date --- the newer chips support the newer instruction set.
This is a good sort of improvement. And it's probably only practical to
support AVX and AVX2 if you have 256-bit registers to work with.

Does the lower end mobile Pentium branded line even have full physical support
for 256-bit vectors? That is, are they supporting 128-bit SSE instructions in
a full 256-bit register but not providing any instructions that use the upper
lane?

~~~
DSingularity
There is a strong advantage to a homogenous ISA. I am glad to see that newer
chips have more full support of these instructions. Maybe they are still
emulating them with micro-ops without real acceleration still convenient.

Regarding the mobile lineup and AVx2, honestly I wouldn't know. It is very
much possible that they disabled it to avoid competing with higher lineups.
But a more likely (IMO) reason is that to add 256 bit support added enough
costs that the were either unacceptable or simply aren't justified by target
workloads when evaluated in simulations.

The physical reasons I was alluding to revolve around power, area, and
capacitance (delay). These inform the cost ($$$) and performance of chips.
Some of these big features (like AVX3) push the limits of what you can do
without having to use high performance transistors or deal with intense power
draws.

Another physical reason is that sometime parts suffer effects during
manufacturing. Modern design of chips is inherently modular, so sometimes the
only option is to disable certain parts of the chip.

------
darklajid
Trying to buy me a present atm, new desktop machine.

This is actually one of my pain points, because none of the sites I visit list
cpu features and I am interested in some (notably virtualization related
extensions).

~~~
ivank
Intel publishes this information for every CPU they make at
[http://ark.intel.com/](http://ark.intel.com/)

------
antonios
Strangely enough, ECC is supported on some Core i3 models.

------
billions
Moore's law is over. Pentium 4s reached 3GHZ 15 years ago. CPUs try things
like parallelizing while the human brain thinks sequentially. CPUs are still
getting faster but at an incrementally slower rate. A 4 year old Macbook Air
is half as performant as the latest. In previous decades, the performance
gains were 2-5X per year.

A paradigm shift (quantum computing?) needs to happen soon so computation can
continue scaling.

~~~
nemothekid
Moore's law was never about performance.

~~~
billions
Moore's law is about the number of transistors on a chip which is reaching a
limit because heat cannot be dissipated fast enough. Heat is the primary limit
on performance.

~~~
sliverstorm
More transistors is still better, even if they need to be off most of the
time. You can spend more silicon on advanced features, just like AVX, spending
transistor count for performance at neutral power cost.

