
Intel’s 6th Generation Skylake Processors Scheduled for 2H 2015 - nkurz
http://wccftech.com/intels-6th-generation-skylake-processors-scheduled-2h-2015-5th-generation-broadwell-spring-15-updates-2015-2016-mobility-roadmap/
======
femto
Some interesting numbers:

In 1971, the 4004 microprocessor had 2300 transistors fabricated with a 10um
process. If the 4004 was fabricated with the same 14nm process as the Skylake,
over two hundred 4004s would fit in the area that a single transistor occupied
in the original 4004.

"Breakeven" occurred with the 68000:

In 1979, the 68000 microprocessor had 68000 transistors fabricated with a
3.5um process. If the 68000 was fabricated with the same 14nm process as the
Skylake, an entire 68000 would fit in the area that a single transistor
occupied in the original 68000.

Today's transistors are _really_ small compared to what they used to be! The
mind boggles at how many simple processor cores could be crammed into a
current die.

The numbers came from:

[https://en.wikipedia.org/wiki/Transistor_count](https://en.wikipedia.org/wiki/Transistor_count)

~~~
swalsh
What blows my mind is that Intel continues to do it with what seems like
really only one real competitor.

~~~
higherpurpose
ARM chip makers _are_ competitors to Intel now. Whether you believe it or not,
is less important. Intel, however, does believe it. Their entire chip strategy
over the past few years has been a _response_ to ARM - with each new
generation they barely increase the performance of _mainstream_ chips or even
_drop_ it, like they did with Core-M, as they try to lower their power
consumption, to put them closer to ARM's range. I'm not talking about IPC
here, but about the overall performance of the mainstream chips they're
pushing in the market.

First it was -M chips in laptops (30W-level performance), then it was -U chips
(15W-level performance), and now it's -Y chips (5W - or so they say - level
performance). Yes, a good chunk of that difference is from moving to new nodes
and optimization, but not all of it. Another big part comes from lowering the
performance and/or throttling their chips a lot more aggressively than before
in order to achieve that drastic drop in power consumption.

They don't care what AMD is doing these days, but they do care what ARM chip
makers are doing. They even bought a 4G modem company, so they can integrate
modems into chips the way Qualcomm does it. Speaking of Qualcomm, I believe it
has just announced it's going into server chips now.

Intel feels so threatened by ARM now, that it's losing _$1 billion per
quarter_ , to essentially give away its Atom chips for free to no-name tablet
makers, or less important OEMs (who are the only ones desperate enough to take
Intel's deal, because others know that as soon as Intel becomes a big player,
they'll charge a much higher rate even than other ARM competition - Intel
likes big profits on its chips and it's the only way for them to sustain the
company).

They even went into the business of IP licensing for crying out loud, just
like ARM (they're licensing Atom IP to Rockchip and Spreadtrum). Intel has
essentially admitted that it can't be competitive _building its own mobile
chips_. They have to license the IP to others so other companies can build
them - on TSMC's foundry, on an old process node, even by current ARM
standards.

~~~
gsnedders
Intel's big problem is the fact that they invest so heavily in fabs (which is
how they're basically constantly a process ahead of everyone else!) that their
costs are so much higher than most other manufacturers — and that's why they
charge so much more for their chips than most of the generic ARM SoCs. Yes,
some of it is profit margin, but they've been pushing the development so much
that much of it is just recouping their massive investments that give them
their competitive advantage.

------
vardump
Warning: link auto-plays video with sound.

Edit/addition:

AVX-512 in Skylake is pretty exciting. 32 FLOPS per clock cycle per core is
amazing. Also good to see 64 or 128 MB of eDRAM being included in more
configurations. It should also somewhat help with algorithms with larger
working sets, not just with graphics. Integration is also interesting, latency
between GPU and CPU should be low. This may open chances for better
integration and sharing of work between those units.

1 Gbps ethernet starts to be a limitation, it's barely enough for current
internet connections in some parts of the world. Nearly any SSD can do
sustained 5 Gbps, most new mechanical disks push 2 Gbps. It's time for
10GBASE-T to be a standard, with power saving option to function at lower
speeds. Nowadays even low end hardware such as $80 Asus etc. wifi-router with
USB 3.0 can push 500 Mbps+ at file serving.

~~~
bhouston
It is hard to get mainstream usage of CPU extensions until they are on the
large majority of chips because creating multiple code paths is a pain. Thus
while AVX-512 is nice, I do not expect to see usage of it outside video codecs
until at least 3 years from now, and it likely won't be widely used until 5-7
years from now. This problem is compounded by the slow upgrade cycles of PCs
now that their performance doesn't drastically increase per generation.

~~~
vardump
It's worse than that. It's still problematic to even assume SSE4.2, not to
mention AVX2, which would help significantly. Virtualization is making the
problem worse, because hypervisors set CPUID bits to lowest common denominator
within a vMotion / teleportation domain.

I think (specialized) JITting is the answer for that. Generate code on the
fly. I've been lately playing around with JITted [de]serializers, etc. I think
it might be possible to have an order of magnitude gains in JSON, XML,
msgpack, etc. processing.

Even *printf can be much, much faster. If the format specifier is static,
first run just generates code just for that specifier. Subsequent invocations
can be about 10-20x faster versus normal style format string scanner.

Edit: When I say JIT, I'm not talking about JVM or any other mainstream
implementation specifically, but the concept in general. Which means just-in-
time generation of (optimal) machine specific code.

~~~
tormeh
JIT and AOT on-device is the future, I agree. How good are the JVMs at taking
advantage of new instructions?

~~~
bhouston
My understanding is that JVMs do not generate any advanced instructions from
standard Java code. The only time they use these instructions is if they link
in hand coded C++ libraries that make use of them.

Edit: Stackoverflow answer says that in rare cases the JVM can vectorize
simple loops: [http://stackoverflow.com/questions/10784951/do-any-jvms-
jit-...](http://stackoverflow.com/questions/10784951/do-any-jvms-jit-
compilers-generate-code-that-uses-vectorized-floating-point-ins)

~~~
vardump
Just take a look at the code they generate, it's easy. In short, they do
generate SIMD code, but very rarely take advantage of vectorized execution.

~~~
tormeh
Why aren't the processor manufacturers making their own JITs and AOT
compilers? Surely that would be a competitive advantage.

~~~
vardump
Intel makes a famous AOT compiler, ICC. [https://software.intel.com/en-
us/intel-compilers](https://software.intel.com/en-us/intel-compilers)

~~~
tormeh
Yes, but it's for developer-compiled languages. It's not like I switch from an
AMD to an Intel processor and suddenly all C code on my machine has new
instructions in them. Compiled code remains the way it is.

This is not what I want. What I want is either JIT or AOT-on-end-device.

~~~
vardump
Binary (compiled code) JIT compilation could very well be faster than direct
execution.

I'm not aware of anyone currently using these techniques for running native
code faster. Yet.

You can also call these techniques binary translation if you like.

At simplest level, one could for example dynamically replace relevant static
library calls with higher performance versions. Perform function signature
based optimization - when a certain known common function is found, simply
replace it with a faster version.

More advanced ones could for example inline functions, even those pointed by
function pointers given a guard condition. It could also vectorize suitable
serial code.

Superoptimization techniques could also be applied to the instruction stream.
[http://en.wikipedia.org/wiki/Superoptimization](http://en.wikipedia.org/wiki/Superoptimization)

Even complex optimizations are possible. Like memory access pattern
translation. The JIT compiler could perform cache simulation and find a
mapping between memory accesses that would increase true hardware level
locality of reference. Performance could be improved by generating access
transformation code when target binary tries to access memory. Like map image
processing memory access code from row major to Z-order curve to increase
cache hit probability. Unless memory mapped (either to other processes or
physical hardware) the buffer resides the data would need to be translated
back row-major only for system calls. A lazy page fault mechanism could also
work in some cases.

------
bhouston
It is hard to get excited about this update if the core count stays the same
(2/4) in the low end chips.

The CPU throughput per core does not change that much per generation now:

2010-era Intel Core i7 980X (6 cores with HT) has a performance rating of
~9000:
[https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7+X+980...](https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7+X+980+%40+3.33GHz&id=866)

2014-era Intel Core i7 5930K (6 cores with HT) has a performance rating of
~13500:
[https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-5930K...](https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-5930K+%40+3.50GHz&id=2336)

That works out to an CPU performance increase of ~1000/10% per
year/generation. I guess that is impressive given no clock speed increase, and
reduced power consumption, but that doesn't turn into more powerful results
for consumers in terms of raw performance. If Intel does not increase the
general core count available to consumers, we are basically stuck at the same
level of performance we were 4 years ago -- this isn't the case with GPUs nor
with mobile CPUs, which are increasing rapidly.

~~~
phaemon
Isn't the 2010-era Intel Core i7 980X about twice the price of the 2014-era
Intel Core i7 5930K?

In other words, aren't you comparing a high end CPU from 2010 to a medium
range CPU from 2014?

~~~
bhouston
The Intel Core i7 5930K is the fastest 6 core Intel Core i7 5xxx CPU. It is
the second faster Intel Core i7 5xxx CPU available. To call that mid-range is
not correct.

~~~
phaemon
Fair enough, but the 5960X is still a better comparison, surely?

~~~
bhouston
I was trying to make the point that per core speed hasn't increased. Both the
980X and the 5960K are the top of the line 6 core consumer grade CPUs.

This ties in with my point that the consumer grade CPUs from Skylake are all 2
and 4 core CPUs, which isn't that different from low-end consumer grade CPUs
from 2010/2011, because per core performance hasn't increased significantly.

My solution was for Intel to increase core count for consumer grade CPUs.
Because you can get real performance increases this way, as you proved by
comparing a 6 core CPU with an 8 core CPU.

~~~
Sanddancer
At the high end, per-core performance hasn't increased significantly. However,
at the 2-4 core level, it has. For example, the i3-4360t is 72% faster than
the i3 550, even though the 4360t uses half the power of the 550. Plus, you
need to keep in mind other things that will affect perceptions of performance,
like the pci-e lanes being on-die with the most recent generations of chips.
While the high end has plateaued to some extent with recent generations,
Intel's pumped a significant amount of work making their other lines
considerably more performant.

~~~
vardump
> At the high end, per-core performance hasn't increased significantly.

When comparing aforementioned CPUs (Intel Core i7 980X and Intel Core i7
5930K), I'd call 4->16 FLOPS per clock and 25.6 GB/s -> 68 GB/s memory
bandwidth significant. I think integer rate has also at least quadrupled.

Of course, because of high latencies, when running pointer chasing or branchy
code, there's little improvement. The software just can't take advantage of
hardware capabilities.

What kind of software are you considering when making that statement?

------
higherpurpose
Just like Broadwell was scheduled for the first half of 2014?

~~~
sandGorgon
I dont know why you are being downvoted, but it is a valid questions.
Broadwell was delayed for half a year and is still not widely available.
Macbooks (which are a pretty important predictor of demand) are not going to
have Broadwells till 2015 [1], so the question is if Intel is going to have
Broadwell at all or skip directly to SkyLake.

Plus, Broadwell was delayed because of manufacturing yield issues... so unless
that is cleared up, SkyLake will be similarly affected.

[1] [http://www.macrumors.com/2014/07/09/broadwell-early-to-
mid-2...](http://www.macrumors.com/2014/07/09/broadwell-early-to-mid-2015/)

~~~
gsnedders
As I understand it, the issues are fundamental to the 14nm process — and one
might expect once they can manufacture 14nm Broadwell parts with reasonable
yields, they will be able to manufacture 14nm Skylake parts with reasonable
yields. Last I heard the plan was for Broadwell to be available for something
like six months only.

