
Intel Launches Next Gen Itanium Monster Processor - MojoKid
http://hothardware.com/Reviews/Intel-Previews-RecordBreaking-32nm-Itanium-Poulson-Processor/
======
aidenn0
1) The article says "thread level parallelism" when they mean "instruction
level parallelism"

2) There is way more ILP available at run-time than compile time, and what ILP
is available in both is much more tractable at run-time. An out-of-order CPU
is constantly filling a buffer with instructions (or microcode), and hardware
is determining the dependencies dynamically to issue them to the ALU. This is
a much more tractable problem than trying to guess the control flow at compile
time. A large enough prefetch buffer can overcome a really dumb compiler.

3) Requiring software to be aware of details of your hardware implementation
is a really tempting idea, but it has been historically much worse than the
opposite. Consider a modern x86 that is nothing like an 80386, but runs the
same software, often at higher IPCs than an original 386. Now compare to MIPS
which has its delay-slot for branches which often just gets filled with a NOP.
Furthermore on modern MIPS cores, which have longer pipelines and branch-
prediction, that slot is more-or-less useless!

4) Assuming IA64 doesn't die out, by the time compiler writers figure out how
to make code that runs fast on an Itanium of today, Intel will be performing
hardware gymnastics to make that code run fast on the hardware of tomorrow.

~~~
sb
Regarding point 2: Last time I checked (IIRC 2006-ish), there seemed to be
common resentiment among scientists working in the area of programming
language implementation that there is just too little ILP for successful wide-
spread VLIW adoption (modulo some special use cases.)

AFAI(K|R), Hennesy and Patterson's cannonical text (CA-AQA [1]) reflects this:
going from 3rd to 4th edition, we find a new chapter "Limits on ILP",
VLIW/EPIC elements have been moved from the main contents to the CD-ROM, too
(which probably is not a good indicator, though: the 3rd edition was just too
heavy to carry it around a lot ;)

[1]: [http://www.amazon.com/Computer-Architecture-Quantitative-
App...](http://www.amazon.com/Computer-Architecture-Quantitative-
Approach-4th/dp/0123704901)

~~~
DarkShikari
There _is_ plenty of ILP for VLIW in many real-world cases -- but the most
common case is that where all the instructions in the VLIW are identical. This
of course reduces to SIMD, which makes the VLIW unnecessary.

------
kenjackson
The IDC prediction chart is a thing of beauty. I'd love to have a webpage just
for seeing data like this (predictions vs reality). Must be one of the best
charts I've seen in a while.

~~~
JoeAltmaier
So, is THIS Itanium going to sell? Intel sure has a boatload of patience.

------
psykotic
When the Itanium first launched, we got some machines from Intel so we could
port our software. After excitedly unpacking one machine, we plugged it into
the office power outlet and flipped the switch. Suddenly all the lights on our
floor went out. Turns out you were supposed to only feed it from a data-
center-strength power grid.

~~~
kenjackson
The nice thing though is once it was installed, you save money because you can
turn off the heat in your building.

------
joshu
I didn't realize Intel still made this stuff.

~~~
Andys
You can thank ongoing long-term enterprise and government contracts for that,
to the tune of over a billion a year.

------
rbanffy
What current OS options exist for Itanium processors? I would count HP-UX and
Linux and, of course, NetBSD. Microsoft already stated 2008 R2 will be the
last OS they make for ia64.

That said, looks like an impressive processor.

------
jey
What's the main customer/application/market for Itanic, er, Itanium?

~~~
sfk
Not sure if it is the main market, but you can still run OpenVMS on HP Itanium
servers:

<http://h71000.www7.hp.com/index.html?jumpid=/go/openvms>

~~~
yuhong
Yep, HP customers is the main market for Itanium nowadays, running HP-UX or
OpenVMS.

------
VladRussian
The monster is back.

>Itanium relies on the compiler to optimize code at run-time

Thats sums it up for me :)

But seriously - in case when compiler is able to do parallelization, NVidia
GPU seems to be a better - cheaper, more accessible and performant - target.

~~~
scott_s
Not all parallelism is the same. In this article, parallelism usually means
instruction level parallelism
(<http://en.wikipedia.org/wiki/Instruction_level_parallelism>).GPUs are
fantastic at data parallelism
(<http://en.wikipedia.org/wiki/Data_parallelism>). Being able to exploit one
says nothing about the other.

Itanium differs from other processor architectures is how it handles
instruction level parallelism. The processor in the computer in front of you
probably uses out-of-order execution (<http://en.wikipedia.org/wiki/Out-of-
order_execution>) to exploit ILP. This happens on the fly, as a program
executes. Itanium depends on the compiler to determine where ILP is.

~~~
VladRussian
>Being able to exploit one says nothing about the other.

data parallelism is a partial case of instruction level parallelism - N
instances of the same instruction run for different pieces of data. A very
frequent case in high performance computing or enterprise data crunching tasks
supposedly targeted by the Itanic

~~~
scott_s
What you described is not what we mean by the term "instruction level
parallelism." Yes, it is parallelism that involves instructions, but it is not
ILP. Instruction level parallelism (ILP), data level parallelism (DLP) and
task (sometimes thread) level parallelism (TLP) are all orthogonal. I can have
data level parallelism that is _not_ at the instruction level.

~~~
VladRussian
so, according to you, the data level parallelism at the instruction level i
described isn't a case of instruction level parallelism?

<http://en.wikipedia.org/wiki/Instruction-level_parallelism>

and returning to the original specific context of Itanium vs. GPU :

[http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35...](http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35.html)

Take your stub at what can be classified as what :)

~~~
scott_s
_> so, according to you, the data level parallelism at the instruction level i
described isn't a case of instruction level parallelism?_

Correct. The GPU article on NVIDIA's website uses ILP incorrectly. They are
describing SIMD operations - single instruction, multiple data - which is data
parallelism at the instruction level. This is inherently different than ILP,
which is when you extract parallelism from a sequential stream of instructions
by executing them out-of-order. Of course, it's possible to exploit ILP on a
stream of SIMD instructions.

~~~
VladRussian
>The GPU article on NVIDIA's website uses ILP incorrectly.

That depends on the implementation. Lets say there is a program

f3=OP1(f1)

f4=OP1(f2)

OP2(f3)

OP2(f4)

Some possible ILP forms:

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

or

OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)

A DLP form:

OP1(f1, f2)

OP2(f3, f4)

or may be their DLP implemented as

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

and if it is the case i don't see why they can't call it an ILP.

~~~
scott_s
The concepts are _orthogonal_. Which means you can apply both at the same
time. If you have a stream of SIMD instructions - which is data parallelism -
and you can determine that you can execute some of them in parallel, then you
are able to extract instruction level parallelism out of a stream of data
parallel instructions.

ILP means something very specific. This is a discussion of semantics, but
semantics are important so that we can communicate easily. If I stick a
Hershey's bar in the oven, it is literally hot chocolate, but it is not what
we normally mean when we say "hot chocolate." You are talking about
parallelism at the instruction level. I'm trying to explain that that is note
what people mean when they say "instruction level parallelism."

~~~
VladRussian
>You are talking about parallelism at the instruction level. I'm trying to
explain that that is note what people mean when they say "instruction level
parallelism."

ok, just point which of the 2 ILP executions i mentioned above isn't really
ILP, and which of the 2 DLP executions isn't really a DLP.

~~~
scott_s
I'm going to have to make assumptions on your semantics. I assume that OP(x)
means that instruction OP uses data element x. Further, I assume that
instructions on the same line are executed in parallel and before instructions
on the next line.

 _OP1(f1)OP1(f2)

OP2(f3)OP2(f4)_

Instruction level parallelism because OP1(f1) executes in parallel with
OP1(f2) and OP2(f3) executes in parallel with OP2(f4).

 _OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)_

Instruction level parallelism because OP1(f2) executes in parallel with
OP2(f3). (But, the schedule isn't as good as the first one.)

 _OP1(f1, f2)

OP2(f3, f4)_

Data level parallelism because OP1 is applied to both f1 and f2. While this
has the same _result_ as the first example, a processor would achieve both in
different ways. In the first example, the processor would have to fetch two
instructions. It just so happens that both of those instructions are OP1. Then
it would have to schedule both of those instructions, and it was luckily able
to schedule them both at the same time.

The third example is different. In this case, the processor would fetch _one_
instruction, but execute it on both f1 and f2 at the same time. That's why
it's called SIMD: single instruction, multiple data. One instruction executes,
but it modifies multiple data elements. In the first case, you had to fetch
and execute an instruction for each data element.

Why bother distinguishing between them? Because this may also be possible:

OP1(f1, f2)OP2(f3, f4)

That is both data level parallelism _and_ instruction level parallelism.

------
jacques_chester
That's a massive transistor budget.

You could fit ~440 MIPS R10k processors on that thing.

------
anonymous246
Last sentence of the article has the phrase "sufficiently intelligent
compilers". :) Intentional in-joke? Wikipedia's definition: "Sufficiently
Smart Compiler, any of a family of theoretically possible compilers able to
perform sophisticated but unrealistic code optimizations"

> Given sufficiently intelligent compilers, Itanium could begin to make
> economic sense in fields that couldn't previously justify the high cost of
> optimizing for the chip.

<https://duckduckgo.com/?q=%22sufficiently+smart+compiler%22>

~~~
kd0amg
I would probably read it differently depending on the background of the person
who wrote it (mostly based on how aware I'd expect the writer to be of the
problems involved). I've seen people with a basic understanding of compilers
stumble over it and not know why others in the room chuckled. In this
particular case, I don't know enough about the author to say, but EPIC/VLIW
architectures are kind of known for making things difficult for the compiler
(meaning the joke would be very appropriate here).

~~~
jacques_chester
The entire bet for EPIC was that a sufficiently smart compiler would mean you
could free up die space for processing transistors by ditching branch
detection, prefetch logic, speculative execution, caches etc.

As you point out, the SSC has yet to appear. Just look at that layout: it's
_dominated_ by cache.

~~~
scott_s
I think even in the ideal SSC case, you'd still want as much cache as you can
get. Even if the compiler can insert perfect prefetching instructions, the
prefetched data has to go _somewhere_. And the more somewhere you have, the
more aggressively you can prefetch.

I think the main benefit would be much simpler instruction pipelines, which
would include the points you mentioned (branch prediction, prefetch) but also
all of the logic needed to keep track of dependencies in an out-of-order
processor.

~~~
jacques_chester
Absolutely. I studied the Itanium design philosophy back in 2000 and this is
exactly what they were aiming to do: drop all the complex logic devoted to
keeping the pipelines full and all the units busy.

True about data, though I vaguely recall EPIC had advantages there too because
without needing to do branch prediction, you didn't need to speculatively
fetch multiple memory addresses; meaning the same D-cache went further.

