
Buckle Up, Intel Preps 8-Core Nehalem-EX Chips for March Launch - Flemlord
http://hothardware.com/News/Buckle-Up-Intel-Preps-8Core-NehalemEX-Chips-for-March-Launch/
======
sketerpot
I remember when 24 MB was a respectable size for a hard drive. Now we've got
CPU caches that big. It's downright ridiculous.

~~~
Andys
CPU caches will soon be big enough, and flash fast enough, that there won't be
any system RAM anymore. Your programs will basically run directly out of
flash, and made performant thanks to huge CPU cache.

~~~
djcapelis
That's just not going to end up true for a general system. For some embedded
systems that's a possibility, but the type of write wear system RAM gets is
very different than the type of write wear hard drives get. A few people had
visions of this, but any engineer who's run the numbers and looked at the
trends in flash technology knows it's extremely unlikely that flash will
replace RAM before it eats dust because of minimum theoretical densities.

STTRAM[1] might be capable of this when it comes online and in production, but
it seems flash is unlikely to reach this point and STTRAM is still in
development.

[1] <http://en.wikipedia.org/wiki/Spin_torque_transfer>

------
ggruschow
I should start publishing this news, since my source seems faster and more
reliable, though a little hard to understand. I get an email at least a few
hours before the announcement that reads "Your new absolute-top-of-the-line-
hardware has shipped. Tracking number YC23JK."

While I don't think it's technically illegal, I do find it strange that the
manufacturers are allowed to collude to insure I personally overspend by at
least 20% and never have top-of-the-line hardware for more than a month. I've
come to accept that I'll never know how they picked me, but I'd still really
like to know how they get my email passwords.

I think it'll work the other way too: A cheap way for hardware manufacturers
to make their deadline is to ship me their current generation a couple weeks
before they need to ship the next generation junk.

~~~
gjm11
The canonical solution to this problem is not to buy absolute top-of-the-line
hardware; it's seldom good value for money.

(Yeah, I know, sometimes you just gotta have the best. But still.)

~~~
hga
I've generally found myself satisfied by looking at a price-performance graph
and buying the processor just before the knee. Sure, I'd like more CPU, but up
to now more memory and disk have been the better tradeoffs.

Especially memory: there are some things that just aren't practical when your
working set gets too large (e.g. you start constantly paging), whereas a
somewhat slower CPU just requires a bit of patience for what I do (which isn't
number crunching or anything else that keeps my CPU pegged).

All that said, the next major machine I build should be in 2011-12 and I'm
really looking forward to having bunch of cores with more cache than, say, the
16MB of RAM I bought in 1991 for my 486 Windows machine.

As it is now, wouldn't it be hard to buy "normal" CPU that doesn't have at
least as much cache as the address space of a PDP-10 (18 bits of 36 bits
words, or a megabyte of 9 bit bytes)? And cache has become the new RAM, RAM
the new disk, and disk the new tape.

Wild times for someone who started out with punched card FORTRAN on an IBM
1130 (64KB max, more likely a lot less, 1-10 MB disk, probably closer to the
former).

------
djcapelis
Neat, but wow look at the bus on that thing! Is Intel still using a basic
crossbar for an interconnect? When are they going to roll out their new
interconnect technologies? (They clearly must be working on them... they know
just as well as the rest of us do that the on-chip network they have now isn't
going to scale.)

These appear to be still built on the old 45nm process. It should be
interesting to see what the 32nm westmere shrink look like. (The ones out so
far are mostly only dual cores. They plan to do the rest of the line later in
the year...)

~~~
wmf
Actually, this is the first Intel chip with a ring bus.

~~~
djcapelis
Ah, interesting. I assume there's multiple rings though, one can't really be
satisfied with a 1x-4x difference in contacting other cores? (Assuming
bidirectional.)

Also rings still don't scale, so they're really going to have to get more
clever. I wonder what the perf hit for the ring is going to be.

~~~
Andys
[http://www.theregister.co.uk/2009/11/16/sgi_altix_uv_preview...](http://www.theregister.co.uk/2009/11/16/sgi_altix_uv_preview/)

~~~
djcapelis
That's an off-chip interconnect, we're talking about on-chip interconnects. 8
cores is probably pushing the limits of Intel's current on-chip network.

One option is to move closer (but probably not actually adopt) something like
a scalar operand network used on some of the tile architectures.

~~~
sparky
It depends on the data sharing and memory access patterns you want to support.
For instance, the 16-32-core Larrabee has a ring bus. So do GPUs with many,
many cores (<http://en.wikipedia.org/wiki/Radeon_R600#Memory_controllers>).
What rings give up in terms of nonuniform point-to-point latency and
diversity, they partially make up for in physical design simplicity. They are
easily routed and verified, giving the circuit guys more time to go nuts on it
and crank up the frequency more than you might be able to on a mesh, torus, or
other more complex topology (
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.1909&rep=rep1&type=pdf)
). You're quite right that they're not infinitely scalable though; Intel came
out with two chips at ISSCC using 2D meshes (
[http://hothardware.com/News/Intel-Unveils-48Core-
SingleChip-...](http://hothardware.com/News/Intel-Unveils-48Core-SingleChip-
Cloud-Computer/) <http://www.theregister.co.uk/2010/01/28/isscc_chip_preview/>
) for the 48-64-core regime. They also used a mesh for their Polaris test chip
a couple years ago (
[http://www.eetimes.com/news/latest/showArticle.jhtml?article...](http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=197004697)
). Beyond that, there is a ton of published work (network-on-chip, or NoC, is
the keyword will help you find it) but not too many chips to prove or disprove
their ideas.

For what it's worth, most of the many-core chips that do exist keep it fairly
simple. Ambric's MPPA had a mesh with configurable routing
(<http://www.nethra.us.com/technologies_mppa.php>), GPUs have a combination of
pretty straightforward hierarchical interconnect and rings, and Azul's
interconnect is a tree-like structure as well.

One thing to consider is that, while more complex topologies can buy you
something in a supercomputer, give you more bisection bandwdith, and better
support point to point communication, it has turned out to be pretty hard to
actually write correct programs in that style (lots of arbitrary peer-to-peer
communication); this partially motivates simpler interconnects, where the
programmer can more easily reason about what's going on. Another concern is
that topologies that work great in a server room (3D) are nigh-unroutable on a
chip (2D). The wraparound links in a torus are a great example of this; Blue
Gene/L used a torus to great effect
([http://www.google.com/url?sa=t&source=web&ct=res&...](http://www.google.com/url?sa=t&source=web&ct=res&cd=3&ved=0CBUQFjAC&url=https%3A%2F%2Fasc.llnl.gov%2Fcomputing_resources%2Fbluegenel%2Ftalks%2Fheidelberger.pdf&ei=p22WS9ysApTSNbyhqKgP&usg=AFQjCNGH-
zFZu37JKAIQCY4KBdYrA-xabA&sig2=at2kAwIukcbbHFC2B3MjPA)), and the wraparound
links drastically reduce worst-case and average point to point latency over a
mesh, but those links mean giving up a large fraction of a metal layer on a
chip, as opposed to a long cable in a machine room.

Also, I would hazard a guess that the workloads these chips are intended for
are mostly multiprogrammed, not multithreaded; if they don't share much data,
the network's bisection bandwidth is not as much of an issue as the off-chip
bandwidth.

Regarding scalar operand networks as found in Raw/Tilera64: The specific idea
(register-mapped networks) probably changes the programming model too much for
Intel to adopt directly, but the Single-chip Cloud Computer (SCC) linked
above, presented at ISSCC 2010, uses something sort of similar for message
passing over their 2D mesh. The communication channels are memory-mapped
rather than register-mapped, but the mechanism is similar.

~~~
djcapelis
Oh sure, you could do a ring bus and it'll scale for workloads that aren't the
ones Intel targets for their CPUs. (General purpose, not vector, not SIMD,
just straight-up random computations requesting all kinds of data from all
kinds of places.)

I think they'll drop the rings soon. But it will keep them going until they
figure out how to solve their interconnect problems.

I agree with you that the SON as on RAW/Tilera is unlikely to make the leap to
Intel. I just hope they'll move towards that direction. Though clearly they
may also chose to move in a completely different direction, but they will need
a more sensible strategy than they've got now and I really doubt it's going to
be rings for the type of straight up make no assumptions about your workload
general computing Intel must be good at.

We don't agree that on-chip networks aren't important though. While right now
no one's got a good programming model for these things, we're going to need
one and there's likely going to be some data sharing involved, which means a
good on-chip network is going to be a hell of a thing. Also I simply don't
envision a cache architecture that makes sense that doesn't have a lot of
unfortunate on-chip communication, and that needs to not be annoyingly NUMA.
(Though it may have to be to some degree...)

I guess we'll find out. :)

And as an aside, thanks for taking the time to provide one of the more
informative and responsive posts attached to this thread. I think HN could use
more architecture folks.

------
patrickgzill
The economics of virtualization certainly point to this being a hit. A 2U
chassis with 64GB of RAM or so could easily replace half a rack of slower
systems (add some flash drives to speed up IO if needed).

------
lallysingh
I don't have a use for a car, but a quad-socket machine may be pretty fun in
terms of those "let the computer search for answers over the weekend" type
problems.....

~~~
jacquesm
If you do that just on the weekend that's an excellent use case for renting a
bunch of boxes in the cloud. In fact, it's one of the few use cases for which
the numbers really work out well.

------
MikeCapone
I'd love to have a few of these babies to crunch for Rosetta@home.

