
AMD's Bulldozer server benchmarks are here, and they're a catastrophe - evo_9
http://arstechnica.com/business/news/2011/11/bulldozer-server-benchmarks-are-here-and-theyre-a-catastrophe.ars
======
zdw
AMD's problem is that they can't compete either in the low power high
performance (the 45w and under laptop chips), or high power high performance
markets (like the high end Xeons).

So they build massively parallel chips, and try new stuff like the Bulldozer
platform, which creates a bunch of trade offs. It's very similar in concept to
the Sun/Oracle Sparc T-series, with lots of great performing integer cores,
and (at least on the first models) less than stellar floating point.

The best thing they ever did was to acquire ATI, which gave them great
integrated graphics and allowed fairly balanced and inexpensive platforms.
Their long term problem is that the really lucrative parts of the business are
all being taken by Intel.

~~~
bad_user
They did the same thing back in the days of AMD K6 ... fast integer ops, poor
floating point. Then AMD Athlon came, built on the same foundation and Intel
never knew what hit them.

Don't dismiss Bulldozer just yet. AMD has this uncanny ability to make short
term compromises that benefit them in the long term.

~~~
gjm11
I worry that you're saying they have "this uncanny ability" on the basis of
_one_ instance where they've done it. The other possibility would be that they
just got lucky once (helped by Intel's terrible, horrible, no good, very bad
P4 design) and haven't had similar luck success before or since.

------
saulrh

      Again, one can't help but feel that a hypothetical 16-core
      Magny-Cours would have been a better option.
    

I think that this quote demonstrates exactly the reason that AMD went forward
with their catastrophe - current architectures can't get much more parallel.
AMD is betting (heavily) that Intel's architecture will max out in the near
future while AMD keeps improving Bulldozer. Should be interesting.

~~~
wmf
I don't see evidence of this. Intel has moved to a more scalable ring bus
while AMD is still using a crossbar (although the Bulldozer module design
should allow each pair of cores to share a crossbar port).

~~~
zdw
The HyperTransport (HT) crossbar offered more options, although people didn't
tend to use it like that. I can only think of a handful of examples (the Sun
X4500/X4540 for example) that took full advantage of the expanded I/O
functionality it offered, and there were even a few FPGA solutions that could
have been used to create really innovative custom logic designs.

The problem was that, outside of totally custom solutions like the above, it
didn't offer enough of a benefit to AMD, especially once Intel jumped on the
memory directly attached to CPU bandwagon.

------
onenine
This article is impressively bad.

While the 8-module chip does share a few things (mainly a vector processing
unit, that becomes two when doing the 128-bit SSE operations) they really can
run 16 threads on 16 ALUs. But, they'll have sse contention if they schedule
more than 8 256-bit vector operations (sadly intel won't bring this
instruction set to market for a bit). Bulldozer is pretty cool, but sadly the
tech press decides to shit on the underdog in a market that multiple companies
have successfully sued the monopolist for anti-competitive behavior. :(

~~~
DrPizza
> This article is impressively bad.

Cheers!

> While the 8-module chip does share a few things (mainly a vector processing
> unit, that becomes two when doing the 128-bit SSE operations)

A few things? No, it shares a lot of things. The entire floating point and
SIMD unit. The entire front-end. The branch predictor is, I believe, a weird
hybrid of shared and non-shared. The I-cache, and the L2 cache, also both
shared.

The front-end is particularly troublesome. The entire decoder can either
service one thread or the other. If both threads need instructions, the best
it can do is round robin between them. This averages to allow just two
instructions per cycle: less decode bandwidth than K10.

Likewise the integer units: there are fewer ALUs and AGUs per thread than in
K10. Likewise the floating point unit. There's lots of sharing, and even the
private, non-shared parts are resource-starved.

> But, they'll have sse contention if they schedule more than 8 256-bit vector
> operations (sadly intel won't bring this instruction set to market for a
> bit).

SSE contention will occur if a thread can issue more than two SSE operations
per cycle, or one AVX operation per cycle.

> Bulldozer is pretty cool, but sadly the tech press decides to shit on the
> underdog in a market that multiple companies have successfully sued the
> monopolist for anti-competitive behavior. :(

I don't care about "the underdog" or which is "cool" or which multi-billion
dollar corporation you might prefer. I care about which works better. It ain't
Bulldozer.

~~~
dman
Benchmarks at
[http://www.phoronix.com/scan.php?page=article&item=amd_f...](http://www.phoronix.com/scan.php?page=article&item=amd_fx8150_bulldozer&num=13)
appear to paint a picture of a much more balanced performance profile for
bulldozer chips. It does well in threaded applications and where code is
recompiled for it.

I cringed a bit when I saw this on arstechnica - the linkbaitey headline, the
image of a burning bulldozer, the lack of any benchmarks that you ran yourself
and the fact that data is presented in a lopsided fashion. Here are a few
examples -

a) If you look at the actual prices for the Xeon system and the AMD system you
can see that the price of the system is entirely dominated by the cost of the
SSD drive. Of the ~1.5 Million in before discount price nearly 1.2 Million is
for the SSD in the AMD system. While in the Xeon system 485k of the roughly
740k price is the SSD. Penalising AMD for that seems unfair. Also it remains
unclear what the SSD in the AMD at double the cost of the Xeon SSD does for
performance. b) In the SPEC JBB2005 section where the bulldozer 6200 scores
1.25 million bops, the 6100 gets 0.981 million, and the Xeon has 0.975 million
you explain away the high performance saying that this exists only because of
a higher number of cores. c) For the SAP section - "the 6200 scores 31,720
SAPS, the 6100 scores 24,020, and the Xeon gets 28,480. The 6200 system, with
33 percent more processors than the 6100 system, gets 32 percent more
performance." Heres a test that clearly contradicts your Bulldozer is absymal
narrative. d) In the end you write - "AMD is boasting that Opteron 6200 is the
"first and only" 16-core x86 processor on the market. Not only is this not
really true (equating threads and cores is playing fast and loose with the
truth), it just doesn't matter. " - except in the SPEC JBB2005 test where you
yourself said that "But these results are still cause for some concern. The
6200 part has 33 percent more cores than the 6100 part, as well as a minor
clock speed advantage. Its performance in this CPU-stressing benchmark is only
27 percent greater than that of the 6100. " e) Next time please run some
benchmarks of your own.

~~~
DrPizza
The Phoronix benchmarks, like most others, suggest that the only area where
Bulldozer appears at all competent is HPC. To describe this as niche is an
understatement.

a) I agree it remains unclear how much difference the SSD makes. That's why I
don't think it's a useful demonstration of Bulldozer's performance _even
though AMD is citing it as such_. b) Yes, I do. That 1/3 more cores gives 1/3
more performance in a test that scales almost perfectly means that the per-
core performance has stood still. A 32 nm K10.5 chip with 1/3 more cores would
perform just as well, cost less to build, use less power, and eliminate the
performance regressions. So what is the point of Bulldozer? c) No, it
reinforces the "Bulldozer performs no better than a scaled up K10.5 system
would and hence is pointless" narrative. d) @_@ e) No. I don't have a half
million dollars of equipment just lying around so that I can run TPC-C (etc.)
myself.

~~~
dman
My reading of the phoronix article suggests that Bulldozer does fairly well on
the following tests. a) ffmpeg encoding b) parallel io c) x264 encoding d)
compression e) mp3 encoding f) c-ray rendering g) smallpt

I will concede that I know virtually nothing of which workloads are
representative of what percentage of the market.

~~~
DrPizza
I don't think most server systems are doing much in the way of MP3 or H.264
encoding.

Rendering is more or less equivalent to HPC. Different markets, but similar
problem sets (lots of computation, minimal communication or dependencies
between threads).

None of those are particularly relevant to typical server workloads; servers
are doing things like querying databases, spitting out Web pages, running Java
VMs, running virtualization software, that kind of thing.

------
jsnell
This doesn't exactly come as a surprise. It was already obvious from the
desktop parts that something had gone horribly wrong. There just weren't any
bright spots there.

Worse single-threaded performance despite much larger cores (even counting
each "module" as two "cores" rather than two "threads"). No improvement in
clock speeds despite a newer design an deeper pipelines. No improvement in
multi-threaded workloads core-for-core. And it seems hard to give credit for
the improved multi-threaded performance per-socket, given the huge increase in
both transistor count and die size.

It would really be interesting to know what actually went wrong. The new
process? The rumored switch to automated design methodology? Unfortunately the
answer to the question posed by the article seems obvious... They had to
persevere with this and ship it, since they didn't have the resources for a
plan B.

------
iFire
Who here wants to play with 64 cores?

That's 4 sockets of Opteron 6272 which is around 540 dollars each on newegg.

~~~
Jach
How about 144? ;P <http://www.greenarraychips.com/home/products/index.html>

~~~
iFire
DB003 Evaluation Board Reference for EVB001 seems to have low interconnect
bandwidth.

I don't see anything like hypertransport.

