
Cell (microprocessor) - dmmalam
https://en.wikipedia.org/wiki/Cell_(microprocessor)
======
scott_s
I got a PhD dissertation out of how difficult it was to use the Cell
processor: "Shared Memory Abstractions for Heterogeneous Multicore
Processors", [http://www.scott-
a-s.com/files/scott_dissertation.pdf](http://www.scott-
a-s.com/files/scott_dissertation.pdf). Code for the project here:
[https://github.com/scotts/cellgen/](https://github.com/scotts/cellgen/)

Cell made a big splash in the high-performance computing community because it
gave software the kind of fine-grained control that people with a detailed
knowledge of processor architecture and their own algorithms could exploit.
But the difficulty with how that kind of control was exposed meant that it was
kind of all or nothing: you either optimized your code to use the SPEs
correctly and efficiently, or you didn't use the SPEs at all.

I had the opportunity to talk with one of the main processor architects a
while back, and I asked why they did not include any hardware-level caching
for the SPEs at all - which I think would have allowed people to get that kind
of incremental performance improvement. His answer was that they wanted all of
the chips real estate and transistors towards the fine-grained control
approach. Fair enough, but I think it would have been a lot more usable
otherwise.

I think that Roadrunner
([https://en.wikipedia.org/wiki/IBM_Roadrunner](https://en.wikipedia.org/wiki/IBM_Roadrunner))
was the only major deployment of the Cell processor, outside of the
PlayStation 3. I always wanted to talk to videogame developers who programmed
the PS3 to ask them what techniques and patterns they used for the Cell. I
suspect that for some games, they ended up running much of the main game logic
on the PPE, didn't bother using the SPEs, and use the GPU for all
acceleration.

Disclaimer: I actually work for IBM Research now, but not on anything related
to Cell.

~~~
bri3d
There's a bit of public information about PS3 game development that's trickled
out over the years. A couple of game developers have released slides about how
they used SPU, for example in Killzone 2: [http://twvideo01.ubm-
us.net/o1/vault/gdc09/slides/GDC2009-vd...](http://twvideo01.ubm-
us.net/o1/vault/gdc09/slides/GDC2009-vdLeeuw-KZ2SPUsCaseStudy.pdf) .

I think a lot of PS3 games used middleware to leverage the SPEs rather than
writing the code themselves. The PS3 SDK shipped with an SPE task scheduler
called SPURS and a set of libraries called "Edge" that offloaded a lot of
common tasks onto SPEs. Probably the biggest Edge component was Edge Animation
/ Edge Geometry, which basically re-implemented a mesh skinning / geometry
pipeline on SPEs instead of the RSX:
[http://www.jonolick.com/uploads/7/9/2/1/7921194/gdc_edge_07_...](http://www.jonolick.com/uploads/7/9/2/1/7921194/gdc_edge_07_final.pdf)

~~~
angersock
Insomniac R&D has a great handful of presentations on the crackheadedness of
programming the Cell.

One of the most annoying things that we ran into was that it was basically
impossible to buy a system to learn Cell programming on--you either had to
somehow root a PS3 (which still didn't let you get all cores), or you had to
buy something from Mercury Computer Systems (I think it was?) that was super
expensive.

It was a niche embedded processor, and hobbled because they didn't make it
easier to develop for. :(

~~~
zurn
This was before launch I guess? The consumer PS3 did support Linux in the
early days, Sony still has a page about it:
[https://www.playstation.com/ps3-openplatform/index.html](https://www.playstation.com/ps3-openplatform/index.html)

Then they decided to put a stop to it and just removed it in a software
update...

I heard speculation that it was about getting some preferential tax/customs
treatment for it because it could pass for a "computer" then.

WP article:
[https://en.wikipedia.org/wiki/OtherOS](https://en.wikipedia.org/wiki/OtherOS)

------
FullyFunctional
I worked on the Sony cell tool support before it was released. Besides the
great point scott_s mentions, the Cell never lived up to the hype. The
(original) hype that got people excited was: dual-thread dual-issue 4.0 GHz
PowerPC code + 8 SPUs. Reality was: a 3.2 GHz processor with performance
crippling bugs (load-hit-store was one), 7 SPE but really only 5 (one was
hardwired to the OS, another to sound processing). The SPE had even worse bugs
that made branches absurdly expensive. If that wasn't enough, developers
balked at the cache penalty of 64-bit code so Sony imposed a truly
mindboggling terrible 32-bit mode that _didn 't_ use the physical 32-bit mode
but instead effectively emulated it (effectively all arithmetics need to
truncate to 32-bit). Finally, the whole inter-SPU communication primitives
were way more expensive than they should have been.

There's more, but I have willfully flushed it from memory. Cell was a failed
design that deserved to die. I will forever after be skeptical of
"heterogeneous" designs like these.

~~~
FullyFunctional
Forgot a great bit: they originally thought the SPUs were good enough that
they didn't need a GPU. That NVIDIA GPU was the only thing that saved the PS3.

------
protomyth
"The Race for a New Game Machine"[1] is a pretty good read on the development
of the Cell and Xenon[2] processors. Giving up the out-of-order seems like a
really bad decision and the internal fighting at IBM, if accurate, is really
sad.

A side question: What in the POWER architecture makes it hard to implement? I
was told the addressing modes are complicated enough that it will always be
slower and harder to create than other processors. I'm wondering if this is
urban myth or has some basis in reality?

1) [http://www.amazon.com/The-Race-New-Game-
Machine/dp/080653101...](http://www.amazon.com/The-Race-New-Game-
Machine/dp/0806531010) with a lot of articles written about the book such as
[http://www.wsj.com/articles/SB123069467545545011](http://www.wsj.com/articles/SB123069467545545011)

2)
[https://en.wikipedia.org/wiki/Xenon_(processor)](https://en.wikipedia.org/wiki/Xenon_\(processor\))

~~~
trsohmers
It wasn't due to the POWER architecture, it was the SPEs, which were not based
on POWER. The biggest problem for compilers and developers of the day was the
fact that each of the SPE cores had their own local memory ("scratchpad") that
is similar to a cache, but is not hardware managed.

My company (REX Computing) is working on a new processor architecture that is
built around scratchpad memory... the main benefits of a scratchpad are that
it is lower latency (which basically means faster), uses less area and power,
and can be built into larger arrays than your regular hardware managed caches.
The big drawback, as seen by Cell, is the fact that it is difficult for most
programmers to manually manage their memory.

Our big advancement is that we've developed new compiler techniques (which
have been DARPA funded), that would not only make it a lot easier to program
the Cell architecture, but allows us to go even further than they did with
Cell in making a highly parallelized and energy efficient processor.

~~~
protomyth
Actually, I wasn't referring to the SPEs, I specifically heard the POWER
architecture had some stuff that made it hard to implement and slower than
other RISCs.

~~~
cbsmith
As was aptly demonstrated by x86 systems (both now and in that era),
transistor budgets had grown enough by the time Cell came of age that the
impact of ISAs could have on overall performance was comparatively trivial.
POWER had a lot of funny things about it, but it wasn't that hard to get POWER
architectures to move quickly.

~~~
protomyth
You still have to implement the spec, and my question is about the spec. Yes,
we can overcome crap which let us end up with the least liked ISA for the
winner.

~~~
cbsmith
POWER inherited a lot of baggage from the IBM R/T, the first real go at RISC,
so it was the most un-RISC-like of the RISC ISA's. That was the included all
the addressing modes and a surprisingly rich ISA. Those were all considered
no-no's for unlocking the advantages of RISC.

~~~
protomyth
Thanks. Is there a document / blog post that tells exactly what the baggage
and addressing modes were? I've heard the problem but never found the actual
specifics.

~~~
cbsmith
Well, a fairly handy reference might be the PowerPC 601 ISA docs:
[http://www.freescale.com/files/32bit/doc/user_guide/MPC601UM...](http://www.freescale.com/files/32bit/doc/user_guide/MPC601UM.pdf)

Chapter 3 touches on memory addressing modes.

The 601 was more compatible with the full POWER ISA that came before it, but
looking at the docs for the POWER ISA itself would probably be more
instructive.

------
oppositelock
This thing was very hard to use from a practical perspective.

The PS3 was a popular console, and game developers had to get the most out of
it. I was one of those unfortunate people at times. The PS2's Emotion Engine
was already a pain to use (the emotion refers to anger, btw), but the Cell
raised my emotions to new levels.

The PPC core was slow and main memory latency was terrible, you had to tune
code for memory access, no problem, this was understood problem, though, and
workarounds were known, if painful.

Then, you ad this wonderful array of SPE's. You could access five of them, but
what are these things? They're really fast vector units with a little bit of
local memory - 256k, if I remember correctly, and no access to main memory.

To write code for these things, you use a different compiler, and have to
manage the DMA engine yourself to keep their local memory filled with useful
input, and to pull out the output. If you're doing streaming computation,
you'd basically have to schedule close to 100% ideal utilization of the DMA
engine and double buffer data in and out of these things (reducing useful
memory to 128k/SPE). DMA had high bandwidth, but also high latency. From the
time you initiate a DMA, it was something like 1800 cycles before the first
block arrives. This turns into a really complicated memory and latency
scheduling problem very quickly. You have to identify specific workloads which
you can run asynchronously, and then how to chunk them up for DMA transfers.
All this book-keeping work is actually taxing on the anemic PPC core, which
has a really difficult times with performance of conditionals.

Bah! Good riddance.

------
api
I wrote some code for this beast once:

[http://adam.ierymenko.name/files/MersenneTwister32_spe.cpp](http://adam.ierymenko.name/files/MersenneTwister32_spe.cpp)

Was working on genetic algorithms at the time and managed to port one to run
on the Cell. It's a good "embarrassingly parallel" work load for that kind of
chip. The MT code above was the PRNG for the genetic algorithm.

Makes me want to hack GAs again... boy are those things barrels of fun.

~~~
Gigablah
I implemented Gaussian elimination on the Cell for my uni coursework... that
was fun.

SPE portion:
[https://gist.github.com/gigablah/1c72acbe718844310b10](https://gist.github.com/gigablah/1c72acbe718844310b10)

