
NVIDIA Announces Tesla Personal Supercomputer - jaydub
http://www.hpcwire.com/offthewire/NVIDIA_Announces_Tesla_Personal_Supercomputer.html
======
bootload
I took a look at the Tesla after viewing an Nvidia demo with the Mythbusters
Adam & Jamie doing a simple demo of CPU v's GPU which you can see here ~
<http://www.youtube.com/watch?v=fKK933KK6Gg>

Firstly you can run the thing as either a card (cheaper, slower) or standalone
machine (expensive, nobody lists the price). The CUDA toolkit which is C based
is Win/Lin 32/64 bit compatible and available for most mainstream disto's ~
<http://www.nvidia.com/object/cuda_learn.html>

_"... The Tesla architecture is built around a scalable array of multithreaded
Streaming Multiprocessors (SMs) ..."_

If you think this is a similar to mainstream development think again.
Programming with this machine is right down to the metal and reminds me of
programming the PS2 and other specialised consoles. You need to spend some
time reading the Hardware manuals, understand the architecture to use the
machine to it's capability.

~~~
SirWart
I had some friends who tried to port the Linpack benchmark to a small cluster
of computers each with 8 NVIDIA GPUs using CUDA, and they found that the
biggest bottleneck was the bandwidth to and from the GPUs. It's just hard to
keep the GPUs fed with enough data. They confirmed that they are both hard to
program and incredibly powerful.

~~~
DarkShikari
I tried CUDA programming as well and found that the threading model and such
is an utter nightmare. For certain tasks its quite easy, such as upscaling or
filtering an image. However, such a task is _exactly the kind of task where
you'll end up bandwidth-limited anyways_. The kind of complex tasks where you
actually fully use the GPU processors are often the exact kind of situations
where the API will work against you every step of the way.

The "960 cores" moniker is also very misleading, as last I recall there were
really only (960/8) cores, with each core being able to run 8 instructions at
the same if all the instructions were exactly the same.

~~~
miloshh
What are some of these complex tasks you had in mind? I think that almost any
computation-bound task, however complex, could benefit from GPU acceleration.
But I would like to hear about exceptions.

~~~
DarkShikari
A motion search, for video compression, is what I was trying.

It has some huge disadvantages:

1\. The threading model is completely unsuited to a search that takes a
different number of iterations per block (since it wants all the threads doing
the same thing).

2\. CPUs already have the PSADBW instruction, which allows an absurd effective
throughput: its literally an instruction dedicated to this kind of task. Its
also a purely 8-bit integer problem, so it doesn't benefit from the high-
performance floating point units on the GPU.

Due to 1), you're pretty much restricted to either crippling your performance,
using a very simplified search, or using an exhaustive search. And if you're
using an exhaustive search, it turns out that there's a mathematically
equivalent and vastly faster way to do it called "sequential elimination"...
which is completely impractical to implement on a GPU as well due to its
linear nature, and which allows a Core 2 to vastly outperform a GPU and
possibly even be competitive with a dedicated FPGA doing a normal exhaustive
search.

~~~
miloshh
Hmmm.. OK. You mind posting a link to the pseudo-code of the algorithm? The
problem of different number of iterations per thread is quite common, but can
often be fixed.

If all threads in a block have the same number if iterations, you're fine;
different blocks can take different number of iterations. As long as the
number of blocks is much higher than the number of SMs, the machine will
dynamically schedule the blocks to available SMs.

If each thread takes a different number of iterations, it's more difficult,
but you can still do dynamic allocation yourself, by running as many blocks as
you have SMs, and having each thread pick up work as it needs. You can also
randomize the assignment of work to threads, so the expected amount of work is
roughly the same. This all depends on the particular problem and data,
though...

As for the 8-bit nature of the problem - this is true, you're not utilizing
the floating point units that are the biggest advantage of the GPU. How large
are the vectors you need to do PSADBW over?

~~~
wmf
_If all threads in a block have the same number if iterations, you're fine;
different blocks can take different number of iterations. As long as the
number of blocks is much higher than the number of SMs, the machine will
dynamically schedule the blocks to available SMs._

At this point the programmer's head has already exploded. MIMD systems (like
OpenMP on Larrabee) don't have any of this BS.

~~~
miloshh
No, Larrabee will have exactly the same issues. Except threads will be called
fibers, and blocks will be called threads. Read the Larrabee paper from
Siggraph 08.

The way to get maximum performance out of a given chip area is to use SIMD,
and that's here to stay, with all the associated issues.

~~~
wmf
_Except threads will be called fibers, and blocks will be called threads._

That's how their rasterizer works, but I'm talking about using it as a regular
x86.

 _The way to get maximum performance out of a given chip area is to use SIMD,
and that's here to stay_

There's a big difference between MIMD+narrow SIMD and super wide SIMD.

~~~
miloshh
The Larrabee SIMD width will be 16 and NVIDIA's is currently 32, so that's
almost the same.

Not just the rasterizer, but any application that wants to take full advantage
of Larabee will have to use the SIMD vector units to the max.

Larrabee might turn out to have great performance (which I hope), but if it
does, the reason will not be black magic or breaking laws of physics. The
reason will be SIMD.

~~~
wmf
For some reason it's easier for me to wrap my head around one thread driving a
16-wide SIMD unit than 32 threads that execute in lockstep. I know it ends up
being equivalent but it feels different.

Also, on Larrabee you can execute a different kernel on each core, while on
GPUs you can't.

------
noonespecial
I like Tesla. He was under-appreciated in his time, brilliant, and did things
just for the joy of the science.

I just wish they'd stop with the naming of products, rock bands, and breakfast
cereals, etc. after him. Its starting to wear thin.

------
jm4
This sounds cool, but the page linked to is only a standard press release and
the link to the interesting stuff is buried in the marketing-speak. The fun
stuff is here: <http://www.nvidia.com/object/personal_computing.html>

------
light3
It'll be interesting to see how much these will cost. 4 Teraflops of single
precision calculation is very impressive, however if you need double
precision, the speed is not quite as impressive at 400 Gflops - although still
very good compared to my laptop with 20 Gflops :)

The main drawback with these is that you have to use CUDA, which certainly
takes a while before you wrap your mind across. I played CUDA for a while but
considered it is too much effort for something which is very specialised and
might not become mainstream. Still there seems to be many people using CUDA
and with lots of research roles - see the CUDA forums.

~~~
kahseng
It says "Available from VARs worldwide for under $10,000" in here
<http://www.nvidia.com/object/personal_computing.html>

Now I know how it felt like in the '80s when people were looking at the
mainframes/desktops of the time and wondering... can I afford this $10K
machine? :)

------
tsally
I guess I am a little skeptical about the market. First, I have to believe it
is rare that an __individual __would need a super computer. Besides, any
individual that actually needs one is probably going to build his/her own.
Businesses might want these on a large scale, but then why market it as a
personal computer?

~~~
wmf
Lots of scientists, engineers, and finance people could use these. It's
personal in the sense that it's used by one person, not that people would buy
it for personal use.

Also, it doesn't really matter whether you build or buy; either way all the
cost is in the Tesla card and NVidia gets their money.

------
riobard
Reading the title, I initially thought it was something related to the
electric car ...

Anyway, I guess the problem is that most programmers have no clue of how to
program GPU. Programming multi-core CPU is already very hard, and now comes
the GPU stuff ...

------
jcromartie
960 cores. Wow. I thought that I was in over my head trying to program 2 at a
time!

~~~
jodrellblank
What was it Joel Spolsky said about programmers counting "It must work for 1.
Oh, there's more than 1? Then it must work for _any number_ ".

;)

