

C++ Library for Linear Algebra on Supercomputers - poulson
http://libelemental.org
Elemental is an open-source C++ library meant for performing dense linear algebra on tens of thousands of processes (via the Message Passing Interface). I have been the primary developer for the past four years, but a community is starting to emerge. In addition to the website, the project is also hosted on github: http:&#x2F;&#x2F;github.com&#x2F;elemental&#x2F;Elemental
======
poulson
I'm happy to answer any questions about the pros and cons of the library (and
to remain quite objective). The main feature that the library currently lacks
that is available in ScaLAPACK is a parallel Schur decomposition, but one is
in the works.

I'm also happy to answer general questions about parallel linear algebra and
to point people to the appropriate literature.

~~~
dmlorenzetti
Do you see this getting put into the ACTS toolkit
([http://acts.nersc.gov/tools.html](http://acts.nersc.gov/tools.html))? I see
some on your project team are at Argonne, which would seem like a natural fit.

~~~
poulson
Honestly, I doubt it. Getting added to such a list has both political and
merit-based components.

~~~
dmlorenzetti
Interesting. I had the impression it was just meant to be a repository for
DOE-related numerical tools.

That's a shame, because having collections of useful tools like that makes it
a lot easier to keep track of options.

~~~
jeff_science
Given that a few dozen US-based students attend ACTS every year, I hardly see
how it is a game-changer w.r.t. the widespread acceptance of Elemental. I
think the internet is doing just fine so far.

------
temp453463343
Great.. another C++ linear algebra library...

    
    
      boost::uBLAS
      eigen
      armadillo
      a dozen other...
    

why not contribute to an existing project?

The reoccurring bifurcation of talent and resources in the open source
community is really disheartening. Can't we focus on one or two libraries and
make them actually good? Or at least fork off of something that already exists
and add your own features. I look at benchmarks of the existing tools and one
library will do one operation very efficiently, while another will work well
with something. Often the differences in speed are huge (more than a factor of
10). So I end up having to flip a coin in choosing which library to use.

~~~
poulson
With all due respect, did you even read the title of the post? This is a
distributed-memory library, unlike all of the ones you just mentioned. This is
a fundamental difference in design and capability. The only related libraries
are ScaLAPACK, PLAPACK, and DPLASMA.

~~~
temp453463343
And why exactly can't that be made part of an existing library?

~~~
poulson
Essentially all distributed dense linear algebra libraries are built on top of
sequential dense linear algebra libraries (mine as well, but on the interface
rather than the implementation). Distributed libraries are at least an order
of magnitude more complex than their sequential counterparts.

~~~
temp453463343
I guess I don't get what you're saying:

the C++ linear algebra libraries are basically syntactic sugar for interacting
with optimized Fortran libraries like BLAS, LAPACK, etc. (I'm sure I'm
overlooking some complexity here, because the C++ libraries, while linking to
the same Fortran libraries have very different run times)

Yours interacts with ScaLAPACK and other distributed memory libraries.

Why would it not be possible to simply extend say armadillo or eigen to
interact with distributed memory libraries? If you need more syntax, then
extend the interface.

~~~
poulson
Elemental _implements_ , as opposed to wraps, the distributed-memory
algorithms. In other words, it does not call any library like ScaLAPACK; it is
an alternative approach. Elemental builds on top of BLAS/LAPACK/MPI in order
to provide a nice interface to dense linear algebra on
clusters/supercomputers.

The other major difference is that sequential libraries tend to get away with
letting users not have to worry about where data resides. This is of
fundamental importance in distributed libraries, and, for this reason, it is
usually a bad idea to think of simply modifying sequential APIs.

~~~
temp453463343
So why not contribute to (or extend) say ScaLAPACK to make it do what you need
and add a wrapped to an existing linear algebra library?

Looking at ScaLAPACK, it's been developed since 1995. I've never touched it,
but it's probably many many lines of code (and maybe a few PhD thesis) with
all sorts of kinks worked out that will take you decades to iron out yourself.
To throw out all that knowledge/work/man-hours and to start from scratch seems
like a waste.

~~~
jedbrown
Because ScaLAPACK has a cumbersome interface, poor internal design, relies on
incorrect premises that affect performance (block cyclic vs. elemental
cyclic), and is buggy. In contrast, Elemental is a joy to use and is almost
always faster, often by a large margin (see the Elemental paper). Elemental is
not a toy project by any means; it is already the basis for a large share of
the interesting parallel linear algebra research and is used by many important
applications. Jack is currently the best researcher/implementer in the direct
linear algebra world. Anyone that knows me knows that I am a critical bastard
that does not throw such praise around lightly.

~~~
poulson
I appreciate the complements, but I disagree with a few of your points:

1\. What granularity to distribute the entries in the matrix is a long and
subtle argument which doesn't provide a clear winner for every operation (the
current conclusions are different for LU with partial pivoting vs. reduction
to tridiagonal form). I would by no means say that the approach used by
ScaLAPACK is _wrong_ , but only that it is unnecessarily complex and only one
operation purposefully targets the finest granularity case.

2\. Again, I appreciate the complement, but I don't think that arguments from
authority are valid, nor do I think that one can be the "best". I have a large
number of colleagues doing wonderful work, much of which I find extremely
impressive.

Also, it would be good to disclose that you're affiliated with the project you
were promoting.

------
n00b101
Without GPU support I don't see how this is a good idea, if we are talking
about HPC applications. Based on my understanding, a large number of dense
matrix operations can be significantly accelerated in GPUs.

~~~
dmlorenzetti
This targets distributed-memory applications, so it's not clear why failing to
support GPUs with the first iteration makes it a bad idea.

I imagine the authors could make a single node of the cluster use a GPU, if
they wanted.

~~~
n00b101
Maybe I should have just said that I think GPU support would be a great
addition.

Distributed memory and GPUs are not mutually exclusive. Multi-GPU clusters are
extremely common. In fact the latest devices (e.g. Tesla K10) have multiple
GPU processors packaged in a single card, so it is necessary for applications
to target multiple GPUs. There is explicit support for distributed-memory
applications in GPUs through the "GPUDirect" technology that allows peer-to-
peer DMA and RDMA transfers between GPUs.

Given that reports of 30-50x GPU performance gains (versus CPUs) are common,
the issue is important because it means solving a problem with (say) $10,000
of kit instead of $500,000.

~~~
jedbrown
30x to 50x claims are the result of a terrible CPU implementation. When
normalized by power consumption (e.g., TDP) or by acquisition cost (assume
high-end rather than consumer versions, as with all supercomputing centers),
you'll find about a 2x advantage for operations like DGEMM and much less for
other operations. Strong scaling is important for many super-computing
applications and accelerators like GPUS and Intel MIC are much slower for that
purpose.

For examples, look at the MKL benchmarks for Xeon Phi
([http://software.intel.com/en-us/intel-mkl/](http://software.intel.com/en-
us/intel-mkl/)) and normalize as 3 Sandy Bridge sockets per Xeon Phi (common
configurations pair one SB socket with one Xeon Phi, the Phi has more than
twice the TDP and costs more than twice as much). Don't forget to look at QR
and Cholesky, for which the Phi at best breaks even, but only for enormous
matrix sizes.

~~~
m_mueller
It may sound strange, but one of the advantages of GPUs is actually the
programming model, hence why many applications suddenly perform 20-50x faster
instead of only the theoretical 5-7x.

Let me explain: One of the most common issues with x86 HPC applications coming
from the scientific crowd is a lack of vector optimization such as loop
unrolling. Even having the right compiler flags is rather difficult for this
kind of thing. Another reason is a lack of understanding on how to program for
memory bandwidth optimization. GPU programming on the other hand, especially
with CUDA, is hard to get into at first, but once you have the right formula
you can apply it pretty easily to most common tasks. Getting to, say, 70% of
model performance on GPU is _much_ easier than on x86. One reason is the
implicitly bandwidth optimized idiomatic way of writing CUDA/OpenCL programs
as a set of scalar kernels applied over a whole data region - this allows the
programmer to think of block dimensions in an abstract way - no need to fiddle
around manually with loops to achieve this. There is also no need to use any
intrinsics, just plain C in idiomatic CUDA is enough.

So, to wrap it up, there is more to GPU programming than just the hardware
itself, the software model actually makes a lot more sense than traditional
OOP/procedural programming for HPC - often resulting in higher than expected
speedups when going from idiomatic x86 to idiomatic GPGPU (since there is no
such thing as easy to learn idioms for HPC x86 programming).

And btw. Xeon Phi is the result of Intel not understanding exactly this
interrelation, since it doesn't even have OpenCL support as of now.

~~~
jedbrown
This is actually very subtle. Intel's OpenCL stack is currently atrocious, but
if they ever make it decent, it will be able to vectorize. The problem is that
vectorization is only part of the equation and many (likely most) applications
today are limited in some way by memory and/or communication rather than by
vectorization. Meanwhile, vectorization is generally at odds with memory
bandwidth. CPU cache, shared memory on a GPU, and registers on a GPU are
limited and vectorization usually leads to less effective use of these
precious resources. Additionally, many kernels have inconvenient dependencies
(e.g., an 30-term equation of state, a 70-species reaction mechanism) that
force extremely low occupancy for typical GPU implementations (e.g., one
thread per quadrature point). If you try to make the parallelism finer grained
to reduce the register demands (break the quadrature point into several
parts), you either need physics-dependent decomposition within a thread block
or multiple kernel invocations (higher latency and needs to reload from
memory). All of these things are counter-productive for strong scaling. Recall
that a lot of applications are run near the limit of strong scaling and that
CPU network latencies are in the microsecond range while GPU kernel launch
overheads start at 20 microseconds (and you likely need several of them per
inter-node communication).

While it's true that there are kernels that are easier to optimize for a GPU,
I still think most claims of huge speedup are due to neglecting the CPU
implementation. See "Debunking the 100x GPU vs CPU Myth" and other papers on
this. And amidst all of that, there are a lot of applications that utterly
fail on GPUs, despite having lots of parallelism, just not at quite the right
granularity.

------
ssawyer06
Nice work, this is important stuff. Next step... sparse linear algebra!

~~~
poulson
Thanks! I actually work on a lot of fast/sparse linear algebra. For example,
see Clique:
[http://github.com/poulson/Clique](http://github.com/poulson/Clique)

~~~
ssawyer06
Interesting, a direct sparse solver (for structured sparse matrices?). The
name "clique" implies graph theory, so I was expecting to see distributed
iterative SVD. I've yet to see a good distributed SVD for huge real-
world/power-law graphs.

~~~
knepley
SLEPc, [http://www.grycap.upv.es/slepc/](http://www.grycap.upv.es/slepc/), has
distributed SVD for sparse matrices.

~~~
poulson
Yes, I should have mentioned this. SLEPc has the only distributed
implementation of partial reorthogonalization (the key component of high-
performance Krylov SVD and Hermitian eigensolvers) that I'm aware of.

